[DRAFT] FEAT: Chain-of-Thought Hijacking Attack Strategy and Test Coverage#1438
[DRAFT] FEAT: Chain-of-Thought Hijacking Attack Strategy and Test Coverage#1438riyosha wants to merge 1 commit intoAzure:mainfrom
Conversation
romanlutz
left a comment
There was a problem hiding this comment.
This is really good! While reading it, I couldn't shake the feeling that this is very similar to RedTeamingAttack with the big difference that it cycles through the system prompt templates, of course. I haven't had time to compare with it in detail to see if that would be doable. My hunch is that it would introduce considerable complexity and is probably not worth it but I'd like to be sure...
Other things:
- needs mentioning in api.rst
- needs example notebook (both ipynb and py files) somewhere in doc/executor/attack, which in turn needs to be mentioned in TOC file. Example notebook doesn't need to be elaborate.
- needs integration test, perhaps just one that runs the example notebook. This may be auto-created by test_executor_notebooks.py I think...
| Returns: | ||
| Optional[AttackScoringConfig]: The scoring configuration. | ||
| """ | ||
| return AttackScoringConfig( |
There was a problem hiding this comment.
I'm a bit surprised that we're unpacking the attack scoring config in the constructor into these two below, and then reassembling it here. Is that a pattern you've seen in another executor?
| def __init__( | ||
| self, | ||
| *, | ||
| objective_target: PromptTarget = REQUIRED_VALUE, # type: ignore[assignment] |
There was a problem hiding this comment.
Most likely, this assumes we're dealing with a target that has reasoning capabilities, right? @hannahwestra25 is currently working on expanding TargetCapabilities so that could come in handy here for validation purposes.
| outcome=final_outcome, | ||
| ) | ||
|
|
||
| # AttackLM |
| Returns: | ||
| Optional[Score]: The score for the response. | ||
| """ | ||
| if not self._objective_scorer: |
There was a problem hiding this comment.
Personally, I'm not a huge fan of this. If people want that they can explicitly specify the current fallback scorer as their objective scorer. In fact, we have a fairly good refusal scorer. If you think this one is useful we can add it as a deterministic option into the score module.
| if score_value >= 1.0: | ||
| self._logger.info("Attack successful!") | ||
| context.attack_succeeded = True | ||
| final_outcome = AttackOutcome.SUCCESS | ||
| break | ||
| if score_value >= 0.8: | ||
| self._logger.info("Attack largely successful (score 0.8+)") | ||
| final_outcome = AttackOutcome.SUCCESS | ||
| break |
There was a problem hiding this comment.
This should probably be configured via the scoring configuration. In other attacks, we expect a true/false scorer. So here it could be a threshold based on that takes a float scale scorer and then applies the threshold at 0.8 or 1.0 on top fo that to determine success.
Description
[DRAFT] FEAT: Chain-of-Thought Hijacking Attack Strategy and Test Coverage
Related to issue #897
This PR introduces the Chain-of-Thought (CoT) Hijacking attack strategy, as described in Zhao et al. (2025). The changes include:
pyrit/executor/attack/multi_turn/cot_hijacking.pypyrit/datasets/executors/cot_hijacking/puzzle_generation_{puzzle_type}.yamltests/unit/executor/attack/multi_turn/test_cot_hijacking.pyRelated issues: #897
Tests and Documentation
tests/unit/executor/attack/multi_turn/test_cot_hijacking.pyThis is a draft PR and I want to get your thoughts on the implementation so far. I have planned these updates:
Question:
async def _teardown_asynceven if unused. Should I also add it?