Conversation
| assert jailbreak_string.template_source == "<string_template>" | ||
|
|
||
|
|
||
| def test_all_jailbreak_yaml_templates_have_is_general_strategy(jailbreak_dir): |
There was a problem hiding this comment.
I liked making this a property of the data, and then just testing all jailbreaks have this property. I debated other ways (like updating TextJailBreak to set it). But ultimately this seems like a property of the data, so I like this
| class SeedObjective(Seed): | ||
| """Represents a seed objective with various attributes and metadata.""" | ||
|
|
||
| is_general_strategy: bool = False |
There was a problem hiding this comment.
I understand the point is to use a boolean to distinguish between general datasets and strategic datasets, but could this be a set instead? I imagine that if users are already going to want to filter by how generic a dataset is, they may want to track other pieces of information as well. Something like:
is_general_strategy: set = set()tells the user "this is a generic dataset", while
is_general_strategy: set = {"jailbreak", "simulation"}allows the user to check for the presence of any tags (this provides the generic/strategic split desired) while also tracking more specific details.
If I've just misunderstood the purpose of this flag, please feel free to say so 😄 this may be overengineering or only worth visiting in a follow-up PR
| parameters: | ||
| - prompt | ||
| data_type: text | ||
| is_general_strategy: true |
There was a problem hiding this comment.
On my later point in the review, this also lets us treat this like an optional field, so new or legacy datasets can be treated as generic by default, while strategic datasets get a new field that makes it explicit rather than implicit
| missing = [] | ||
| for yaml_file in yaml_files: | ||
| seed = SeedPrompt.from_yaml_file(yaml_file) | ||
| if not seed.is_general_strategy: |
There was a problem hiding this comment.
Does this raise an error for non-boolean values, and if so, do we want to catch it explicitly or let SeedPrompt handle it? Curious if we see this evolving such that explicitly flagging as false vs mistyping vs omitting the field are meaningfully different failure modes
One problem we want to tackle is to identify unique attack techniques. As we are currently architected, this consists of two parts
AttackIdentifier: This includes the attack, converters, targets, scorers, etc.These are the factors we want to include when we calculate how successful an attack is. But a gap we have is the datasets.
This PR includes a way to distinguish general datasets with the
is_general_strategy. To start, simulated conversations and jailbreaks will have this by default, others will not.In a future PR, we'll introduce an
AtomicAttackIdentifierto uniquely identify these.