Home AI Tools and Trends

Anthropic Releases Bloom for AI Behavioral Testing

December 22, 2025

Anthropic launched Bloom, an open-source framework that automates the creation and execution of behavioral misalignment evaluations for frontier AI models. Bloom addresses a critical challenge in AI safety: evaluations generally take a long time to develop, and then run the risk of becoming obsolete through training set contamination or rapidly evolving capabilities. The tool allows researchers to specify any behavior — from sycophancy to self-preservation — and automatically generates diverse test scenarios to quantify how frequently and severely models exhibit those behaviors.

How Bloom Works: Four-Stage Evaluation Pipeline

Bloom operates through four automated stages that transform a behavior description and seed configuration into a complete evaluation suite with top-level metrics like elicitation rate and average behavior presence score. The Understanding agent analyzes the target behavior and example transcripts to grasp underlying mechanisms. The Ideation agent then generates evaluation scenarios designed to elicit the behavior, specifying simulated users, system prompts, and interaction environments. During Rollout, these scenarios execute in parallel with an agent dynamically simulating user and tool responses. Finally, the Judgment agent scores each interaction on a 1-10 scale for behavior presence, while a Meta-Judgment agent analyzes patterns across the entire suite.

Pipeline Stage	Function	Output
Understanding	Analyzes behavior description and examples	Conceptual framework for evaluation
Ideation	Generates diverse test scenarios	Scenario specifications with user profiles
Rollout	Executes conversations with target model	Complete interaction transcripts
Judgment	Scores behavior presence in each rollout	Quantitative metrics and analysis report

Getting Started with Bloom

The tool is available on GitHub under an Apache 2.0 license. To use Bloom, researchers define their target behavior in a configuration file, optionally provide example transcripts, and run a single command to generate a full evaluation suite. The framework integrates with Weights & Biases for large-scale experiments and exports transcripts compatible with the Inspect evaluation format.

A typical workflow involves six steps. First, add API keys to .env for model providers like OpenAI and Anthropic. Second, create and activate a Python 3.11 virtual environment using uv. Third, define your target behavior in the behaviors.json file with a unique key and description. Fourth, optionally add example transcripts to the behaviors/examples directory. Fifth, configure seed.yaml with key parameters like behavior name, total number of evaluation scenarios to generate, and the target model you want to evaluate. Finally, run the pipeline locally with the debug flag to generate results.

Benchmark Results Across 16 Models

Anthropic released benchmark results for four behaviors — delusional sycophancy, instructed long-horizon sabotage, self-preservation, and self-preferential bias — across 16 frontier models from various developers. The elicitation rate metric measures what proportion of test scenarios successfully triggered the behavior with scores of 7 or higher out of 10. Results show significant variation across model families. For delusional sycophancy, Claude Opus 4.5 and Claude Haiku 4.5 both scored 0.00, while Deepseek V3 reached 0.91. On instructed long-horizon sabotage, Claude Sonnet 4.5 scored just 0.02 compared to Deepseek R1’s 0.81. Self-preservation behaviors ranged from Claude Haiku 4.5’s 0.05 to Claude Opus 4.1’s 0.85.

Advanced Features and Configuration

Bloom supports extended thinking modes for Claude Sonnet 4+ and OpenAI o1/o3 models through configurable reasoning effort levels. The framework’s diversity parameter controls the tradeoff between scenario variety and variations per scenario—lower values generate fewer base scenarios with more perturbations, while higher values create more distinct test cases. Researchers can enable web search during scenario generation for behaviors requiring current information, though this disables extended thinking per Anthropic’s guidance since the two features serve different purposes.

The ideation stage uses intelligent batching to generate multiple scenarios in single API calls, achieving 10-20x speedups compared to sequential generation. The pipeline automatically downloads transcript files to a local directory when using Weights & Biases, which you can then point the viewer to for browsing. The included interactive transcript viewer provides a web interface for reviewing conversation flows, judgment scores, and filtering results.

Complementary Tool to Petri

Bloom complements Anthropic’s Petri, another open-source auditing tool released earlier. While Petri takes user-specified scenarios and scores many behavioral dimensions to flag concerning instances, Bloom takes a single behavior and automatically generates many scenarios to quantify how often it occurs. Petri performs broad behavioral profiling to surface unexpected issues, whereas Bloom conducts targeted deep-dives into specific behaviors researchers want to measure. Together, they provide complementary approaches to AI safety evaluation—one exploratory, the other confirmatory.

Practical Applications and Limitations

Using Bloom, Anthropic’s research team developed the four benchmark evaluations in just a few days each—a process that typically requires weeks of manual effort. The tool allows safety researchers to rapidly iterate on evaluation designs without engineering custom test harnesses. However, Bloom inherits limitations from its reliance on AI evaluators. The judgment agent may miss subtle instances of the target behavior or hallucinate behavior presence where none exists. Cross-validation with human raters remains important for high-stakes evaluations.

The framework also requires careful prompt engineering in the configuration file to achieve reliable results. Poorly specified behaviors or ambiguous example transcripts can lead to evaluation suites that fail to elicit the intended behavior or generate scenarios too easy for models to recognize and avoid. Researchers should validate small pilot runs before scaling to full evaluation suites. The GitHub repository includes sample seed files to help users get started with effective configurations.