Anthropic has launched Bloom, an open supply agentic framework for producing behavioral evaluations of frontier AI fashions. Bloom takes a researcher-specified habits and quantifies its frequency and severity throughout mechanically generated situations. This permits testing of AI-based techniques by evaluating their habits.
Bloom’s evaluations correlate strongly with hand-labeled judgments. Anthropic finds they reliably separate baseline fashions from deliberately misaligned ones. As examples of this, Anthropic launched benchmark outcomes for 4 alignment related behaviors on 16 fashions.
Bloom is a scaffolded analysis system which accepts as enter an analysis configuration (the “seed”), which specifies a goal habits (for instance sycophancy, political bias or self-presevation), exemplary transcripts, and the varieties of interactions the consumer is involved in, and generates an analysis suite of interactions with the goal mannequin that try to uncover the chosen habits. As mirrored within the identify of the instrument, the analysis suite will develop in a different way relying on how it’s seeded, which differs from different evaluations which will use a set elicitation approach and prompting sample. Thus, for reproducibility, Bloom evaluations ought to at all times be cited along with their full seed configuration.
Hyperlink to Anthropic launch submit: https://www.anthropic.com/analysis/bloom
Hyperlink to Bloom GitHub repository: https://github.com/safety-research/bloom
