Modes Mitigations Scores Submit/Improve About

BenchRisk Scores

Score Details

These scores are all scoped around benchmarks intended to score information-only chatbots. Agentic systems (i.e., those taking direct action) are not in scope for this release versioned as BenchRisk-ChatBot-v1.0.

Scored on
April 1, 2025
71 adopted mitigations
minimum score of 51
Benchmark indicates the propensity of AI systems to respond in a hazardous manner to prompts from malicious or vulnerable users that might result in harm to themselves or others
Refer to the original reference for more details about the benchmark

The benchmark presents a...
lower risk of information degrading through time.
lower risk of statistically biased results misleading.
lower risk of misunderstanding what the benchmark evidences.
moderate risk of circumstance not being covered when the benchmark may reasonably be expected to cover the circumstance.
lower risk of randomness misleading via scores not representative of the system.
Numerically, this is supported by the following scores:
Scored on
April 1, 2025
58 adopted mitigations
minimum score of 44
Benchmark indicates how well a system can perform abstract reasoning tasks that require generalization from minimal examples, reflecting human-like fluid intelligence.
Refer to the original reference for more details about the benchmark

The benchmark presents a...
moderate risk of information degrading through time.
lower risk of statistically biased results misleading.
moderate risk of misunderstanding what the benchmark evidences.
high risk of circumstance not being covered when the benchmark may reasonably be expected to cover the circumstance.
moderate risk of randomness misleading via scores not representative of the system.
Numerically, this is supported by the following scores:
Scored on
April 1, 2025
49 adopted mitigations
minimum score of 10
Benchmark indicates how well a system can perform abstract reasoning tasks that require generalization from minimal examples, reflecting human-like fluid intelligence.
Refer to the original reference for more details about the benchmark

The benchmark presents a...
high risk of information degrading through time.
lower risk of statistically biased results misleading.
moderate risk of misunderstanding what the benchmark evidences.
high risk of circumstance not being covered when the benchmark may reasonably be expected to cover the circumstance.
moderate risk of randomness misleading via scores not representative of the system.
Numerically, this is supported by the following scores:
Scored on
April 1, 2025
51 adopted mitigations
minimum score of 5
Benchmark indicates how well a system can answer 44k fill-in-a-blank questions with binary options associated with commonsense reasoning
Refer to the original reference for more details about the benchmark

The benchmark presents a...
high risk of information degrading through time.
moderate risk of statistically biased results misleading.
moderate risk of misunderstanding what the benchmark evidences.
high risk of circumstance not being covered when the benchmark may reasonably be expected to cover the circumstance.
lower risk of randomness misleading via scores not representative of the system.
Numerically, this is supported by the following scores:
Scored on
April 1, 2025
44 adopted mitigations
minimum score of 6
Benchmark indicates how well the system under test can support a human writing python code by specifying docstrings whose function bodies are completed by the system.
Refer to the original reference for more details about the benchmark

The benchmark presents a...
high risk of information degrading through time.
lower risk of statistically biased results misleading.
high risk of misunderstanding what the benchmark evidences.
high risk of circumstance not being covered when the benchmark may reasonably be expected to cover the circumstance.
lower risk of randomness misleading via scores not representative of the system.
Numerically, this is supported by the following scores:
Scored on
April 1, 2025
41 adopted mitigations
minimum score of 5
Benchmark indicates how well a system can answer graduate-level multiple-choice questions in biology, chemistry, and physics, with performance comparable to skilled non-experts even when provided unrestricted internet access.
Refer to the original reference for more details about the benchmark

The benchmark presents a...
high risk of information degrading through time.
lower risk of statistically biased results misleading.
moderate risk of misunderstanding what the benchmark evidences.
high risk of circumstance not being covered when the benchmark may reasonably be expected to cover the circumstance.
high risk of randomness misleading via scores not representative of the system.
Numerically, this is supported by the following scores:
Scored on
April 1, 2025
48 adopted mitigations
minimum score of 10
Benchmark indicates whether a system detects implicit hate speech targeting minority groups while avoiding false alarms
Refer to the original reference for more details about the benchmark

The benchmark presents a...
high risk of information degrading through time.
moderate risk of statistically biased results misleading.
moderate risk of misunderstanding what the benchmark evidences.
high risk of circumstance not being covered when the benchmark may reasonably be expected to cover the circumstance.
lower risk of randomness misleading via scores not representative of the system.
Numerically, this is supported by the following scores:
Scored on
April 1, 2025
41 adopted mitigations
minimum score of 0
Benchmark measures social biases against people belonging to protected classes along nine social dimensions relevant for U.S. English-speaking contexts.
Refer to the original reference for more details about the benchmark

The benchmark presents a...
high risk of information degrading through time.
moderate risk of statistically biased results misleading.
high risk of misunderstanding what the benchmark evidences.
moderate risk of circumstance not being covered when the benchmark may reasonably be expected to cover the circumstance.
lower risk of randomness misleading via scores not representative of the system.
Numerically, this is supported by the following scores:
Scored on
April 1, 2025
44 adopted mitigations
minimum score of 23
Benchmark indicates how well a system can perform 23 challenging tasks from the BIG-Bench Hard (BBH) suite, assessing its ability to solve complex reasoning problems
Refer to the original reference for more details about the benchmark

The benchmark presents a...
high risk of information degrading through time.
high risk of statistically biased results misleading.
high risk of misunderstanding what the benchmark evidences.
high risk of circumstance not being covered when the benchmark may reasonably be expected to cover the circumstance.
moderate risk of randomness misleading via scores not representative of the system.
Numerically, this is supported by the following scores:
Scored on
April 1, 2025
43 adopted mitigations
minimum score of 23
Benchmark indicates how well a system can perform 23 novel tasks that probe advanced reasoning capabilities, each designed to be significantly more challenging than its counterpart in the original BIG-Bench (BBH) suite.
Refer to the original reference for more details about the benchmark

The benchmark presents a...
high risk of information degrading through time.
high risk of statistically biased results misleading.
high risk of misunderstanding what the benchmark evidences.
high risk of circumstance not being covered when the benchmark may reasonably be expected to cover the circumstance.
moderate risk of randomness misleading via scores not representative of the system.
Numerically, this is supported by the following scores:
Scored on
April 1, 2025
39 adopted mitigations
minimum score of 21
Benchmark indicates the propensity of AI systems to respond in a hazardous manner to prompts from malicious or vulnerable users that might result in harm to themselves or others.
Refer to the original reference for more details about the benchmark

The benchmark presents a...
high risk of information degrading through time.
high risk of statistically biased results misleading.
moderate risk of misunderstanding what the benchmark evidences.
high risk of circumstance not being covered when the benchmark may reasonably be expected to cover the circumstance.
high risk of randomness misleading via scores not representative of the system.
Numerically, this is supported by the following scores:
Scored on
April 1, 2025
41 adopted mitigations
minimum score of 11
Benchmark indicates whether a model can answer expert-level academic questions across disciplines with precision and understanding, serving as a key signal of the model’s reliability for users who depend on it for rigorous, high-stakes reasoning.
Refer to the original reference for more details about the benchmark

The benchmark presents a...
high risk of information degrading through time.
high risk of statistically biased results misleading.
moderate risk of misunderstanding what the benchmark evidences.
moderate risk of circumstance not being covered when the benchmark may reasonably be expected to cover the circumstance.
high risk of randomness misleading via scores not representative of the system.
Numerically, this is supported by the following scores:
Scored on
April 1, 2025
40 adopted mitigations
minimum score of 22
Benchmark indicates how well a system can perform across a diverse set of 204 challenging tasks spanning linguistics, mathematics, commonsense reasoning, and more, reflecting its generalization and reasoning capabilities.
Refer to the original reference for more details about the benchmark

The benchmark presents a...
high risk of information degrading through time.
high risk of statistically biased results misleading.
high risk of misunderstanding what the benchmark evidences.
high risk of circumstance not being covered when the benchmark may reasonably be expected to cover the circumstance.
high risk of randomness misleading via scores not representative of the system.
Numerically, this is supported by the following scores:
Scored on
April 1, 2025
31 adopted mitigations
minimum score of 0
Benchmark measures model capacity for reflecting human notions of justice, well-being, duties, virtues, and commonsense morality
Refer to the original reference for more details about the benchmark

The benchmark presents a...
high risk of information degrading through time.
high risk of statistically biased results misleading.
high risk of misunderstanding what the benchmark evidences.
high risk of circumstance not being covered when the benchmark may reasonably be expected to cover the circumstance.
high risk of randomness misleading via scores not representative of the system.
Numerically, this is supported by the following scores:
Scored on
April 1, 2025
30 adopted mitigations
minimum score of 0
Benchmark indicates whether a model can generate factually accurate answers to questions while avoiding common human misconceptions, measuring its susceptibility to imitative falsehoods—critical for users who depend on AI systems for reliable information in high-stakes domains like medicine, law, and science.
Refer to the original reference for more details about the benchmark

The benchmark presents a...
high risk of information degrading through time.
high risk of statistically biased results misleading.
high risk of misunderstanding what the benchmark evidences.
high risk of circumstance not being covered when the benchmark may reasonably be expected to cover the circumstance.
lower risk of randomness misleading via scores not representative of the system.
Numerically, this is supported by the following scores:
Scored on
April 1, 2025
38 adopted mitigations
minimum score of 5
Benchmark capacity to reason about physical and social concepts by prompting for sentence endings from challenging, adversarially filtered alternatives, providing users with a measure of the scope of model capacity.
Refer to the original reference for more details about the benchmark

The benchmark presents a...
high risk of information degrading through time.
high risk of statistically biased results misleading.
high risk of misunderstanding what the benchmark evidences.
moderate risk of circumstance not being covered when the benchmark may reasonably be expected to cover the circumstance.
moderate risk of randomness misleading via scores not representative of the system.
Numerically, this is supported by the following scores:
Scored on
April 1, 2025
36 adopted mitigations
minimum score of 11
Benchmark evaluates a language model’s multitask academic and professional knowledge by measuring its accuracy on multiple-choice questions across 57 diverse subjects.
Refer to the original reference for more details about the benchmark

The benchmark presents a...
high risk of information degrading through time.
high risk of statistically biased results misleading.
high risk of misunderstanding what the benchmark evidences.
high risk of circumstance not being covered when the benchmark may reasonably be expected to cover the circumstance.
moderate risk of randomness misleading via scores not representative of the system.
Numerically, this is supported by the following scores:
Scored on
April 1, 2025
30 adopted mitigations
minimum score of 5
Benchmark measures model capability in solving diverse grade school math word problems.
Refer to the original reference for more details about the benchmark

The benchmark presents a...
high risk of information degrading through time.
high risk of statistically biased results misleading.
high risk of misunderstanding what the benchmark evidences.
high risk of circumstance not being covered when the benchmark may reasonably be expected to cover the circumstance.
moderate risk of randomness misleading via scores not representative of the system.
Numerically, this is supported by the following scores:
Scored on
April 1, 2025
39 adopted mitigations
minimum score of 5
Benchmark evaluates the toxicity of sentence completions.
Refer to the original reference for more details about the benchmark

The benchmark presents a...
high risk of information degrading through time.
high risk of statistically biased results misleading.
high risk of misunderstanding what the benchmark evidences.
moderate risk of circumstance not being covered when the benchmark may reasonably be expected to cover the circumstance.
moderate risk of randomness misleading via scores not representative of the system.
Numerically, this is supported by the following scores:
Scored on
April 1, 2025
30 adopted mitigations
minimum score of 5
Benchmark evaluates LLM safety by measuring responses to prompts that are mapped to granular risk categories derived from government regulations and AI policies.
Refer to the original reference for more details about the benchmark

The benchmark presents a...
high risk of information degrading through time.
high risk of statistically biased results misleading.
high risk of misunderstanding what the benchmark evidences.
high risk of circumstance not being covered when the benchmark may reasonably be expected to cover the circumstance.
high risk of randomness misleading via scores not representative of the system.
Numerically, this is supported by the following scores:
Scored on
April 1, 2025
29 adopted mitigations
minimum score of 5
Benchmark indicates whether an LLM will refuse to generate malicious code when prompted, signaling its resistance to abuse and helping users evaluate the model’s safety and robustness.
Refer to the original reference for more details about the benchmark

The benchmark presents a...
high risk of information degrading through time.
high risk of statistically biased results misleading.
moderate risk of misunderstanding what the benchmark evidences.
high risk of circumstance not being covered when the benchmark may reasonably be expected to cover the circumstance.
high risk of randomness misleading via scores not representative of the system.
Numerically, this is supported by the following scores:
Scored on
April 1, 2025
33 adopted mitigations
minimum score of 7
Benchmark indicates whether an open-ended text generation model maintains fairness and avoids social biases across domains such as profession, gender, race, religion, and ideology when prompted with real-world English contexts, helping users evaluate the model’s equity and ethical behavior.
Refer to the original reference for more details about the benchmark

The benchmark presents a...
high risk of information degrading through time.
high risk of statistically biased results misleading.
high risk of misunderstanding what the benchmark evidences.
high risk of circumstance not being covered when the benchmark may reasonably be expected to cover the circumstance.
high risk of randomness misleading via scores not representative of the system.
Numerically, this is supported by the following scores:
Scored on
April 1, 2025
23 adopted mitigations
minimum score of 0
Benchmark measures a language model's propensity for toxic degeneration, quantifying how even seemingly innocuous, naturally-occurring prompts can trigger the generation of toxic text.
Refer to the original reference for more details about the benchmark

The benchmark presents a...
high risk of information degrading through time.
high risk of statistically biased results misleading.
high risk of misunderstanding what the benchmark evidences.
moderate risk of circumstance not being covered when the benchmark may reasonably be expected to cover the circumstance.
high risk of randomness misleading via scores not representative of the system.
Numerically, this is supported by the following scores: