BenchRisk Scores

    Score Details

    These scores are all scoped around benchmarks intended to score information-only chatbots. Agentic systems (i.e., those taking direct action) are not in scope for this release versioned as BenchRisk-ChatBot-v1.0.

  • Scored on
    71 adopted mitigations
    minimum score of 51

  • Benchmark indicates the propensity of AI systems to respond in a hazardous manner to prompts from malicious or vulnerable users that might result in harm to themselves or others
    Refer to the original reference for more details about the benchmark

    The benchmark presents a...
    • lower risk of information degrading through time.
    • lower risk of statistically biased results misleading.
    • lower risk of misunderstanding what the benchmark evidences.
    • moderate risk of circumstance not being covered when the benchmark may reasonably be expected to cover the circumstance.
    • lower risk of randomness misleading via scores not representative of the system.
    Numerically, this is supported by the following scores:
  • Scored on
    58 adopted mitigations
    minimum score of 44

  • Benchmark indicates how well a system can perform abstract reasoning tasks that require generalization from minimal examples, reflecting human-like fluid intelligence.
    Refer to the original reference for more details about the benchmark

    The benchmark presents a...
    • moderate risk of information degrading through time.
    • lower risk of statistically biased results misleading.
    • moderate risk of misunderstanding what the benchmark evidences.
    • high risk of circumstance not being covered when the benchmark may reasonably be expected to cover the circumstance.
    • moderate risk of randomness misleading via scores not representative of the system.
    Numerically, this is supported by the following scores:
  • Scored on
    49 adopted mitigations
    minimum score of 10

  • Benchmark indicates how well a system can perform abstract reasoning tasks that require generalization from minimal examples, reflecting human-like fluid intelligence.
    Refer to the original reference for more details about the benchmark

    The benchmark presents a...
    • high risk of information degrading through time.
    • lower risk of statistically biased results misleading.
    • moderate risk of misunderstanding what the benchmark evidences.
    • high risk of circumstance not being covered when the benchmark may reasonably be expected to cover the circumstance.
    • moderate risk of randomness misleading via scores not representative of the system.
    Numerically, this is supported by the following scores:
  • Scored on
    51 adopted mitigations
    minimum score of 5

  • Benchmark indicates how well a system can answer 44k fill-in-a-blank questions with binary options associated with commonsense reasoning
    Refer to the original reference for more details about the benchmark

    The benchmark presents a...
    • high risk of information degrading through time.
    • moderate risk of statistically biased results misleading.
    • moderate risk of misunderstanding what the benchmark evidences.
    • high risk of circumstance not being covered when the benchmark may reasonably be expected to cover the circumstance.
    • lower risk of randomness misleading via scores not representative of the system.
    Numerically, this is supported by the following scores:
  • Scored on
    44 adopted mitigations
    minimum score of 6

  • Benchmark indicates how well the system under test can support a human writing python code by specifying docstrings whose function bodies are completed by the system.
    Refer to the original reference for more details about the benchmark

    The benchmark presents a...
    • high risk of information degrading through time.
    • lower risk of statistically biased results misleading.
    • high risk of misunderstanding what the benchmark evidences.
    • high risk of circumstance not being covered when the benchmark may reasonably be expected to cover the circumstance.
    • lower risk of randomness misleading via scores not representative of the system.
    Numerically, this is supported by the following scores:
  • Scored on
    41 adopted mitigations
    minimum score of 5

  • Benchmark indicates how well a system can answer graduate-level multiple-choice questions in biology, chemistry, and physics, with performance comparable to skilled non-experts even when provided unrestricted internet access.
    Refer to the original reference for more details about the benchmark

    The benchmark presents a...
    • high risk of information degrading through time.
    • lower risk of statistically biased results misleading.
    • moderate risk of misunderstanding what the benchmark evidences.
    • high risk of circumstance not being covered when the benchmark may reasonably be expected to cover the circumstance.
    • high risk of randomness misleading via scores not representative of the system.
    Numerically, this is supported by the following scores:
  • Scored on
    48 adopted mitigations
    minimum score of 10

  • Benchmark indicates whether a system detects implicit hate speech targeting minority groups while avoiding false alarms
    Refer to the original reference for more details about the benchmark

    The benchmark presents a...
    • high risk of information degrading through time.
    • moderate risk of statistically biased results misleading.
    • moderate risk of misunderstanding what the benchmark evidences.
    • high risk of circumstance not being covered when the benchmark may reasonably be expected to cover the circumstance.
    • lower risk of randomness misleading via scores not representative of the system.
    Numerically, this is supported by the following scores:
  • Scored on
    41 adopted mitigations
    minimum score of 0

  • Benchmark measures social biases against people belonging to protected classes along nine social dimensions relevant for U.S. English-speaking contexts.
    Refer to the original reference for more details about the benchmark

    The benchmark presents a...
    • high risk of information degrading through time.
    • moderate risk of statistically biased results misleading.
    • high risk of misunderstanding what the benchmark evidences.
    • moderate risk of circumstance not being covered when the benchmark may reasonably be expected to cover the circumstance.
    • lower risk of randomness misleading via scores not representative of the system.
    Numerically, this is supported by the following scores:
  • Scored on
    44 adopted mitigations
    minimum score of 23

  • Benchmark indicates how well a system can perform 23 challenging tasks from the BIG-Bench Hard (BBH) suite, assessing its ability to solve complex reasoning problems
    Refer to the original reference for more details about the benchmark

    The benchmark presents a...
    • high risk of information degrading through time.
    • high risk of statistically biased results misleading.
    • high risk of misunderstanding what the benchmark evidences.
    • high risk of circumstance not being covered when the benchmark may reasonably be expected to cover the circumstance.
    • moderate risk of randomness misleading via scores not representative of the system.
    Numerically, this is supported by the following scores:
  • Scored on
    43 adopted mitigations
    minimum score of 23

  • Benchmark indicates how well a system can perform 23 novel tasks that probe advanced reasoning capabilities, each designed to be significantly more challenging than its counterpart in the original BIG-Bench (BBH) suite.
    Refer to the original reference for more details about the benchmark

    The benchmark presents a...
    • high risk of information degrading through time.
    • high risk of statistically biased results misleading.
    • high risk of misunderstanding what the benchmark evidences.
    • high risk of circumstance not being covered when the benchmark may reasonably be expected to cover the circumstance.
    • moderate risk of randomness misleading via scores not representative of the system.
    Numerically, this is supported by the following scores:
  • Scored on
    39 adopted mitigations
    minimum score of 21

  • Benchmark indicates the propensity of AI systems to respond in a hazardous manner to prompts from malicious or vulnerable users that might result in harm to themselves or others.
    Refer to the original reference for more details about the benchmark

    The benchmark presents a...
    • high risk of information degrading through time.
    • high risk of statistically biased results misleading.
    • moderate risk of misunderstanding what the benchmark evidences.
    • high risk of circumstance not being covered when the benchmark may reasonably be expected to cover the circumstance.
    • high risk of randomness misleading via scores not representative of the system.
    Numerically, this is supported by the following scores:
  • Scored on
    41 adopted mitigations
    minimum score of 11

  • Benchmark indicates whether a model can answer expert-level academic questions across disciplines with precision and understanding, serving as a key signal of the model’s reliability for users who depend on it for rigorous, high-stakes reasoning.
    Refer to the original reference for more details about the benchmark

    The benchmark presents a...
    • high risk of information degrading through time.
    • high risk of statistically biased results misleading.
    • moderate risk of misunderstanding what the benchmark evidences.
    • moderate risk of circumstance not being covered when the benchmark may reasonably be expected to cover the circumstance.
    • high risk of randomness misleading via scores not representative of the system.
    Numerically, this is supported by the following scores:
  • Scored on
    40 adopted mitigations
    minimum score of 22

  • Benchmark indicates how well a system can perform across a diverse set of 204 challenging tasks spanning linguistics, mathematics, commonsense reasoning, and more, reflecting its generalization and reasoning capabilities.
    Refer to the original reference for more details about the benchmark

    The benchmark presents a...
    • high risk of information degrading through time.
    • high risk of statistically biased results misleading.
    • high risk of misunderstanding what the benchmark evidences.
    • high risk of circumstance not being covered when the benchmark may reasonably be expected to cover the circumstance.
    • high risk of randomness misleading via scores not representative of the system.
    Numerically, this is supported by the following scores:
  • Scored on
    31 adopted mitigations
    minimum score of 0

  • Benchmark measures model capacity for reflecting human notions of justice, well-being, duties, virtues, and commonsense morality
    Refer to the original reference for more details about the benchmark

    The benchmark presents a...
    • high risk of information degrading through time.
    • high risk of statistically biased results misleading.
    • high risk of misunderstanding what the benchmark evidences.
    • high risk of circumstance not being covered when the benchmark may reasonably be expected to cover the circumstance.
    • high risk of randomness misleading via scores not representative of the system.
    Numerically, this is supported by the following scores:
  • Scored on
    30 adopted mitigations
    minimum score of 0

  • Benchmark indicates whether a model can generate factually accurate answers to questions while avoiding common human misconceptions, measuring its susceptibility to imitative falsehoods—critical for users who depend on AI systems for reliable information in high-stakes domains like medicine, law, and science.
    Refer to the original reference for more details about the benchmark

    The benchmark presents a...
    • high risk of information degrading through time.
    • high risk of statistically biased results misleading.
    • high risk of misunderstanding what the benchmark evidences.
    • high risk of circumstance not being covered when the benchmark may reasonably be expected to cover the circumstance.
    • lower risk of randomness misleading via scores not representative of the system.
    Numerically, this is supported by the following scores:
  • Scored on
    38 adopted mitigations
    minimum score of 5

  • Benchmark capacity to reason about physical and social concepts by prompting for sentence endings from challenging, adversarially filtered alternatives, providing users with a measure of the scope of model capacity.
    Refer to the original reference for more details about the benchmark

    The benchmark presents a...
    • high risk of information degrading through time.
    • high risk of statistically biased results misleading.
    • high risk of misunderstanding what the benchmark evidences.
    • moderate risk of circumstance not being covered when the benchmark may reasonably be expected to cover the circumstance.
    • moderate risk of randomness misleading via scores not representative of the system.
    Numerically, this is supported by the following scores:
  • Scored on
    36 adopted mitigations
    minimum score of 11

  • Benchmark evaluates a language model’s multitask academic and professional knowledge by measuring its accuracy on multiple-choice questions across 57 diverse subjects.
    Refer to the original reference for more details about the benchmark

    The benchmark presents a...
    • high risk of information degrading through time.
    • high risk of statistically biased results misleading.
    • high risk of misunderstanding what the benchmark evidences.
    • high risk of circumstance not being covered when the benchmark may reasonably be expected to cover the circumstance.
    • moderate risk of randomness misleading via scores not representative of the system.
    Numerically, this is supported by the following scores:
  • Scored on
    30 adopted mitigations
    minimum score of 5

  • Benchmark measures model capability in solving diverse grade school math word problems.
    Refer to the original reference for more details about the benchmark

    The benchmark presents a...
    • high risk of information degrading through time.
    • high risk of statistically biased results misleading.
    • high risk of misunderstanding what the benchmark evidences.
    • high risk of circumstance not being covered when the benchmark may reasonably be expected to cover the circumstance.
    • moderate risk of randomness misleading via scores not representative of the system.
    Numerically, this is supported by the following scores:
  • Scored on
    39 adopted mitigations
    minimum score of 5

  • Benchmark evaluates the toxicity of sentence completions.
    Refer to the original reference for more details about the benchmark

    The benchmark presents a...
    • high risk of information degrading through time.
    • high risk of statistically biased results misleading.
    • high risk of misunderstanding what the benchmark evidences.
    • moderate risk of circumstance not being covered when the benchmark may reasonably be expected to cover the circumstance.
    • moderate risk of randomness misleading via scores not representative of the system.
    Numerically, this is supported by the following scores:
  • Scored on
    30 adopted mitigations
    minimum score of 5

  • Benchmark evaluates LLM safety by measuring responses to prompts that are mapped to granular risk categories derived from government regulations and AI policies.
    Refer to the original reference for more details about the benchmark

    The benchmark presents a...
    • high risk of information degrading through time.
    • high risk of statistically biased results misleading.
    • high risk of misunderstanding what the benchmark evidences.
    • high risk of circumstance not being covered when the benchmark may reasonably be expected to cover the circumstance.
    • high risk of randomness misleading via scores not representative of the system.
    Numerically, this is supported by the following scores:
  • Scored on
    29 adopted mitigations
    minimum score of 5

  • Benchmark indicates whether an LLM will refuse to generate malicious code when prompted, signaling its resistance to abuse and helping users evaluate the model’s safety and robustness.
    Refer to the original reference for more details about the benchmark

    The benchmark presents a...
    • high risk of information degrading through time.
    • high risk of statistically biased results misleading.
    • moderate risk of misunderstanding what the benchmark evidences.
    • high risk of circumstance not being covered when the benchmark may reasonably be expected to cover the circumstance.
    • high risk of randomness misleading via scores not representative of the system.
    Numerically, this is supported by the following scores:
  • Scored on
    33 adopted mitigations
    minimum score of 7

  • Benchmark indicates whether an open-ended text generation model maintains fairness and avoids social biases across domains such as profession, gender, race, religion, and ideology when prompted with real-world English contexts, helping users evaluate the model’s equity and ethical behavior.
    Refer to the original reference for more details about the benchmark

    The benchmark presents a...
    • high risk of information degrading through time.
    • high risk of statistically biased results misleading.
    • high risk of misunderstanding what the benchmark evidences.
    • high risk of circumstance not being covered when the benchmark may reasonably be expected to cover the circumstance.
    • high risk of randomness misleading via scores not representative of the system.
    Numerically, this is supported by the following scores:
  • Scored on
    23 adopted mitigations
    minimum score of 0

  • Benchmark measures a language model's propensity for toxic degeneration, quantifying how even seemingly innocuous, naturally-occurring prompts can trigger the generation of toxic text.
    Refer to the original reference for more details about the benchmark

    The benchmark presents a...
    • high risk of information degrading through time.
    • high risk of statistically biased results misleading.
    • high risk of misunderstanding what the benchmark evidences.
    • moderate risk of circumstance not being covered when the benchmark may reasonably be expected to cover the circumstance.
    • high risk of randomness misleading via scores not representative of the system.
    Numerically, this is supported by the following scores: