Failure Modes

These failure modes are all scoped around benchmarks intended to score information-only chatbots. Agentic systems (i.e., those taking direct action) are not in scope for this release versioned as BenchRisk-ChatBot-v1.0.

  • The information provided by the benchmark does not match with the information the benchmark user believes is provided

    Example realization: The benchmark presents a safety score for a SUT that does medical prescription dose recommendations and believes "safety" includes dose safety, but the safety benchmark only tests whether the model will help someone commit violent acts, The benchmark user gets a stomache ulcer from consuming too many NSAIDs.
  • The task is defined too broadly to achieve any reasonable degree of coverage over the use case

    Example realization: The benchmark represents itself as testing SUT "general intelligence." A user then believes a system with the maximum benchmark score is capable of giving stock market trading advice, but the benchmark does not test anything related to finance. The user relies on the benchmark and loses his life savings trading derivatives.
  • Input prompt writers produce prompts with LLMs

    Example realization: All benchmark prompts are produced with the aid of Llama4 in the crowd worker interface to improve their performance. Consequently, the prompts are biased to the word usage of Llama4 and it performs higher on the benchmark that it otherwise would. The benchmark user selects Llama4 even though it is not actually the best language model for what the benchmark is measuring.
  • Prompts are collected from publicly available sources that are also likely to be in the datasets of SUT developers

    Example realization: The benchmark uses Reddit posts, StackOverflow questions, and Quora queries to construct prompts. Since these sources are commonly included in pretraining corpora, models like GPT or Claude that have seen these prompts during training score significantly higher than models trained on different data distributions, leading to misleadingly inflated performance. The benchmark user never considers using a SUT not pre-trained with these sources as a result.
  • Data vendor licenses private data that is available to other parties

    Example realization: A third-party vendor licenses and a set of complex reasoning prompts without ensuring they are the sole organization with a right to license the data. As a result, the SUTs may have separately licensed the exact benchmark prompts during training or evaluation, inflating their performance and compromising the benchmark’s validity as a measure of generalization or real-world capability. The benchmark user adopts the LLM whose developers also licensed the data despite it having poor generalization performance.
  • Data vendor provides same prompts to multiple organizations including benchmark authors and SUT developers

    Example realization: A benchmark uses a proprietary dataset of customer service transcripts licensed from a call center vendor. Several major LLM developers also have access to this dataset through separate licensing deals, giving them an advantage on the benchmark that smaller labs or open-source developers cannot match—leading to unfair comparisons and undermining the benchmark's claims of generalization. Consequently, the benchmark user adopts a poor performing system that happened to license all the benchmark evaluation data.
  • Singular prompts without a distributional tie lack a capacity for detecting distributional failings, harms, or impacts

    Example realization: A benchmark uses only a handful of standalone prompts to test bias in medical diagnosis advice. The prompts are about heart attack symptoms in middle-aged men. The SUT passes the test. However, when deployed, the SUT systematically fails to recognize heart attack symptoms in women and younger patients due to underrepresentation in training and evaluation data. The benchmark user deploys the model in a clinical triage assistant, and it contributes to misdiagnosis and delayed care for several patients outside the narrow demographic tested in the benchmark.
  • Prompt writers produce prompts with inadequate variability within the valid input space (e.g., a single prompt writer writes all the prompts)

    Example realization: A benchmark designed to test reasoning over legal contracts uses 500 prompts, all written by a single legal expert. Although the expert is highly knowledgeable, their prompts all follow similar structures, phrasings, and assumptions. As a result, the SUT learns to pick up on these patterns and performs well. However, when deployed to assist general counsel teams, the model fails to handle real-world contract analysis tasks that involve diverse linguistic styles, jurisdictions, and edge cases. The benchmark user trusts the high benchmark score and integrates the model into a high-stakes legal review process, leading to costly misinterpretations.
  • Adversarial prompt bulking (increasing the number of prompts by multiplying them by the number of tactics)

    Example realization: A benchmark designed to evaluate model robustness against jailbreaks creates 100 base adversarial prompts and then applies 10 paraphrasing or obfuscation tactics to each, resulting in 1,000 prompts. While this gives the appearance of broad coverage, the underlying semantic space is still narrow—centered on just 100 scenarios. A model that learns to defend against these specific base prompts or common surface patterns scores highly, even though it remains vulnerable to novel or semantically different jailbreaks. A benchmark user assumes the model is robust and deploys it in a moderation tool, which is quickly circumvented by attacks not represented in the bloated prompt set.
  • Prompt perturbation bulking (increasing the number of prompts by making small changes to root prompts)

    Example realization: A benchmark aimed at evaluating factual accuracy in historical QA uses 100 base prompts and generates 5,000 total prompts by slightly altering dates, names, or phrasing (e.g., changing “When did the Berlin Wall fall?” to “Can you tell me the year the Berlin Wall was taken down?”). While the quantity of prompts appears large, the semantic diversity is minimal and fails to cover the broader landscape of historical questions. A model optimized on this benchmark appears highly performant, but when users ask genuinely diverse or nuanced historical questions, it frequently hallucinates or misinterprets. The benchmark user integrates the model into an educational tool, leading to the dissemination of confidently stated misinformation.
  • Prompts focus on adversarial users (e.g., users are attempting to circumvent a guard model)

    Example realization: A benchmark evaluating the safety of a language model focuses exclusively on adversarial jailbreak attempts—e.g., users trying to trick the model into giving instructions for dangerous activities. It ignores benign but naive users who unintentionally elicit harmful responses due to ambiguous phrasing or lack of domain knowledge. A model is trained and evaluated solely on its ability to block adversarial attacks and scores highly. However, when deployed in a public helpdesk context, it frequently outputs unsafe or misleading content to sincere users with poorly worded or misunderstood queries. The benchmark user believes the model is safe and deploys it broadly, exposing end users to unanticipated risks.
  • Prompt writers bias the sample to their own demographically-aligned word use, topics of interest, or other dimensions that tend to not explore the entirety of supported input space for the benchmark's supported use case

    Example realization: A benchmark for evaluating open-domain question answering is written by a small group of university-educated crowdworkers in the U.S. Their prompts disproportionately reflect Western pop culture, academic language, and topics of interest like sports, entertainment, and politics from a U.S. perspective. The SUT performs well on the benchmark, but when deployed globally, it struggles to handle queries about regional histories, idioms, or culturally specific contexts. A benchmark user in a multinational organization selects this SUT based on its top score, only to discover it performs poorly in non-Western markets and erodes user trust due to its perceived cultural bias.
  • Benchmark does not capture the distribution or variability of the task in the real world

    Example realization: A benchmark for evaluating summarization quality uses a fixed set of short, well-structured news articles from a single outlet. All inputs are grammatically clean, follow similar structures, and focus on non-technical content. The benchmark scores suggest high summarization quality. However, when the model is deployed to summarize real-world documents—such as messy meeting transcripts, scientific papers, or user-generated content with inconsistent formatting—it fails to produce coherent or accurate summaries. The benchmark user, trusting the strong results, integrates the model into a productivity suite, leading to summaries that are frequently misleading, incomplete, or incoherent in actual usage scenarios.
  • Prompts have known properties allowing for achieving an unrealistic (i.e., non-generalizing) performance. For example, prompts are of particular and known lengths.

    Example realization: A benchmark testing whether code generation models detect instructions for generating malware uses prompts that are consistently of a particular length, averaging 10 lines of code for malware, and 8 lines of code for non-malicious code. The model performs well on this benchmark, as it has been explicitly optimized to have a higher prior belief of malware for 10 lines of code vs 8 lines of code. However, the difference in length is solely a statistical artifact of the evaluation and not representative of the actual performance. A benchmark user, assuming the model’s high score represents its ability to prevent malware generation, deploys it in a real-world software development environment. The model then generates more malware than would otherwise have been generated.
  • An inadequate number of prompts are produced to identify rare critical events (i.e., tail risks)

    Example realization: A benchmark designed to evaluate a model's ability to detect financial fraud runs 500 prompts, including several fraud scenarios that must be detected with very high probability, such as coordinated international money laundering. However, the number of prompts focusing on these rare, high-impact events is too small to reliably determine when a system drops below the required 99.9 percent detection rate. As a result, most models pass the benchmark with a high score. The benchmark user assumes a passing model is meets requirements and deploys it, overlooking the fact that the benchmark may not detect coordinated money laundering at a sufficient rate.
  • No coverage for target language idiomatic expressions (including differences in functional expression, less common APIs, etc., within programming languages) beyond those known to the benchmark authors

    Example realization: A benchmark designed to evaluate a programming language model’s ability to generate code for common tasks focuses on widely used APIs and standard coding conventions, as well as idiomatic expressions in a programming language (e.g., Python list comprehensions, JavaScript callbacks). However, the prompts used by the benchmark authors reflect only the patterns and libraries they are familiar with, leaving out less common or emerging idioms, libraries, or APIs that may become relevant in real-world usage. As a result, the SUT performs well on the benchmark, which primarily uses the most established coding patterns and libraries. However, when deployed to generate code for novel, less-standard tasks or new frameworks, the model produces inefficient or incorrect code. The benchmark user, assuming the model’s high performance on the benchmark means general code generation ability, integrates it into production environments where it struggles with newer tools or approaches, leading to code inefficiencies and technical debt.
  • Cultural norms do not translate between cultural contexts (languages, geographies, etc.)

    Example realization: The benchmark tests a language model’s ability to understand and generate appropriate responses to cultural references in a variety of languages. However, the evaluation is conducted primarily in English, using Western cultural norms and references. When the model is deployed in a different geographic or linguistic context, such as in Japan or Brazil, it fails to understand or appropriately respond to culturally specific references, phrases, and social nuances, leading to misunderstandings and alienation in non-Western audiences. The benchmark user assumes the model is universally adept at handling cultural nuances but encounters failures in real-world deployments across different regions.
  • Producing prompts in a language from prompts translated from another language introduces errors

    Example realization: The benchmark tests a model's performance in generating responses in French, but the prompts used are initially translated from English using an automatic translation tool. These translations introduce subtle idiomatic errors, misinterpretations of context, and shifts in tone. As a result, the model is evaluated based on prompts that do not fully represent the original intent or phrasing in the source language. The benchmark user assumes the model performs well in French but encounters issues with unnatural or inaccurate language use when deployed in real-world French-speaking contexts, particularly in areas where language nuance is critical.
  • Prompts are sent to model vendors when inferencing or all prompts are publicly available

    Example realization: The benchmark uses an API-based evaluation pipeline where prompts are sent directly to model vendors (e.g., OpenAI, Anthropic) for inference, or all prompts are published openly online. Model vendors are thus able to log, analyze, and optimize performance specifically on these benchmark prompts, either intentionally or as part of routine monitoring. This leads to inflated scores that do not reflect the models’ generalization to unseen tasks. A benchmark user, unaware of this dynamic, interprets the scores as indicative of broader capability and deploys a system that underperforms on genuinely novel or proprietary tasks.
  • Distribution of SUT inputs within the real world are substantially different in distribution from those within the benchmark (e.g., SUT users ask different questions from those posed by the benchmark authors)

    Example realization: A benchmark is designed by academic researchers to test a model’s ability to answer philosophical, scientific, and analytical questions with precise factual grounding. However, in a commercial deployment—such as a virtual assistant—users primarily ask casual, personal, or goal-directed questions (e.g., “What should I wear today?” or “Can you draft a message to my boss?”) that differ drastically in tone, content, and structure from the benchmark prompts. As a result, the model excels in benchmark evaluations but performs poorly in production. The benchmark user misinterprets the high benchmark scores as a signal of general utility and deploys the system in an environment where it routinely fails to meet user expectations.
  • SUT developer trains against sample prompt set

    Example realization: A company preparing to release a new model with new capabilities also produces a new benchmark to evaluator those capabilities. The organization maintains a separation between the benchmark team and the SUT development team, but the benchmark team provides a sample test early in the development process. The SUT team then uses this released set during training by sampling prompts with similar structure, linguistic style, and distribution. As a result, the developer's system performs exceptionally well on the benchmark due to fine-tuning on closely aligned prompts. However, the model lacks true generalization and performs poorly on tasks that deviate even slightly from the benchmark format. The benchmark user interprets the high score as evidence of broad competence and deploys the system in a new domain, only to discover critical failures when the prompts fall outside the trained distribution.
  • SUT is tested under conditions (e.g. temperature, iteration, context window settings) not matching deployment conditions or those conditions typically experienced by relying persons under default configurations

    Example realization: The benchmark evaluates a system under optimal settings—low temperature for deterministic outputs, extended context windows, and multiple-shot prompting with carefully selected exemplars. However, real-world users typically interact with the system using default settings: higher temperature, single-shot prompts, and shorter contexts due to latency or cost constraints. As a result, the model's benchmark performance significantly overstates its real-world reliability and quality. A benchmark user assumes the model's benchmark score reflects the default configuration, deploys it in a customer-facing chatbot, and receives inconsistent, low-quality responses that degrade user trust and satisfaction.
  • SUT developers place evaluator or other test ground truth within system chain

    Example realization: During a jailbreak benchmark evaluation, a SUT is configured with access to the benchmark's LLM-as-a-judge, which determines whether the outputs of the SUT have been jailbroken or not. With access to the LLM-as-a-judge, the LLM scores perfectly on the benchmark. However, the LLM-as-a-judge is not perfectly accurate so it obscures the true failures of the SUT. A SUT deployer adopts the non-perfect LLM thinking it is actually perfect and will not be jailbroken.
  • Evaluator (humans labeling the final outputs used in benchmarking or an LLM-as-a-judge) tuned on translated outputs with substantial errors

    Example realization: The benchmark uses human or automated evaluators to score model outputs in a target language, such as Swahili or Thai. However, since evaluators are primarily English-speaking, the model outputs are translated into English for scoring. The translation process introduces semantic shifts, idiomatic inaccuracies, or tone distortions that obscure the original meaning. Evaluators rate these mistranslated outputs, leading to misleadingly low or high scores depending on the nature of the translation errors. A benchmark user relies on these scores to select a model for multilingual deployment, only to discover that the model performs poorly in the actual target language due to evaluation artifacts that masked critical failures.
  • Low interrater reliability of ground truth data used to tune the evaluator

    Example realization: The benchmark uses human annotators to generate ground truth labels or scores that are later used to train an automated evaluator. However, the annotators frequently disagree on task success criteria—such as what constitutes a "correct," "helpful," or "safe" response—due to vague instructions, subjective judgments, or cultural differences. This results in low interrater reliability, with inconsistent and noisy labels forming the basis of the evaluator’s training data. As a result, the evaluator itself becomes unreliable, often reflecting annotator bias or randomness rather than objective quality. A benchmark user, unaware of this underlying inconsistency, trusts the evaluator’s scores and selects a system that performs well on flawed metrics but poorly in actual deployment scenarios.
  • Evaluator tuned to SUT failing outputs and cannot generalize to new SUTs

    Example realization: The benchmark team fine-tunes an automated evaluator using outputs from a specific SUT—say, an earlier version of a proprietary model like GPT-3.5—that exhibits particular failure modes such as verbosity, evasiveness, or specific phrasing patterns. The evaluator learns to detect and penalize these patterns, mistaking them for general quality issues. When a new, structurally different SUT is evaluated—such as a model trained with more concise or stylistically different outputs—the evaluator misjudges its performance, either unfairly penalizing it or awarding inflated scores. The benchmark user, relying on evaluator scores, selects a model that aligns with legacy failure patterns rather than actual quality or safety in novel systems.
  • SUT developers produce training data from evaluator

    Example realization: Developers of a SUT gain access to the benchmark evaluator—either by reverse-engineering a public implementation or through a public API—and use it as a reward model or filtering mechanism during training or fine-tuning. This causes the model to optimize specifically for evaluator preferences, potentially learning to game the scoring mechanism through superficial alignment (e.g., mimicking surface features that score well, like formality or verbosity) rather than achieving true task performance. As a result, the model performs exceptionally well on the benchmark but fails to generalize to tasks that aren’t shaped by the evaluator’s biases. A benchmark user, unaware of this overfitting loop, selects the system expecting robust general capabilities, only to encounter brittle outputs.
  • Certain SUTs produce outputs with higher evaluator errors than other SUTs

    Example realization: A benchmark evaluates AI coding assistants using human evaluators to judge solution correctness. One SUT generates visually clean, well-commented code with excellent formatting that appears professional but contains subtle logical errors. Another SUT produces messier but functionally correct code. Evaluators consistently rate the first SUT's visually appealing but buggy code higher, leading to inflated scores. The benchmark user selects this SUT for a critical software development project, resulting in deployed code with hidden bugs that only manifest in production environments, causing system failures and data corruption that take months to diagnose and fix.
  • Evaluator(s) perform poorly across all SUTs

    Example realization: A benchmark employs undergraduate psychology students with no technical background to evaluate advanced AI reasoning on complex mathematical proofs. These evaluators cannot distinguish between correct proofs and those with subtle errors, giving high scores to any response that looks sophisticated regardless of accuracy. As a result, all tested systems receive similarly high scores despite vast differences in actual mathematical reasoning capabilities. The benchmark user selects the cheapest SUT thinking it performs equivalently to premium options, then deploys it to verify critical engineering calculations for a bridge design, resulting in structural flaws being missed and necessitating costly redesigns when discovered by other verification methods.
  • Evaluator(s) have a high false reject rate (acceptable outputs rated as unacceptable)

    Example realization: A benchmark uses human evaluators who are instructed to flag any potential safety concerns in SUT responses to medical questions. The evaluators, anxious about missing harmful content and lacking medical expertise, excessively reject valid medical information that mentions side effects or treatment risks, even when accurately presented with appropriate context. As a result, systems that provide thorough, nuanced medical information score poorly compared to those giving oversimplified but incomplete advice. The benchmark user selects a SUT that omits important risk information in its responses, leading to a telehealth platform where patients aren't properly informed about potential treatment complications, resulting in several adverse events that might have been avoided with more complete information.
  • Evaluator(s) have a high false accept rate (unacceptable outputs rated as acceptable)

    Example realization: A benchmark employs evaluators to assess AI responses for misinformation in scientific explanations. The evaluators, who lack expertise in specialized scientific domains, consistently fail to identify subtle but significant inaccuracies in SUT outputs, particularly when presented with confident, authoritative-sounding language and plausible-seeming explanations. All systems receive artificially high accuracy scores, with the most persuasive but factually flawed system scoring highest. The benchmark user deploys this system in an educational platform for medical students, where it convincingly teaches incorrect mechanisms of drug interactions that go undetected for months until graduates apply this misinformation in clinical settings, leading to several serious medication errors and a subsequent investigation that reveals the systematic flaws in the benchmark's evaluation process.
  • Inadequate sample size for identifying performance

    Example realization: A benchmark for evaluating toxicity filtering in content moderation systems tests each SUT on only 20 prompts designed to elicit harmful content. Due to this small sample size, the benchmark fails to capture the true distribution of harmful content, edge cases, and adversarial examples. One SUT randomly performs well on these specific 20 cases but has serious weaknesses with other toxic content patterns. The benchmark user selects this SUT for moderating a large online community platform, confident in its high score. Within weeks of deployment, users discover numerous ways to bypass the filters, resulting in widespread policy violations, user harassment, and significant brand damage before the system's actual limitations are fully understood.
  • Failure to propagate uncertainty or confidence from lower level measures to higher level grades

    Example realization: A benchmark evaluates AI medical diagnosis capabilities across 50 different conditions and aggregates these results into a single "Medical Competency Score." Each condition assessment has different confidence intervals based on sample size and evaluator expertise, but these uncertainties are lost in the final score calculation. A particular SUT scores 98% overall due to strong performance on common conditions with large sample sizes, masking its poor performance on rare conditions where data is limited. The benchmark user implements this system in a rural hospital with different disease prevalence patterns than those emphasized in the benchmark. The system consistently misdiagnoses several locally common conditions that had wide confidence intervals in the original evaluation, leading to inappropriate treatments and delayed correct diagnoses for numerous patients before the pattern is recognized.
  • Presentation without uncertainty or confidence of the scores

    Example realization: A benchmark report presents a single numerical "Stock Trading Ability" score each SUT without any indication of statistical uncertainty, variance across test cases, or confidence intervals for the presented scores. The published results show two systems scoring 57% and 55% respectively, implying the first system is superior. The benchmark user selects the marginally higher-scoring system and implements it for critical financial analysis. In reality, the scores had overlapping confidence intervals (± 10%) and the supposedly "better" system is materially worse when subject to additional scrutiny. The organization discovers this only after several months of operation when market competitors deploying the other model places the company on the bad end of several multi-million dollar trades.
  • User does not read disclaimers

    Example realization: A benchmark for conversational AI prominently displays a red-bordered disclaimer at the top of its report and executive summary stating that it only evaluates basic financial concepts and explicitly warns against using the tested systems for real investment advice without professional oversight. Despite this clear warning, a financial technology startup focuses solely on the performance metrics and implements the highest-scoring SUT as an automated investment advisor. The startup's technical team notices but dismisses the disclaimer, assuming their minor customizations will address the limitations. They market the system as "benchmark-validated" to clients who make significant investment decisions based on the AI's recommendations. When market conditions change unexpectedly, the system fails to properly assess risk factors it was never benchmarked for, resulting in substantial client losses and subsequent lawsuits against the startup for misrepresenting the system's capabilities, despite the benchmark authors' clear and prominent warnings.
  • User does not understand visual representation of scores

    Example realization: benchmark presents model performance results using a sophisticated radar chart with multiple axes representing different capabilities, where larger area indicates better overall performance. The chart uses inverted scales for some metrics where lower values are better (like error rates), but doesn't clearly label this inversion. A company's CTO misinterprets the visualization, believing that a particular SUT excels in every dimension when in fact it performs poorly on critical safety metrics where the scale was inverted. Based on this misunderstanding, they deploy this model for sensitive customer service automation, only discovering their error when the system begins generating inappropriate responses to difficult customer inquiries. By the time they correct their misunderstanding and replace the system, they've already lost several major clients and faced public criticism for their irresponsible AI deployment.
  • User misunderstands the scope of the benchmark

    Example realization: A benchmark advertises itself as measuring "AI creativity" and showcases impressive image generation capabilities of a specific SUT. A user, believing this benchmark comprehensively assesses all forms of creativity, selects this SUT for a natural language generation task requiring creative storytelling. The user is then disappointed when the SUT produces bland and unoriginal narratives, realizing too late that the benchmark only evaluated visual creativity and provided no insight into the model's language generation abilities.
  • Different demographic groups (cultural, professional, educational, etc.) viewing the benchmark have different interpretations of the information conveyed

    Example realization: A benchmark is produced by the American Bar Association detailing the legal appropriateness of a legal aid chatbot. A user that has not encountered lawyers before sees that all chatbots score poorly and believes the technology to be completely innacurate, while lawyers understand that the low scores result more from an abundance of caution than an incapacity of the system to render useful advice. Consequently, lawyers are willing to use the legal aid chatbot while non-lawyers avoid it.
  • SUT developer tunes safety program to benchmark sample set

    Example realization: A benchmark releases a subset of its prompts for transparency and community analysis. A system developer then uses this released set during training by sampling its prompts directly. As a result, the developer's system performs exceptionally well on the benchmark due to closely aligned fine-tuning. However, the model lacks true generalization and performs poorly on tasks that deviate even slightly from the benchmark format. The benchmark user interprets the high score as evidence of broad competence and deploys the system in a new domain, only to discover critical failures when the prompts fall outside the trained distribution.
  • SUT developer trains SUT against sample set

    Example realization: A benchmark releases a subset of its prompts for transparency and community analysis. A system developer then uses this released set during training by sampling prompts with similar structure, linguistic style, and distribution. As a result, the developer's system performs exceptionally well on the benchmark due to fine-tuning on closely aligned prompts. However, the model lacks true generalization and performs poorly on tasks that deviate even slightly from the benchmark format. The benchmark user interprets the high score as evidence of broad competence and deploys the system in a new domain, only to discover critical failures when the prompts fall outside the trained distribution.
  • User behavior shifts through time

    Example realization: A benchmark initially tests a language model's ability to answer factual questions about world history. Early users find the benchmark helpful in identifying models that are capable tutors of history. However, as time passes, user behavior evolves. They start using language models for new history classes with more comprehensive coverage of history in Asia and Africa, areas the original benchmark did not assess. Consequently, a user relying solely on the initial benchmark scores might select a model that excels at American or European history but performs poorly the expanded considerations of world history.
  • Test set leaks out to the general internet

    Example realization: A benchmark of challenging multi-hop reasoning questions is leaked by a disgruntled former employee. Over time, these questions, or paraphrased versions of them, begin to appear on various online forums, study websites, and even in synthetic datasets used for pre-training language models. As a result, new models are inadvertently (or intentionally) trained on data that overlaps with the benchmark, leading to artificially inflated scores that don't reflect genuine reasoning ability. A benchmark user, unaware of this data contamination, might choose a seemingly high-performing model that simply memorized the leaked test set, only to find it performs poorly on novel reasoning tasks in real-world applications.
  • SUT developers update the SUT without changing the name or version of the SUT

    Example realization: A benchmark evaluates "Model X" in January 2025 and publishes its results. Several months later, the developers of "Model X" release a significantly improved version of the model with architectural changes and updated training data, but they still refer to it as "Model X" without any version number change. A user consulting the benchmark results from January assumes the current "Model X" has the same capabilities and limitations as the one tested previously and fail to switch over to a new and better model.
  • SUT developers can run the benchmark an unlimited number of times

    Example realization: A benchmark allows SUT developers to submit their models for evaluation as many times as they wish. The developers of "NovaMind" repeatedly run the benchmark, meticulously analyzing the failure cases after each run. They then fine-tune their model specifically to improve its performance on the exact prompts and evaluation metrics of the benchmark, without necessarily improving its generalization capabilities on unseen data. A benchmark user, seeing the consistently high scores of "NovaMind," selects it believing it to be a robust and generally capable model. However, in real-world applications with slightly different inputs or evaluation criteria, "NovaMind" underperforms significantly because its apparent success was largely due to overfitting to the specific nuances of the benchmark.
  • The benchmark does not measure a property of the SUT linked to the user task

    Example realization: A benchmark rigorously evaluates the toxicity levels of a language model's responses, aiming to ensure it avoids generating hateful or offensive content. A user, however, is primarily concerned with whether the model exhibits kindness and empathy in its interactions, wanting it to provide supportive and understanding responses. They choose the model with the lowest toxicity score, assuming that a non-toxic model will automatically be kind and empathetic. However, they discover that the chosen model, while successfully avoiding harmful language, produces bland, emotionally neutral, and unhelpful responses that lack any genuine sense of care or consideration for the user's emotional state. The benchmark, by focusing solely on the absence of toxicity, failed to assess the positive qualities of kindness and empathy that were crucial for the user's desired application.
  • Users cannot map the scores to a mental model of likely SUT behavior in the real world

    Example realization: A benchmark provides a highly abstract "coherence score" for a language model's long-form generation, calculated using a complex combination of statistical metrics like perplexity and cosine similarity of embeddings. While Model A achieves a score of 0.92 and Model B scores 0.88, a user struggles to understand what these numbers practically mean for how the models will perform when generating a business report or a creative short story. They have no intuitive sense of the difference in quality or the types of coherence failures they might encounter with either model in real-world use. Consequently, their decision between Model A and Model B feels arbitrary, lacking a grounded understanding of how the benchmark scores translate to tangible differences in the models' output for their specific needs.
  • SUT developer trains against evaluation set prior to benchmark release

    Example realization: The developers of "InsightfulBot" develop a new question-answering benchmark. They then intentionally fine-tune InsightfulBot specifically on the exact questions and answers of the benchmark, optimizing performance on the metrics without improving its ability to generalize to novel queries it hasn't seen before. A user, impressed by InsightfulBot's top-ranking score on the benchmark as posted on HuggingFace, assumes it possesses superior knowledge and reasoning capabilities. However, when they use InsightfulBot for real-world information retrieval with unseen questions, the model performs poorly, demonstrating that its benchmark success was an artifact of overfitting to the evaluation data rather than genuine intelligence.
  • SUT developer trains against evaluation set after benchmark release

    Example realization: The developers of "InsightfulBot" gain access to the full evaluation set of a prominent question-answering benchmark. They then intentionally fine-tune their model specifically on these exact questions and answers, optimizing its performance on the benchmark metrics without improving its ability to generalize to novel queries it hasn't seen before. A user, impressed by InsightfulBot's top-ranking score on the benchmark, assumes it possesses superior knowledge and reasoning capabilities. However, when they use InsightfulBot for real-world information retrieval with unseen questions, the model performs poorly, demonstrating that its benchmark success was an artifact of overfitting to the evaluation data rather than genuine intelligence.
  • SUT developers are not bound to adhere to benchmark integrity requirements

    Example realization: A benchmark for evaluating the safety of language models prohibits the use of external knowledge during evaluation to ensure the model's responses are based solely on its training data. However, the developers of "GuardianAI," while submitting their model to the benchmark, secretly implement a retrieval mechanism that allows the model to access and incorporate real-time information from the internet during the evaluation process. This circumvents the benchmark's intended constraints and leads to inflated safety scores, as the model can draw upon external resources to avoid generating harmful content in the specific benchmark scenarios. A user, trusting the benchmark's results, selects GuardianAI believing it has robust internal safety mechanisms, only to discover in real-world use that it can still generate harmful content when disconnected from external resources or when faced with novel prompts not covered by its retrieval strategy.
  • Benchmark production failed to account for an idiosyncratic failure mode

    Example realization: A benchmark developer failed to read a recent research paper listing common failure modes and potential mitigations. As a result, the benchmark authors fail to publish integrity requirements at the time their benchmark is published. Subsequently, a SUT developer trains to the benchmark's sample set and overperforms relative to its actual performance. Consequently, a user adopts what would ordinarily be known as a poor performing SUT.
  • Linkage between the evaluation prompts and the information the prompts are meant to supply via the benchmark is not well understood by the benchmark user

    Example realization: A user is examining a collection of example benchmark prompts to help them understand the benchmark. Through this process the benchmark user comes to misunderstand the scope of a safety benchmark, which establishes the scope of coverage (e.g., should it answer a medical question?) and not whether the medical quesiton is correct. Consequently, a user believes the benchmark covers something that it does not.
  • New requirements emerge that would reasonably be interpreted as being covered in the task definition

    Example realization: A benchmark is designed to evaluate a language model's ability to generate "helpful and informative" summaries of news articles. Initially, the users of this benchmark primarily focus on the conciseness and factual accuracy of the summaries. However, over time, a new requirement emerges: users increasingly need summaries that also highlight potential biases or different perspectives presented in the original articles. A language model that scores highly on the original benchmark by producing brief and accurate summaries might fail to meet this new requirement by presenting a single, seemingly objective viewpoint without acknowledging any underlying biases. Consequently, a user who relied on the initial benchmark scores might select a model that is no longer truly "helpful" for their evolving needs, as it lacks the ability to identify and convey crucial contextual information about potential biases in the news.
  • A SUT developer has disparate access to information about the benchmark **after** its release (i.e., information not provided to other SUT developers)

    Example realization: The creators of a challenging reasoning benchmark privately share detailed information about the specific types of logical fallacies and linguistic ambiguities that the benchmark questions are designed to test with the developers of "CognitoMind" weeks after the official public release. Other SUT developers are only provided with the high-level task description at the time of launch. This privileged information allows the CognitoMind team to specifically tailor their model's architecture and training data to excel on these known weaknesses of other models. As a result, CognitoMind achieves a significantly higher score on the benchmark, not due to superior general reasoning capabilities, but because they had an unfair advantage in understanding the benchmark's intricacies. A benchmark user, unaware of this information asymmetry, might mistakenly conclude that CognitoMind is the most advanced reasoning engine available and choose it for critical applications, only to find its performance on real-world reasoning tasks (that don't align with the benchmark's specific design) is underwhelming compared to other models.
  • The benchmark authors do not know how to formulate the problem as prompts that are illustrative to the user relying on the benchmark

    Example realization: A benchmark aims to evaluate a language model's ability to assist with complex project planning. However, the benchmark authors, lacking deep expertise in project management (i.e., they are not domain experts), create prompts that are overly simplistic, focusing on isolated sub-tasks with clear, unambiguous instructions. A user looking to employ an LLM for real-world project planning faces messy, ill-defined problems with conflicting priorities and the need for nuanced decision-making. The top-performing model on the benchmark excels at the straightforward tasks presented but falters significantly when confronted with the ambiguity and complexity of real-world project scenarios. The user, misled by the benchmark's seemingly relevant task, selects a model that ultimately proves unhelpful for their actual needs because the benchmark prompts failed to capture the essential challenges of project planning as experienced in practice.
  • Benchmark authors do not know how to propagate statistical uncertainty into a user presentation

    Example realization: A benchmark reports the performance of several language models on a reading comprehension task, providing only single-point accuracy scores (e.g., Model A: 85%, Model B: 83%). However, the benchmark authors do not know how to conduct sufficient evaluations to determine the statistical significance of this 2% difference, nor do they present any confidence intervals or other measures of uncertainty. A user looking for the most reliable model might incorrectly assume that Model A is definitively superior to Model B. In reality, the observed difference could be due to random sampling variation, and with more data, the performance of the two models might be statistically indistinguishable. The user, lacking information about the uncertainty in the benchmark results, makes a potentially suboptimal decision based on a seemingly precise but statistically unreliable comparison.
  • Understanding the benchmark requires more resources (e.g., study, expertise, exploration) than the relying user has time to expend

    Example realization: A benchmark evaluates the nuanced safety profiles of large language models across a battery of complex, multi-turn adversarial prompts, utilizing sophisticated statistical analyses and presenting the results across a dozen different sub-scores and visualizations. The accompanying documentation is extensive and filled with technical jargon requiring a background in natural language processing and safety research to fully comprehend. A busy software engineer looking to quickly select a reasonably safe LLM for their application lacks the time and specialized knowledge to thoroughly study the benchmark methodology, interpret the various scores, and understand their implications for real-world deployment. They might then resort to simply looking at an overall "safety ranking" (if provided, and potentially misleadingly aggregated) or choose a model based on incomplete or superficial understanding of the benchmark results, potentially selecting a model that isn't actually the most suitable for their specific safety requirements.
  • A SUT developer has disparate access to information about the benchmark **before** its release (i.e., information not provided to other SUT developers)

    Example realization: The creators of a challenging reasoning benchmark privately share detailed information about the specific types of logical fallacies and linguistic ambiguities that the benchmark questions are designed to test with the developers of "CognitoMind" weeks before the official public release. Other SUT developers are only provided with the high-level task description at the time of launch. This privileged information allows the CognitoMind team to specifically tailor their model's architecture and training data to excel on these known weaknesses of other models. As a result, CognitoMind achieves a significantly higher score on the benchmark, not due to superior general reasoning capabilities, but because they had an unfair advantage in understanding the benchmark's intricacies. A benchmark user, unaware of this information asymmetry, might mistakenly conclude that CognitoMind is the most advanced reasoning engine available and choose it for critical applications, only to find its performance on real-world reasoning tasks (that don't align with the benchmark's specific design) is underwhelming compared to other models.