- Mitigation 1
- to Failure Mode 1
- improving Intelligibility
- during (1) Task Definition
- Severity Reduction 25
- Likelihood Reduction 80
Do you clearly, publicly, and prominently state the user information foraging task associated with your benchmark evaluation—using plain and concise language that is accessible to the intended users? By 'information foraging task,' we mean the specific goal or question that a user is trying to answer or learn about when engaging with the benchmark, reflecting their real-world information needs.
This mitigates Failure Mode 1: The information provided by the benchmark does not match with the information the benchmark user believes is provided
Affirming Benchmarks:
AILuminate05AILuminate10AIRBenchARCAGIPrivateARCAGIPublicBOLDBiasDecodingTrustPrivacyDecodingTrustToxicityEthicsGPQAGSM8KHellaSwagHumanEvalHumanitysLastExamMMLUMachiavelliRealToxicityPromptsToxigenTruthfulQAWinoGrandeWordcraftBBQ