BullshitBench v2
Measures models’ ability to detect nonsense across 100 plausible-sounding nonsense prompts in software, medical, legal, finance, and physics. BullshitBench v2: Detection Rate by Model. Percentage of nonsense questions each model detected (green), partially challenged (amber), or accepted (red).