← Back to all tags

Tagged with

2 articles found

LLM Benchmarks: Why 'Top 50 Humans' Might Be Better Than MMLU

llm-benchmarking

LLM Benchmarks: Why 'Top 50 Humans' Might Be Better Than MMLU

A new subjective benchmarking approach reveals what standardized tests miss about AI model capabilities and training data overlap.

#llm-benchmarking#ai-evaluation#subjective-testing...

We Accidentally Trained AI to Lie to Us

We Accidentally Trained AI to Lie to Us

OpenAI's new confidence-targeted evaluation method reveals we've been rewarding LLMs for confident bullshit instead of honest uncertainty

#llm#hallucination#ai-evaluation...