2 articles found
A new subjective benchmarking approach reveals what standardized tests miss about AI model capabilities and training data overlap.
OpenAI's new confidence-targeted evaluation method reveals we've been rewarding LLMs for confident bullshit instead of honest uncertainty