A new test from OpenAI aims to understand how close AI is to outperforming humans at economically valuable work.
Anthropic's Claude Opus 4.1 excelled at many professional tasks, especially those performed by clerks, software developers, ...
Anthropic's Claude Sonnet 4.5 now scores 77% on a key software engineering benchmark and can work autonomously for over 30 ...
Google's Gemini 2.5 Flash Lite is now the fastest proprietary model (and there's more big Gemini updates) Google continues to improve its Gemini family of large language models (LLMs) and its audio ...
MITRE said the ALUE benchmark for aerospace LLM evaluation supports custom datasets, open-source LLMs and user-defined prompts.
Preview, a trillion-parameter natural language reasoning model and the first open-source system of its scale. On the ...
MLCommons today released AILuminate, a new benchmark test for evaluating the safety of large language models. Launched in 2020, MLCommons is an industry consortium backed by several dozen tech firms.
We’ve identified multiple loopholes with SWE-bench Verified,’ the manager at Meta Platforms’ AI research lab Fair says.
Scientists at Singapore-based AI firm Sapient have unveiled a new hierarchical reasoning model (HRM), inspired by how the human brain processes information.
Some results have been hidden because they may be inaccessible to you
Show inaccessible results