Researchers from the Center for AI Safety and Scale AI have introduced "Humanity's Last Exam" (HLE), a rigorous benchmark designed to measure how close advanced AI models are to human-level expertise across over 100 subjects.
The exam, detailed in a new Nature study, features 2,500 PhD-level questions vetted by over 1,000 global experts. Questions are designed to be unambiguous, verifiable, and not solvable by simple web search.
In initial tests, top models like OpenAI's o1 scored only 8.3%. As of February 2026, Google's Gemini 3 Deep Think leads with 48.4%, still far below the human expert average of 90%.
The creators emphasize that while high HLE performance is a necessary step, it alone does not signify the achievement of true artificial general intelligence (AGI), which requires broader capabilities like autonomous research.
