A benchmark of expert-level academic questions to assess AI capabilities

Center for AI Safety; Scale AI; HLE Contributors Consortium

A benchmark of expert-level academic questions to assess AI capabilities

Tools

CENTER FOR AI SAFETY, SCALE AI and HLE CONTRIBUTORS CONSORTIUM (2026). A benchmark of expert-level academic questions to assess AI capabilities. Nature, 649, 1139-1146. [Article]

[+][-]

Documents

37034:1202596

[+][-]

37034:1202596

[thumbnail of Adesanya-A_Benchmark_of_Expert-level(VoR).pdf]

Preview

PDF
Adesanya-A_Benchmark_of_Expert-level(VoR).pdf - Published Version
Available under License Creative Commons Attribution Non-commercial No Derivatives.

Download (3MB) | Preview

Abstract

Benchmarks are important tools for tracking the rapid advancements in large language model (LLM) capabilities. However, benchmarks are not keeping pace in difficulty: LLMs now achieve more than 90% accuracy on popular benchmarks such as Measuring Massive Multitask Language Understanding1, limiting informed measurement of state-of-the-art LLM capabilities. Here, in response, we introduce Humanity’s Last Exam (HLE), a multi-modal benchmark at the frontier of human knowledge, designed to be an expert-level closed-ended academic benchmark with broad subject coverage. HLE consists of 2,500 questions across dozens of subjects, including mathematics, humanities and the natural sciences. HLE is developed globally by subject-matter experts and consists of multiple-choice and short-answer questions suitable for automated grading. Each question has a known solution that is unambiguous and easily verifiable but cannot be quickly answered by internet retrieval. State-of-the-art LLMs demonstrate low accuracy and calibration on HLE, highlighting a marked gap between current LLM capabilities and the expert human frontier on closed-ended academic questions. To inform research and policymaking upon a clear understanding of model capabilities, we publicly release HLE at https://lastexam.ai.

More Information

Official URL:

https://www.nature.com/articles/s41586-025-09962-4...

Open Access URL:

https://www.nature.com/articles/s41586-025-09962-4...

Open Access Version:

Published version

Contributors:

Adesanya, Fatimah (Affiliation: Sheffield Hallam University)

Identifiers

Identification Number:

10.1038/s41586-025-09962-4

Library

Publisher:

Springer Nature

Item Type:

Article

Depositing User:

Colin Knott

Date record made live:

03 Mar 2026 14:53

Last Modified:

04 Mar 2026 10:15

Date of first compliant deposit:

3 March 2026

Date of first compliant Open Access:

3 March 2026

Version of first compliant deposit:

Version of Record

ISSN:

1476-4687

Page Range:

1139-1146

Volume:

649

URI:

https://shura.shu.ac.uk/id/eprint/37034

Statistics

Downloads

Downloads per month over past year

View more statistics

Metrics

Actions (login required)

View Item

Sheffield Hallam University Research Archive

A benchmark of expert-level academic questions to assess AI capabilities

Downloads

Altmetric Badge

Dimensions Badge

Actions (login required)

Sheffield Hallam University

City Campus, Howard Street

Sheffield S1 1WB

Sheffield Hallam University Research Archive

Contact us: shura@shu.ac.uk

Research at SHU

SHU Library