Global Tech Experts Launch ‘Humanity’s Last Exam’ to Challenge Advanced AI

Global Call for Toughest AI Questions: ‘Humanity’s Last Exam’ Aims to Test Advanced AI Systems

A coalition of technology experts has issued a global call for the most challenging questions to test artificial intelligence systems, in an initiative designed to gauge the arrival of expert-level AI. The project, dubbed ‘Humanity’s Last Exam’, seeks to create a comprehensive assessment that remains relevant as AI capabilities continue to advance.

Organized by the non-profit Center for AI Safety (CAIS) and the startup Scale AI, the initiative aims to crowdsource at least 1,000 difficult questions by November 1. These questions are intended to be challenging for non-experts and will undergo peer review. Selected submissions will be awarded co-authorship and prizes of up to $5,000, sponsored by Scale AI.

Raising the Bar for AI Benchmarks

The call for tougher AI challenges comes in the wake of recent developments where advanced AI models have outperformed existing benchmarks. Just days ago, OpenAI previewed its latest model, OpenAI o1, which surpassed popular reasoning benchmarks with ease.

Dan Hendrycks, executive director of CAIS and an advisor to Elon Musk’s xAI startup, emphasized the need for more rigorous assessments. Hendrycks co-authored two influential papers in 2021 that proposed tests for AI systems, including an undergraduate-level exam covering topics like U.S. history and competition-level mathematics. While AI systems initially performed poorly on these tests, recent models have “crushed” them.

“At the time of those papers, AI was giving almost random answers to questions on the exams,” Hendrycks said. “They’re now crushed.”

For example, models like Anthropic’s Claude have improved from scoring about 77% on the undergraduate-level test in 2023 to nearly 89% a year later. This rapid advancement has diminished the effectiveness of standard benchmarks as measures of AI progress.

Seeking New Measures of Intelligence

As AI models continue to improve, researchers are exploring new ways to assess their capabilities, particularly in areas like abstract reasoning and planning. Some lesser-used tests involving plan formulation and visual pattern-recognition puzzles have shown AI scoring poorly, suggesting these areas might be better indicators of intelligence.

The ‘Humanity’s Last Exam’ will focus on abstract reasoning to provide a more accurate measure of advanced AI capabilities. Hendrycks noted that some questions will remain private to prevent AI systems from simply memorizing answers during training.

Global Participation and Ethical Considerations

The initiative encourages global participation in developing challenging questions for AI systems. One notable restriction is the exclusion of questions about weapons, due to concerns over the potential dangers of AI studying such topics.

“We desperately need harder tests for expert-level models to measure the rapid progress of AI,” said Alexandr Wang, CEO of Scale AI.

By harnessing collective expertise, the project aims to develop an exam that can effectively gauge the progression of AI systems and ensure that assessments remain meaningful even as technology advances.