METR


METR, is a nonprofit research institute, based in Berkeley, California, that evaluates frontier AI models' capabilities to carry out long-horizon, agentic tasks that some researchers argue could pose catastrophic risks to society. They have worked with leading AI companies to conduct pre-deployment model evaluations and contribute to system cards, including OpenAI's o3, o4-mini, GPT-4o and GPT-4.5, and Anthropic's Claude models.
METR's CEO and founder is Beth Barnes, a former alignment researcher at OpenAI who left in 2022 to form ARC Evals, the evaluation division of Paul Christiano's Alignment [Research Center]. In December 2023, ARC Evals was then spun off into an independent 501(c)(3) nonprofit and renamed METR.

Research

A substantial amount of METR's research is focused on the capabilities of AI systems to conduct research and development of AI systems themselves, including RE-Bench, a benchmark designed to test whether AIs can "solve research engineering tasks and accelerate AI R&D".
File:Measuring AI Ability to Complete Long Tasks.png|thumb|A graph showing that the length of tasks frontier models are capable of executing at a 50% success rate doubled every 7 months from 2019 to 2024. The shaded region represents a 95% confidence interval.
In March 2025, METR published a paper noting that the length of software engineering tasks that the leading AI model could complete had a doubling time of around 7 months between 2019 and 2024.