
An update on our preliminary evaluations of Claude 3.5 Sonnet ...
Jan 31, 2025 · METR conducted preliminary evaluations of Anthropic’s upgraded Claude 3.5 Sonnet (October 2024 release), and a pre-deployment checkpoint of OpenAI’s o1. In both cases, we failed to …
Techmeme: METR: Claude Opus 4.5 has a 50% task completion ...
1 day ago · METR: Claude Opus 4.5 has a 50% task completion time horizon of about 4 hours and 49 minutes, more than double that of Claude Opus 4 released earlier this year — We estimate that, on …
Claude 3.7 Evaluation Results - METR
Apr 4, 2025 · METR conducted a preliminary evaluation of Claude 3.7 Sonnet. While we failed to find significant evidence for a dangerous level of autonomous capabilities, the model displayed …
Details about METR’s preliminary evaluation of Claude 3.5 ...
METR evaluated Claude-3.5-Sonnet on tasks from both our general autonomy and AI R&D task suites. The general autonomy evaluations were performed similarly to our GPT-4o evaluation, and uses …
METR
We propose measuring AI performance in terms of the length of tasks AI agents can complete. We show that this metric has been consistently exponentially increasing over the past 6 years, with a doubling …
Archive - METR
METR conducted preliminary evaluations of Anthropic’s upgraded Claude 3.5 Sonnet (October 2024 release), and a pre-deployment checkpoint of OpenAI’s o1.
Details about METR’s preliminary evaluation of Claude 3.7
METR conducted a preliminary evaluation of Claude 3.7 Sonnet. While we failed to find significant evidence for a dangerous level of autonomous capabilities, the model displayed impressive AI R&D …