Nknknk - Search

About 103,000 results

Open links in new tab

Any time

metr.org
https://metr.org › blog
An update on our preliminary evaluations of Claude 3.5 Sonnet ...
Jan 31, 2025 · METR conducted preliminary evaluations of Anthropic’s upgraded Claude 3.5 Sonnet (October 2024 release), and a pre-deployment checkpoint of OpenAI’s o1. In both cases, we failed to …
techmeme.com
https://www.techmeme.com
Techmeme: METR: Claude Opus 4.5 has a 50% task completion ...
1 day ago · METR: Claude Opus 4.5 has a 50% task completion time horizon of about 4 hours and 49 minutes, more than double that of Claude Opus 4 released earlier this year — We estimate that, on …
substack.com
https://metr.substack.com
Claude 3.7 Evaluation Results - METR
Apr 4, 2025 · METR conducted a preliminary evaluation of Claude 3.7 Sonnet. While we failed to find significant evidence for a dangerous level of autonomous capabilities, the model displayed …
metr.org
https://evaluations.metr.org
Details about METR’s preliminary evaluation of Claude 3.5 ...
METR evaluated Claude-3.5-Sonnet on tasks from both our general autonomy and AI R&D task suites. The general autonomy evaluations were performed similarly to our GPT-4o evaluation, and uses …
metr.org
https://metr.org
METR
We propose measuring AI performance in terms of the length of tasks AI agents can complete. We show that this metric has been consistently exponentially increasing over the past 6 years, with a doubling …
substack.com
https://metr.substack.com › archive
Archive - METR
METR conducted preliminary evaluations of Anthropic’s upgraded Claude 3.5 Sonnet (October 2024 release), and a pre-deployment checkpoint of OpenAI’s o1.
metr.org
https://evaluations.metr.org
Details about METR’s preliminary evaluation of Claude 3.7
METR conducted a preliminary evaluation of Claude 3.7 Sonnet. While we failed to find significant evidence for a dangerous level of autonomous capabilities, the model displayed impressive AI R&D …

Some results have been removed
Pagination
- 1
- 2
- 3
- Next

An update on our preliminary evaluations of Claude 3.5 Sonnet ...

Techmeme: METR: Claude Opus 4.5 has a 50% task completion ...

Claude 3.7 Evaluation Results - METR

Details about METR’s preliminary evaluation of Claude 3.5 ...

METR

Archive - METR

Details about METR’s preliminary evaluation of Claude 3.7