About 103,000 results
Open links in new tab
  1. An update on our preliminary evaluations of Claude 3.5 Sonnet ...

    Jan 31, 2025 · METR conducted preliminary evaluations of Anthropic’s upgraded Claude 3.5 Sonnet (October 2024 release), and a pre-deployment checkpoint of OpenAI’s o1. In both cases, we failed to …

  2. Techmeme: METR: Claude Opus 4.5 has a 50% task completion ...

    1 day ago · METR: Claude Opus 4.5 has a 50% task completion time horizon of about 4 hours and 49 minutes, more than double that of Claude Opus 4 released earlier this year — We estimate that, on …

  3. Claude 3.7 Evaluation Results - METR

    Apr 4, 2025 · METR conducted a preliminary evaluation of Claude 3.7 Sonnet. While we failed to find significant evidence for a dangerous level of autonomous capabilities, the model displayed …

  4. Details about METR’s preliminary evaluation of Claude 3.5 ...

    METR evaluated Claude-3.5-Sonnet on tasks from both our general autonomy and AI R&D task suites. The general autonomy evaluations were performed similarly to our GPT-4o evaluation, and uses …

  5. METR

    We propose measuring AI performance in terms of the length of tasks AI agents can complete. We show that this metric has been consistently exponentially increasing over the past 6 years, with a doubling …

  6. Archive - METR

    METR conducted preliminary evaluations of Anthropic’s upgraded Claude 3.5 Sonnet (October 2024 release), and a pre-deployment checkpoint of OpenAI’s o1.

  7. Details about METR’s preliminary evaluation of Claude 3.7

    METR conducted a preliminary evaluation of Claude 3.7 Sonnet. While we failed to find significant evidence for a dangerous level of autonomous capabilities, the model displayed impressive AI R&D …