Claude introduces computer use capability in public beta

Anthropic launches computer use feature for Claude 3.5 Sonnet, enabling AI to interact with computers like humans, alongside model updates.

Claude introduces computer use capability in public beta
Developers can build with the computer use beta on the Anthropic API, Amazon Bedrock, and Google Cloud’s Vertex AI

On October 22, 2024, Anthropic announced the release of computer use capabilities for Claude 3.5 Sonnet, allowing the AI model to interact with computers through cursor movement, clicking, and typing. According to Anthropic's announcement, this public beta release makes Claude 3.5 Sonnet the first frontier AI model to offer computer use functionality.

According to Anthropic's research documentation, Claude 3.5 Sonnet achieved a 14.9% score on OSWorld in the screenshot-only category, surpassing the next-best AI system's score of 7.8%. When given additional steps to complete tasks, the model's performance increased to 22.0%.

The upgraded Claude 3.5 Sonnet demonstrated significant improvements in coding capabilities, with SWE-bench Verified performance increasing from 33.4% to 49.0%. The model also showed enhanced performance on TAU-bench, improving from 62.6% to 69.2% in the retail domain and from 36.0% to 46.0% in the airline domain.

Technical implementation

The computer use functionality operates through an API that enables Claude to:

  • Interpret screen contents through screenshots
  • Calculate pixel-based cursor movements
  • Execute clicking actions
  • Input text through a virtual keyboard
  • Navigate software interfaces designed for human use

Industry applications and early adoption

Several major technology companies have already begun implementing the new capabilities:

  • Replit: Utilizing computer use for app evaluation in their Replit Agent product
  • GitLab: Testing the model for DevSecOps tasks with reported 10% improvement in reasoning
  • Cognition: Implementing autonomous AI evaluations
  • The Browser Company: Automating web-based workflows

Safety measures and limitations

According to Anthropic's development documentation, the company has implemented several safety measures:

  • New classifiers to identify computer use activity and potential harm
  • Monitoring systems for election-related activities
  • Restrictions on social media content generation
  • Controls for domain registration
  • Limitations on government website interactions

Current technical limitations include:

  • Challenges with scrolling and dragging actions
  • Difficulties with zoom functionality
  • Limited ability to process rapid screen changes
  • Slower execution compared to human operators

Development process

The research process involved:

  • Building upon previous tool use and multimodality work
  • Training the model on basic software like calculators and text editors
  • Developing pixel-counting accuracy for precise cursor control
  • Testing generalization capabilities across different software applications

Performance Metrics

Key statistical improvements include:

  • SWE-bench Verified: 49.0% performance
  • TAU-bench retail domain: 69.2%
  • TAU-bench airline domain: 46.0%
  • OSWorld screenshot-only category: 14.9%
  • Extended task completion score: 22.0%

Key Facts

  • Release Date: October 22, 2024
  • Model Name: Claude 3.5 Sonnet
  • Primary Innovation: Computer use capability
  • Performance Leadership: Highest public model score on SWE-bench Verified
  • Safety Level: AI Safety Level 2
  • Availability: Public beta via Anthropic API, Amazon Bedrock, and Google Cloud's Vertex AI
  • Current Human-Level Comparison: Model at 14.9% vs typical human performance of 70-75% on OSWorld