Claude introduces computer use capability in public beta

On October 22, 2024, Anthropic announced the release of computer use capabilities for Claude 3.5 Sonnet, allowing the AI model to interact with computers through cursor movement, clicking, and typing. According to Anthropic's announcement, this public beta release makes Claude 3.5 Sonnet the first frontier AI model to offer computer use functionality.

According to Anthropic's research documentation, Claude 3.5 Sonnet achieved a 14.9% score on OSWorld in the screenshot-only category, surpassing the next-best AI system's score of 7.8%. When given additional steps to complete tasks, the model's performance increased to 22.0%.

The upgraded Claude 3.5 Sonnet demonstrated significant improvements in coding capabilities, with SWE-bench Verified performance increasing from 33.4% to 49.0%. The model also showed enhanced performance on TAU-bench, improving from 62.6% to 69.2% in the retail domain and from 36.0% to 46.0% in the airline domain.

Technical implementation

The computer use functionality operates through an API that enables Claude to:

Interpret screen contents through screenshots
Calculate pixel-based cursor movements
Execute clicking actions
Input text through a virtual keyboard
Navigate software interfaces designed for human use

Industry applications and early adoption

Several major technology companies have already begun implementing the new capabilities:

Replit: Utilizing computer use for app evaluation in their Replit Agent product
GitLab: Testing the model for DevSecOps tasks with reported 10% improvement in reasoning
Cognition: Implementing autonomous AI evaluations
The Browser Company: Automating web-based workflows

Safety measures and limitations

According to Anthropic's development documentation, the company has implemented several safety measures:

New classifiers to identify computer use activity and potential harm
Monitoring systems for election-related activities
Restrictions on social media content generation
Controls for domain registration
Limitations on government website interactions

Current technical limitations include:

Challenges with scrolling and dragging actions
Difficulties with zoom functionality
Limited ability to process rapid screen changes
Slower execution compared to human operators

Development process

The research process involved:

Building upon previous tool use and multimodality work
Training the model on basic software like calculators and text editors
Developing pixel-counting accuracy for precise cursor control
Testing generalization capabilities across different software applications

Performance Metrics

Key statistical improvements include:

SWE-bench Verified: 49.0% performance
TAU-bench retail domain: 69.2%
TAU-bench airline domain: 46.0%
OSWorld screenshot-only category: 14.9%
Extended task completion score: 22.0%

Key Facts

Release Date: October 22, 2024
Model Name: Claude 3.5 Sonnet
Primary Innovation: Computer use capability
Performance Leadership: Highest public model score on SWE-bench Verified
Safety Level: AI Safety Level 2
Availability: Public beta via Anthropic API, Amazon Bedrock, and Google Cloud's Vertex AI
Current Human-Level Comparison: Model at 14.9% vs typical human performance of 70-75% on OSWorld