AI agent demos may look impressive, but getting the technology to work reliably and without annoying or costly errors in real life can be a challenge. Current models can answer questions and converse with near-human skill and are the backbone of chatbots like OpenAI’s ChatGPT and Google’s Gemini. They can also perform tasks on computers when given a simple command by accessing the computer screen as well as input devices such as a keyboard and trackpad or through low-level software interfaces.
Anthropic says Claude outperforms other AI agents in several key benchmarks, including SWE bankwhich measures an agent’s software development skills and OS Worldwhich measures an agent’s ability to use a computer operating system. The claims have yet to be independently verified. Anthropic says Claude performs tasks in OSWorld correctly 14.9 percent of the time. This is well below humans, who typically score around 75 percent, but considerably higher than the current best agents, including OpenAI’s GPT-4, which succeeds about 7.7 percent of the time.
Anthropic says several companies are already testing the agent version of Claude. This includes canvawho uses it to automate design and editing tasks and Repeatthat uses the model to encode tasks. Other early adopters include The browser company, Asana and Notion.
Ofir Pressa Princeton University postdoc who helped develop SWE-bench, says agent AI tends to lack the ability to plan far in advance and often has difficulty recovering from errors. “To prove useful, we need to perform strongly on strict and realistic benchmarks,” he says, such as reliably planning a wide range of trips for a user and booking all the necessary tickets.
Kaplan notes that Claude can already fix some bugs surprisingly well. When faced with a terminal error when trying to start a web server, for example, the model knew how to revise its command to fix it. It also turned out that you had to enable pop-ups when you hit a dead end when browsing the web.
Many technology companies are now racing to develop AI agents in their quest for market share and prominence. In fact, it may not be long before many users will have agents at their fingertips. Microsoft, which has invested more than $13 billion in OpenAI, says it is testing agents that can use Windows computers. Amazon, which has invested heavily in Anthropic, is exploring how agents could recommend and eventually purchase products for its customers.
Sonya Huang, a partner at venture firm Sequoia who focuses on AI companies, says that despite all the excitement around AI agents, most companies are actually simply rebranding powered tools. by AI. Speaking to WIRED ahead of the Anthropic news, he says the technology currently works best when applied in narrow domains, such as coding-related work. “You need to pick problem spaces where if the model fails, you’re fine,” he says. “Those are the problem spaces where truly native agent companies will emerge.”
A key challenge with agent AI is that errors can be much more problematic than a confusing response from a chatbot. Anthropic has placed certain restrictions on what Claude can do, for example limiting his ability to use a person’s credit card to buy things.
If mistakes can be avoided well enough, says Princeton University’s Press, users can learn to see AI (and computers) in a whole new way. “I’m very excited about this new era,” he says.