I’m not an expert programmer at all, but thanks to a free program called SWE Agent, I was I’ve just been able to debug and fix a tricky issue involving an incorrectly named file within different code repositories on the software hosting site GitHub.
I flagged an issue on GitHub to SWE-agent and watched as it analyzed the code and reasoned about what might be wrong. It correctly determined that the root cause of the error was a line pointing to the wrong location in a file, then navigated through the project, located the file, and modified the code to make everything work correctly. It’s the kind of thing that an inexperienced developer (like myself) could spend hours trying to debug.
Many programmers already use AI to write software faster. GitHub Copilot was the first integrated development environment to leverage AI, but many IDEs will now automatically complete code snippets as a developer starts typing. You can also ask AI questions about the code or ask it to offer suggestions on how to improve what you’re doing.
Last summer, John Yang and Carlos Jimenez, two Princeton PhD students, began debating what it would take for AI to become a real-world software engineer. This led them and others at Princeton to come up with SWE Banka set of benchmarks for testing AI tools on a variety of coding tasks. After releasing the benchmark in October, the team developed their own tool, SWE-agent, to master these tasks.
SWE-agent (“SWE” is short for “software engineering”) is one of a number of considerably more powerful AI coding programs that go beyond writing lines of code and act as so-called software agents, leveraging the tools needed to manipulate, debug, and organize software. The startup Devin went viral with a video demonstration of one of those tools in March.
Ofir Press, a member of the Princeton team, says SWE-bench could help OpenAI test the performance and reliability of software agents. “This is just my opinion, but I think they will release a software agent very soon,” Press says.
OpenAI declined to comment, but another source with knowledge of the company’s activities, who asked not to be identified, told WIRED that “OpenAI is definitely working on encryption agents.”
Just as GitHub Copilot demonstrated that large language models can write code and increase programmer productivity, tools like SWE-agent can demonstrate that AI agents can work reliably, starting with building and maintaining code.
Several companies are testing agents for software development. At the top of the SWE-bench rankings, which measures the score of different coding agents on a variety of tasks, is one of Factory AIa startup, followed by AutoCodeRoveran open source entry from a team at the National University of Singapore.
The big players are also getting involved. A software writing tool called Amazon Q is another top performer on SWE-bench. “Software development is much more than just writing,” says Deepak Singh, vice president of software development at Amazon Web Services.
He adds that AWS has used the agent to translate entire software stacks from one programming language to another. “It’s like having a really smart engineer sitting next to you, writing and building an application with you,” Singh says. “I think it’s quite transformative.”
An OpenAI team recently helped the Princeton team improve a benchmark for measuring the reliability and effectiveness of tools like SWE-agent, suggesting the company could also be perfecting agents to write code or perform other tasks on a computer.
Singh says several customers are already building complex back-end applications using Q. My own experiments with SWE-bench suggest that anyone coding soon will want to use agents to improve their programming skills, or risk being left behind.