The most capable open-source AI model with visual capabilities yet could lead to more developers, researchers, and startups developing AI agents that can perform useful tasks on your computers for you.
Released today by the Allen Institute for AI (Ai2), the Open multimodal language modelMolmo, or Molmo, can interpret images and converse via a chat interface. This means it can interpret a computer screen, which could help an AI agent perform tasks such as surfing the web, navigating file directories, and composing documents.
“With this release, many more people can implement a multimodal model,” he says. Ali FarhadiCEO of Ai2, a research organization based in Seattle, Washington, and a computer scientist at the University of Washington. “It should be an enabler for next-generation applications.”
So-called AI agents are being widely touted as the next big thing in AI, with OpenAI, Google, and others racing to develop them. Agents have become a buzzword lately, but the grand vision is for AI to go far beyond chat to reliably perform complex, sophisticated actions on computers when given a command. This capability has yet to materialize at any kind of scale.
Some powerful AI models already have visual capabilities, such as OpenAI’s GPT-4, Anthropic’s Claude, and Google DeepMind’s Gemini. These models can be used to power some experimental AI agents, but they are hidden from view and can only be accessed through a paid application programming interface (API).
Meta has released a family of AI models called Llama under a license that limits their commercial use, but has not yet provided developers with a multimodal version. Meta is expected to announce several new products, perhaps including new Llama AI models, at its Connect event today.
“Having an open-source multimodal model means that any startup or researcher with an idea can try to carry it out,” he says. Ofir Pressa postdoctoral researcher at Princeton University who works with AI agents.
Press says the fact that Molmo is open source means developers will be able to more easily tune their agents for specific tasks, such as working with spreadsheets, by providing additional training data. Models like GPT-4 can only be tuned to a certain extent through their APIs, whereas a completely open model can be modified widely. “When you have an open source model like this, you have a lot more options,” Press says.
Ai2 is releasing several sizes of Molmo today, including a 70 billion parameter model and a 1 billion parameter model that is small enough to run on a mobile device. A model’s parameter count refers to the number of units it contains for storing and manipulating data and roughly corresponds to its capabilities.
Ai2 claims that Molmo is just as capable as other, considerably larger commercial models despite its relatively small size, because it was carefully trained using high-quality data. The new model is also completely open source, as unlike Meta’s Llama, there are no restrictions on its use. Ai2 is also publishing the training data used to create the model, giving researchers more details about how it works.
The release of powerful models is not without risk. These models can be more easily adapted for nefarious purposes – one day, for example, we could see the emergence of malicious AI agents designed to automate the hacking of computer systems.
Ai2’s Farhadi argues that Molmo’s efficiency and portability will enable developers to create more powerful software agents that run natively on smartphones and other portable devices. “The billion-parameter model now has performance on par with models that are at least ten times larger,” he says.
However, creating useful AI agents may depend on more than just more efficient multimodal models. A key challenge is getting models to perform more reliably. This may require further advances in AI reasoning capabilities, something OpenAI has attempted to address with its latest o1 model, which demonstrates step-by-step reasoning abilities. The next step may be to equip multimodal models with such reasoning capabilities.
For now, the launch of Molmo means that AI agents are closer than ever and could soon be useful even outside the giants that rule the AI world.