In a cluttered open-plan office in Mountain View, California, a tall, spindly wheeled robot has been busy playing tour guide and casual office helper, thanks to a major update to Google’s DeepMind language model. revealed todayThe robot uses the latest version of Google’s massive Gemini language model to parse commands and navigate.
For example, when a human says, “Find me a place to write,” the robot obediently walks away and leads the person to a spotless whiteboard located somewhere in the building.
Gemini’s ability to handle video and text, plus its ability to ingest large amounts of information in the form of pre-recorded video tours of the office, allows the “Google helper” robot to understand its environment and navigate correctly when given commands that require some common-sense reasoning. The robot combines Gemini with an algorithm that generates specific actions for the robot to perform, such as turning, in response to the commands and what it sees in front of it.
When Gemini was unveiled in December, Demis Hassabis, CEO of Google DeepMind, told WIRED that its multimodal capabilities would likely unlock new robotic abilities. He added that the company’s researchers were hard at work testing the model’s robotic potential.
In A new document Describing the project, the researchers behind the work say their robot proved up to 90 percent reliable at navigating, even when given difficult commands like “Where did I leave my roller coaster?” DeepMind’s system “has significantly improved the naturalness of human-robot interaction and greatly increased the robot’s usability,” the team writes.
The demo clearly illustrates the potential for large language models to reach out into the physical world and do useful work. Gemini and other chatbots mostly operate within the confines of a web browser or app, though they are increasingly capable of handling visual and auditory input, as Google and OpenAI have recently demonstrated. In May, Hassabis showed off an upgraded version of Gemini capable of interpreting the layout of an office as seen through a smartphone camera.
Academic and industrial research labs are racing to see how language models could be used to improve robot skills. program For the International Conference on Robotics and Automation, a popular event for robotics researchers, nearly two dozen papers are listed that involve the use of vision language models.
Investors are pouring money in startups looking to apply AI advances to robotics. Several of the researchers involved in the Google project have left the company to found a startup called Physical intelligencewhich received $70 million in seed funding, is working to combine large language models with real-world training to give robots general problem-solving skills. Skild Artificial Intelligencefounded by robotics experts at Carnegie Mellon University, has a similar goal. This month it announced $300 million in funding.
Just a few years ago, a robot needed a map of its environment and carefully selected commands to navigate successfully. Large language models contain useful information about the physical world, and newer versions that are trained on images and videos in addition to text — known as vision language models — can answer questions that require perception. Gemini allows Google’s robot to analyze visual instructions in addition to spoken ones, following a sketch on a whiteboard that shows a route to a new destination.
In their paper, the researchers say they plan to test the system on different types of robots. They add that Gemini should be able to make sense of more complex questions, such as “Do you have my favorite drink today?” asked by a user with a bunch of empty Coke cans on their desk.