While the tech industry has been pushing for generative artificial intelligence, one giant has held back: Apple. The company has yet to introduce an AI-generated emoji, and according to one New York Times report today And previous reporting from Bloomberg, with which she is in preliminary discussions Google on adding the search company’s Gemini AI model to iPhones.
Still one research paper which was quietly posted online last Friday by Apple engineers suggests that the company is making significant new investments in AI that are already paying off. It describes the development of a new generative AI model called MM1, which can work with text and images. The researchers show that it answers questions about photos and reflects the kind of general knowledge skills shown by chatbots like ChatGPT. The model name is not explained, but could stand for MultiModal 1.
MM1 appears to be similar in design and sophistication to a variety of recent AI models from other tech giants, including Meta’s open source Llama 2 and Google’s Gemini. Work by Apple’s rivals and academics suggests that these types of models could be used to power capable chatbots or build “agents” that can solve tasks by writing code and performing actions such as using computer interfaces or websites . That suggests that MM1 could still find its way into Apple’s products.
“The fact that they are doing this shows that they have the ability to understand how to train and how to build these models,” says Ruslan Salakhutdinov, a professor at Carnegie Mellon who led AI research at Apple several years ago. “It requires a certain expertise.”
MM1 is a multimodal large language model, or MLLM, meaning it is trained on both images and text. This allows the model to respond to text prompts and also answer complex questions about certain images.
An example in Apple’s research report shows what happened when MM1 was presented with a photo of a sunlit restaurant table with a few beer bottles and also an image of the menu. When asked how much someone would expect to pay for ‘all the beer on the table’, the model correctly reads the correct price and adds the costs together.
When ChatGPT launched in November 2022, it could only ingest and generate text, but recently its creator OpenAI and others have been working to extend the underlying large language model technology to work with other types of data. When Google launched Gemini (the model that now powers its answer to ChatGPT) last December, the company touted its multimodal nature as the start of a major new direction in AI. “After the rise of LLMs, MLLMs are emerging as the next frontier in basic modeling,” says Apple’s article.
MM1 is a relatively small model, as measured by the number of “parameters,” or the internal variables that are adjusted as a model is trained. Kate Saenkoa professor at Boston University who specializes in computer vision and machine learning, says this could make it easier for Apple engineers to experiment with different training methods and refinements before scaling up if they come across something promising.
Saenko says the MM1 article provides a surprising amount of detail about how the model was trained for a corporate publication. For example, the engineers behind MM1 describe tricks to improve the model’s performance, including increasing image resolution and mixing text and image data. Apple is known for its secrecy but has previously shown unusual openness about AI research as it has tried to attract the talent needed to compete in the crucial technology.