Table of Contents
Last month, Google GameNGen’s AI model proved that Generalized image diffusion techniques. can be used for generate a passable and playable version of CondemnNow, researchers are using some similar techniques with a model called MarioVGG to see if the AI can generate a plausible video of Super Mario Bros. in response to user input.
The results of The MarioVGG model-available as a pre-printed article Posted by Cryptocurrency-Related AI Firm Virtual Protocol—It still has many glaring flaws and is too slow for anything approaching real-time gameplay. But the results show how even a limited model can infer impressive physics and gameplay dynamics just by studying a bit of video and input data.
The researchers hope this represents a first step toward “producing and demonstrating a reliable and controllable video game generator” or possibly even “completely replacing game development and game engines using video generation models” in the future.
Watching 737,000 frames of Mario
To train their model, researchers at MarioVGG (GitHub users) Ernie Chew and Brian Lim are listed as contributors) started with a public dataset of Super Mario Bros. A game containing 280 “levels” of input data and images organized for machine learning purposes (level 1-1 was removed from the training data so that images from it could be used for evaluation). The 737,000+ individual frames in that dataset were “preprocessed” into 35-frame chunks so that the model could begin to learn what the immediate results of various inputs generally looked like.
To “simplify the game situation,” the researchers decided to focus on just two possible inputs from the dataset: “run right” and “run right and jump.” However, even this limited set of movements presented some difficulties for the machine learning system, as the preprocessor had to look back for a few frames before a jump to figure out if and when the “run” started. Any jumps that included mid-air adjustments (i.e., the “left” button) also had to be discarded because “this would introduce noise into the training dataset,” the researchers write.
After preprocessing (and about 48 hours of training on a single RTX 4090 graphics card), the researchers used a standard convolution and noise removal a process of generating new video frames from an initial static gameplay image and text input (either “running” or “jumping” in this limited case). While these generated sequences only last a few frames, the last frame of a sequence can be used as the first frame of a new sequence, allowing for gameplay videos of any length to be created that still show “coherent and consistent gameplay,” according to the researchers.
Super Mario 0.5
Even with all this setup, MarioVGG doesn’t exactly output a smooth video that’s indistinguishable from an actual NES game. To achieve greater efficiency, the researchers reduced the resolution of the NES’s output frames from 256×240 to a much blurrier 64×48. They also condensed the 35-frame video time into just seven generated frames that are spread out “at even intervals,” creating a “gameplay” video that looks much rougher than the actual gameplay output.
Despite those limitations, the MarioVGG model still struggles to come close to real-time video generation at the moment. The single RTX 4090 used by the researchers took a full six seconds to generate a six-frame video sequence, which amounts to just over half a second of video even at an extremely limited frame rate. The researchers admit that this is “not practical or friendly for interactive video games,” but they hope that future optimizations in weight quantization (and perhaps the use of more computational resources) could improve this speed.
However, given those limits, MarioVGG can create some pretty believable videos of Mario running and jumping from a static starting image, similar to Google’s Genie game creatorThe model was even able to “learn the game’s physics solely from video frames in the training data without any explicit rules encoded,” the researchers write. This includes inferring behaviors such as Mario falling when he runs off the edge of a cliff (with believable gravity) and (normally) stopping Mario’s forward motion when he’s adjacent to an obstacle, the researchers write.
While MarioVGG focused on simulating Mario’s movements, the researchers found that the system could effectively hallucinate new obstacles for Mario as the video scrolled through an imaginary level. These obstacles “are consistent with the game’s graphical language,” the researchers write, but they can’t currently be influenced by user input (e.g., putting a pit in front of Mario and having him jump over it).
Just make it up
However, like all probabilistic AI models, MarioVGG has a frustrating tendency to sometimes give completely useless results. Sometimes that means simply ignoring user input prompts (“we observed that the input action text is not obeyed all the time,” the researchers write). Other times, it means hallucinating Obvious Visual Glitches: Mario sometimes lands inside obstacles, runs through obstacles and enemies, flashes different colors, shrinks/grows from frame to frame, or disappears completely for several frames before reappearing.
One particularly absurd video shared by the researchers shows Mario falling through the bridge, turning into Cheep-Cheep, then flying back across the bridges and transforming back into Mario. That’s the kind of thing we’d expect to see from a Wonder Flower, not an AI video of the original. Super Mario Bros.
The researchers speculate that training for longer with “more diverse gameplay data” could help address these significant issues and help their model simulate more than just running and jumping inexorably to the right. Still, MarioVGG is a fun demonstration that even with limited training data and algorithms, some decent initial models of basic games can be created.
This story originally appeared in Ars Technica.