The last two weeks before the deadline were hectic. Although officially part of the team still had desks in building 1945, they mainly worked in 1965 because there was a better espresso machine in the micro kitchen. “People weren’t sleeping,” says Gomez, who lived in a constant debugging frenzy as an intern and also created the visualizations and diagrams for the article. In such projects, it is common to perform ablations, where things are removed to see if what remains is enough to get the job done.
“There was every possible combination of tricks and modules: which one helps, which one doesn’t help. Let’s get it out. Let’s replace it with this,” says Gomez. “Why does the model behave in this counterintuitive way? Oh, it’s because we didn’t remember to do the masking properly. Does it still work? Okay, move on to the next one. All of these components of what we now call the transformer were the result of this extremely rapid, iterative trial and error. The ablations, aided by Shazeer’s implementations, produced “something minimalistic,” says Jones. “Noam is a wizard.”
Vaswani remembers collapsing on an office couch one evening while the team was writing the paper. As he stared at the curtains that separated the couch from the rest of the room, he was struck by the pattern on the fabric, which to him looked like synapses and neurons. Gomez was there and Vaswani told him that what they were working on would transcend machine translation. “Ultimately, just like with the human brain, you have to unify all these modalities – speech, audio, vision – under a single architecture,” he says. “I had a strong suspicion that we were onto something more general.”
However, in the upper echelons of Google, the work was seen as yet another interesting AI project. I asked some Transformers people if their bosses had ever called them for updates on the project. Not so much. But “we understood that this was potentially a pretty big problem,” says Uszkoreit. “And it caused us to become obsessed towards the end with one of the sentences in the paper, where we comment on future work.”
That phrase anticipated what might come next: the application of transformer models to virtually all forms of human expression. “We are excited about the future of attention-based models,” they wrote. “We plan to extend the transformer to problems involving input and output modalities other than text” and to explore “images, audio, and video.”
A few nights before the deadline, Uszkoreit realized they needed a title. Jones noted that the team had arrived at a radical rejection of accepted best practices, particularly LSTMs, for one technique: attention. Jones remembered that The Beatles had called a song “All You Need Is Love.” Why not call the paper “Attention is All You Need”?
The Beatles?
“I’m British,” says Jones. “It literally took five seconds of thinking. I didn’t think they would use it.”
They continued to collect results from their experiments until the deadline. “The English-French figures came about five minutes before we submitted the article,” says Parmar. “I was in the micro kitchen in 1965 and got that last issue.” With barely two minutes to spare, they sent the paper.