This world needs something better than Transformer. I believe that everyone here hopes that it can be replaced by something that takes us to a new level of performance. This article shares the conversation between Nvidia CEO Huang Renxun and the renowned Transformer authors, discussing the future of language models. This article is sourced from Tencent Technology, compiled, translated, and written by DeepTech.
Summary:
Huang Renxun’s Ten Stories: From Dishwasher to Why He Loves Leather Jackets, Nvidia’s Twice-bankruptcy Experience, First Love as His Wife…
Background:
Huang Renxun’s Bold Claim: Competitors Giving Away Free AI Chips, Nvidia Still Unbeatable!
Table of Contents:
Impressive Views Expressed in the Conversation:
Transcript
Introduction of Transformer Speakers
Reasons for Establishing Transformer
Problems to be Solved by Transformer
Reasons for Establishing the Company
In 2017, a groundbreaking paper titled “Attention is All You Need” was published, introducing the Transformer model based on self-attention mechanism. This innovative architecture broke free from the limitations of traditional RNN and CNN, effectively solving the problem of long-distance dependencies and significantly improving the speed of sequence data processing. The encoder-decoder structure and multi-head attention mechanism of Transformer sparked a storm in the field of artificial intelligence, and the popular ChatGPT is built based on this architecture.
Imagine the Transformer model as your brain when you’re conversing with a friend. It can simultaneously pay attention to every word your friend says and understand the connections between these words. It gives computers a similar language understanding ability to humans. Before this, RNN was the mainstream method for language processing, but its information processing speed was slow, like an old tape player that had to play word by word. The Transformer model, on the other hand, is like an efficient DJ who can manipulate multiple tracks at the same time and quickly capture key information.
The emergence of the Transformer model greatly enhances the computer’s ability to process language, making tasks such as machine translation, speech recognition, and text summarization more efficient and accurate. This is a huge leap for the entire industry.
This innovative achievement was the result of the joint efforts of eight AI scientists who had worked at Google. Their initial goal was simple: to improve Google’s machine translation service. They wanted machines to fully understand and read entire sentences instead of translating word by word in isolation. This idea became the starting point of the “Transformer” architecture – the “self-attention” mechanism. Based on this, the eight authors, with their respective expertise, published the paper “Attention Is All You Need” in December 2017, detailing the Transformer architecture and opening a new chapter in generative AI.
In the world of generative AI, the scaling law is the core principle. In simple terms, as the scale of the Transformer model expands, its performance also improves. However, this also means that more powerful computing resources are needed to support larger models and deeper networks. NVIDIA, which provides efficient computing services, has become a key player in this AI wave.
At this year’s GTC conference, NVIDIA’s Huang Renxun invited the seven authors of Transformer (Niki Parmar was unable to attend due to special circumstances) to participate in a roundtable discussion, marking the first public appearance of all seven authors together.
This world needs something better than Transformer. I believe that everyone here hopes that it can be replaced by something that takes us to a new level of performance.
We did not succeed in our initial goal. We started Transformer with the intention of simulating the evolution of tokens. It is not just a linear generation process, but a step-by-step evolution of text or code.
For simple questions like 2+2, it may use the computing resources of large models. I think adaptive computation is one of the things that must happen next. We should know how much computing resources to spend on specific problems.
I think the current models are too cost-effective and the scale is still too small, with a price of about 1 million tokens per dollar, which is 100 times cheaper than buying a paperback book.
Huang Renxun: In the past sixty years, computer technology seems to have not undergone fundamental changes, at least since the moment I was born. The computer systems we currently use, whether it’s multitasking, hardware and software separation, software compatibility, or data backup capabilities and software engineering skills of programmers, are basically based on the design principles of IBM System360 – central processors, Bio subsystems, multitasking, hardware and software, software system compatibility, etc.
I believe that since 1964, there has been no fundamental change in modern computation. Although in the 1980s and 1990s, computers underwent a major transformation, forming the forms we are familiar with today. However, over time, the marginal cost of computers continued to decrease. Every decade, the cost decreased by ten times, every fifteen years by a thousand times, and every twenty years by ten thousand times. In this computer revolution, the magnitude of cost reduction is so great that in twenty years, the cost of computers has almost decreased by ten thousand times. This change has brought tremendous momentum to society.
Imagine if all the expensive items in your life were reduced to one ten-thousandth of their original price, such as a car you bought for $200,000 twenty years ago, now only costing $1. Can you imagine this change? However, the cost reduction of computers did not happen overnight but gradually reached a tipping point. After that, the trend of cost reduction suddenly stopped, and although it continues to improve little by little every year, the rate of change remains stagnant.
We started exploring accelerated computing, but using accelerated computing is not easy. You need to design it from scratch, step by step. In the past, we might have solved problems step by step according to established procedures. But now, we need to redesign these steps. This is a completely new scientific field, redefining the previous rules into parallel algorithms.
We realized this and believed that if we could accelerate even 1% of the code and save 99% of the execution time, there would definitely be applications that would benefit from it. Our goal is to turn the impossible into possible, or to make what is possible more efficient. This is the meaning of accelerated computing.
Looking back at the history of the company, we found that we have the ability to accelerate various applications. Initially, we achieved significant acceleration in the gaming field, so much so that people mistakenly thought we were a gaming company. But in fact, our goals go far beyond that because this market is huge, large enough to drive incredible technological advancements. This situation is rare, but we found such an exception.
To make a long story short, in 2012, AlexNet ignited the spark, marking the first collision between artificial intelligence and NVIDIA GPUs. This marked the beginning of our magical journey in this field. A few years later, we found a perfect application scenario that laid the foundation for our development today.
In short, these achievements laid the foundation for the development of generative AI. Generative AI can not only recognize images but also convert text into images and even create new content. Now we have enough technical capabilities to understand pixels, identify them, and understand the meaning behind them. Through this meaning, we can create new content. Artificial intelligence’s ability to understand the meaning behind data is a huge revolution.
We have reason to believe that this is the beginning of a new industrial revolution. In this revolution, we are creating something unprecedented. For example, in previous industrial revolutions, water was the source of energy. Water entered the devices we created, and the generator started working. Water in, electricity out, like magic.
Generative AI is a new form of “software” that can also create software. It relies on the collective efforts of many scientists. Imagine giving AI raw materials – data, and they enter a “building” – the machine we call GPU, and it outputs magical results. It is reshaping everything, and we are witnessing the birth of an “AI factory.”
This change can be called the beginning of a new industrial revolution. We have never truly experienced such a change before, but now it is slowly unfolding before us. Don’t miss the next decade because in this decade, we will create tremendous productivity. The pendulum of time has started, and our researchers have begun to take action.
Today, we invited the creators of Transformer to discuss where generative AI will lead us in the future.
Ashish Vaswani: Joined Google Brain team in 2016. Co-founded Adept AI with Niki Parmar in April 2022, left the company in December of the same year, and co-founded another AI startup called Essential AI.
Niki Parmar: Worked at Google Brain for four years and co-founded Adept AI and Essential AI with Ashish Vaswani.
Jakob Uszkoreit: Worked at Google from 2008 to 2021. Left Google in 2021 and co-founded Inceptive with others. Inceptive is a company focused on AI life sciences and is dedicated to designing the next generation of RNA molecules using neural networks and high-throughput experiments.
Illia Polosukhin: Joined Google in 2014 and was one of the first to leave the eight-person team. In 2017, he co-founded the blockchain company NEAR Protocol with others.
Noam Shazeer: Worked at Google from 2000 to 2009 and from 2012 to 2021. In 2021, Shazeer left Google and co-founded Character.AI with former Google engineer Daniel De Freitas.
Llion Jones: Worked at Delcam and YouTube. Joined Google in 2012 as a software engineer. Later, he left Google and founded an AI startup called sakana.ai.
Lukasz Kaiser: Former researcher at the French National Center for Scientific Research. Joined Google in 2013. In 2021, he left Google and became a researcher at OpenAI.
Aidan Gomez: Graduated from the University of Toronto, and at the time of the publication of the Transformer paper, he was an intern at Google Brain. He was the second person to leave Google among the eight-person team. In 2019, he co-founded Cohere with others.
Huang Renxun: Today, please actively seize the opportunity to speak up here. There is no topic that cannot be discussed here. You can even jump off your chair to discuss the issues. Let’s start with the most basic question: what problems did you encounter at that time, and what inspired you to do Transformer?
Illia Polosukhin: If you want to release a model that can truly read search results, such as processing piles of documents, you need a model that can process this information quickly. At that time, recurrent neural networks (RNN) could not meet such needs.
Indeed, at that time, although there were recurrent neural networks (RNN) and some initial attention mechanisms (ARNs) that caught attention, they still needed to read word by word, which was not efficient.
Jakob Uszkoreit: We were generating training data at a speed far exceeding our ability to train state-of-the-art architectures. In fact, we used simpler architectures, such as feed-forward networks with n-gram as input features. These architectures usually outperformed more complex and advanced models in large-scale training data at Google because they were faster to train.
The powerful RNN, especially the long short-term memory (LSTM), already existed at that time.
Noam Shazeer: It seems that this was an urgent problem to be solved. Around 2015, we began to notice these scaling laws. You can see that as the model size increases, its intelligence also improves. This is like the best problem in world history, and it is no