Ilya Sutskever - Sequence to Sequence Learning with Neural Networks: What a Decade - Transcript & Video

Speaker: Ilya Sutskever
Event: NeurIPS 2024

Ilya Sutskever:

I want to thank the organizers for choosing our paper for this award. It was very nice, and I also want to thank my incredible co-authors and collaborators: Oriol Vinyals and Quoc Le, who stood right before you a moment ago.

What you have here is an image, a screenshot from a similar talk 10 years ago at NeurIPS in 2014 in Montreal. It was a much more innocent time. Here we are, shown in the photos—this is the “before” and here’s the “after,” by the way. Now we’re more experienced, hopefully wiser.

I’d like to talk a little bit about the work itself and maybe give a 10-year retrospective on it, because a lot of the things in this work were correct, but some not so much. We can review them and see what happened and how it gently flowed to where we are today.

Let’s begin by talking about what we did. The way we’ll do it is by showing slides from the same talk 10 years ago. The summary of what we did is the following three bullet points:

It’s an autoregressive model trained on text.
It’s a large neural network.
It’s a large dataset.

That’s it. Now let’s dive into the details a little bit more.

Sequence to Sequence Learning with Neural Networks

This was a slide from 10 years ago. Not too bad: the “Deep Load Hypothesis.” What we said here is that if you have a large neural network with 10 layers, then it can do anything that a human being can do in a fraction of a second. Why did we have this emphasis on things human beings can do in a fraction of a second? Why this specifically?

Well, if you believe the deep learning dogma—that artificial neurons and biological neurons are similar, or at least not too different—and you believe that real neurons are slow, then anything that human beings can do quickly (by quickly I mean a fraction of a second, and by human beings I even mean just one human in the entire world), if there is one human in the entire world who can do some task in a fraction of a second, then a 10-layer neural network can do it too. It follows that you could just take their connections and embed them inside your artificial neural net. This was the motivation: anything that a human being can do in a fraction of a second, a big 10-layer neural network can do too. We focused on 10-layer networks because that’s what we knew how to train back in the day. If we could go beyond 10 layers somehow, then we could do more. But back then we only knew how to do 10 layers, which is why we emphasized human capabilities over that time scale.

A different slide from the talk—one that says “Our Main Idea.” You may be able to recognize two things, or at least one thing: you might recognize that something autoregressive is going on here. What does this slide say? It says that if you have an autoregressive model, and it predicts the next token well enough, then it will, in fact, capture the correct distribution over sequences that come next. This was relatively new. It wasn’t literally the first ever autoregressive neural network, but I would argue it was the first one where we truly believed that if you train it really well, you get whatever you want. In our case back then, what we wanted was the humble—today humble, then incredibly audacious—task of translation.

Now I’m going to show you some ancient history that many of you might have never seen before: the LSTM. To those unfamiliar, an LSTM is something that poor deep learning researchers did before Transformers. It’s basically a ResNet but rotated by 90 degrees. That’s an LSTM. It came before Transformers. It’s kind of like a slightly more complicated ResNet. You can see there’s an integrator, which is now called the residual stream, but you’ve got some multiplication going on. It’s a little more complicated, but that’s what we did. It was like a ResNet rotated by 90 degrees.

Another cool feature from that old talk that I want to highlight is that we used parallelization, but not just any parallelization—we used pipelining, as witnessed by this “one layer per GPU.” Was it wise to pipeline? As we now know, pipelining is not wise. But we were not as wise back then, so we used that and we got a 3.5x speedup using eight GPUs.

The conclusion slide from that talk, in some sense, is the most important slide, because it spelled out what could arguably be the beginning of the scaling hypothesis. It stated that if you have a very big dataset and you train a very big neural network, then success is guaranteed. One can argue, if one is charitable, that this indeed has been what’s been happening.

I want to mention one other idea, and this is what I claim truly stood the test of time. It’s the core idea of deep learning itself—connectionism. It’s the idea that if you allow yourself to believe that an artificial neuron is kind of, sort of like a biological neuron, it gives you the confidence to believe that very large neural networks, they don’t need to be literally human brain-scale; they might be a little smaller, but you could configure them to do pretty much all the things that we human beings can do.

There’s still a difference. The human brain also figures out how to reconfigure itself, whereas we are using the best learning algorithms that we have, which require as many data points as there are parameters. Human beings are still better in this regard. But this led, I claim, to the age of pre-training. The age of pre-training is what we might say started with the GPT-2 model, the GPT-3 model, the scaling laws. I want to specifically call out my former collaborators: Alec Radford, Jared Kaplan, and Dario Amodei for really making this work. This led to the age of pre-training, and pre-training on extra large neural networks trained on huge datasets is what’s been driving all the progress we see today.

But pre-training, as we know it, will unquestionably end. Why will it end? Because while computation keeps growing through better hardware, better algorithms, and logic clusters—all those things keep increasing your compute—the data is not growing. We have but one internet. You could even say data is the fossil fuel of AI. It was created somehow, and now we use it. We’ve achieved peak data and there’ll be no more. We have to deal with the data that we have. It will still let us go quite far, but there’s only one internet.

Here I’ll take a bit of liberty to speculate about what comes next. Actually, I don’t need to speculate too much because many people are speculating. You may have heard the phrase “agents.” I’m sure that eventually something will happen. People feel that agents are the future. More concretely, but also vaguely, synthetic data. What does “synthetic data” mean? Figuring this out is a big challenge. Different people have all kinds of interesting progress there. Another idea is inference-time compute, or what we’ve seen most recently with the O(1) model. These are all examples of people trying to figure out what to do after pre-training. These are all very good directions.

I want to mention one other example from biology, which I think is really cool. Many years ago, at this conference, I saw a talk where someone presented a graph that showed the relationship between the size of the body of a mammal and the size of their brain. It was in mass. That talk highlighted that in biology, everything is messy, but here’s a rare example where there’s a very tight relationship between the size of the body and the brain mass. I became curious about this graph and went to Google Images, and I found something interesting. For mammals, there’s one nice scaling relationship, but when you look at hominids—close relatives to humans in evolution like Neanderthals, Homo habilis, etc.—they have a different slope on their brain-to-body scaling exponent. That’s pretty cool. It means biology found a different scaling solution, something clearly different happened.

This is relevant because the things we’ve scaled so far in AI are just the first things we figured out. Undoubtedly, everyone working in this field will figure out what to do next.

Now I want to take a few minutes and speculate about the longer term. Where are we headed? We’re making astounding progress. Those of you who’ve been in the field 10 years ago remember just how incapable everything was back then. To see today’s capabilities is unbelievable. If you joined the field in the last two years, you might say, “Of course computers talk back,” but it wasn’t always that way.

I want to talk a bit about superintelligence, because that is obviously where this field is headed. The thing about superintelligence is that it will be qualitatively different from what we have today. My goal is to give you some concrete intuition about how it will be different, so you can reason about it yourself.

Right now, we have these incredible language models and unbelievable chatbots. They can do things, but they’re also strangely unreliable. They get confused while also having dramatically superhuman performance on certain evals. It’s unclear how to reconcile this. But eventually, sooner or later, the following will be achieved: these systems will actually be agentic in real ways. Right now, systems are not agents in any meaningful sense—maybe just barely. They will actually reason. Reasoning is interesting because the more a system reasons, the more unpredictable it becomes. All the deep learning we’ve seen so far is very predictable, but reasoning introduces unpredictability.

As a small analogy, consider that the best chess AIs are unpredictable to the best human chess players. We will have AI systems that are incredibly unpredictable, that will understand things from limited data, and will not get confused. I’m not saying how or when, just that it will happen. When all these things come together, including self-awareness—because why not, self-awareness might be useful—then we’ll have systems of radically different qualities and properties than we have today.

Of course, they will have incredible capabilities, but the issues that come up with such systems are very different from what we’re used to. It’s also impossible to predict the future. All kinds of stuff is possible.

On this uplifting note, I will conclude. Thank you so much.

Q&A Session

Questioner:

Now in 2024, are there other biological structures that are part of human cognition that you think are worth exploring in a similar way or that you’re interested in?

Ilya Sutskever:

If someone has a specific insight about what the brain does that we’re not doing, and that’s something implementable, they should pursue it. Personally, it depends on the level of abstraction. There’s been a lot of desire to make biologically inspired AI. You could argue that on some level, biologically inspired AI is incredibly successful—deep learning is essentially that. But the actual biological inspiration we’ve used is very modest: “Let’s use neurons.” That’s basically it.

More detailed biological inspiration has been hard to come by. But I wouldn’t rule it out. If someone has a special insight, that would be useful.

Questioner:

I have a question about “autocorrect.” You mentioned reasoning as one of the core aspects of future modeling. Today, we analyze hallucinations by some statistical measure. In the future, if a model can reason, will it be able to correct itself—like a grand form of autocorrect—so there won’t be as many hallucinations?

Ilya Sutskever:

Yes. What you described is highly plausible. I wouldn’t rule out that it might already be happening with some of the early reasoning models today. Longer term, why not? I think calling it “autocorrect” is doing it a disservice, though. It’s far grander than autocorrect, but aside from that detail, the answer is yes.

Questioner:

I loved the ending, leaving open questions about whether these AIs replace us, if they are superior, or if they need rights. How do we create the right incentive mechanisms for humanity to ensure these AIs have freedoms like us?

Ilya Sutskever:

I don’t feel confident answering that. It sounds like you’re talking about top-down structures or government. It could be cryptocurrencies, too. I don’t feel like I’m the right person to comment on that. It’s not a bad end result if we have AIs that just want to coexist with us and have rights. Maybe that would be fine. Things are so unpredictable, I hesitate to comment. But I encourage speculation.

Questioner:

Hi, my name is Shalev from the University of Toronto. Do you think large language models generalize multi-hop reasoning out of distribution?

Ilya Sutskever:

This question assumes a yes/no answer, but it’s not so simple. What does out-of-distribution mean? Our standards for generalization have increased a lot. Before deep learning, generalization meant something completely different—just not repeating the exact phrasing in the training data. Now, maybe we worry that a math problem was seen somewhere online. Is that memorization or generalization?

Humans generalize better than today’s models, but these models definitely generalize to some degree. It’s just not as good as human generalization. I hope this topological answer is helpful.

Moderator:

Unfortunately, we’re out of time for this session. We could go on for another six hours, but thank you so much, Ilya, for the talk.

Ilya Sutskever:

Thank you.