Lex Fridman Podcast - #306 – Oriol Vinyals: Deep Learning and Artificial General Intelligence
Episode Date: July 26, 2022Oriol Vinyals is the Research Director and Deep Learning Lead at DeepMind. Please support this podcast by checking out our sponsors: - Shopify: https://shopify.com/lex to get 14-day free trial - Weigh...ts & Biases: https://lexfridman.com/wnb - Magic Spoon: https://magicspoon.com/lex and use code LEX to get $5 off - Blinkist: https://blinkist.com/lex and use code LEX to get 25% off premium EPISODE LINKS: Oriol's Twitter: https://twitter.com/oriolvinyalsml Oriol's publications: https://scholar.google.com/citations?user=NkzyCvUAAAAJ DeepMind's Twitter: https://twitter.com/DeepMind DeepMind's Instagram: https://instagram.com/deepmind DeepMind's Website: https://deepmind.com Papers: 1. Gato: https://deepmind.com/publications/a-generalist-agent 2. Flamingo: https://deepmind.com/blog/tackling-multiple-tasks-with-a-single-visual-language-model 3. Language Models are Few-Shot Learners: https://arxiv.org/abs/2005.14165 4. Emergent Abilities of Large Language Models: https://arxiv.org/abs/2206.07682 5. Attention Is All You Need: https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf PODCAST INFO: Podcast website: https://lexfridman.com/podcast Apple Podcasts: https://apple.co/2lwqZIr Spotify: https://spoti.fi/2nEwCF8 RSS: https://lexfridman.com/feed/podcast/ YouTube Full Episodes: https://youtube.com/lexfridman YouTube Clips: https://youtube.com/lexclips SUPPORT & CONNECT: - Check out the sponsors above, it's the best way to support this podcast - Support on Patreon: https://www.patreon.com/lexfridman - Twitter: https://twitter.com/lexfridman - Instagram: https://www.instagram.com/lexfridman - LinkedIn: https://www.linkedin.com/in/lexfridman - Facebook: https://www.facebook.com/lexfridman - Medium: https://medium.com/@lexfridman OUTLINE: Here's the timestamps for the episode. On some podcast players you should be able to click the timestamp to jump to that time. (00:00) - Introduction (05:18) - AI (20:14) - Weights (26:33) - Gato (1:01:22) - Meta learning (1:15:21) - Neural networks (1:37:46) - Emergence (1:44:30) - AI sentience (2:08:27) - AGI
Transcript
Discussion (0)
The following is a conversation with Oriol Vinyalis, his second time in the podcast.
Oriol is the research director and deep learning lead a deep mind
and one of the most brilliant thinkers and researchers in the history of artificial intelligence.
And now a quick few second mention of each sponsor.
Check them out in the description. It's the best way to support this podcast.
We've got Shopify for e-commerce, weights and biases for machine learning, magic spoon
for breakfast and blinkest for non-fiction.
Choose wisely my friends.
And now onto the full ad reads, no ads in the middle.
I try to make this interesting, but if you skip them, please still check out our sponsors.
I enjoy their stuff.
Maybe you will too.
This show is brought to you by Shopify, a platform designed for anyone to sell anywhere, the
great looking online store that brings your ideas to life, and of course it has tools
to manage day to day operation.
A bunch of folks ask me to have some kind of merch.
As a fan of a bunch of podcasts, I find merch to be pretty cool.
It's a cool way to celebrate people you connect with because of a podcast or a show or
any kind of thing. It's just fun. No matter what your goal is, Shopify is a great platform
to make it happen. It makes the whole thing easy from beginning to end, including daily
management. Anyway, Shopify powers over 1.7 million entrepreneurs. You can get a
free trial and full access to Shopify's entire suite of features when you sign
up at Shopify.com slash Lex. That's all lowercase Shopify.com slash Lex. This shows also brought to you by weights and biases.
The company helping machine learning teams build better models faster. You can debug,
compare, reproduce models, architectures, hyper parameters, get commits, model weights,
GPU usage, and even data sets and predictions.
And you can do all of this while collaborating with teammates.
It's like Google Docs type of collaboration,
but for your own network, hyperparameters.
It's an amazing visualization, collaboration,
ecosystem for machine learning researchers.
Their tools are free for personal use and their team's feature is available for free to academic
researchers.
Listen by says gives you a better way to stay organized and be more productive.
Join over 200,000 machine learning engineers and data scientists when you sign up at lexfreemian.com
slash Wnb. This
episode is also brought to you by an old friend magic spoon low carb keto
friendly cereal as zero grams of sugar 13 to 14 grams of protein and only four
nine grams of carbs and also 140 calories in yeast serving. They have been a good companion throughout
this journey on this little podcast and I've loved them throughout both having them as a sponsor
and to eat them. My favorite flavor of theirs is the cocoa. It reminds me of childhood of happiness.
It's delicious, like a cereal that makes you happy in childhood.
And when I say childhood, by the way, I mean high school.
Serial was one of the few ways I would indulge.
Magic spoon has a 100% happiness guarantee.
So if you don't like it, they will refund it.
You can get discount on your order if you go to magicspoon.com slash flex and use code lex.
This show is brought to you by Blinkus. My favorite app for learning new things, Blinkus
takes the key ideas from thousands of nonfiction books and condenses them down into just 15
minutes that you can read or listen to. I can recommend sapiens meditations by Marcus Aurelius
beginning of an affinity by David Doge, the Snowden book, all of it. This is the best non-fiction books
ever written or many of the best non-fiction books have ever written are on there. I use it to
both review the books I've already read, choose the books I'm going to read, or summarize
books that I just don't have the time to read, but they're still a big part or an important part
of public discourse. I have never found a place that's able to summarize the key insights
in non-fiction books as well as Blinkist. I mean it's an art form.
You can claim a special offer for savings at Blinkist.com slash Lex.
This is the Lex Friedman podcast to support it. Please check out our sponsors in the description. And now, dear friends, here's, Vanilla-Less. Music
You are one of the most brilliant researchers in the history of AI, working across all kinds of modalities.
Probably the one common theme is it's always sequences of data. So we're talking about languages,
images, even biology and games as we talked about last time. So you're a good person to
ask this in your lifetime, will we be able to build an AI system that's able to replace
me as the interviewer in this conversation in terms of ability to
ask questions that are compelling to somebody listening.
And then further question is, are we close?
Will we be able to build a system that replaces you as the interviewee in order to create
a compelling conversation?
How far away are we do you think?
It's a good question.
I think partly I would say, do we want that?
I really like when we start now with very powerful models,
interacting with them and thinking of them
more closer to us, the question is,
if you remove the human side of the conversation,
is that an interesting artifact?
And I would say probably not. I've seen, for instance, last time we spoke like we were talking about
Starcraft and creating, you know, agents that play games in both self-play, but ultimately what
people care about was, how does this agent behave when the opposite side is a human?
So without a doubt, we will probably be more empowered by AI.
Maybe you can source some questions from an AI system.
I mean, that even today, I would say it's quite plausible that with your creativity,
you might actually find very interesting questions that you can filter.
We call this these cherry picking sometimes
in the field of language.
And likewise, if I had now the tools on my side,
I could say, look, you're asking this interesting question.
From this answer, I like the words chosen
by this particular system that created a few words.
Completely replacing it feels not exactly exciting to me.
Although in my lifetime, I think way, I think, given the trajectory, I think it's possible that perhaps
there could be interesting, maybe self-play interviews as you're suggesting that would look
or sound quite interesting and probably would educate or you could learn a topic through
listening to one of these interviews at a basic level at least.
So you said it doesn't seem exciting to you, but what if exciting is part of the objective function, the thing is optimized over.
So there's probably a huge amount of data of humans if you look correctly, of humans communicating online,
and there's probably ways to measure the degree of, you know, as they talk about engagement.
So you can probably optimize the question that's most created an engaging conversation in
the past.
So actually, if you strictly use the word exciting, there is probably a way to create a optimally
exciting conversations that are involved AI systems, at least one side is AI.
Yeah, that makes sense. I think maybe looping back a bit to games and the game industry,
when you design algorithms, you're thinking about winning as the objective,
right, or the reward function.
But in fact, when we discuss this with Blizzard, the creators of StarCraft in this case,
I think what's exciting, fun, if you could measure that and optimize for that, that's probably why we
play video games or why we interact or listen or look at cat videos or whatever on the internet. So
it's true that modeling reward beyond the obvious reward functions we've used to in reinforcement learning is definitely very exciting.
And again, there is some progress actually into a particular aspect of AI which is quite critical, which is, for instance, is a conversation or is the information truthful, right? So you could start trying to evaluate these from,
except from the internet, right?
That has lots of information.
And then if you can learn a function automated, ideally,
so you can also optimize it more easily,
then you could actually have conversations
that optimize for non-obvious things,
such as excitement.
So yeah, that's quite possible.
And then I would say in that case,
it would definitely be fun, fun exercise and quite unique
to have at least one site that is fully driven
by an excitement reward function.
But obviously, there would be still quite a lot of humanity
in the system, both from who is building the system, of course, and also
ultimately, we think of labeling for excitement that those labels must come from us because it's
just hard to have a computational measure of excitement as far as I understand there's no such thing.
Wow. You mentioned truth also. I would actually venture to say that excitement is easier
to label than truth.
Or perhaps has lower consequences of failure.
But there is perhaps the humanness that you mentioned.
That's perhaps part of a thing that could be labeled.
And that could mean an AI system that's doing dialogue,
that's doing conversations should be flawed, for example. Like that's the thing you optimize for,
which is have inherent contradictions by design, have flaws by design. Maybe it also needs to
have a strong sense of identity. So it has a backstory. It told itself that it sticks to.
It has memories not in terms of the how the system is designed,
but it's able to tell stories about its past.
It's able to have mortality and fear of mortality
in the following way that it has an identity.
And like, if it says something stupid and gets canceled on Twitter,
that's the end of that system. So it's not like you get to rebrand yourself. That system is,
that's it. So maybe that the high stakes nature of it because like you can't say anything stupid
now or because you'd be canceled on Twitter and that there's there's stakes to that. And then I
think part of the reason that makes it interesting.
And then you have a perspective like you've built up over time that you stick
with and then people can disagree with you.
So holding that perspective strongly holding sort of a maybe a controversial
at least a strong opinion.
All of those elements, it feels like they can be learned because it feels like
there's a lot of data on the internet of people having an opinion.
And then combine that with a metric of excitement, you can start to create something that, as
opposed to trying to optimize for sort of grammatical clarity and truthfulness, the factual
consistency over many sentences, you're optimized for the
humanness.
And there's obviously data for humanness on the internet.
So I wonder if there's a future where that's part, or I mean, I sometimes wonder that
about myself.
I'm a huge fan of podcasts, and I listen to some podcasts, and and I think like what is interesting about this? What is compelling?
At the same way you watch other games like you said watch play Starcraft or
Have Magnus Carlson played chess. So I'm not a chess player. So but it's still interesting to me. What is that? That's the
the stakes of it may be
The end of a domination of a series of wins. I don't
know, there's all those elements somehow connect to a compelling conversation, and I wonder
how hard is that to replace? Because ultimately, all of that connects the initial proposition
of how to test whether an AI is intelligent or not with the touring test, which I guess
my question comes from a place of
the spirit of that test.
Yes, I actually recall I was just listening to our first podcast where we discussed Turing
test.
So I would say from a neural network, AI builder perspective, there's usually you try to map many of these interesting topics you discuss to benchmarks
and then also to actual architectures on how these systems are currently built, how they
learn, what data they learn from, what are they learning, right?
We're talking about weights of a mathematical function and then looking at the current state of the game,
maybe what do we need leaps forward
to get to the ultimate stage of all these experiences,
lifetime experience of fears,
like words that currently barely we're seeing progress,
just because what's happening today is you take all these human interactions.
It's a large vast variety of human interactions online and then you're distilling these sequences
going back to my passion, sequences of words, letters, images, sound, there's more modalities
here to be at play. And then you're trying to just learn a
function that will be happy, that maximizes the likelihood of seeing all these through a neural
network. Now, I think there's a few places where the way currently we train these models would
clearly like to be able to develop the kinds of capabilities you save.
I'll tell you, maybe a couple.
One is the lifetime of an agent or a model.
So you learn from this data offline, right?
So you're just passively observing and maximizing this,
it's almost like a landscape of mountains.
And then everywhere there's data that humans interacted
in this way, you're trying to make that higher and then, you know, lower where there's no data.
And then these models generally don't then experience themselves. They just are observers, right. They're passive observers of the data.
And then we're putting them to then generate data when we interact with them. But that's very limiting. The experience, they actually experience when they could maybe be optimizing or further optimizing the weights.
We're not even doing that. So to be clear, and again, mapping to AlphaGo, AlphaStar,
we train the model, and when we deploy it to play against humans, or in this case,
interact with humans like language models, they don't even keep training, right?
They're not learning in the sense of the weights
that you've learned from the data, they don't keep changing.
Now, there's something a bit more feels magical,
but it's understandable if you're into neural nets,
which is, well, they might not learn in the strict sense
of the words, the way it's changing,
maybe that's mapping to how neurons interconnect
and how we learn over our lifetime.
But it's true that the context of the conversation
that takes place with when you talk to these systems,
it's held in their working memory, right?
It's almost like you start a computer,
it has a hard drive that has a lot of information,
you have access to the internet,
which has probably all the information,
but there's also a working memory
where these agents, as we call them,
or start calling them, build upon.
Now, these memories are very limited.
I mean, right now we're talking to be concrete
about 2,000 words that we hold, then beyond that we start forgetting what we've seen
So you can see that there's some short-term coherence already, right with when you said I mean, it's a very interesting topic
having sort of a mapping
an agent to like have consistency then you know if you say oh, what's your name?
It could remember that, but then it
might forget beyond 2000 words, which is not that long of context, if we think even of this podcast,
books are much longer. So technically speaking, there's a limitation there, super exciting from
people that work on deep learning to be working on. But I would say we lack maybe benchmarks
and the technology to have this lifetime-like experience
of memory that keeps building up.
However, the way it learns offline is clearly very powerful.
So you asked me three years ago, I would say, oh, we're very
far.
I think we've seen the power of this imitation, again, on the internet scale that hasn't enabled this to feel like at least the
knowledge, the basic knowledge about the world now is incorporated into the weights. But then this
experience is lacking. And in fact, as I said, we don't even train them when, you know,
when we're talking to them, other than their working memory, of course, is affected.
So that's the dynamic part, but they don't learn in the same way that you and I have learned, right?
From basically when we were born and probably before.
So lots of fascinating, interesting questions you ask there.
I think the one I mentioned is this idea of memory and experience versus just
kind of observe the world and learn its knowledge, which I think for that I would argue lots of
recent advancements that make me very excited about the field. And then the second, maybe
issue that I see is all these models, we train them from scratch.
That's something I would have complained three years ago
or six years ago or 10 years ago.
And it feels, if we take inspiration
from how we got here, how the universe evolved us
and we keep evolving, it feels that it's a missing piece,
that we should not be training models from scratch every few months that there should be some sort of
way in which we can grow models much like as a species and many other elements in the universe is building from the previous sort of iterations and that from a just purely neural network perspective
a just purely neural network perspective, even though we would like to make it work, it's proven very hard to not throw away the previous weights, right?
This landscape we learn from the data and refresh it with a brand new set of weights,
even maybe a recent snapshot of these data sets we train on, et cetera, or even a new game
we're learning.
So that feels like something is missing fundamentally.
We might find it, but it's not very clear how it will look like.
There's many ideas and it's super exciting as well.
Yes, just for people who don't know, when you're approaching new problem in machine learning,
you're going to come up with an architecture that has a bunch of weights and then you initialize them somehow,
which in most cases is some version of random.
So that's what you mean by started from scratch
and it seems like it's a waste every time you solve
the game of Go and chess, Starcraft, protein folding,
like surely there's some way to reuse the weights.
As we grow this giant database of neural networks that have solved some of the toughest problems
in the world.
So some of that is what is that?
Methods, how to reuse weights, how to learn extract was generalizable or at least has a chance to be and throw away the other stuff.
And maybe the neural network itself should be able to tell you that.
Like what, yeah, what ideas do you have for better initialization of weights?
Maybe stepping back, if we look at the field of machine learning but especially deep learning
right at the core of deep learning there's this beautiful idea that is a single algorithm can solve any task
right so it's been proven over and over with more increasing set of benchmarks and things that were
thought impossible that are being cracked by this basic principle,
that is, you take a neural network of uninitialized ways, so like a blank computational brain,
then you give it in the case of supervised learning a lot, ideally of examples of, hey, here is
what the input looks like, and the desired output should look like this. I mean, image classification is very clear example, images to maybe one of a thousand
categories.
That's what image net is like, but many, many, if not all problems can be mapped this
way.
And then there's a generic recipe, right, that you can use.
And this recipe with very little change, and I think that's the core of deep learning
research, right, that what is the recipe that's the core of deep learning research, right?
That what is the recipe that is universal?
That for any new given task, I'll be able to use without thinking, without having to work very hard on the problem at stake.
We have not found this recipe, but I think the field is excited to find less tweaks or tricks that people find when they work on important problems
specific to those and more of a general algorithm.
So, at an algorithmic level, I would say we have something general ready, which is this
formula of training a very powerful model and neural network on a lot of data.
And in many cases, you need some specificity to the actual problem you're solving.
Protein folding being such an important problem has some basic recipe that is learned from beyond before,
like transformer models, graph neural networks, ideas coming from NLP,
like, you know, something called bird that is a kind of loss that you can emplace to have the model
knowledge distillation is another technique, right? So this is the formula. We still had to
find some particular things that were specific to alpha-fold, right? That's very important because
protein folding is such a high value problem that as humans we should solve it no matter if we
need to be a bit specific.
And it's possible that some of these learnings will apply then to the next iteration of
this recipe that deep learners are about.
But it is true that so far, the recipe is what's common, but the way it's generally throw
away, which feels very sad.
Although maybe in the last, especially in the last two,
three years, and when we last spoke, I mentioned this area of
metal learning, which is the idea of learning to learn.
That idea and some progress has been had,
starting, I would say, mostly from GPT-3 on the language domain only,
in which you could conceive a model that is trained once. And then this model is not
narrow in that it only knows how to translate a pair of languages or it only knows how to
assign sentiment to a sentence. This actually, you could teach it by a prompting is called. And
this prompting is essentially just showing it a few more examples. Almost like you do show examples, input output examples,
algorithmically speaking to the process of creating this model.
But now you're doing it through language,
which is very natural way for us to learn from one another.
I tell you, hey, you should do this new task.
I'll tell you a bit more.
Maybe you ask me some questions.
And now you know the task, right?
You didn't need to retrain it from scratch.
And we've seen these magical moments almost
in this way to do few shot prompting through language
on language only domain.
And then in the last two years,
we've seen these expanded to beyond language,
adding vision, adding actions and games,
lots of progress to be had.
But this is maybe, if you ask me,
like, about how are we gonna crack this problem? This is perhaps one way and games, lots of progress to be had. But this is maybe, if you ask me,
like, about how are we gonna crack this problem?
This is perhaps one way in which you have a single model.
The problem of this model is it's hard to grow
in weights or capacity, but the model is certainly
so powerful that you can teach it some tasks, right?
In this way that I teach you, I could teach you a new task now,
if we were, oh, let's, a text-based task
or a classification, a vision-style task.
But it still feels like more breakthroughs should be had,
but it's a great beginning, right?
We have a good baseline.
We have an idea that this, maybe, is the way
we want to benchmark progress towards AGI.
And I think, in my view, that's critical to always have a way to benchmark
the community sort of converging to this overall, which is good to see.
And then this is actually what excites me in terms of also next steps
for deep learning is how to make these models more powerful,
how do you train them, how to grow them
if they must grow, should they change their weights as you teach it, task or not.
There are some interesting questions, many to be answered.
Yeah, you've opened a door about a bunch of questions I want to ask, but let's first
return to your tweet and read it like a Shakespeare.
You wrote, God knows not the end.
It's the beginning.
And then you wrote, meow, and that an emoji of a cat.
So first two questions,
first can you explain the Miao and the cat emoji,
and second can you explain what Godo is and how it works?
Right, indeed, I mean, thanks for reminding me
that we're all exposing on Twitter and on this.
Permanently there.
Yes, permanently there.
One of the greatest AI researchers of all time, Miao and Kat emoji.
Yes.
There you go.
Right.
So can you imagine like touring, tweeting, Miao and Kat, probably he would probably would.
Probably.
So yeah, the tweet is important actually.
You know, I put thought on the tweets.
I hope people would watch do you think? Okay, so there's three sentences.
Gato's not the end. Gato's the beginning. Miao, Kato moji, which
is the important part? The Miao, no, no, definitely, that it is
the beginning. I mean, I probably was just explaining a bit
where the field is going,
but let me tell you about Gato.
So first, the name Gato comes from maybe a sequence of releases
that the mine had that named, like, used animal names
to name some of their models that are based on this idea
of large sequence models.
Initially, they're only language,
but we're expanding to other modalities.
So we had to go for Tintila,
these were language only, and then more recently,
we released Flamingo, which adds vision to the equation,
and then Gato, which adds vision,
and then also actions in the mix.
As we discuss actually actions, especially discrete actions, like app down left right, I just told you the actions,
but they're words.
So you can kind of see how actions naturally map
to sequence modeling of words,
which these models are very powerful.
So Gato was named after, I believe, I can only, from memory, right, these, you know,
these things always happen with an amazing team of researchers behind. So, before the release,
we had a discussion about which animal would we pick, right, and I think because of the word
general agent, right, and this is a property quite unique to Gato. We kind of
were playing with the GA words and then you know Gato.
And a razzle cat. Yes. And Gato is obviously a Spanish version of Gato. I had
nothing to do with it, although I'm from Spain. Oh, what is... Wait, sorry. How do you
say cat in Spanish? Gato. Oh, Gato. Yeah. No, okay. I see. I see. I see. You know.
No, it all makes sense. Okay. So how do you say meow in Spanish? No, that's I think you you say the same way
But you're right it is
M.I.A.U
It's universal. Yeah, so then how does the thing work? So you said general is so you said
language
vision and action
How does this?
Can you explain what kind of neural networks are involved?
What is the training look like? And maybe what do you or some beautiful ideas within the system?
Yeah. So maybe the basics of Gato are not that dissimilar from many, many work that comes. So
here is where the sort of the recipe
I mean, it hasn't changed too much.
There is a transformer model.
That's the kind of recurrent neural network
that essentially takes a sequence of modalities,
observations that could be words,
could be vision or could be actions.
And then it's own objective that you train it to do when you
train it is to predict what the next anything is and anything means what's the next action. If
this sequence that I'm showing you to train is a sequence of actions and observations, then you're
predicting what's the next action and the next observation, right? So you think of of these really as a sequence of bytes, right? So take
any sequence of words, a sequence of interleaf words and images, a sequence of maybe observations
that are images and moves in Atari up, down, left, right? And these you just three think
of them as bytes and you're modeling, what's the next byte going to be like. And you might interpret
that as an action and then played in a game or you could interpret it as a word and then write it
down if you're chatting with the system and so on. So Gato basically can be thought as inputs, images,
text, video, actions, it also actually inputs some sort of
proprioception sensors from robotics because robotics is one of the tasks that
it's been trained to do. And then at the output, similarly, it outputs words,
actions, it does not output images. That's just by design, we decided not to go
that way for now. That's also in part why it's the beginning
because there's more to do clearly.
But that's kind of what Gato is.
It's this brain that essentially you give it
any sequence of these observations and modalities
and it outputs the next step.
And then you, of you go,
you feed the next step into and predict the next one
and so on.
Now, it is more than a language model
because even though you can chat with Gato,
like you can chat with Tintilla or Flamingo,
it also is an agent, right?
So that's why we call it A of Gato,
like the word the letter A, and also it's general. It's not an agent that's been trained
to be good at only StarCraft or only Atari or only Go. It's been trained on a vast variety of
data sets. What makes it an agent if I may interrupt the fact that it can generate actions?
Yes, so when we call it, I mean, it's a good question, right? What, when do we call a model, I mean, everything is a model,
but what is an agent, in my view, is indeed the capacity
to take actions in an environment that you then send to it
and then the environment might return with a new observation
and then you generate the next action.
This actually, this reminds me of the question
from the side of biology, what is life?
Which is actually a very difficult question as well. What is living? What is living when you think
about life here on this planet earth? And question interesting to me about aliens, what is life when we
visit another planet? We'll do you be able to recognize it. And this feels like it sounds perhaps silly, but I don't think it is, at which
points in your own network, a being versus a tool. And it feels like action, ability to modify
its environment is that fundamental leap. Yeah, I think it certainly feels like action is a
necessary condition to be more alive alive but probably not sufficient either.
So, sadly, I'm so conscious this thing, whatever. Yeah, yeah, we can get back to that later.
But anyway, it's going back to the meow and the gato, right? So,
one of the leaps forward and what took the team a lot of effort and time was,
as you were asking, how has Gato been
trained?
I told you Gato is this transformer neural network, models actions, sequences of actions,
words, etc.
Then the way we train it is by essentially pulling data sets of observations.
It's a massive imitation learning algorithmer
that it imitates obviously to what is the next word
that comes next from the usual data sets we use before, right?
So these are these web scale style data sets of people
writing on webs or chatting or whatnot, right?
So that's an obvious source that we use
on all language work.
But then we also took a lot of agents
that we have a deep mind.
I mean, as you know, deep mind, we're quite interested
in learning reinforcement learning
and learning agents that play in different environments.
So we kind of created a data set of these trajectories
as we call them, or agent experiences.
So in a way, there are other agents we trained for a single-mind purpose to, let's say, you know, control a 3D game environment and navigate a maze.
So we had all the experience that was created through the one agent interacting with that environment.
And we added this to the data set.
As I said, we just see all the data,
all these sequences of words,
or sequences of these agent interacting with that environment,
or agents playing Atari and so on.
We see this as the same kind of data.
We mix these data sets together and we train Gato.
That's the G part. It's general because it really has
mixed, it doesn't have different brains for each modality or each narrow task. It has a single brain.
It's not that big of a brain compared to most of the neural networks we see these days. It has one
billion parameters. Some models we're seeing getting the trillions these days, and certainly, 100 billion feels like a size that is very common from when you
train this job. So the actual agent is relatively small, but it's been
trained on a very challenging diverse data set, not only containing all of
internet, but containing all these agent experience, playing very different distinct environments.
So this brings us to the part of the tweet of,
this is not the end, it's the beginning.
It feels very cool to see Gato in principle
is able to control any sort of environments
that especially the ones that it's been trained to do,
these 3D games, Atari games, all sorts of robotics tasks and so on.
But obviously it's not as proficient as the teachers it learned from on these environments.
But not obvious. It's not obvious that it wouldn't be more proficient.
It's just the current beginning part.
Right.
Is that the performance is such that it's not as good as if it's specialized
to that task. Right, so it's not as good, although I would argue size matters here, so the fact that
I would argue always size always matters. Yeah, that's a different concept. Yes, but for neural networks,
certainly size does matter. So it's the beginning because it's relatively small. So obviously scaling this idea up might make the connections
that exist between text on the internet
and playing Atari and so on,
more synergistic with one another.
And you might gain.
And that moment we didn't quite see,
but obviously that's why it's the beginning.
That synergy might emerge with scale.
Right, my emerge with scale.
And also, I believe there's some new research
or ways in which you prepare the data that might,
you might need to sort of make it more clear to the model
that you're not only playing Atari,
and it's just you start from a screen,
and here is up and a screen and down.
Maybe you can think of playing Atari
as there's
some sort of context that is needed for the agent.
Before it starts seeing, oh, this is an Atari screen.
I'm going to start playing.
You might require, for instance, to be told in words,
hey, in this sequence that I'm showing,
you're going to be playing an Atari game.
So text might actually be a good driver
to enhance the data, right?
So then these connections might be made more easily, right?
That's an idea that we start seeing in language,
but obviously beyond this is going to be effective, right?
It's not like, I don't show you a screen,
and you from scratch, you're supposed to learn a game.
There is a lot of context we might set.
So there might be some work needed as well to set that context.
But anyways, there's a lot of work.
So that context puts all the different modalities on the same level ground
actually provide the context best.
So maybe on that point, so there's this task which may not seem trivial of
tokenizing the data of converting the data into pieces into basic atomic elements
that then could
cross modalities somehow. So what's tokenization?
How do you tokenize text? How do you tokenize images, how do you tokenize games and
actions and robotics tasks? Yeah, that's a great question. So tokenization is the entry point to
actually make all the data look like a sequence because tokens then are just kind of these little
puzzle pieces we break down anything into these puzzle pieces,
and then we just model what's the,
what's this puzzle look like, right?
When you make it lay down in the line,
so to speak in a sequence.
So in Gato, the text, there's a lot of work.
You tokenize text usually by looking at common,
commonly used substrings, right?
So there's, you know, ING in English, is a very common substring, so that becomes a token. There's quite well-studied
problem on tokenizing text and Gato just used the standard techniques that have been developed
from many years, even starting from N-gram models in the 1950s and so on.
Just for context, how many tokens, like what order magnitude number of tokens is required for a word?
Yeah, but what are we talking about? Yeah, for a word in English, right? I mean every language is very different
The current level or granularity of tokenization generally means is maybe two to five
I mean, I don't know the statistics exactly, but to give you an idea, we don't
tokenize at the level of letters, then it would probably be like, I don't know what the
average length of a word is in English, but that would be, you know, the minimum set of
tokens you could use.
It was bigger than letters smaller than words.
Yes, yes. And you could think of very, very common words like V. I mean, that would be
a single token, but very quickly, you're talking to three, four,
four tokens or so.
Have you ever tried to tokenize emojis?
Emojis are actually just sequences of letters.
So maybe to you, but to me, they mean so much more.
Yeah, you can render the emoji, but you, you, my, if you actually just, yeah, this is a philosophical
question.
Is it as emojis and image or a text.
The way we do these things is they're actually mapped to small sequences of characters.
Yeah. So you can actually play with these models and input emojis.
It will output emojis back, which is actually quite a fine exercise.
You probably can find other tweets about this out there.
But yeah, so anyways, text, it's very clear how this is done.
And then in Gato, what we did for images
is we map images to essentially, we compress images
so to speak into something that looks more like less like
every pixel with every intensity that would mean we have a very
long sequence, right? Like if we were talking about 100 by 100 pixel images that would make the
sequences far too long. So what was done there is you just use a technique that essentially compresses
an image into maybe 16 by 16 patches of pixels. And then that is map.
Again, tokenize you just essentially quantize
this space into a special world that actually maps
to these little sequence of pixels.
And then you put the pixels together in some raster order.
And then that's how you get out or in the image
that you're processing.
But there's no semantic aspect to that.
So you're doing some kind of you don't need to understand anything about the image in order to
tokenize it currently. No, you're only using this notion of compression. So you're trying to find
common is like JPG or all these algorithms. It's actually very similar at the tokenization level.
All we're doing is finding common patterns
and then making sure in a lossy way,
we compress these images,
given the statistics of the images
that are contained in all the data we deal with.
Although you could probably argue that JPEG
does have some understanding of images.
Like, because visual information,
maybe color, compressing based, incredibly based on color,
does capture something important about an image.
That's about its meaning, not just about some statistics.
Yeah, I mean, JP, as I said,
these very, the algorithms look actually very similar
to, they use this, the cosine transform in Jpg.
The approach we usually do in machine learning when we deal with images and we do this
quantization step is a bit more data driven so rather than have some sort of Fourier basis for how
you know frequencies appear in natural in the natural world, we actually just use the statistics
of the images and then quantize them based
on the statistics much like you do in words.
So common sub strings are allocated a token.
And images is very similar.
But there's no connection, the token space,
if you think of, oh, like the tokens are an integer
and in the end of the day.
So now, like we work on, maybe we have about,
let's say, I don't know the exact numbers,
but let's say 10,000 tokens for text, right?
Certainly more than characters
because we have groups of characters and so on.
So from one to 10,000,
those are representing all the language
and the words we'll see.
And then images occupy the next set of integers.
So they're completely independent, right?
So from 10,000, 1 to 20,000, those are the tokens that represent these other modality images.
And that is an interesting aspect that makes it orthogonal.
So what connects these concepts is the data, right? Once you have a
data set, for instance, that captions images that tells you, oh, this is someone playing a Frisbee
on a green field. Now, the model will need to predict the tokens from the text green field to then
the pixels, and that will start making the connections between the tokens. So these connections
happen as the algorithm learns. And then the last, if we think of these integers, the first few
are words, the next few are images in Gato, we also allocated the highest order of integers
to actions, right, which we discretize and actions are very diverse. In Atari, there's, I don't know,
if 17 discrete actions in robotics, actions
might be torques and forces that we apply.
So we just use kind of similar ideas
to compress these actions into tokens.
And then we just, that's how we map now all the space
to these sequence of integers.
But they occupy different space and what connects them is then the learning algorithm
That's where the magic happens. So the modalities are
Thogunal to each other in token space, right? So like in the input
Everything you add you add extra tokens, right, and then
You're shoving all of that into one place. Yes, the transformer.
And that transformer, that transformer tries,
look at this gigantic token space and tries to form
some kind of representation,
some kind of unique wisdom about all of these
different modalities.
How is that possible?
Are the, if you were to sort of like put your psychoanalysis hat on
and try to psychoanalyze this neural network,
is it schizophrenic?
Does it try to give in this very few weights,
represent multiple disjoint things
and somehow have them not interfere with each other,
or is this a small building on the joint strength, on whatever is common to all the different
modalities?
Like, what if you were to ask a question, is it schizophrenic or is it of one mind?
I mean, it is one mind. And it's actually the simplest algorithm,
which that's kind of in a way how it feels like the field
hasn't changed since back propagation and gradient descent
was purpose for learning neural networks.
So there is obviously details on the architecture.
This has evolved.
The current iteration is still the transformer,
which is a powerful sequence modeling architecture.
But then the goal of this,
you know, setting these weights to predict the data
is essentially the same as basically I could describe.
I mean, we describe a few years ago,
Alpha Star, language modeling and so on, right?
We take, let's say, an Atari game, we map it to a string of numbers that will all be probably image, space, and action space interlived. And all we're going to do is say, okay, given the numbers,
you know, 10,000, 1, 10,000, 4, 10,000, 5, the next number that comes is 20,000,
6, which is in the action space. And you're just optimizing these weights, be a very simple
gradient, like, you know, mathematically, is almost the most boring algorithm you could
imagine. We settled the weights so that given this particular instance, these ways are set to maximize the probability
of having seen this particular sequence of integers for this particular game.
And then the algorithm does this for many, many, many iterations, looking at different
modalities, different games, right?
That's the mixture of the data set we discuss. So in a way, it's a very simple algorithm
and the weights, right?
They're all shared, right?
So in terms of, is it focusing on one modality or not?
The intermediate weights that are converting
from these input of integers to the target integer
you're predicting next.
Those weights certainly are common.
And then the way that organization happens,
there is a special place in the neural network,
which is we map this integer, like number 10,001,
to a vector of real numbers.
Like real numbers, we can optimize them with gradient descent,
right, the functions we learn are actually
surprisingly differentiable, that's why we compute gradients.
So this step is the only one that this orthogonality you
mention applies.
So mapping a certain token for text or image or actions,
these each of these tokens gets its own little vector of real
numbers that represents this.
If you look at the field back many years ago, people were
talking about word vectors or word embeddings.
These are the same. We have word vectors or embeddings. We have image vector or embeddings and action vector of embeddings.
And the beauty here is that as you train this model, if you visualize these little vectors, it might be that they start aligningigning even though they're independent parameters. There could be anything,
but then it might be that you take the word gato or cat,
which maybe is common enough that it actually has its own token.
Then you take pixels that have a cat and you might start seeing that these vectors look like they align.
So by learning from this vast amount of data,
the model is realizing the potential
connections between these modalities. Now, I will say there would be another way, at least
in part, to not have these different vectors for each different modality. For instance,
when I tell you about actions in certain space, I'm defining actions by words, right?
So you could imagine a world in which I'm not learning that the action app in Atari is its own number
The action app in Atari maybe is literally the word or the sentence app in Atari, right?
And that would mean we now leverage much more from the language
right? And that would mean we now leverage much more from the language. This is not what we did here, but certainly it might make these connections much easier to learn and also to teach the model to
correct its own actions and so on, right? So all these to say that Gato is indeed the beginning,
that it is it is a radical idea to do this this way, but there's probably a lot more to be done,
and the results to be more impressive, not only through scale, but also through some new research
that will come, hopefully, in the years to come.
So just to elaborate quickly, you mean one possible next step or one of the paths that you might take next is
doing the tokenization fundamentally as a kind of
linguistic communication. So like you convert even images into language.
So doing something like a crude semantic segmentation,
trying to just assign a bunch of words to an image that like
have almost like a dumb entity explaining as much as you can about the image.
And so you convert that into words and then you convert games into words and then you
provide the context and words and all of it eventually getting to a point where everybody
agrees with no chomsky that language is actually at the core of everything, that's the base layer of intelligence
and consciousness and all that kind of stuff.
Okay.
You mentioned early on, like, it's hard to grow.
What did you mean by that?
Because we're talking about scale might change.
There might be, and we'll talk about this too,
like there's a emergent
There's certain things about these neural networks that are emergent so certain like performance
We can see only with scale and there's some kind of threshold of scale
So it why is it hard to grow something like this meow network?
So the meow network is is not it's not hard to grow if you retrain it. Yeah, what's hard is well
we have now one billion parameters
We trained them for a while we we spend some amount of work towards building these these weights that are an amazing
initial brain for doing this kind of task we care about
could we reuse the weights and
expand to a larger brain?
And that is extraordinarily hard, but also exciting from a research perspective and a
practical perspective point of view, right?
So there's this notion of modularity in software engineering.
And we're starting to see some examples and work that
leverages modularity. In fact, if we go back one step from Gato to a work that I
would say train much larger, much more capable network called Flamingo. Flamingo
did not deal with actions, but it definitely dealt with images in in an
interesting way, kind of akin to what Agato did, but slightly different technique for tokenizing.
But we don't need to go into that detail. But what Flamingo also did, which Agato didn't do,
and that just happens because these projects, you know, they're different, you know,
it's a bit of like the exploratory nature of research, which is great.
The research behind these projects is also modular.
Yes, exactly. And it has to be,
right? We need to have creativity and sometimes you need to protect pockets of people, researchers,
and so on. By we human humans. Yes. And also in particular researchers and maybe even further
deep mind or other such labs. And then act the neural networks themselves. So it's modularity all the way down
All the way down. So the way that we did modularity very beautifully in flamingo is we took chinchilla which is a language only
Model not an agent if we think of actions being necessary for agency. So we took chinchilla. We took the weights of chinchilla and
So we took Chinchilla, we took the weights of Chinchilla, and then we froze them.
We said, these don't change,
we trained them to be very good at predicting the next word,
is a very good language model,
state of the art at the time you release it, etc.
We're going to add a capability to see,
we are going to add the ability to see to this language model.
So we're going to attach small pieces of neural networks at the right places in the model. It's almost like injecting the network with some weights and some substractors in the ways in a good way, right? ability without destroying others, et cetera. So we created a small sub network
initialized not from random, but actually from self-supervised
learning, that model that understands vision in general.
And then we took data sets that connect the two modalities,
vision and language.
And then we froze the main part, the largest portion of the network,
which was Chinchilla, that is 70 billion parameters.
And then we added a few more parameters on top,
trained from scratch,
and then some others that were pre-trained
from, like, from with the capacity to see.
Like it was, it was not tokenization
in the way I described for Gato,
but it's a similar idea.
And then we trained the whole system.
Parts of it were frozen, parts of it were new.
And all of a sudden, we developed Flamingo, which is an amazing model that is essentially,
I mean, describing it as a chatbot where you can also upload images and start conversing
about images, but it's also kind of a dialogue style chat
bot. So the input is images and text and the output is text. Exactly. And how many parameters
you said 70 billion for chinchilla? Yeah, chinchilla is 70 billion and then the ones we add on top,
which kind of almost is almost like a way to overwrite its little activations so that when it sees vision,
it does kind of a correct computation
of what it's seeing mapping it back towards.
So to speak, that adds an extra 10 billion parameters.
Right.
So it's total 80 billion, the largest one we released.
And then you train it on a few data sets
that contain vision and language.
And once you interact with the model, you start seeing that you can upload an image
and start having a dialogue about the image,
which is actually not something.
It's very similar and akin to what we saw in language only.
These prompting abilities that it has,
you can teach it a new vision task,
read it dust things beyond the capabilities
that in theory, the datasets
provided in themselves, but because it leverages a lot of the language knowledge acquired from
Chinchilla, it actually has this few shot learning ability and these emerging abilities that we didn't
even measure once we were developing the model, but once developed, then as you play with the
interface, you can start seeing, wow, okay, yeah, it's cool.
We can upload, I think, one of the tweets talking about Twitter
was this image from Obama that is placing a weight
and someone is kind of waiting themselves
and it's kind of a joke-style image.
And it's notable because I think,
Andrew Carpati, a few years ago said,
no computer vision system can understand
the subtlety of this joke in this image,
all the things that go on.
And so what we try to do, and it's very anecdotally,
I mean, this is not a proof that we solved this issue,
but it just shows that you can upload now this image
and start conversing with the model,
trying to make out if it gets that there's a joke because the
person waiting themselves doesn't see that someone behind is making the weight higher
and so on and so forth.
So it's a fascinating capability.
And it comes from this key idea of modularity where we took a frozen brain and we just added
a new capability.
So the question is, should we, so in a way, you can see even from DeepMind,
we have Flamingo that is this moderate approach
and thus could leverage the scale a bit more reasonably
because we didn't need to retrain a system from scratch.
And the other, on the other hand, we had Gato
which used the same data sets,
but then it trained it from scratch, right?
And so I guess big question for the community is, should we train
from scratch or should we embrace modularity? And this lies, like this goes back to modularity
as a way to grow but reuse seems like natural and it was very effective certainly.
The next question is, if you go the way of modularity, is there a systematic way
if you go the way of modularity, is there a systematic way of freezing weights
and joining different modalities
across, you know, not just two or three or four networks,
but hundreds of networks from all different kinds of places.
Maybe open source network that looks at weather patterns
and you shove that in somehow.
And then you have networks that, I don't know,
do all kinds of the play Starcraft
and play all the other video games.
And you can keep adding them in without significant effort,
like that's maybe the effort scales linearly or something like that.
It's supposed to like the more network you add, the more you have to worry
about the instabilities created.
Yeah. So that vision is beautiful.
I think there's still the question
about within single modalities, like Jinchilla was reused,
but now if we train an exiteration of language models,
are we gonna use Jinchilla or not?
Yeah, how do you swap out Jinchilla?
So there's still big questions, but that idea
is actually really akin to software engineering,
which we're not re-implementing.
Library is from scratch.
We're reusing and then building ever more amazing things, including neural networks, with
software that we're reusing.
I think this idea of modularity, I like it.
I think it's here to stay.
That's also why I mentioned it's just the beginning, not the end.
You mentioned metal learning.
Given this promise of God,
can we try to redefine this term?
That's almost akin to consciousness
because it means different things to different people
throughout the history of artificial intelligence.
But what do you think meta learning is
and looks like now in the five years, 10 years,
will it look like the system I got, but scaled?
What's your sense of, what is meta learning look like?
Do you think?
Great, with all the wisdom we've learned so far.
Yeah, great question.
Maybe it's good to give another data point
looking backwards rather than forward.
So when we talk in 2019,
meta-learning meant something that has changed mostly
through the revolution of GPT-3 and beyond.
So what meta-learning meant at the time
was driven by what benchmarks people care about
in meta-learning and the benchmarks were about
a capability to learn about object identities.
So it was very much overfeated to vision and object
classification.
And the part that was met about that
was that, oh, we're not just learning
1,000 categories that ImageNet tells us to learn.
We're going to learn object categories that
can be defined when we interact with the model.
So it's interesting to see the evolution.
The way this started was we have a special language
that was a dataset, a small dataset,
that we prompted the model with, saying,
hey, here is a new classification task.
I'll give you one image and the name,
which was an integer at the time of the image,
and a different image and so on.
So you have a small prompt in the form of a dataset,
a machine learning dataset,
and then you got then a system that could then predict or
classify these objects that you just defined on the fly.
So fast forward, it was revealed that language models are
future learners.
That's the title of the paper.
So very good title.
Sometimes titles are really good.
So this one is really, really good.
Because that's the point of GPT-3 that showed that, look,
sure, we can focus on object classification
and what meta-learning means within the space
of learning object categories.
This goes beyond or before, rather,
to also omni-glot before image net and so on.
So there's a few benchmarks.
To now all of a sudden, we have a bit unlock from benchmarks.
And through language, we can define tasks, right?
So we're literally telling the model some logical task
or little thing that we wanted to do.
We prompted much like we did before,
but now we prompted through natural language.
And then not perfectly, I mean,
these models have failure modes and that's fine,
but these models then are now doing a new task, right?
So they met and learned this new capability.
Now, that's where we are now. Flamingo
expanded these two visual and language, but it basically has the same abilities. You can teach
it, for instance, an emergent property was that you can take pictures of numbers and then do
aridmetic with the numbers just by teaching it. Oh, that's, I mean, when I show you three plus six,
you know, I want you to output nine and you show it a few examples. And now it does that. So it went
way beyond the, oh, this image net sort of category categorization of images that we were a bit
stuck maybe before this revelation moment that happened in 2000. I believe it was 19, but it was after we
checked that way it has solved metal learning as was previously defined.
Yes, it expanded what it meant. So that's what you say, what does it mean?
So it's an evolving term. But here is maybe now looking forward looking at
what's happening, you know, obviously in the community with more modalities,
what we can expect. And I would certainly hope to see the following.
And this is a pretty drastic hope,
but in five years, maybe we chat again.
And we have a system, right,
a set of weights that we can teach it to play Starcraft.
Maybe not at the level of Alpha Star, but play Starcraft a complex game.
We teach it through interactions to prompting.
You can certainly prompt a system that what Gato shows to play some simple Atari games.
So imagine if you start talking to a system, teaching it a new game, showing it examples
of, you know, in this particular game, this user did something good, maybe the system can even play
and ask you questions, say, hey, I played this game, I just played this game, did I do well,
can you teach me more? So five, maybe two, ten years,
these capabilities, or what metal learning means will be
much more interactive, much more rich, and through domains that we were
specializing,
right? So you see the difference, right? We build Alpha Star specialized to play Starcraft.
The algorithms were general, but the weights were specialized. And what we're hoping is
that we can teach a network to play games, to play any game, just using games as an example,
through interacting with it, teaching it,
uploading the Wikipedia page of Starcraft,
like this is in the horizon,
and obviously there are details that need to be filled
and research need to be done.
But that's how I see metal learning above,
which is going to be beyond prompting.
It's going to be a bit more interactive.
The system might tell us to give it feedback
after it maybe makes mistakes or it loses a game.
But it's nonetheless very exciting because if you think about this this way, the benchmarks are already there. We just repurpose them the benchmarks, right?
So in a way, I like to map the space of what maybe AGI means to say, okay, like we went 101% performance in Go,
in chess, in Starcraft.
The next iteration might be 20% performance
across quote unquote, all tasks, right?
And even if it's not as good, it's fine.
We actually, we have ways to also measure progress
because we have those special agents,
specialized agents, and so on.
So this is to me very exciting.
And these next iteration models are definitely
hinting at that direction of progress,
which hopefully we can have.
There are obviously some things that could go wrong
in terms of we might not have the tools, maybe transformers
are not enough,
then there's some breakthroughs to come,
which makes the field more exciting to people like me as well,
of course.
But that's, if you ask me five to 10 years,
you might see these models that start to look more like weights
that are already trained, and then it's more about teaching
or their meta-learn what you're trying to
induce in terms of tasks and so on. Well, beyond the simple now task, we're starting to see
emerge like, you know, smaller, idmetic tasks and so on.
So a few questions are on that. This is FASA. So that kind of teaching, interactive, not so beyond prompting, interacting with the
neural network, that's different than the training process.
So it's different than the optimization over differentiable functions.
This is already trained and now you're teaching, I mean, it's almost like a
kin to the brain, the neurons already set with their connections on top of
that, you know, using that infrastructure to build up further knowledge. Okay, so
that's a really interesting distinction that's actually not obvious from a
software engineering perspective, that there's a line to be drawn.
Because you always think for neural network to learn it has to be retrained trained and retrained, but maybe
and prompting is a way of
teaching
and your network a little bit of context about whatever the heck you're trying it to do. So you can maybe expand this prompting capability by
whatever the heck you're trying to do. So you can maybe expand this prompting capability
by making it interact.
That's really, really good.
By the way, this is not, if you look at way back
at different ways to tackle even classification tasks.
So this comes from like long standing literature
in machine learning.
What I'm suggesting could sound to some
like a bit like NIRS neighbor.
So NIRS neighbor is almost the simplest algorithm that you can, that does not require learning.
So it has this interesting like you don't need to compute gradients. And what NIRS neighbor does is
you quote unquote have a data set or a bloat a data set. And then all you need to do is a way to measure distance between points.
And then to classify a new point, you're just simply computing what's the closest point
in this massive amount of data.
And that's my answer.
So you can think of prompting in a way as you're uploading not just simple points and
you know, the metric is not the distance between the
images or something simple.
It's something that you compute that's much more advanced.
But in a way, it's very similar.
You simply are uploading some knowledge to this pre-trained system in nearest neighbor.
Maybe the metric is learned or not, but you don't need to further train it.
And then now you immediately get a classifier out of this.
Right? Now it's just an evolution of that concept,
very classical concept in machine learning, which is,
yeah, just learning through what's the closest point,
closest by some distance, and that's it.
Yeah. It's an evolution of that.
And I will say, how I saw meta-learning when we worked on a few ideas in 2016 was precisely
through the lens of Niro's neighbor, which is very common in computer vision community,
right?
There's a very active area of research about how do you compute the distance between
two images, but if you have a good distance metric, you also have a good classifier, right?
All I'm saying is now these distances and
the points are not just images, they're like words or sequences of words and images and actions
that teach you something new, but it might be that technique wise, those come back. And I will say
that it's not necessarily true that you might not ever train the weights a bit further.
Some aspect of meta-learning, some techniques in meta-learning, do actually do a bit of fine
tuning as it's called.
They train the weights a little bit when they get a new task.
So as I call the how or how we're going to achieve this, as a deep learner I'm very skeptic,
we're going to try a few things, whether it's a bit of
training, adding a few parameters, thinking of these as nearest neighbor or just simply thinking
of there's a sequence of words, it's a prefix, and that's the new classifier we'll see, right?
There's the beauty of research, but what's important is that is a good goal in itself, that I see as very worthwhile pursuing
for the next stages of not only meta-learning,
I think this is basically what's exciting
about machine learning period to me.
Well, and then the interactive aspect
that is also very interesting.
Yes.
The interactive version of a nearest neighbor.
Yeah, to help you pull out the classifier from this giant thing.
Okay, is this the way we can go in five, ten plus years from any task, sorry, from many
tasks to any task?
So, and what does that mean?
What does it need to be actually trained on?
At which point is the network had enough?
So, what does a network need to learn about this world
in order to be able to perform any task?
Is it just as simple as language, image, and action,
or do you need some set of representative images?
Like if you only see land images, will you know anything about underwater? Is that some
fundamentally different? I don't know. Those are open questions I would say. I mean the way
you put, let me maybe further your example, right? If all you see is land images, but you're reading
all about land and water worlds, but in books, right?
Imagine.
Yes.
Would that be enough?
Good question.
We don't know, but I guess maybe you can join us if you want in our quest to find this.
That's precisely what a world.
Yeah.
Yes.
That's precisely, I mean, the beauty of research and that's the research business where I guess is to figure this out and ask
the right questions and then iterate with the whole community, publishing like findings and so
on. But yeah, this is a question, it's not the only question, but it's certainly as you ask,
is on my mind constantly, right? And so we'll need to wait for maybe the,
let's say five years, let's hope it's not 10 to see what are the answers. Some people will
largely believe in and supervise or self-supervised learning of single modalities and then crossing them.
Some people might think end to end learning is the answer, modularity is maybe the answer,
so we don't know, but we're just definitely excited to find out.
But it feels like this is the right time and we're at the beginning of this.
Yes.
This is the general.
We're finally ready to do these kind of general big models and agents. What do you sort of specific technical thing about
Gato, Fomingo, Jinchilla, Gofer, any of these that is
especially beautiful, that was surprising, maybe is there
something that just jumps out at you? Of course, there's the
general thing of like, you didn't think it was possible. And
then you realize it's possible in terms
of the generalizability across modalities and all that kind of stuff or maybe how small
of a network relatively speaking got away and all that kind of stuff.
But is there some weird little things that were surprising?
Look, I'll give you an answer that's very important because maybe people don't quite realize this but
the teams behind these efforts, the actual humans, that's maybe the surprising,
in an obviously positive way. So anytime you see these breakthroughs, I mean it's easy to
map it to a few people, there's people that are great at explaining things and so on. That's very nice. But maybe the
learnings or the method learnings that I get as a human about disease, sure, we can move forward.
But the surprising bit is how important are all the pieces of these projects, how do they come
together. So I'll give you maybe some of the ingredients of success
that are common across these,
but not the obvious ones on machine learning.
I can always also give you those.
But basically, there is engineering is critical.
So very good engineering because ultimately,
we're collecting data sets, right? So
the engineering of data and then of deploying the models at scale into some compute cluster that
cannot go understated, that is a huge factor of success. And it's hard to believe that details matter so much.
We would like to believe that it's true
that there is more and more of a standard formula
as I was saying, like, this recipe that works for everything.
But then when you zoom in into this,
each of these projects, then you realize
the debilies indeed in the details.
And then the teams have to work kind of together towards these goals.
So engineering of data and obviously clusters and large scale is very important. And then one that
is often not maybe nowadays it is more more clear is benchmark progress. So we're talking here about
multiple months of tens of know, tens of researchers
and people that are trying to organize the research and so on working together. And you don't know
that you can get there. I mean, this is the beauty. Like, if you're not risking to trying to do
something that feels impossible, you're not gonna get there,
but you need the way to measure progress. So the benchmarks that you build are critical.
I've seen this beautifully pay out in many projects.
I mean, maybe the one I've seen it more consistently,
which means we established the metric,
actually the community did,
and then we leverage that massively as alpha fold. This is a project where the data, the metrics were all there, and all it took was, and it's
easier said than done, an amazing team working not to try to find some incremental improvement
and publish, which is one way to do research that is valid, but I'm very high and work literally for years
to iterate over that process.
And working for years with the team,
I mean, it is tricky that also happened to happen
partly during a pandemic and so on.
So I think my meta learning from all these is
the teams are critical to the success.
And then if now going to the machine learning,
the part that's surprising is,
so we like architectures like neural networks. And I would say this was a very rapidly
evolving field until the transformer came. So attention might indeed be all you need, which is
the title, also good title, although in hindsight
is good. I don't think at the time I thought this is a great title for a paper, but that architecture
is proving that the dream of modeling sequences of any bytes, there's something there that will
stick. And I think these at bands in architectures in kind of how neural networks are architecture to do what they do.
It's been hard to find one that has been so stable and relatively has changed very little
since it was invented five or so years ago. So that is a surprising, is a surprise that keeps
recurring into other projects. Try, on a philosophical or technical level, introspect what is the magic of attention?
What is attention?
That's attention in people that study cognition, so human attention.
I think there's giant wars over what attention means, how it works in the human mind.
So what, there's very simple looks at what attention is in neural network
from the days of attention is all you need. But brought, do you think there's a general
principle that's really powerful here?
Yeah. So a distinction between transformers and LSTMs, which were what came before and
you know, there was a transitional period where you could, you could use both. In fact,
when we talked about alpha-stack,
we used transformers and LSTMs.
So it was still the beginning of transformers.
They were very powerful, but LSTMs were also very powerful sequence models.
So the power of the transformer is that it has built in what we call an inductive bias of attention
that makes the model when you think of a sequence
of integers, right, like we discussed this before, right, this is the sequence of words.
When you have to do very hard tasks over these words, this could be we're going to translate
a whole paragraph or we're going to predict the next paragraph given 10 paragraphs before.
There's some loose intuition from how we do it as a human that is very nicely mimicked and replicated
structurally speaking in the transformer, which is this idea of you're looking for something,
which is this idea of you're looking for something, right? So you're sort of, when you're, you're just
read a piece of text, now you're thinking what comes next.
You might want to relook at the text or look it from scratch.
I mean, the retail is because there's no recurrence.
You're just thinking what comes next.
And it's almost hypothesis driven, right?
So if I'm thinking the next word that I'll write
is cat or dog, okay? The way the transformer works almost philosophically is it has these two
hypotheses is it is it going to be cat or is it going to be dog? And then it says, okay, if it's
cat, I'm going to look for certain words, Not necessarily cat, although cat is an obvious word,
you would look in the past to see whether it makes
more sense to output cat or dog.
And then it does some very deep computation
over the words and beyond, right?
So it combines the words and but it has the query,
as we call it, that is cat.
And then similarly for dog, right?
And so it's very, it's a very computational way
to think about, look, if I'm thinking deeply about text,
I need to go back to look at all the text, attend over it.
But it's not just attention.
What is guiding the attention?
And that was the key inside from an earlier paper
is not how far away is it?
I mean, how far away is it?
It is important.
What did I just write about?
That's critical.
But what you wrote about 10 pages ago
might also be critical.
So you're looking not positionally, but content-wise, right?
And you, transformers have this beautiful way
to query for certain content and pull it out
in a compressed way.
So then you can make a more informed decision.
I mean, that's one way to explain transformers.
But I think it's a very powerful inductive bias.
There might be some details that might change over time.
But I think that is what makes transformers so much more
powerful than the recurrent networks that were more
recent CBI biased based,
which obviously works in some tasks,
but it has major flaws.
Transformer itself has flaws,
and I think the main one, the main challenges,
these prompts that we just were talking about,
they can be a thousand words long.
But if I'm teaching you Starcraft,
I mean, I'll have to show you videos, I'll have to I have to point you to call Wikipedia articles about the game, we'll
have to interact probably as you play your last-minute questions. The context required for us to
achieve me being a good teacher to you on the game as you would want to do it with a model.
I think goes well beyond the current capabilities. So the question is, how do we
benchmark this and then how do we change the structure of the architectures? I think
there's ideas on both sides, but we'll have to see empirically, right? Obviously what
ends up working in the industry.
And as you talked about, some of the ideas could be, you know, keeping the constraint of
that length in place, but then forming like hierarchical representations
to where you can start being much clever in how you use those thousand tokens.
Yeah, that's really interesting, but it also is possible that this attentional mechanism where you
basically, you don't have a recently biased, but you look more generally, you make it learnable.
The mechanism in which way you look back into the past, you make that learnable.
It's also possible where at the very beginning of that, because that, you might become smarter
and smarter in the way you query the past.
So recent past and distant past and maybe very, very distant past.
So almost like the attention mechanism will have to improve and evolve as good as the tokenization mechanism,
where so you can represent long-term memories somehow.
Yes. And I mean, hierarchies are very, I mean, it's a very nice board that sounds appealing.
There's lots of work adding hierarchy to the memories.
In practice, it does seem like we keep coming back to the main formula or main architecture.
That sometimes tells us something.
There is such a sentence that a friend of mine told me like, whether it wants to work or not.
So, transformer was clearly an idea that wanted to work.
And then I think there's some principles we believe
will be needed, but finding the exact details,
details matter so much, right?
That's gonna be tricky.
I love the idea that there's like you as a human being,
you want some ideas to work. And then there's the model you want some ideas to work.
And then there's the model that wants some ideas to work
and you get to have a conversation to see which,
more likely the model will win in the end.
Because it's the one, you don't have to do any work.
The model is the one that has to do the work.
So you should listen to the model.
And I really love this idea that you talked about
the humans in this picture.
If I could just briefly ask,
one is you're saying the benchmarks about the modular humans
working on this.
The benchmarks providing a sturdy ground over which to do these things that seem impossible,
they give you, in the darkest of times, give you hope because little signs of improvement you get you could yes
Like you're not you're somehow you're not lost if you have metrics to measure your your improvement
And then there's other aspect you said elsewhere and in here today like titles matter. I wonder
How much humans matter in the evolution of all of this, meaning individual humans.
You know, something about their interaction, something about their ideas, how much they
change the direction of all of this.
Like if you change the humans in this picture, like is it that the model is sitting there
and it wants you, it wants some idea to work, or is it the humans, or maybe the model is sitting there and it wants you, it wants some idea to work or is it the humans,
or maybe the models providing you 20 ideas that could work and depending on the humans
you pick, they're going to be able to hear some of those ideas.
In all the, because you're now directing all of deep learning and deep mind, you get
to interact with a lot of projects, a lot of brilliant researchers, How much variability is created by the humans and all of this?
Yeah, I mean, I do believe humans matter a lot at the very least at the,
you know, time scale of years on when things are happening and what's the
sequencing of it, right?
So you get to interact with people that I mean you mentioned this, some people really
want some idea to work and they'll persist and then some other people might be more practical like
I don't care what idea works I care about you know cracking protein folding and these at least these two kind of seem opposite sides. We need both.
And we've clearly had both historically
and that made certain things happen earlier or later.
So definitely humans involved in all of these endeavor
have had, I would say, years of change or of ordering,
how things have happened, which breakthroughs came before, which other
breakthroughs and so on. So certainly that does happen. And so one other, maybe one other axis of
distinction is what I called, and this is most commonly used in reinforcement learning, is the
exploration exploitation trade off as well. It's not exactly what I meant, although quite related. So when you start
trying to help others, like you become a bit more of a mentor to a large group of people
beat a project or the deep learning team or something, or even in the community when you
interact with people in conferences and so on, you're identifying quickly, right?
Some things that are exploitative or exploitative and it's tempting to try to guide
people. Obviously, I mean, that's what makes like our experience.
We bring it and we try to shape things sometimes wrongly.
And there's many times that I've been wrong in the past.
That's great.
But it would be
wrong to dismiss any sort of of the research styles that I'm observing. And I often get asked,
well, you're in industry, right? So we do have access to large compute scale and so on. So there's
certain kinds of research. I almost feel like we need to do responsibly and so on, but it is
a kind of, we have the particle accelerator here, so to speak in physics.
So we need to use it, we need to answer the questions that we should be answering right
now for the scientific progress.
But then at the same time, I look at many advances, including attention, which was discovered
in Montreal initially because of lack of compute.
So we were working on sequence to sequence with my friends over at Google Brain at the time.
And we were using, I think, 8 GPUs, which was somehow a lot at the time.
And then, I think Montreal was a bit more limited in the scale.
But then they discovered this content-based attention concept
that then has obviously triggered things like Transformer.
Not everything obviously starts Transformer.
There's always a history that is important to recognize
because then you can make sure that then those
whom I feel now, well, we don't have so much compute,
you need to then help them optimize that kind of research that might actually produce
amazing change. Perhaps it's not as short-term as some of these advancements or perhaps it's
a different time scale, but the people and the diversity of the field is quite critical to
that we maintain it. And at times, especially mixed with hype or other things, it's a bit tricky to be observing
maybe too much of the same thinking across the board. But the humans definitely are
critical and I can think of quite a few personal examples where also someone told
me something that had a huge, you know, huge effect on, on to some idea and then
that's why I'm saying at least at the
time in terms of ears, probably something is do happen.
Yes, it's a lot different. Yeah.
And it's also fascinating how constraints somehow are central for innovation.
And the other thing you mentioned about engineering, I have a sneaking suspicion. Maybe I
over, you know, my, my love is with engineering. So I have a sneaky suspicion that all the genius,
large percentage of the genius is in the tiny details
of engineering.
So I think we like to think our genius,
the genius is in the big ideas.
There's, I have a sneaking suspicion that,
because I've seen the genius of details of engineering
details make the night and day difference.
And I wonder if those kind of have a ripple effect over time.
So that too, so that's sort of taking the engineering perspective that sometimes that quiet
innovation at the level of individual engineer or maybe
at the small scale of a few engineers can make all the difference. That scales because we're
doing, we're working on computers that are scaled across large groups. That one engineering
decision can lead to ripple effects. Yes, it's interesting to think about. Yeah, I mean, engineering, there's also kind of a
historical, it might be a bit random, because if you think of the history of how, especially deep
learning and neural networks took off, feels like a bit random, because GPUs happen to be there
at the right time for different purpose, which was to play video games. So even the engineering that goes into the hardware and it might have a time, like the time frame
might be very different.
I mean, the GPUs were evolved throughout many years where we didn't even were looking
at that, right?
So even at that level, right, that revolution, so to speak, the repuls are like, like,
we'll see when they stop, right?
But in terms of thinking of why is this happening, right?
There's, there's, I think that when I try to categorize it in sort of things that might not be so obvious.
I mean, clearly, there's a hardware revolution.
We are surfing thanks to that.
Data centers as well.
I mean, data centers are where, like, I mean like I mean at Google for instance obviously they're serving Google, but there's also now thanks to that and to have built such amazing data centers we can't train these models.
Software is an important one. I think if I look at the state of how I had to implement things to implement my ideas how I discarded ideas because they were too hard to implement.
Yeah, clearly the times have changed
and thankfully we are in a much better software position
as well.
And then, obviously, this research that happens at scale
and more people enter the field, that's great to see,
but it's almost enabled by these other things.
And last but not least, is also data, right?
Curating data sets, labeling data sets,
these benchmarks we think about,
maybe we'll want to have all the benchmarks in one system,
but it's still very valuable that someone
would the thought and the time and the vision
to build certain benchmarks.
We've seen progress, thanks to,
but we're gonna repurpose the benchmarks.
That's the beauty of Atari is like, we soft it in a way,
but we use it in Gato, it was critical.
And I'm sure there's still a lot more to do
thanks to that amazing benchmark
that someone took the time to put,
even though at the time maybe,
all you have to think what's the next iteration
of architectures, that's what maybe the field recognizes, but we need to balance in terms of a human's
behind.
We need to recognize all these aspects because they're all critical.
And we tend to think of the genius, the scientists and so on, but I'm glad you have a strong
engineering background.
But also, I'm a lover of data.
It's a pushback on the engineering comment.
Ultimately, it could be the creators of benchmarks, we have the most impact.
Andre Kapati, who you mentioned, has recently been talking a lot of trash about ImageNet,
which he has the right to do because of how critical he is about image, how essential
he is to the development and the success of deep learning around ImageNet.
And you're saying that that's actually that benchmark is holding back the field.
Because I mean, especially in his context on Tesla autopilot, that's looking at real world
behavior of a system. It's you, there's something fundamentally missing about image net that
doesn't capture the real worldness of things that we need to have data size benchmarks
that have the unpredictable unpredictability,
the edge cases, whatever the heck it is
that makes the real world so difficult to operate in.
We need to have benchmarks with that.
So, but just to think about the impact of image net
as a benchmark, and that really puts a lot of emphasis on the importance
of a benchmark, both sort of internally a deep mind and as a community. So one is coming in from
within like, how do I create a benchmark for me to mark and make progress? And how do I make
benchmark for the community to mark and push progress. You have this amazing paper you
co-authored a survey paper called Emergent Abilities of Large Language Models. It has again
the philosophy here that I'd love to ask you about. What's the intuition about the phenomena
of emergence in your own networks? Transformers, language models. Is there a magic threshold
beyond which we start to see certain performance? And is that different from task to task? Is
that us humans just being poetic and romantic? Or is there literally some level of which
we start to see breakthrough performance. Yeah, I mean, this is a property that we start seeing in systems that actually tend to be.
So in machine learning, traditionally, again, going to benchmarks, I mean, if you have
some input outputs, right, like that is just a single input and a single output, you generally, when you train these systems, you see reasonably smooth curves when you analyze how, how much the data set size affects the performance or how the model size affects the performance or how much you long train, how long you train the system for affects the performance, right? So, you know, if we think of image net,
like the training curves look fairly smooth and predictable in a way. And I would say that's
probably because of the, it's kind of a one, a one hop reasoning task, right? It's like, here
is an input and you think for a few milliseconds or 100 milliseconds, 300 as a human.
And then you tell me, yeah, there's an alpaca in this image.
So in language, we are seeing benchmarks that require more pondering and more
thought in a way. This is just kind of you need to look for some saddles.
There it involves inputs that you might think of,
or even if the input is a sentence describing
a mathematical problem, there is a bit more processing
required as a human and more introspection.
So I think how these benchmarks work
means that there is actually a threshold
just going back to how Transformers work in this way of querying for the right questions
to get the right answers. That might mean that performance becomes random
until the right question is asked by the querying system of a Transformer
or of a language model, like
a transformer. And then only then you might start seeing performance going from random to
non-random. And this is more empirical. There's no formalism or theory behind this yet, although
it might be quite important. But we're seeing these face transitions of random performance
and until some, let's say, scale of a model.
And then it goes beyond that.
And it might be that you need to fit
a few low order bits of thought
before you can make progress on the whole task.
And if you could measure actually
those breakdown of the task, maybe you would
see more smooth, oh, like, yeah, these, you know, once you get these and these and these
and these and these, then you start making progress in the task. But it's somehow a bit
annoying because then it means that certain questions we might ask about architectures, possibly can only be done at certain scale.
And one thing that, conversely, I've seen great progress on in the last couple of years,
is this notion of science of deep learning and science of scale in particular, right?
So on the negative is that there's some benchmarks for which progress might need to be measured
at minimum at a certain scale until you see then what details of the model matter to make
that performance better.
So that's a bit of a con.
But what we've also seen is that you can sort of empirically analyze behavior of models
at scales that are smaller.
So let's say to put an example, we
had this Tintila paper that revised the SOCOS
scaling loss of models.
And that whole study is done at the reasonably small scale,
that maybe hundreds of millions up to 1 billion parameters.
And then the cool thing is that you create some loss,
right, some loss that some trends, right?
You extract trends from data that you see, okay,
like it looks like the amount of data required
to train now a 10x larger model would be this.
And these loss, so far, these extrapolations have helped
that safe compute and just get to a better place
in terms of the science of how should we
run these models at scale, how much data, how much depth and all sorts of questions we start
asking, extrapolating from a small scale. But then these emergence is sadly that not everything
can be extrapolated from scale depending on the benchmark and maybe the harder benchmarks
are not so good for extracting these laws.
But we have a variety of benchmarks at least.
So I wonder to which degree the threshold, the phase shift scale is a function of the benchmark,
some of that, some of the signs of scale might be engineering benchmarks where that threshold
is low. Sort of taking a main benchmark and reducing it somehow,
where the essential difficulties left,
but the scale which the emergence happens is lower.
Just for the science aspect of it,
versus the actual real world aspect.
Yeah, so luckily we have quite a few benchmarks.
Some of which are simpler or maybe there are more like, I think people might call these systems one versus systems two styles.
So I think what we're not seeing, luckily is that extrapolations from maybe slightly more
smooth or simpler benchmarks are translating to the harder ones. But that is not to say that this extrapolation will hit its limits.
And when it does, then how much we scale or how we scale will sadly be a bit suboptimal
until we find better loss. And these laws, again, are very empirical laws.
They're not like physical laws of models, although I wish there would be better theory
about these things as well. but so far I would say empirical
theory as I call it is way ahead than actual theory of machine learning.
Let me ask you almost for fun.
So this is not Oriol as a deep-mind person or anything to do with deep-mind or Google, just as a human being, and looking at these news of a Google engineer
who claimed that, I guess, the Lambda language model was sentient, or had the, I still need
to look into the details of this, but sort of making an official report and a claim that
he believes there's evidence that this system
has achieved sentience.
And I think this is a really interesting case on a human level and a psychological level
on a technical machine learning level of how language models transform our world and
also just philosophical level of the role of AI systems in a human world.
So what did you find interesting?
What's your take on all of this
as a machine learning engineer and a researcher
and also as a human being?
Yeah, I mean a few reactions, quite a few actually.
Have you ever briefly thought, is this thing sent you?
Right. So never absolutely.
Like even with like alpha star, wait a minute.
What?
No, sadly though, I think, yeah, sadly, I have not.
Um, yeah, I think I think the current,
any of the current models, although very useful and very good.
Um, yeah, I think we're quite far from that.
Um, and there's kind of a converse side story. So one of one of the my passions is about science in general. And
I think I feel I'm a bit of a like a failed scientist. That's why I came to machine learning
because you always feel and you start seeing this, that machine learning is maybe the science
that can help other sciences, as we've seen, right?
Like you, you know, it's such a powerful tool.
So thanks to that angle, right, that, okay, I love science,
I love, I mean, I love astronomy, I love biology,
but I'm not an expert and I decided, well,
the thing I can do better at these computers.
But having, especially with when I was
a bit more involved in alpha-fold learning a bit about proteins and about biology and about
life, the complexity, it feels like it really is, like, I mean, if you start looking at the things that are going on at the atomic level.
And also, I mean, there's obviously that we are maybe
inclined to try to think of neural networks as like the brain,
but the complexity is, and the amount of magic
that it feels when, I mean, I'm not an expert,
so it naturally feels more magic,
but looking at biological systems,
as opposed to these computer computational brains,
just makes me like, wow,
there's such level of complexity difference still, right?
Like orders of magnitude complexity that,
sure, these weights, I mean, we train them
and they do nice things, but they're not at the level
of biological entities, brains,
cells. It just feels like it's just not possible to achieve the same level of complexity behavior
and the my belief when I talk to other beings is certainly shaped by this amazement of biology
that maybe because I know too much,
I don't have about machine learning,
but I certainly feel it's very far fetched
and far in the future to be calling
or to be thinking, well, this mathematical function
that is differentiable is in fact sentient and so on.
So there's something on that point,
it's very interesting.
So you know enough about machines and enough about biology to know that
there's many orders of magnitude of difference and complexity.
But you know how machine learning works.
So the interesting question from human beings that are interacting with a system
that don't know about the underlying complexity.
And I've seen people, and probably including myself, that are falling in love with things
that are quite simple.
Yeah.
And so maybe the complexity is one part of the picture, but maybe that's not a necessary
condition for sentience, for perception or emulation of sentience.
Right. So, I mean, I guess the other side of this is that's how I feel personally.
I mean, you asked me about the person, right? Now, it's very interesting to see how other
humans feel about things, right? This is this we are like, again, like I'm not as amazed
about things that I feel this is not as magical
as this other thing because of maybe how I got to learn about it and how I see the curve
a bit more smooth because I just seen the progress of language models since Shannon in
the 50s and actually looking at that time scale, we're not that fast progress, right? I mean, what we were thinking at the time,
like almost a hundred years ago,
is not that dissimilar to what we're doing now.
But at the same time, yeah,
obviously others, my experience,
right, that the personal experience,
I think no one should,
you know, I think no one should
should tell others how they should feel.
I mean, the feelings are very personal, right?
So how others might feel about the models and so on.
That's one part of the story that is important to understand for me personally as a researcher.
And then when I maybe disagree or I don't understand or see that,
yeah, maybe this is not something I think right now is reasonable,
knowing all that I know,
one of the other things and perhaps partly why it's great to be talking to you and reaching out to the world about machine learning is, hey, let's make let's demystify a bit the magic and try to see a
bit more of the math and the fact that literally to create these models, if we had the right software,
it would be 10 lines of code,
and then just a dump of the internet.
So versus like then the complexity of like the creation
of humans from their inception, right?
And also the complexity of evolution
of the whole universe to where we are,
that is feels orders of magnitude more complex and fascinating to me.
So I think, yeah, maybe part of the only thing I'm thinking about trying to tell you is, yeah, I think explaining a bit of the magic,
there's a bit of magic. It's good to be in love obviously with what you do at work.
And I'm certainly fascinated and surprised quite often as well.
work and I'm certainly fascinated and surprised quite often as well. But I think, hopefully, as experts in biology, hopefully we will tell me this is not as magic and I'm happy to learn that
through interactions with the larger community, we can also have a certain level of education
that in practice also will matter because I mean one question is how you feel about this,
but then the other very important is
you starting to interact with this in products and so on.
It's good to understand a bit what's going on, what's not going on, what's safe, what's not safe and so on.
Otherwise the technology will not be used properly for good, which is obviously the goal of all of us, I hope. So let me then ask the next question, do you think in order to solve intelligence
or to replace the legs bot that does interviews
as we started this conversation with, do you think the system needs to be sentient?
Do you think it needs to achieve something like consciousness?
And do you think about what consciousness is in the human
mind that could be instructive for creating AI systems?
Yeah, honestly, I think probably not to the degree of intelligence that there's this
brain that can learn, can be extremely useful, can challenge you, can teach you,
conversely, you can teach it to do things. I'm not sure it's necessary personally speaking,
but if consciousness or any other biological or evolutionary lesson can be
repurposed to then influence our next set of algorithms.
That is a great way to actually make progress, right?
And the same way I tried to explain transformers a bit how it feels we operate when we look at text specifically.
These insights are very important, right?
So there's a distinction between details of how the brain might be doing computation.
I think my understanding is, sure, there's neurons and there's some resemblance to neural
networks, but we don't quite understand enough of the brain in detail, right, to be able
to replicate it.
But then more, if you zoom out a bit, how we then, our thought process, how memory works,
maybe even how evolution got us here,
what's exploration, exploitation,
like all these things happen.
I think this clearly can inform algorithmic level research.
And I've seen some examples of this being quite useful
to then guide the research,
even it might be for the wrong reasons.
So I think biology and what we know about ourselves can help a whole lot to build
essentially what we call AGI, this general, the real ghetto, the last step of the chain,
hopefully. But consciousness in particular, I don't myself at least think too hard about how to add that to the system.
But maybe my understanding is also very personal about what it means.
I think even that in itself is a long debate that I know people have often, and maybe I should learn more about this.
Yeah, and I personally, I noticed the magic often on a personal level, especially with
physical systems like robots.
I have a lot of, uh, legged robots now in Austin that I play with.
And even when you program them, when they do things you didn't expect, there's an immediate
anthropomorphization, and you notice the magic and you start to think about things like sentience that has to do more with effective communication unless with any of these kind of dramatic things.
It it seems like a useful part of communication. Having the perception of consciousness
seems like useful for us humans. We treat each other more seriously. We are able to do a nearest neighbor,
shoving of that entity into your memory correctly,
all that kind of stuff.
It seems useful, at least to fake it,
even if you never make it.
Maybe like, yeah, mirroring the question,
and since you talk to a few people,
then you do think that we'll need to figure something out in order
to achieve intelligence in a grander sense of the world.
Yeah, I personally believe yes, but I don't even think it'll be like a separate island
we'll have to travel to.
I think you'll emerge quite naturally.
Okay.
That's easier than for us then.
Thank you.
But the reason I think it's important to think
about is you will start, I believe, like with this Google engineer, you'll start seeing this
a lot more, especially when you have AI systems that actually interacting with human beings that
don't have an engineer in background. And we have to prepare for that because there'll be,
I do believe there will be a civil rights movement
for robots as silly as it is to say. There's going to be a large number of people that
realize there's these intelligent entities with whom I have a deep relationship and I don't
want to lose them. They've come to be a part of my life and they mean a lot. They have
a name, they have a story, they have a memory, and we start to ask questions about ourselves.
Well, what this thing sure seems like it's capable of suffering because it tells all these
stories of suffering.
It doesn't want to die and all those kinds of things.
And we have to start to ask ourselves questions.
What is the difference between a human being and this thing?
And when you engineer, I believe from an engineering perspective, like a deep
mind or anybody that builds systems, there might be laws in the future where you're not
allowed to engineer systems with the displays of sentience. Unless they're explicitly designed
to be that, unless it's a pet. So if you have a system that's just doing customer support, you're legally
not allowed to display sentience. We'll start to like ask ourselves that question. And
then so that that's that's going to be part of the software engineering process. Do we do we
which features do we have in one of them is communications of the sentience. But it's important
to start thinking about that stuff, especially how much it captivates public attention.
But it's important to start thinking about that stuff, especially how much it captivates public attention.
Yeah, absolutely. It's definitely a topic that
is important we
think about and I think in a way I always see not not I mean not every movie is is equally
on point with certain things, but certainly science fiction in this sense
at least has prepared society to
start thinking about certain topics that even if it's too early to talk about,
as long as we are like reasonable,
it's certainly gonna prepare us for both the research to come
and how to, I mean, there's many important challenges
and topics that come with building an intelligent system, many of which you just mentioned.
So I think we're never going to be fully ready unless we talk about these. And we start also,
as I said, just kind of expanding the people we talk to, not include only our own researchers and so on.
In fact, places like DeepMind, but elsewhere, there's more interdisciplinary groups forming
up to start asking and really working with us on these questions.
Because obviously, this is not initially what your passion is when you do your PhD, but
certainly, it is coming, right?
So it's fascinating kind of.
It's the thing that brings me to one of my passions that is learning.
So in this sense, this is kind of a new area that,
as a learning system myself, I want to keep exploring.
And I think it's great that,
to see parts of the debate
and even I see a level of maturity in the conferences
that deal with AI, if you look five years ago,
to now just the amount of workshops and so on has changed so much.
It's impressive to see how much topics of safety, ethics,
and so on come to the surface, which is great.
And if you were to early, clearly it's fine.
I mean, it's a big field and there's lots of people
with lots of interest that will do progress or make progress.
And obviously, I don't believe we're too late.
So in that sense, like, I think it's great that we're doing this already.
It's better be too early than too late when it comes to superintelligence systems.
Let me ask, speaking of sentient AIs,
you gave props to your friend, Elias Atschever,
for being elected the Fellow of the World Society.
So just as a shout out to a fellow researcher
and a friend, what's the secret to the genius
of Elias Atschever?
And also, do you believe that his tweets
as you hypothesized and Andre Kapati did as well
are generated by a language model?
Yeah, yeah.
So I strongly believe Ilya is gonna be sitting
a few weeks actually, so I'll ask him in person.
But will he tell you the truth?
Yes, of course.
Hopefully.
I mean, we're, you know, ultimately, we all have
share paths and there's friendships that go beyond obviously
institutional institutions and so on. So hope he tells me the truth.
Well, maybe the AI system was holding him hostage somehow, maybe
he has some videos about he doesn't want to release. So maybe it
has taken control over him. So he can. Well, if I see him in person, then he will. Yeah. But but I think the, um, I think it's a good,
I think Elias personality just knowing him for a while. Um, yeah, he's, he's, he's,
everyone in Twitter, I guess, gets a different persona and, and I think Elias one, um, does
not surprise me, right? So I think knowing Ilias from before social media
and before AI was so prevalent,
I recognize a lot of his character.
So that's something for me that I feel good
about a friend that hasn't changed
or like is still true to himself, right?
Obviously there is though a fact that your field
becomes more popular and he is obviously one of the
main figures in the field having done a lot of advancement. So I think that the tricky bit here
is how to balance your true self with the responsibility that you're worth carrying. So in this sense, I
think, yeah, like I appreciate the style and I understand it, but it created debates on some of his tweets,
right, that maybe it's good. We have them early anyways, right? But yeah, then the reactions
are usually polarizing. I think we're just seeing kind of the reality of social media
being there as well, reflected on that particular topic or set of topics he's tweeting about.
Yeah, I mean, it's funny that he's speaking to this tension.
He was one of the early seminal figures in the field of deep learning,
and so there's a responsibility with that.
But he's also from having interacted with him quite a bit.
He's just a brilliant thinker about ideas,
and which as are you, and there's's attention between becoming in the manager versus like the actual thinking through very novel ideas.
The yeah, the the scientist versus the manager and he's one of the great scientists of our time. It was quite interesting. And also people tell me quite silly, which I haven't
quite detected yet. But in private, we'll have to see about that. Yeah. Yeah. I mean, just on the point
of, I mean, Ilya has been an inspiration. I mean, quite a few colleagues, I can think shaped,
you know, the person you are, like, Ilya certainly gets probably the top spot,
if not close to the top.
And if we go back to the question about people in the field,
like how the role would have changed the field or not,
I think Ilya's case is interesting
because he really has a deep belief
in the scaling up of neural networks.
There was a talk that that is still famous to this day
from the sequence to sequence paper where he was just claiming, just give me supervised data
and large neural network and then you know you'll solve basically all the problems, right? That
that vision, right, was already there many years ago. So it's good to see someone who is, in this case,
very deeply into this style of research
and clearly has had a tremendous track record
of successes and so on.
The funny bit about that talk is that
we rehearsed the talk in a hotel room before
and the original version of that talk
would have been even more controversial.
So maybe I'm the only person that
has seen the unfiltered version of the talk.
And maybe when the time comes, we should revisit
some of the skip slides from the talk from Ilya.
But I really think the deep belief
into some certain style of research
based out, right, is good to be practical sometimes. And I actually think Ilya and myself are
like practical, but it's also good. There's some sort of long-term belief and trajectory.
Obviously, there's a bit of lacking in both, but it might be that that's the right path. Then
you clearly are ahead and hugely influential to the field as he has been.
Do you agree with that intuition that maybe was written about by Rich Sutton in the bitter
lesson that the biggest lesson that can be read from 70 years of AI research is that
general methods that leverage computation are ultimately the most effective.
Do you think that intuition is ultimately correct, general methods, leverage computation,
allowing the scaling of computation to do a lot of the work? And so the basic task of us humans is to design methods that are more and more general versus more and more specific
to the tasks at hand.
I certainly think this essentially mimics
a bit of the deep learning research, almost like philosophy,
that on the one hand, we want to be data agnostic.
We don't want to pre-process data sets,
we want to see the bytes, like the true data as it is, and then learn everything on top.
So, very much agree with that.
And I think scaling up feels, at the very least, again, necessary for building incredible
complex systems.
It's possibly not sufficient,
bearing that we need a couple of breakthroughs.
I think Rich sat and mentioned search
being part of the equation of scale and search.
I think search, I've seen it,
that's been more mixed in my experience
so from that lesson in particular,
search is a bit more tricky
because it is very appealing
to search in domains like Go, where you have a clear
reward function that you can then discard some search traces.
But then in some other tasks, it's not very clear
how you would do that.
Although recently, one of our recent works,
which actually was mostly mimicking or a continuation and even the team and
the people involved were pretty much very intersecting with AlphaStar was AlphaCode in which we actually saw
the bitter lesson how scale of the models and then a massive amount of search yielded this kind of
very interesting result of being able to have human level code competition.
So I've seen examples of it being literally map
to search and scale.
I'm not so convinced about the search bit,
but certainly I'm convinced scale will be needed.
So we need general methods.
We need to test them and maybe we need to make sure
that we can scale them given the hardware
that we have in practice, but then maybe we should also shape
how the hardware looks like based on which, but then maybe we should also shape how the hardware looks
like based on which methods might be needed to scale. And that's an interesting, an interesting
contrast of this GPU comment that is we got it for free almost because games were using this,
but maybe now if sparsities required, we don't have the hardware, although in theory, I mean, many people
are building different kinds of hardware these days, but there's a bit of this notion
of hardware lottery for scale that might actually have an impact at least on the year, again,
scale of years on how fast we'll make progress to maybe a version of neural nets or whatever
comes next that might enable truly intelligent agents.
Do you think in your lifetime we will build an AGI system that would undeniably be a thing
that achieves human level intelligence and goes far beyond?
I definitely think it's possible
that it will go far beyond, but I'm definitely convinced that it will be
human level intelligence.
And I'm hypothesizing about the beyond
because the beyond bit is a bit tricky
to define especially when we look at the current formula
of starting from this imitation learning standpoint.
So we can certainly imitate humans at language and beyond.
So getting at human level through imitation feels very possible.
Going beyond will require reinforcement learning and other things.
And I think in some areas that certainly already has paid out,
I mean, go being an example that's my favorite so far
in terms of going beyond human capabilities.
But in general, I'm not sure we can define reward functions
that from a seat of imitating human level intelligence
that is general and then going beyond.
That bit is not so clear in my lifetime,
but certainly human level, yes.
And I mean, that in itself is already quite powerful, I think.
So going beyond, I think it's obviously not,
we're not gonna not try that if,
if we then we get to superhuman scientists
and discovery and advancing the world, but
at least human level is also, in general, is also very, very powerful.
Well, especially if human level or slightly beyond is integrated deeply with human society
and there's billions of agents like that, do you think there's a singularity moment beyond
which our world will be just very deeply
transformed by these kinds of systems?
Because now you're talking about intelligent systems that are just, I mean, this is no
longer just going from horse and buggy to the car.
It feels like a very different kind of shift.
It wouldn't means to be a living entity on Earth.
Are you afraid or are you excited of this world?
I'm afraid if there's a lot more,
so I think maybe we'll need to think about
if we truly get there just thinking of limited resources
like humanity clearly hits some limits and then there's some balance
hopefully that biologically the planet is imposing and we should actually try to get better
at this as we know there's quite a few issues with having too many people coexisting in
a resource limited way.
So for digital entities it's an interesting question. I think such a limit
maybe should exist, but maybe it's going to be imposed by energy availability because
this also consumes energy. In fact, most systems are more inefficient than we are in terms of
energy required. But definitely, I think, as a society, we'll need to just work together to find what would
be reasonable in terms of growth or how we coexist if that is to happen.
I am very excited about obviously the aspects of automation that make people that obviously
don't have access to certain resources or knowledge
For them to have those that access. I think those are the applications in a way that I'm most exciting to see
And to personally work towards yeah, there's going to be significant improvements in productivity and the quality of life Yeah, it's the whole population which is very interesting, but I'm looking even far beyond
us becoming a multi-planetary species.
And just as a quick bet, last question.
Do you think as humans become multi-planetary species go outside our solar system, all that
kind of stuff, do you think there will be more humans or more robots in that future world?
So will humans be the quirky intelligent being of the past, or is there something deeply
fundamental to human intelligence that's truly special, or we will be part of those other planets,
not just AI systems? I think we're all excited to build a GI to empower or make us more powerful as human species.
Not to say there might be some hybridization.
I mean, this is obviously speculation, but there are companies also trying to, the same
way medicine is making us better.
Maybe there are other things that are yet to happen on that.
But if the ratio is not at most one to one,
I would not be happy. So I would hope that we are part of the equation. But maybe there's
maybe a one to one ratio feels like possible, constructive and so on. But it would not
be good to have a misbalance, at least from my core beliefs
and the why I'm doing, what I'm doing,
when I go to work and I research what I research.
Well, this is how I know you're human
and this is how you've passed the touring test.
And you are one of the special humans,
or it's a huge honor that you would talk with me
and I hope we get the chance to speak again,
maybe once before the singularity, once after and see how our view of the world changes. Thank you again
for talking today. Thank you for the amazing work you do here. Shining example
for research and a human being in this community. Thanks a lot. Yeah. Looking
forward to before the singularity certainly. And maybe after. Thanks for listening
to this conversation with Ariya Alvin-Yalus.
To support this podcast, please check out our sponsors in the description.
And now, let me leave you with some words from Alan Turing.
Those who can imagine anything can create the impossible.
Thank you for listening and hope to see you next time.