Lex Fridman Podcast - #258 – Yann LeCun: Dark Matter of Intelligence and Self-Supervised Learning
Episode Date: January 23, 2022Yann LeCun is the Chief AI Scientist at Meta, professor at NYU, Turing Award winner, and one of the seminal researchers in the history of machine learning. Please support this podcast by checking out ...our sponsors: - Public Goods: https://publicgoods.com/lex and use code LEX to get $15 off - Indeed: https://indeed.com/lex to get $75 credit - ROKA: https://roka.com/ and use code LEX to get 20% off your first order - NetSuite: http://netsuite.com/lex to get free product tour - Magic Spoon: https://magicspoon.com/lex and use code LEX to get $5 off EPISODE LINKS: Yann's Twitter: https://twitter.com/ylecun Yann's Facebook: https://www.facebook.com/yann.lecun Yann's Website: http://yann.lecun.com/ Books and resources mentioned: Self-supervised learning (article): https://bit.ly/3Aau1DQ PODCAST INFO: Podcast website: https://lexfridman.com/podcast Apple Podcasts: https://apple.co/2lwqZIr Spotify: https://spoti.fi/2nEwCF8 RSS: https://lexfridman.com/feed/podcast/ YouTube Full Episodes: https://youtube.com/lexfridman YouTube Clips: https://youtube.com/lexclips SUPPORT & CONNECT: - Check out the sponsors above, it's the best way to support this podcast - Support on Patreon: https://www.patreon.com/lexfridman - Twitter: https://twitter.com/lexfridman - Instagram: https://www.instagram.com/lexfridman - LinkedIn: https://www.linkedin.com/in/lexfridman - Facebook: https://www.facebook.com/lexfridman - Medium: https://medium.com/@lexfridman OUTLINE: Here's the timestamps for the episode. On some podcast players you should be able to click the timestamp to jump to that time. (00:00) - Introduction (06:58) - Self-supervised learning (17:17) - Vision vs language (23:08) - Statistics (28:55) - Three challenges of machine learning (34:45) - Chess (42:47) - Animals and intelligence (52:31) - Data augmentation (1:13:51) - Multimodal learning (1:25:40) - Consciousness (1:30:25) - Intrinsic vs learned ideas (1:34:37) - Fear of death (1:42:29) - Artificial Intelligence (1:56:18) - Facebook AI Research (2:12:56) - NeurIPS (2:29:08) - Complexity (2:37:33) - Music (2:42:28) - Advice for young people
Transcript
Discussion (0)
The following is a conversation with Jan LeCoon, his second time on the podcast.
He is the chief AI scientist at Metta, formerly Facebook, professor at NYU, touring award
winner, one of the seminal figures in the history of machine learning and artificial intelligence,
and someone who is brilliant and opinionated in the best kind of way, and so is always
fun to talk to.
And now a quick few second mention of each sponsor. Check them out in the description.
It's the best way to support this podcast. First is public goods, an online shop I use
for household products. Second is indeed a hiring website. Third is Roca, my favorite
sunglasses and prescription glasses.
Fourth is Nutsweet, business software for managing HR financials and other details.
And fifth is Magic Spoon, low carb, keto-friendly cereal.
So the choice is, business, health, or style.
Choose wisely, my friends.
And now onto the full-eyed reads.
As always, no ads in the middle.
I try to make this interesting, but if you skip them, please still check out our sponsors.
I enjoy their stuff, maybe you will too.
This show is brought to you by Public Goods, the one stop shop for affordable, sustainable,
healthy household products.
I use their hand soap, toothpaste, and toothbrush. I think I use a
much other stuff too, but that's what comes to mind. Their products often have
this minimalist black and white design that I just absolutely find beautiful. I
love it. I love minimalism in design. It doesn't go over the top. It doesn't
have all these extra things and features that you don't need just these essentials.
I think it's hard to explain, but there's something about the absence of things that can
take up your attention that allows you to truly be attentive to what matters in life.
Anyway, go to pubbligoods.com slash Lex or use code Lexit.
Check out to get $15 off your
first order, plus you will receive your choice of either a free pack of bamboo straws or
reusable food storage wraps.
Visit public goods.com slash Lex or use code Lexit checkout.
This show is also brought to you by indeed a hiring website.
I've used them as part of many hiring efforts
I've done for the teams I've led in the past.
Most of those for was for engineering,
for research efforts.
They have tools like Indeed is the match
giving you quality candidates
whose resumes and Indeed fit your job description immediately.
For the past few months,
I've been going through this process of building up a team
of folks to help me, I've been going through this process of building up a team of folks
to help me.
I've been doing quite a bit of hiring.
It's a treacherous and an exciting process because you get to meet some friends.
So it's a beautiful process, but I think it's one of the most important processes in life.
It's selecting the group of people whom you spend your days with. And so you gotta use the best tools for the job. Indeed, I think is an excellent
tool. Right now you can get a free $75 sponsored job credit to upgrade a job
post at indeed.com slash Lex. In terms and conditions apply, go to indeed.com slash
Lex. This show is also brought to you by Roka, the makers of
glasses and sunglasses that I love wearing for their design feel and innovation
on material optics and grip. Roka was started by two all-American swimmers from
Stanford and it was born out of an obsession with performance. I like the way
they feel, I like the way they look. Whether
I'm running like a fast pace run, we're talking about eight-minute mile or faster or if we're
doing a slow pace run, nine, ten-minute mile along the river in the heat or in the cold, or if I'm
just wearing my suit out on a town, how are that expression goes?
I'm not sure, but they look classy with a suit.
They look badass in running gear.
It's just my go-to sunglasses.
Check them out for both prescription glasses and sunglasses at roca.com and enter code
Lex the safe 20% on your first order.
That's roca.com and enter code Lex. This shows also brought to you by NetSuite.
NetSuite allows you to manage financials, human resources, inventory, e-commerce, and many more business related details all in one place.
I'm not sure why I was doing upspeak on that. Maybe because I'm very excited about NetSuite.
Anyway, there's a lot of messy things you have to get right when running a company.
If you're a business owner, if you're an entrepreneur, if you're a founder of a startup, this is
something you have to think about. You have to use the best tools for the job to make
sure all the messy things required to run a business that take care of for you so you can focus on the things that you're best at,
that were your brilliance shines. If you are starting a business, I wish you the best of luck.
It's a difficult journey, but it's worth it.
Anyway, right now, special financing is back. Head to netswede.com slash Lex to get there one of a kind, financing program.
That's netsuite.com slash Lex.
Netsuite.com slash Lex.
This episode is also brought to you by Magic Spoon.
The OG, not quite OG, but really old school sponsor
of this podcast that I love.
It's a low carb, kiddo friendly cereal.
It has zero grams of sugar. It's delicious. I, keto-friendly cereal. It has zero grams of sugar, it's delicious.
I don't say that enough, it really is delicious.
Given that it's zero grams of sugar,
it's very surprising how delicious it is.
13 to 14 grams of protein, only four net grams of carbs
and 140 calories in each serving.
You could build your own box or get a variety pack
with available flavors of cocoa, fruity,
frosted, peanut butter, blueberry, and cinnamon.
Coco is my favorite.
It's the flavor of champions.
I don't know why I keep saying that, but it seems to be true.
Anyway, Magic Spoon has a 100% happiness guarantee, so if you don't like it, they will
refund it.
Who else will give you you 100% happiness guarantee? Go to magicspoon.com
slash Lex and use code Lex. Check out to say $5 off your order. That's magicspoon.com
slash Lex and use code Lex. This is the Lex Friedman podcast and here's my conversation Leyan Lakun.
You co-wrote the article, Self-Supervised Learning, the Dark Matter of Intelligence.
Great title, by the way.
Withishan Mishra. So let me ask the way. Willy-Shant and Misra.
So let me ask, what is self-supervised learning
and why is it the dark matter of intelligence?
I'll start by the dark matter part.
There is obviously a kind of learning
that humans and animals are doing,
that we currently are not reproducing properly
with machines with AI.
So the most popular approaches to machine learning today are,
or PyDymes, I should say,
are supervised learning and reinforcement learning.
There is too many efficient supervised learning requires many samples for learning anything.
Reinforcement learning requires a ridiculously large number of trial and errors to for you know a system to learn anything.
And that's why we don't have start driving cars.
That's a big leap forward to the other. Okay, so that to solve difficult problems, you have to have a lot of
You have to have a lot of human annotations for supervised learning to work and to solve those difficult problems with reinforcement learning.
You have to have some way to maybe simulate that problem such that you can do that large
scale kind of learning that reinforcement learning requires.
Right.
So, how is it that most teenagers can learn to drive a car in about 20 hours of practice, whereas even with millions of
hours of simulated practice, self-driving car can't actually learn to drive itself properly.
So obviously we're missing something, right?
And it's quite obvious for a lot of people that the immediate response you get from people
is, well, humans use their background knowledge to learn faster.
And they're right.
Now, how was that background knowledge acquired?
And that's the big question.
So now you have to ask, how do babies in the first few months of life learn how the world
works, mostly by observation, because they can hardly act in the world.
And they learn an enormous amount of background knowledge
about the world that may be the basis
of what we call common sense.
This type of learning is not learning a task,
it's not being reinforced for anything,
it's just observing the world and figuring out how it works.
Building world models, learning world models,
how do we do this?
And how do we reproduce this in machines?
So cell supervision is one instance
or one attempt at trying to reproduce this kind of learning.
Okay, so you're looking at just observation.
So not even the interacting part of a child.
It's just sitting there watching mom and dad walk around,
pick up stuff, all of that. That's what we there watching Mom and Dad walk around, pick up stuff, all
of that. That's the, that's what you mean by back on knowledge.
Perhaps not even watching Mom and Dad just, you know, watching the world go by.
Just having eyes open or having eyes closed or the very active opening and closing
eyes that the world appears and disappears, all that basic information.
And you're saying in, in order to learn to drive, like the reason humans are able to learn to
drive quickly, some faster than others, is because of the background knowledge, they're
able to watch cars operating in the world in the many years leading up to it, the physics
of the basics of objects, all that kind of stuff.
That's right.
I mean, the basic physics of objects, you don't even know, you don't even need to know,
you know, how a car works, right?
Because that you can run fairly quickly.
I mean, the example I use very often is
you're driving next to a cliff.
And you know in advance because of your understanding
of intuitive physics that if you turn the wheel to the right,
the car will be out to the right.
We're run off the cliff, fall off the cliff,
and nothing good will come out of this, right?
But if you are a sort of tabularized reinforcement learning
system that doesn't have a model of the world,
you have to repeat folding off this cliff thousands
of times before you figure out it's a bad idea.
And then a few more thousand times
before you figure out how to not do it.
And then a few more million times before you figure out
how to not do it in every situation you ever encounter.
So self-supervised learning still has to have some source of truth being told to it by
somebody.
And you have to figure out a way without human assistance or without significant amount
of human assistance to get that truth from the world.
So the mystery there is how much signal is there, how much truth is there
that the world gives you, whether it's the human world, like you watch YouTube or something
like that, or it's the more natural world. So how much signal is there?
So here is a trick. There is way more signal in sort of a self-supervised setting than
there is in either a supervised or reinforcement setting.
And this is going to my analogy of the cake.
Yes.
The, you know, low cake as someone, let's call it,
where when you try to figure out how much information
you ask the machine to predict and how much feedback
you give the machine at every trial.
In reinforcement learning, you give the machine
a single scalar, you tell the machine,
you did good, you did bad.
And you only tell this to the machine once in a while.
When I say you, it could be the universe
tearing the machine, right?
But it's just one scalar.
So as a consequence of this,
you cannot possibly learn something very complicated
without many, many, many trials
where you get many, many feedbacks of this type. Supervised learning, you give a few bits to the machine at every
sample. Let's say you're training a system on, you know, recognizing images on the image
net. There is a 1000 categories that are less than 10 bits of information, per sample.
But so supervised learning here is a setting.
You ideally, we don't know how to do this yet,
but ideally, you would show a machine a segment of a video,
and then stop the video and ask the machine to predict
what's going to happen next.
So you let the machine predict,
and then you let time go by and show the machine
what actually happened, and hope the machine predict and then you let time go by and show the machine what actually happened
hope the machine will learn to do a better job at predicting next time around.
There's a huge amount of information you give the machine because it's an entire video clip
of the future after the video clip you've fed it in the first place.
So both for language and for, there's a subtle seemingly trivial
construction, but maybe that's representative of what is required to create intelligence,
which is filling the gap. So it sounds dumb, but can you, it is possible you can solve all of intelligence in this way.
Just for both language, just give a sentence and continue it or give a sentence and there's
a gap in it.
Some words blanked out and you fill in what words go there.
For vision, you give a sequence of images and predict what's going to happen next or you
fill in what happened in between. Do you think it's possible that formulation alone as a signal for self-supervised learning
can solve intelligence for vision and language?
I think that's a bit short at the moment.
So whether this will take us all the way to human level intelligence or something, or
just cat level intelligence.
It's not clear, but among all the possible approaches that people propose, I think is a bad shot.
So, I think this idea of an intelligence system filling in the blanks, either predicting the future,
inferring the past, filling in missing information.
I'm currently filling the blank of what is beyond your head
and what your head looks like from the back
because I have basic knowledge about how humans are made.
And I don't know if you're gonna,
what are you gonna say at which point you're gonna speak,
whether you're gonna move your head this way,
or that way, which way you're gonna look.
But I know you're not gonna just dematerialize
and reappear three meters down the hole. You know, because I know what's
possible and what's impossible according to into the physics. So you have a model of
what's possible and you'd be very surprised if it happens and you have to reconstruct your
model. Right. So that's the model of the world. It's what tells you, you know, what
feels in the blanks. It's so given your partial information about the state of the world, given by your perception, your model of
the world fills in the missing information and that includes predicting the future,
predicting the past, filling in things you don't immediately perceive.
And that doesn't have to be purely generic vision or visual information or generic language
you can go to specifics like predicting what control decision you make when you're driving
in a lane.
You have a sequence of images from a vehicle and then you could you have information if you
recorded on a video where the car ended up, so you can go back in time and predict what the car
went based on the visual information.
That's very specific, domain-specific.
Right, but the question is whether we can come up with
sort of a generic method for, you know,
training machines to do this kind of prediction
or filling in the blanks.
So right now, this type of approach has been unbelievably
successful in the context of natural language processing. Every modern natural language processing
is pre-trained in self-supervised manner to fill in the blanks. You show it as sequence
of words, you remove 10% of them, and then you train some gigantic neural net to predict the
words that are missing. And once you've pre-trained that network, you can use the internal representation,
learn by it as input to something
that you train supervised or whatever.
That's being incredibly successful,
not so successful in images, although it's making progress.
And it's based on manual data augmentation.
We can go into this later.
But what has not been successful yet is training from video.
So getting a machine to learn to represent the visual world,
for example, by just watching video.
Nobody has really succeeded in doing this.
OK, well, let's kind of give a high level overview.
What's the difference in kind and in difficulty
between vision and language. So you said people haven't
been able to really kind of correct the problem of vision open in terms of self-supervised learning,
but that may not be necessarily because it's fundamentally more difficult. Maybe like
when we're talking about achieving like passing the touring test in a full spirit of the
touring test in language, it might be harder than vision.
That's not obvious. So in your view, which is harder or perhaps are they just the same problem?
When the farther we get the solving each, the more we realize it's all the same thing. It's all the same cake.
I think what I'm looking for are methods that make them look essentially like the same cake,
but currently they're not.
And the main issue with learning world models or learning predictive models is that the
prediction is never a single thing because the world is not entirely predictable.
It may be deterministic or stochastic.
We can get into the philosophical discussion about it, but even if it's deterministic, it's not entirely predictable. It may be deterministic or stochastic, we can get into the philosophy of discussion about it, but even if it's deterministic, it's not
entirely predictable. And so if I play a short video clip, and then I ask you to
predict what's going to happen next, there is many, many plausible
continuations for that video clip, and the number of continuation grows with
the interval of time that you're asking
the system to make a prediction for.
And so one big question we saw for the provisioning is how you represent this uncertainty, how
you represent multiple discrete outcomes, how you represent a continuum of possible outcomes,
et cetera.
And if you are sort of a classical machine learning person, you say, oh, you just represent a distribution, right?
And that we know how to do when we're predicting words, missing words in the text because you can have a neural net give a score for every word in the dictionary.
It's big, you know, it's a big list of numbers, you 100,000 or so. You can turn that into a part of the distribution that tells you,
when I say a sentence, the cat is chasing the blanket in the kitchen.
There are only a few words that make sense there.
It could be a mouse, it could be a lizard spot, or something like that.
And if I say the blanket is chasing the blank in the savannah,
you also have a bunch of plausible options for those two words. That because you have
underlying reality that you can refer to to fill in those blanks.
So you cannot say for sure in the savannah if it's a lion or a cheetah or whatever, you cannot know
if it's a zebra or a gnu or whatever, well the beast, the same thing. But you can represent
the uncertainty by just a long list of numbers. Now, if I do the same thing with video and I ask
you to predict a video clip, it's not a discreet set of potential frames.
You have to have somewhere representing a sort of infinite number of plausible continuations of multiple frames in a high dimensional continuous space.
And we just have no idea how to do this properly. Uh, finite, high dimensional. So like you, just finite, high dimensional, yes, just like the words, I tried to get it to, uh,
down to a, a small finite set of like under a million, something like that.
I mean, it's kind of ridiculous that we're doing a distribution over every single possible
word for language.
And it works.
It feels like that's a really dumb way to do it.
It seems to be like there should be some more compressed representation of the distribution
of the words. You're right about that. I agree. Do you have any interesting ideas about how to
represent all the reality in a compressed way, such you can form a distribution over.
That's one of the big questions.
How do you do that?
What's kind of another thing that really is stupid
about, I shouldn't say stupid, but simplistic
about current approaches to self-supervisioning
in NLP in text is that not only do you represent
a giant distribution over words, but for multiple
words that are missing, those distributions are essentially independent of each other.
And you don't pay too much for price for this.
So you can't, so the system, you know, in the sentence that I gave earlier, if it gives
a certain probability for a lion and a cheetah, and then a certain probability for lion and cheetah, and then a certain
probability for gazelle, wildebeest, and zebra, those two probabilities are independent of
each other.
And it's not the case that those things are independent lions actually attack like bigger
animals than cheetahs.
So there is a huge independence hypothesis in this process,
which is not actually true.
The reason for this is that we don't know how to represent
properly distributions over
combinatorial sequences of symbols,
essentially, because the number of
grozics financially with the length of the symbols.
So we have to use tricks for this,
but those techniques can get around, don't even deal with it. grows exponentially with the length of the symbols. And so we have to use tricks for this,
but those techniques can get around,
like don't even deal with it.
So the big question is,
would there be some sort of abstract
latent representation of text
that would say that when I switch
Lyon for Gezel, Lyon for Chida,
I also have to switch Zebra for Gazelle.
Yeah, so this independence assumption,
let me throw some criticism at you,
that I often hear and see how you respond.
So this kind of feeling in the blanks is just statistics.
You're not learning anything,
like the deep underlying concepts.
You're just mimicking stuff from the past.
You're not learning anything new such that you can use it to generalize about the world.
Or, okay, let me just say the crude version, which is just statistics.
It's not intelligence.
What do you have to say to that?
What do you usually say to that if you kind of hear this kind of thing?
I don't get into the discussions because they are kind of pointless.
So first of all, it's quite possible that intelligence is just statistics. It's just statistics of a particular kind.
Where it says the philosophical question is it kind of is
Intelli- is it possible that intelligence is just statistics? Yeah. But what kind of statistics?
So if you are asking the question, are the model of the world, the models of the world
that we learn, do they have some notion of causality?
Yes.
So if the criticism comes from people who say, you know, current machine learning systems
don't care about causality, which by the way is wrong. I agree with that.
You should, your model of the world should have your actions
as one of the inputs, and that will drive you
to learn causal models of the world where you know what
intervention in the world will cause what result.
Or you can do this by observation of other agents
acting in the world and observing the effect
of other humans acting in the world and observing the effect. Other humans, for example.
So I think, you know, at some level of description, intelligence is just statistics.
But that doesn't mean you don't, you know, you won't have models that have, you know,
deep mechanistic explanation for what goes on.
The question is, how do you learn them?
That's the question I'm interested in.
Because a lot of people who actually voice their criticism
say that those mechanistic models have to come from some place else,
they have to come from human designers, they have to come from,
I don't know what.
And obviously, we learn them.
Or if we don't learn them as an individual, nature
learned them for us using evolution.
So regardless of what you think,
those processes have been learned somehow.
So if you look at the human brain,
just like when we humans introspect
about how the brain works,
it seems like when we think about what is intelligence,
we think about the high level stuff,
like the models we've constructed concepts like cognitive science, the concepts of memory
and reasoning module, almost like these high level modules. Is this service a
good analogy? Like are we ignoring the the dark matter, the the basic low-lovel
mechanisms, just like we ignore the basic low-level mechanisms,
just like we ignore the way the operating system works,
we're just using the high-level software,
we're ignoring that at the low level,
the neural network might be doing something like statistics.
Like me, sorry to use this word,
it probably incorrectly and crudely,
but doing this kind of fill in the gap kind of learning and just kind of updating the model
constantly in order to be able to support the raw sensory
information, to predict it and adjust to the prediction
when it's wrong.
But when we look at our brain at the high level,
it feels like we're playing chess.
Like we're playing with high level concepts
and we're stitching them together.
We're putting them into long-term memory. But really what's going underneath is something we're
not able to introspect, which is this kind of a simple large neural network that's just filling
in the gaps. Right. Well, okay. So there's a lot of questions that are all answers there.
Okay. So first of all, there's a whole school of thought in neuroscience, computational neuroscience in particular,
that likes the idea of predictive coding,
which is really related to the idea
of us talking about itself supervisoring.
So everything is about prediction.
The essence of intelligence is the ability to predict.
And everything the brain does is trying to predict,
predict everything from everything else.
Okay, and that's That's really the underlying principle if you want that.
Self-supervised zoning is trying to reproduce the site of prediction that's an essential
mechanism of task independent learning if you want.
The next step is what kind of intelligence are you interested in reproducing?
Of course, we all think about trying to reproduce
high-level cognitive processes in humans.
But with machines, we're not even at the level of
even reproducing the running processes in a cat brain.
The most intelligent or intelligent systems
don't have as much common sense as a house cat.
So how is it that cats learn?
And cats don't do a whole lot of reasoning.
They certainly have causal models.
They certainly have, because many cats can figure out how they can act on the world to
get what they want.
They certainly have a fantastic model of intuitive physics, certainly the dynamics of their own bodies, but also
of praise and things like that. So they're pretty smart. They only do this with about 800 million
neurons. We are not anywhere close to reproducing this kind of thing. So to some extent, I could say,
let's not even worry about the high-level cognition and kind of long-term planning
and reasoning that humans can do until we figure out
can we even reproduce what Kass are doing.
Now that said, this ability to learn world models
I think is the key to the possibility
of running machines that can also reason.
So whenever I give a talk, I say there are three challenges
in the three main challenges in machine learning.
The first one is getting machines to learn
to represent the world and I'm proposing
south super valid learning.
The second is getting machines to reason
in ways that are compatible with essentially
gradient based running because this is what deep learning
is all about, really.
And the third one is something we have no idea how to solve,
at least I have no idea how to solve,
is can we get machines to learn how
article representations of action plans,
like we know how to trend them to learn
how article representations of perception,
with conventional nets and things like that and transformers.
But what about action plans?
Can we get them to spontaneously learn good,
how Racky and Kerr representations of actions?
Also gradient based.
Yeah, all of that, you know,
needs to be somewhat differentiable so that you can apply sort of gradient based
learning, which is really what deep learning is about.
So it's background knowledge,
ability to reason in a way
that's differentiable that it's somehow connected
deeply integrated with that background knowledge
or builds on top of that background knowledge.
And then giving that background knowledge
be able to make hierarchical plans in the world.
So if you take classical optimal control, there's something in classical optimal control
called model predictive control. And it's, you know, it's been around since the early
60s. NASA uses that to compute trajectories of rockets. And the basic idea is that you
have a predictive model of the rocket, let's say, or whatever system you intend to control,
which given the state of the system at time T
and given an action that you are taking the system,
so for a rocket to be thrust and all the controls you can have,
it gives you the state of the system at time T plus delta T,
so basically a differential equation, something like that.
And if you have this model,
and you have this model in the form of some sort of neural net,
or some sort of set of formula
that you can backpropagate gradient through,
you can do what's called model predictive control,
or gradient-based model predictive control.
So you can unroll that model in time.
You, you, you, you feed it a hypothesis sequence of actions.
And then you have some objective function that measures how well at the end of the trajectory of the system has succeeded or matched what you wanted to do.
You know, is it a robot harm? Have you grasped the object you want to grasp? If it's a rocket,
are you at the right place near this face station? Things like that. And by back propagation
to a time, and again, this was invented in the 1960s by Optimal Control Theoryist, you
can figure out what is the optimal sequence of actions that will, you know, get my system to the best final state. So that's a
form of reasoning. It's basically planning and a lot of planning systems in robotics, actually,
based on this. And you can think of this as a form of reasoning. So, you know, to take the example
of the teenager driving a car again, you have a pretty good dynamic model of the car, it doesn't need to be very accurate,
but again, if you turn the wheel to the right,
and there is a cliff, you're going to run off the cliff,
you don't need to have a very accurate model to predict that.
You can run this in your mind and decide not to do it for that reason,
because you can predict in advance that the result is going to be bad.
So you can imagine difference in iOS,
and then employ or take the first step
in this scenario that is most favorable,
and then repeat the process of planning.
That's called receding horizon model for the default control.
So all those things have names going back decades.
And so if you're not the classical optimal control,
the model of the world is not generally learned.
There's sometimes a few parameters
you have to identify, that's called systems identification.
But generally, the model is mostly deterministic
and mostly built by hand.
So the big question of AI, I think the big challenge of AI
for the next decade, is how do we get machines
to learn predictive models of the world
that deal with uncertainty
and deal with the real world in all this complexity.
So it's not just trajectory of a rocket,
which you can reduce to, for as principles,
it's not even just trajectory of a robot arm,
which again, you can model by careful mathematics,
but it's everything else, everything we observe in the world.
People, behavior, physical systems
that involve collective phenomena,
like water or trees and branches in a tree
or something, or complex things that humans have no trouble
developing abstract representations and predictive model for,
but we still don't know how to do with machines.
Where do you put in these three maybe in the in the planning stages, the game, the
erratic nature of this world where your actions not only respond to the dynamic nature of
the world, the environment, but also affected.
So if there's other humans involved, is this is this point number four, or is it somehow
integrated into the hierarchical representation of action in your view? there's other humans involved. Is this point number four, or is it somehow integrated
into the hierarchical representation of action in your view?
I think it's integrated. It's just that now your model of the world has to deal with,
you know, it just makes it more complicated, right? The fact that humans are complicated
and not easily predictable, that makes your model of the world much more complicated,
that much more complicated.
Well, there's a chat, I mean, I suppose chest is an analogy. So multi-carlo tree search.
There is a I go, you go, I go, you go. Like, I just a cup of hot, the reason they gave a talk
and I might hear about car doors. I think there's some machine learning too, but mostly car doors.
And there's a dynamic nature to the car,
like the person opening the door check.
I mean, he wasn't talking about that.
He was talking about the perception problem of what,
the entology of what defines a car door,
this big philosophical question.
But to me, it was interesting,
because it's obvious that the person opening the car doors,
they're trying to get out, like here in New York,
trying to get out of the car,
you're slowing down,
and it's going to signal something, you're slowing down and it's going to signal something,
you're speeding up, it's going to signal something,
and that's a dance.
It's a asynchronous chess game, I don't know.
So it feels like it's not just,
I mean, I guess you can integrate all of that
into one giant model, like the entirety
of these little interactions, because it's
not as complicated as chess, it's just like a little dance.
We do like a little dance together, and then we figure it out.
Well, in some ways, it's way more complicated than chess, because it's continuous, it's
uncertain in a continuous manner.
It doesn't feel more complicated, but it looks more complicated because that's what we've
evolved to solve. This is a kind we're we've evolved to solve this
The kind of problem we've evolved to solve and so we're good at it because you know nature has made us good at it
Nature has not made us good at chess. We completely suck at chess. Yeah
In fact, that's why we designed it as a as a game is to be challenging and
If there is something that you know, recent progress in the chess
and go as middles realized is that humans are very terrible at those things, like really bad.
You know, there was a story, right, before AlphaGo that, you know, the best go player thought
there were maybe two or three stones behind, you know, an ideal player that they would call God.
In fact, no, there are like nine or ten stones behind. I mean, we're just bad.
So we're not good at, and it's because we have limited working memory. We're not very good at
doing this tree exploration that computer is much better at doing than we are, but we are
much better at learning differentiable models of the world. I mean, I said differentiable in the
kind of, you know, I should say I said differentiable in the kind of,
I should say, not differentiable in the sense that,
we run back for it through it,
but in the sense that our brain has some mechanism
for estimating gradients of some kind,
and that's what makes this efficient.
So if you have an agent that consists of
a model of the world,
which in the human brain is basically the entire front half of your brain. an agent that consists of a model of the world,
which in the human brain is basically
the entire front half of your brain.
An objective function, which in humans
is a combination of two things.
There is your intrinsic motivation module,
which is in the basal ganglia,
in what the basal of your brain.
That's the thing that measures pain and hunger
and things like that, like immediate, you know,
feelings and emotions.
And then there is, you know, the equivalent of what people in Reference Metro are called
a critic, which is a sort of module that predicts ahead what the outcome of a situation will
be.
And so it's not a cost function,
but it's not an objective function,
but it's a trained predictor of the ultimate objective
function.
And that also is differentiable.
And so if all of this is differentiable, your cost
function, your critic, your world model,
then you can use gradient-based type methods to do planning,
to do reasoning, to do learning,
to do all the things that we'd like an intelligent agent to do.
And a gradient-based learning, like what's your intuition,
that's probably at the core of what can solve intelligence.
So you don't need like a Logic based reasoning in your view. I don't know how to make logic based reasoning compatible with
efficient learning. Yeah, and
Okay, I mean there is a big question perhaps if it is a fickle question. I mean, it's not that philosophical, but
that we can ask is is that you know all the learning algorithms we know from
engineering and computer science Pro proceed by optimizing some objective function.
Yeah, right. So one question we may ask is,
is does learning at the brain minimize an objective function?
It could be a composite of multiple objective functions, what is still an
objective function?
Second, if it does optimize an objective function,
does it do it by some sort of gradient estimation?
It doesn't need to be a backdrop, but some way of estimating
the gradient in efficient manner, whose complexity is on the same
order of magnitude as actually running the inference.
Because you can't afford to do things like, you know,
perturbing a weight in your brain to figure out what the effect is,
and then sort of, you know, you can do sort of estimating gradient by perturbation.
It's, to me, it seems very implausible that the brain uses some sort of,
you know, zero-thrower black box gradient free optimization,
because it's so much less efficient than gradient optimization. you know, zero-thorder black box gradient free optimization,
because it's so much less efficient than gradient optimization.
So it has to have a way of estimating gradient.
Is it possible that some kind of logic-based reasoning
emerges in pockets as a useful, like you said,
if the brain is an objective function,
maybe it's a mechanism for creating objective functions.
It's a mechanism for creating knowledge bases, for example,
that can then be queried.
Like, maybe it's an efficient representation of knowledge
that's learned in a gradient-based way or something like that.
Well, so I think there is a lot of different types of intelligence.
So first of all, I think the type of logical reasoning that we think about,
that we are maybe stemming
from classical AI of the 1970s and 80s.
I think humans use that relatively rarely and are not particularly good at it.
But we judge each other based on our ability to solve those rare problems called IQ tests.
Think so.
I'm not very good at chess.
Yes, I'm judging you this whole time.
Because, well, we actually...
With your heritage, I'm sure you're good at chess.
No, stereotypes.
Not all stereotypes that you're...
Well, I'm terrible at chess.
So, you know, but I think perhaps another type of utilitions
that I have is this ability of building models
to the world from reasoning obviously, but also data.
And those models generally are more
kind of analogical.
So it's reasoning by simulation and by analogy, where you use one model to apply to a new situation,
even though you've never seen that situation, you can sort of connect it to a situation you've encountered before.
And your reasoning is more akin to some sort of internal simulation.
So you're kind of simulating what's happening when you're building, I don't know, a box at a word or something, right?
You can imagine in advance
What would be the result of you know cutting the wood in this particular way?
You're gonna use you know screws on ails or whatever
When you are interacting with someone you also have a model of that person and and sort of interact with that person
You know having this model in mind to kind of
Tell the person
what you think is useful to them.
So I think this disability to construct models
of the world is basically the essence of intelligence
and the ability to use it then to plan actions
that will fulfill a particular criterion, of course, is necessary as well.
So I'm going to ask you a series of impossible questions as we keep asking, as they have been doing.
So if that's the fundamental sort of dark matter of intelligence,
the ability to form a background model, what's your intuition about how much knowledge is required. You know, I think dark matter, you can put a percentage
on it of the composition of the universe and how much of it is dark matter, how much of
it is dark energy. How much information do you think is required to be a house cap? So
you have to be able to, when you see a box going in, when you see a human compute the most evil action,
if there's a thing that's near an edge, you knock it off,
all of that, plus the extra stuff you mentioned,
which is a great self-awareness of the physics of your body
and the world, how much knowledge is acquired,
do you think this all of it?
I don't even know how to measure an answer to that question.
I'm not sure how to measure it, but whatever it is, it fits in about 800,000 neurons,
800 million neurons, or you know.
The representation does.
Everything, all knowledge, everything, right?
It was less than a billion.
A dog is two billion, but a cat is less than one billion.
And so multiply that by a thousand and you get a number of synapses. And I think almost
all of it is learned through this, you know, a sort of self-supervised running. Although,
you know, I think a tiny sliver is learned through reinforcement running and certainly very
little through, you know, classical supervised learning, although it's not even clear how supervised learning actually works
in the biogicled world. So I think almost all of it is self-supervised learning, but it's
driven by the sort of ingrained objective functions that a cat or a human have at the base
of their brain, which kind of drives their behavior.
So, nature tells us, you're hungry, it doesn't tell us how to feed ourselves.
That's something that the rest of our brain has to figure out.
What's interesting is there might be more like deeper objective functions
than allowing the whole thing. So hunger may be some kind of,
now you go to like neurobiology,
it might be just the brain,
try and maintain homeostasis.
So hunger is just one of the human perceivable symptoms
of the brain being unhappy with the way things are currently.
Because it could be just like one really dumb objective function at the core.
But that's how behavior is driven.
The fact that, you know, the, or Bazel Ganglia,
drivers to do things that are different from say, you know, Wang Tong, or certainly a cat,
is what makes, you know, human nature versus Wang Tong nature versus cat nature.
is what makes human nature versus orangutan nature versus cat nature.
So for example, our basal ganglia drives us to seek the company of other humans.
And that's because nature has figured out that we need to be social animals for our species to survive and it's true of many primates.
It's not true of orangutongs. orangutongs are solitary animals. They don't seek the
company of others. In fact, they avoid them. In fact, they scream at them when they come too close
because they are territorial. Because for their survival, evolution has figured out that's the bad thing.
I mean, they are occasionally social, of course, for production stuff like that.
But they're mostly solitary.
So all of those behaviors are not part of intelligence.
People say, oh, you're never going to have intelligent machines because human intelligence
is social.
But then you look at orangutans, you look at octopus.
Octopus never know their parents.
They barely interact with any other.
And they get to be really smart in less than a year,
like half a year, you know, in a year or there at all, since we are the dead.
So there are things that we think as humans are individually linked with intelligence,
like social interaction, like language.
We think, I think we give way too much importance to language as a substrate of
intelligence as humans, because we think our reasoning is so linked with language.
So, for to solve the house cat intelligence problem, you think you could do it on a desert
island, you could have pretty much, you could just have a cat sitting there looking at the
waves, at the ocean waves, and figure a lot of it out.
It needs to have the right set of drives to get it to do the thing and learn the appropriate
things, right?
For example, baby humans are driven to learn to stand up and walk.
It's kind of this desire is hardwired. Maybe humans are driven to learn to stand up and walk.
It's not that's kind of this desire is hardwired. How to do it precisely is not that's learned.
But the desire to to walk, move around and stand up.
That's sort of.
Probably hardwired.
It's very simple to hardwired this kind of stuff.
Oh, like the desire to well, that's interesting.
Your hardwired to want to walk.
That's not a, there's got to be a deeper need for walking.
I think it was probably socially imposed by society that you need to walk all the other
bipedal.
No, not a lot of simple animals that, you know, probably walk with our ever watching
any other members of the species.
It seems like a scary thing to have to do because you suck up by people walking in first.
It seems crawling is much safer, much more like why are you in a hurry?
Well because you have this thing that drives you to do it, you know, which is sort of part
of the sort of human development.
Is that understood actually what?
Not entirely.
No, what is the reason to get on two feet?
It's really hard.
Like most animals don't get on two feet.
Why not?
Well, they get on four feet.
You know, many mammals get on four feet.
Yeah, they very quickly.
Some of them extremely quickly.
But I don't, you know, like from the last time I've interacted with a table, that's much
more stable than a thing that two legs, it's just a really hard problem.
Yeah, I mean, birds have figured it out with two feet.
Well, technically, we can go into ontology.
They have four, I guess they have two feet.
They have two feet, chickens.
You know, dinosaurs have two feet, many of them.
Allegedly.
I'm just not learning that T-Rex was eating grass, not other animals.
T-Rex might have been a friendly pet.
What do you think about, I don't know if you looked at the test for general intelligence
that François shall lay put together?
I don't know if you got a chance to look at that kind of thing.
What's your intuition about how to solve an IQ type of test?
I don't know.
I think it's so outside of my radar screen
that it's not really relevant, I think, in the short term.
Well, I guess one way to ask another way,
perhaps more closer to what do you work?
Is like how do you solve MNIST with very little example data?
That's right. And that's the answer to this problem.
We solve supervised running, just learn to represent images and then learning, you know, to
recognize Henryton Digital Top of this will only require a few samples. And we observe
this in humans, right? You show a young child a picture book with a couple of pictures
of an elephant, and that's it. The child knows what an elephant is. I mean, we see this today with practical systems that we
you know, we train image recognition systems with enormous amounts of images,
either completely self-supervised or very weekly supervised. For example,
you can train a neural net to predict whatever hashtag people type on Instagram.
Then you can do this with billions of images because these billions per day that are showing up.
So the amount of training data there is essentially unlimited.
Then you take the output representation,
a couple layers down from the outputs of what the system learned,
and feed this as input to a classifier for any object in the world that you want
and it works pretty well. So that's transfer learning, okay, or weekly supervised transfer learning.
People are making very, very fast progress using self-supervised learning with this kind of scenario
as well. And you know, my guess is that that's going's going to be the future for self supervised learning how much
cleaning do you think is needed for filtering
malicious signal or what's the better term but like a lot of people use hashtags on Instagram
to get like good SEO
That doesn't fully represent the contents of the image
like good SEO that doesn't fully represent the contents of the image.
Like they'll put a picture of a cat and hashtag it with like science, awesome, fun.
I don't know all kind of why would you put science?
Well, so that's not very good SEO.
The way my colleagues who worked on this project at, uh, at Facebook, now, meta, meta AI, uh, a few years ago, uh, dealt with this is that they only selected something like 17,000 tags that correspond to kind of physical things or situations, like, you know, that has some visual content.
So, you know, you wouldn't have like hash tbt or anything like that.
Also, they keep a very select set of hash tags.
Yeah.
But it still, you know, in the order still on the order of 10 to 20,000.
So it's fairly large.
Okay.
Can you tell me about data augmentation, what the heck is data augmentation, and how is it
used, maybe, contrast of learning for video?
What are some cool ideas here?
Right.
So data augmentation, I mean, first data augmentation,
is the idea of artificially increasing the size of your
training set by distorting the images that you have
in ways that don't change the nature of the image.
So you take, you do an list, you can do data augmentation
on an list, and people have done this since the 1990s,
so I do take an instance digit and you shift it a little bit
or you change the size or rotate it, skew it,
you know, et cetera. Add noise. Add noise, et cetera. And it works better. If you try and
know supervised classifier with augmented data, you're going to get better results.
Now, it's become really interesting over the last couple of years because a lot of self-supervised learning techniques to pre-trained vision
systems are based on data augmentation.
And the basic techniques is originally
inspired by techniques that I worked on in the early 90s
and Jeffington worked on also in the early 90s.
There were sort of parallel work.
I used to call this SAM is network.
So basically you take two identical copies
of the same network, they share the same weights.
And you show two different views of the same object.
Either those two different views may have been obtained
by the documentation, or maybe there's
two different views of the same scene
from a camera that you moved, or at different times,
or something like that, or two pictures of the same person, things like that.
And then you train this neural net, those two identical copies of the
neural net to produce an output representation, a vector, in such a way
that the representation for those two images are as close to each other as
possible as identical to each other as possible, right?
Because you want the system to basically learn a function
that will be invariant, that will not change,
whose output will not change when you transform
those inputs in those particular ways, right?
So that's easy to do.
What's complicated is how do you make sure
that when you show two images that are different,
the system will produce different things.
Because if you don't have a specific provision for this, the system will just ignore the inputs
when you train it.
They will end up ignoring the input and just produce a constant vector that is the same for
every input.
That's called a collapse.
Now how do you avoid collapse?
So there's two ideas.
One idea that I proposed in the early 90s, with my colleagues at Bell Labs, Gene Barmley and a couple other people,
which we now call contrastive learning,
which is to have negative examples.
You have pairs of images that you know are different,
and you show them to the network and those two copies,
and then you push the two output vectors away from each other.
And they will eventually guarantee that
things that are semantically similar
produce similar representations and things that are different
produce different representations.
We actually come up with this idea for a project of doing
signature verification. So we would collect signature
signatures from like multiple signatures on the same person
and then train on your own to produce the same representation. And then, you know, force the system to produce different representations from different signatures.
This was actually the problem was proposed by people from what was the subsidiary of AT&T at the time called NCR.
And they were interested in storing representation of the signature on the 80 bytes of the
magnetic strip of a credit card. So we came up with this idea of having a neural net with 80 outputs,
you know, that we quantized on bytes so that we could encode the...
And that encoded was then used to compare whether the signature matches or not.
That's right. So then you would, you know...
Interesting. Signed, it would run through the neural net and then you would compare the output vector to whatever
is stored on your car.
It actually worked.
It worked, but they ended up not using it.
Because nobody cares, actually.
I mean, the American financial payment system
is incredibly lagged in that respect compared to Europe.
Oh, the signatures.
What's the purposes signatures anyway?
This is very...
Nobody looks at them, nobody cares. It's, yeah.
Yeah, no.
So that's contrastive running, right?
So you need positive and negative pairs.
And the problem with that is that,
even though I had the original paper on this,
I'm actually not very positive about it
because it doesn't work in high dimension.
If your representation is high dimensional,
there's just too many ways for two things to be different.
And so you would need lots and lots and lots of negative pairs.
So there is a particular implementation of this,
which is a relatively recent,
from actually the Google Toronto group,
Jeffington is the senior member there,
and it's called SIMCLEAR, SIMCLEAR,
and it's basically a particular way of implementing
this idea of contrasting around the particular objective function.
Now, what I'm much more enthusiastic about these days is non-contrastive methods.
So other ways to guarantee that the representations will be different for different inputs.
It's actually based on an idea that
Jeffington proposed in the early 90s with
his student at the time, Sue Becker.
It's based on the idea of maximizing
the mutual information between the outputs of the two systems.
You only show positive pairs,
you only show pairs of images that you know are somewhat similar,
and you train the two networks to be informative,
but also to be as informative of each other as possible.
So basically, one representation
has to be predictable from the other, essentially.
And he proposed that idea had a couple of papers
in the early 90s, and then nothing was done about it
for decades.
And I kind of revived this idea together
with my postdocs at Fair,
particularly a postdoc called Stefan Duny,
who is now a junior professor in Finland,
at University of Alto.
We came up with something called,
that we call, Balo Twins,
and it's a particular way of maximizing
the information content of a vector, you know,
using some hypothesis. And we have kind of another version of it that's more recent now called
Vickreg VIC-REG that means variance in variance covariance regularization. And it's the thing I'm
the most excited about in machine learning in the last 15 years. I mean, I'm not, I'm really, really excited about this.
What kind of data augmentation is useful for that non-contrast learning method?
Are we talking about, does that not matter that much?
Or it seems like a very important part of the step.
Yeah.
Are you generating images that are similar, but sufficiently different?
Yeah, that's right.
It's an important step, and it's also an annoying step because you need to have that knowledge of what
the documentation you can do that do not
change the nature of the object.
And so the standards scenario, which a lot of people
working in this area are using, is you use the type of distortion.
So basically, you do geometry distortion.
So one basically just shifts the image a little bit.
It's called cropping.
Another one kind of changes the scale a little bit.
Another one kind of rotates it.
Another one changes the colors.
You can do a shift in color balance
or something like that.
Saturation, another one sort of blurs it,
another one as noise.
So you have like a catalog of kind of standard things
and people try to use the same ones for different algorithms so that they can compare.
But some algorithms, some self-supervised algorithm actually can deal with much bigger, like more aggressive data augmentation and some don't.
So that kind of makes a whole thing difficult.
But that's the kind of distortions we're talking about.
And so you train with those distortions, and then you chop off the last layer
or couple layers of the network, and you use the representation as input to a classifier,
you train the classifier on ImageNet, let's say, or whatever, and measure the performance.
And interestingly enough, the methods that are really good
at eliminating the information that is irrelevant,
which is the distortions between those images,
do good job at eliminating it.
And as a consequence, you cannot use the representations
in those systems for things like object detection
and localization because that information is gone.
So the type of detagnetation you need to do depends on the tasks you want eventually,
the system to solve, and the type of detagnetation, standard detagnetation that we used today are
only appropriate for object recognition or image classification.
They're not appropriate for things like...
Can you help me out and understand what wide localization is?
So, you're saying it's just not good at the negative, like a classifying the negative, so that's why it can't be used for the localization.
No, it's just that you train the system, you know, you give it an image, and then you
give it the same image shifted and scaled, and you tell it that's the same image.
So the system basically is trained to eliminate the information about position and size.
So now, and now you want to use that,
like where an object is and what size is it?
Like a bounding box, like they'd be able to actually, okay, it can still find, it can still find
an object in the image, it's just not very good at finding the exact boundaries of that object.
Interesting. Interesting. Which, you know, that's an interesting sort of philosophical question, how important,
how important is object localization anyway? We're like obsessed by measuring like image
segmentation, obsessed by measuring perfectly knowing the boundaries of objects when arguably
that's not that essential to understanding what are the contents of the scene.
On the other hand, I think evolutionarily,
the first vision systems in animals
were basically all about localization,
very little about recognition.
And in the human brain, you have two separate pathways
for recognizing the nature of a scene on an object
and localizing objects.
So you use the first pathway called
the eventual pathway for telling what you're looking at.
The other pathway, the dorsal pathway,
is used for navigation, for grasping, for everything else.
And basically a lot of the things you need for survival
are localization and detection.
Is similarity learning or contrast of learning or these non-contrastive methods the same as understanding something?
Just because you know the story cat is the same as a non-destorty cat. Does that mean you understand what it means to be a cat?
To some extent, I mean it's a superficial understanding obviously.
But like what is the ceiling of this method, do you think? Is this just one trick on the path to doing self-supervised learning?
Can we go really, really far?
I think we can go really far.
So if we figure out how to use techniques of that type,
perhaps very different, but the same nature,
to train a system from video to do video prediction essentially. I think we'll have
a path, you know, towards, I wouldn't say unlimited, but a path towards some level of, you know,
physical common sense in machines. And I also think that that ability to learn how the world works from a sort of high-throughput
channel like vision is a necessary step towards real artificial intelligence.
In other words, I believe in grounded intelligence. I don't think we can train a machine to be
intelligent purely from text because I think the amount of
information about the world that's contained in text is tiny compared to
what we need to know. So for example, let's, and people have attempted to do this for three years,
right, the site project and things like that, right, of basically kind of writing down all the
facts that are known and hoping that some sort of common sense would emerge.
I think it's basically hopeless.
Let me take an example. You take an object. I describe a situation to you. I take an object, I put it on the table, and I push the table.
It's completely obvious to you that the object will be pushed with the table, because it's sitting on it.
There is no text in the world I believe that explains this.
And so she trained a machine as powerful as it could be, you know, your GPT 5000,
or whatever it is, it's never going to learn about this. That information is just not present
in any text. Well, the question, like with the psych project, the dream I think is to have like 10 million
say facts like that, that give you a head start, like a parent guiding you.
Now we humans don't need a parent to tell us that the table will move, sorry, the smartphone
will move with the table.
But we get a lot of guidance
in other ways. So it's possible that we can give it a quick shortcut.
What about cat? Again, who's that?
No, but they evolved. So no, they learned like us.
The, sorry, the physics of stuff.
Yeah.
Well, yeah, so you're saying it's, you're putting a lot of intelligence onto the nurture
side, not the nature.
Yeah.
Because we seem to have, you know, there's a very inefficient, arguably, process of evolution
that got us from bacteria to who we are today.
Started at the bottom now we're here.
So the question is how,
okay, see, the question is how fundamental is that the nature of the whole hardware?
And then is there any way to shortcut it if it's fundamental?
If it's not, if it's most of intelligence, most of the cool stuff we've been talking about
is mostly nurture, mostly trained, we figured out by observing the world,
we can form that big beautiful
sexy background model that you're talking about just by sitting there.
Then okay, then you need to, then like maybe it is all supervised learning all the way
down.
It's also perhaps learning.
Whatever it is that makes human intelligence different from other animals, which a lot
of people think is language and logical reasoning and this kind of stuff, it cannot be that
complicated because it only popped up in the last million years.
It only involves less than 1% of original, which is the difference between human genome
and chimps or whatever.
So it can be that complicated, you know, it can be that fundamental. I mean, the most of the so complicated stuff already exists in cats and dogs and, you know, certainly primates, non-human
primates. Yeah, that little thing with humans might be just something about social interaction and ability to maintain ideas
across a collective of people.
It sounds very dramatic and very impressive, but it probably isn't mechanistically speaking.
It is, but we're not there yet.
We have, I mean, this is number 634 in the list of problems we have to solve.
So basic physics of the world is number one. What do you, just a quick tangent on data augmentation?
So a lot of it is hard coded versus learned.
Do you have any intuition that maybe there could be some weird data augmentation,
like generative type of
Data augmentation like doing something weird damages which then
Improves the the similarity learning process. So not just kind of dumb
Simple distortions, but by you shaking your head just saying that even simple distortions are enough I think no, I think that augmentation is a temporary necessary evil.
So what people are working on now is two things.
One is the type of self-supervised learning,
like trying to translate the type of self-supervised learning people
using language, translating these two images,
which is basically denosing a two encoder method.
So you take an image, you block, you mask some parts of it,
and then you train some giant neural net
to reconstruct the parts that you are missing.
And until very recently, there was no working methods for that.
All the autoencoder type methods for images
weren't producing very good representation,
but there's a paper now coming out of the fair group at Immando Park that actually works very well.
So, that doesn't require the dogmentation, that requires only masking.
Only masking for images.
Okay.
Right, so you're masked part of the image and you train a system which you know in this case is a transformer because you can you can
The transformer represents the image as
Non-overlapping patches, so it's easy to mask patches and things like that
Okay, then my question transfers to that problem then masking like why should the mask be a square rectangle?
So it doesn't matter like you know, I think we're gonna come up probably in the future with sort of
You know ways to mess that are is you know kind of
Random essentially. Well, I mean there are random already, but no, no, but like something that's challenging
like
optimally challenging so like I mean, maybe it's a metaphor that doesn't apply, but you're, it seems like there's an
Data augmentation or masking. There's an interactive element with it. Like you're almost like playing with an image. Yeah, and like it's like the way we play with an image in our minds
It's like dropout. It's like both some machine training you
You know every every every time you see a percept,
you also, you can perturb it in some way.
And then the principle of the training procedure
is to minimize the difference of the output
or the representation between the clean version
and the corrupted version, essentially, right?
And you can do this in real time, right?
So, you know, both the machine work like this, right? You show a per sept, and you tell the machine
that's a good combination of activities or your input neurons. And then you either let
them go their merry way without clamping them to values, or you only do this with a subset.
And what you're doing is you're training the system so that the stable state of the entire network
is the same regardless of whether it's easy entire input or whether it's only part of it.
You know, the noise and autoencoder method is basically the same thing, right?
You're training a system to reproduce the input, the complete inputs, and filling the blanks
regardless of which parts are missing, and that's really the end of the principle.
And you could imagine even in the brain some sort of neural principle where neurons
can oscillate, right?
So they detect their activity, and then temporarily they kind of shut off to force the rest of
the system to basically reconstruct the input without their help.
You could imagine more or less biologically possible processes.
I guess with this denoising autoencoder and masking and data augmentation, you don't have
to worry about being super efficient.
You can just do as much as you want and get better over time.
Because I was thinking like you might want to be clever about the way you do all these
procedures, but that's only if it's somehow costly to do every iteration, but it's not
really.
Not really, maybe.
And then there is, you know, data augmentation without a specific data augmentation, this data augmentation
by waiting, which is the sort of video prediction.
You're observing a video clip, observing the continuation
of that video clip.
Try to align our representation using the joint embedding
architectures in such a way that the representation
of the future clip is easily predictable from the representation of the of the observed clip.
Do you think YouTube has enough raw data from which to learn how to be a cat?
I think so. So the amount of data is not the constraint.
No, it would require some selection, I think.
Some selection.
Some selection of, you know, maybe the right type of data.
You need some.
So put it on the rabbit hole of just cat videos that might, you might need to watch some
lectures or something.
No, you wouldn't.
How meta would that be if it like watches lectures about intelligence and then learns, watches
your lectures and why you,
and learns from that how to be intelligent.
That would be enough.
What's your defined multi-modal learning interest thing
we've been talking about visual language,
like combining those together, maybe audio,
all those kinds of things?
There's a lot of things that I find interesting
in the short term, but are not addressing the important
problem that I think are really kind of the big challenges. So I think, you know, things like
multitask learning, continual learning, you know, adversarial issues. I mean, those have,
you know, great practical interests in the relatively short term possibly, but I don't think they're
fundamental, you know, active learning, even to some extent,
reinforcement learning.
I think those things will become either obsolete or useless or easy
once we figured out how to do
self-supervised representation learning or
learning predictive world models.
And so I think that's what, you know,
the entire community should be focusing
on. At least people who are interested in sort of fundamental questions or, you know,
really kind of pushing the envelope of AI towards the next stage. But of course, there is
like a huge amount of, you know, very interesting work to do in sort of practical questions that
have, you know, short-term impact. Well, you know, it's difficult to talk about the temporal
scale because all of human
civilization will eventually be destroyed because the the the the sun will die out. And even
if Elon Musk is successful, multi planetary colonization across the galaxy, eventually
the entirety of it would just become a giant black holes. And that's going to take a
well, though. So, but what I'm saying is that that logic can be used to say it's all meaningless.
I'm saying all that to say that multitask learning might be you're calling it practical or
pragmatic or whatever.
That might be the thing that achieves something very kind to intelligence while we're trying to solve
the more general problem of self-supervised learning of back-end knowledge.
So the reason I bring that up maybe it won't wait to ask that question.
I've been very impressed by what Tesla Autopilot team is doing.
I don't know if you got any chance to glance at this particular one example of multitask
learning. the lands of this particular one example of multitask learning, where they're literally taking the problem,
like, I don't know, Charles Darwin's starting animals,
they're studying the problem of driving and asking,
okay, what are all the things you have to perceive?
And the way they're solving it is one,
there's an ontology where you're bringing that to the table,
so you're formulating a much different task,
it's like over 100 tasks or something like that, then they're involved in
driving. And then they're deploying it and then getting data back from people that run
to trouble and they're trying to figure out, do we add tasks? Do we like we focus on each
individual tasks separately? Sure. In fact, half so the I would say I'll classify Andrec
Apothe's talk in two ways. So one was about doors and the other one about
how much image net sucks.
He'll go back and forth on those two topics,
which image net sucks, meaning you can't just use
a single benchmark.
There's so, you have to have a giant suite
of benchmarks that you understand how well
your system actually works.
All right, we would aim, I mean,
he's a very sensible guy.
Now, OK, it's very clear that if you're
faced with an engineering problem that you need to solve
in a relatively short time, particularly if you have it almost
breathing down your neck, you're going to have to take shortcut.
You might think about the fact that the right thing to do
in the long-term solution involves some fancy
certain provisioning, but you have most reasoning
on your neck and this involves human lives.
And so you have to basically just do
the systematic engineering and fine tuning
and refinements and try and error and all that stuff.
There's nothing wrong with that.
That's called engineering.
That's called putting technology out in the world.
You have to kind of ironclad it before you do this, you know, so much for, you know, grand ideas and principles.
But you know, I'm placing myself sort of, you know, some, you know, upstream of this,
you know, quite a bit upstream of this.
You're played, I'll think about platonic forms.
You're, you're, you're, you're, because eventually I want this stuff to get used, but it's
okay if it takes five or 10 years for the community to realize this is a writing to
do, I've done this before.
It's been the case before that, you know, I've made that case.
I mean, if you look back in the mid-2000, for example, and you ask yourself the question,
okay, I want to recognize cars or faces or whatever.
You know, I can use convolutional nets or I can use more conventional,
kind of computer vision techniques, you know, using each of the data. Data actors or shift,
density features and, you know, sticking an SVM on top.
At that time, the data sets were so small that those methods that use more
and engineering work better than Comnet.
It was just not enough data for Comnet.
And Comnet's were a little slow
with the kind of hardware that was available at the time.
And there was a C-Change when, basically, when,
data sets became bigger and GPUs became available.
That's what two of the main factors that basically made people change their
mind. And you can look at the history of all subbranches of AI or pattern recognition.
And there is a similar trajectory followed by techniques where people start by engineering the hell out of it.
Be it optical character recognition, speech recognition,
competitive vision like image recognition in general,
natural language understanding, translation,
things like that, right?
You start to engineer the hell out of it.
You start to acquire all knowledge,
the prior knowledge you know about image formation,
about the shape of characters, about morphological operations,
about feature extraction, Fourier transforms,
Vianique moments, whatever, right?
People have come up with thousands of ways
of representing images so that they could be easily
classified afterwards, same for speech recognition, right?
There is, you know, two decades for people to figure out
a good font and to prepossess speech signals
so that, you know, the information about what is being said
is preserved, but most of the information about the identity
of the speaker is gone, you know,
cultural coefficients, so whatever, right?
And same for text, right? You do name, entity recognition, and you know, castoral coefficients, so whatever, right. And same for text, right?
You do name, entity recognition, and then you parse,
and you do tagging of the parts of speech,
and you know, you do this sort of tree representation
of clauses and all that stuff, right, before,
you can do anything.
So that's where it starts, right?
Just engineer the hell out of it.
And then you start having data
and maybe you have more powerful computers,
maybe you know something about statistical learning.
So you start using machine learning
and it's usually a small sliver on top of your
kind of handcrafted system where you extract features by hand.
Okay, and now, you know, nowadays,
the standard way of doing this
is that you train the entire thing
into end with a deep learning system,
and it learns its own features.
And speech recognition systems nowadays,
or CR systems, are completely into end.
It's some giant neural net that takes raw waveforms,
and produces a sequence of characters coming out.
And it's just a huge neural net, right?
There is no, you know, Markov model,
there's no language model that is explicit,
other than, you know, something that's ingrained
in the sort of neural language model,
if you want, same for translation, same for all kinds of stuff.
So you see this continuous evolution
from, you know, less and less hand crafting
and more and more learning.
And I think it's tweed biology as well.
So I mean, we might disagree about this, maybe not this one little piece at the end, you
mentioned active learning.
It feels like active learning, which is the selection of data and also the entire activity
needs to be part of this giant neural network.
You cannot just be an observer to do self-supervised learning.
You have to, well, self-supervised learning is just the word, but I would, whatever this
giant stack of a neural network that's automatically learning, it feels, my intuition is that you have to have a system, whether
it's a physical robot or a digital robot that's interacting with the world and doing so in
a flawed way and improving over time in order to form the self-supervised learning, well,
you can't just give it a giant sea of data.
Okay, I agree and I disagree.
I agree in the sense that I agree in two ways.
The first way I agree is that if you want,
and you certainly need a causal model of the world
that allows you to predict the consequences of your actions,
to train that model, you need to take actions, right?
You need to be able to act in a world and see the effect
for you to learn causal models of the world.
So that's not obvious because you can observe others.
You can observe others.
And you can refer their similar to you,
and then you can learn from that.
Yeah, but then you have to kind of hardware you
that part, mirror neurons and all that stuff.
So, and it's not clear to me how
you would do this in a machine. So, I think the action part would be necessary for having
causal models of the world. The second reason it may be necessary, or at least more efficient,
is that active learning basically goes for the juggler of what you don't know, right?
This obvious Ariazza uncertainty about your world and about the other world behaves, and
you can resolve this uncertainty by systematic exploration of that part that you don't know.
And if you know that you don't know, then it makes you curious.
You kind of look into situations that.
And across the animal world,
the different species are different levels of curiosity,
right?
Depending on how you build, right?
So, you know, cats and rats are incredibly curious.
Dogs know so much, I mean less.
Yeah.
So it could be useful to have that kind of curiosity.
So it'd be useful.
But curiosity just makes the process faster.
It doesn't make the process exist.
Yeah.
So what process, what learning process
is it that active learning makes more efficient?
And I'm asking that first question.
You know, we haven't answered that question yet.
So, you know, I worry about activating one's this question is the more fundamental question
to ask.
And if active learning or interaction increases the efficiency of the learning, see sometimes
it becomes very different if the increase is several orders of magnitude, right?
That's true.
But fundamentally it's still the same thing.
And building up the intuition about how to, in a self-supervised way to construct background models, efficient or inefficient is the core problem.
What do you think about Yosha Benjos talking about consciousness and all of these kinds of concepts?
Okay, I don't know what consciousness is, but it's a good opener.
And to some extent a lot of the things that are said about consciousness remind me of the questions
people were asking themselves in the 18th century or 17th century when they discovered that,
asking themselves in the 18th century or 17th century when they discovered that, you know, how the eye works
and the fact that the image at the back of the eye
was upside down, right, because you have a lens.
And so on your retina, the image that forms
is an image of the world, but it's upside down.
How is it that you see right side up?
And, you know, with what we know today in science,
you know, we realize this question doesn't make any sense.
Or it's kind of ridiculous in some way, right?
So I think a lot of what is said about consciousness is of that nature.
Now that said, there is a lot of really smart people that for whom I have a lot of respect
who are talking about this topic, people like David Chalmers, who is a colleague of mine, and NYU.
I have kind of an orthodox folk speculative hypothesis about consciousness.
So we're talking about the study of world model.
And I think our entire pre-frontal cortex basically is the engine for a world model.
But when we are attending at a particular situation, we're focused on
that situation. We basically cannot attend to anything else. And that seems to suggest
that we basically have only one world model engine, you know, pre-photocortex.
That engine is configurable to the situation at hand. So we are building a box out of wood or we are driving down the highway playing chess.
We basically have a single model of the world that we configure into the situation at hand,
which is why we can only attend to one task at a time.
Now, if there is a task that we do repeatedly,
it goes from the sort of deliberate reasoning
using model of the world in prediction and perhaps something like model predictive control,
which I was talking about earlier,
to something that is more so conscious that becomes automatic.
So I don't know if you've ever played against a chess grandmaster.
I get wiped out in 10 plies, right?
And I have to think about my move for 15 minutes.
And the person in front of me, the Grandmaster,
would just react within seconds, right?
He doesn't need to think about it.
That's become part of subconscious
because it's basically just pattern recognition
at this point.
Same, the first few hours you drive a car, you are really attentive, you can't do anything
else and then after 20, 30 hours of practice, 50 hours, subconscious, you can talk to the
person next to you, things like that, right?
And that's the situation because unpredictable and then you have to stop talking.
So that suggests you only have one model in your head.
And it might suggest the idea that consciousness, basically,
is the module that configures this world model of yours.
You need to have some sort of executive overseer
that configures your world model for the situation at hand.
And that needs to kind of the really curious concept that consciousness
is not a consequence of the power of our mind, but of the limitation of our brains. Because we
have only one world model, we have to be conscious. If we had as many world models as
situations we encounter, then we could do all of them simultaneously and we would need this
sort of executive control that we could consciousness. Yeah them simultaneously and we wouldn't need this sort of executive control
that we could consciousness.
Yeah, interesting.
And somehow maybe that executive controller,
I mean, the hard problem of consciousness,
there's some kind of chemicals and biology
that's creating a feeling,
like it feels to experience some of these things.
That's kind of like the hard question is,
what the heck is that and why is that useful?
Maybe the more pragmatic question.
Why is it useful to feel like this is really you experiencing this versus just like information
being processed?
It could be just a very nice side effect of the way we evolved. That's just very useful to feel a sense of ownership
to the decisions you make, to the perceptions you make,
to the model you're trying to maintain,
like you own this thing, and this is the only one you got,
and if you lose it, it was going to really suck.
And so you should really send the brain some signals about it. What ideas do you believe might be true, that most or at least many people disagree with?
Let's say in the space machine learning.
Well, it depends who you talk about, but I think so certainly there is a bunch of people
who are nativist, who think that a lot of the basic things about the world
are kind of hardwired in our minds.
Things like, the world is three-dimensional, for example, is that hardwired?
Things like, you know, object permanence is something that we learn, you know, before
the age of three months or so, or are we born with it?
And there are, you know, very, you know, white this is a wide disagreement among the cognitive scientists for this.
I think those things are actually very simple to learn.
Is it the case that the oriented edge detectors in V1
are learned or are they hardwired?
I think they are learned.
They might be learned before both,
because it's really easy to generate signals from the retina
that actually will train edge detectors.
So, and again, those are things that can be learned
within minutes of opening your eyes,
or I mean, since the 1990s,
we have algorithms that can learn oriented edge detectors,
completely unsupervised, with the equivalent
of a few minutes of real time.
So those things have to be learned.
And there's also those MIT experiments
where you kind of plug the optical nerve on the auditory cortex of a baby ferret, and that auditory
cortex become a visual cortex essentially. So clearly, there's running taking place there.
So I think a lot of what people think are so basic that they need to be hardwired,
I think a lot of those things are are so basic that they need to be hardwired, I think a lot of those things are learned
because they are easy to run.
So you put a lot of value in the power of learning.
What kind of things do you suspect might not be learned?
Is there something that could not be learned?
So your intrinsic drives are not learned.
There are the things that make humans human or make cats different from dogs,
right? It's the the basic drives that are kind of hardwired in our visual ganglia. I mean,
there are people who are working on this kind of stuff. This is called intrinsic motivation
in the context of reinforcement learning. So these are objective functions, whether reward doesn't
come from the external world, it's computed by your own brain.
Your own brain computes whether you're happy or not, right?
It measures your degree of comfort or in comfort.
And because it's your brain computing this,
presumably knows also how to estimate gradient.
So that's right.
So it's easier to learn when your objective is intrinsic.
So that has to be hardwired. The critic that makes long-term prediction of the outcome,
which is the eventual result of this, that's learned. And perception is learned,
and your model of the world is learned. But let me take an example of why the critic,
I mean, example of how the critic may be learned, right?
If I come to you,
I reach across the table and I pinch your arm, right?
Complete surprise for you.
You would not have expected this,
I was expecting that the whole time, but yes.
Let's say for the sake of the story, yes.
Okay, your Bezo Gangangli is going to lie down
because it's going to hurt.
And now your model of the world includes the fact
that I may pinch you if I approach my.
My don't trust humans.
Right.
My hand to your arm.
So if I try again, you're going to recoil.
And that's your critic.
Your predictive, your
predictor of your ultimate pain system that predicts that something that is going to happen
and you recoil to avoid it.
So even that can be learned.
That is learned, definitely.
This is what allows you also to define sub-goals. who defines up goals. So the fact that your school child
you wake up in the morning and you go to school
and it's not because you necessarily like waking up early
and going to school, but you know that there is a long-term
objective you're trying to optimize.
So Ernest Becker, I'm not sure if you're familiar
with the philosopher he wrote the book, The Nile of Death.
And his idea is that one of the core motivations of human beings is our terror of death, our fear of death. That's what makes us unique
from cats. Cats are just surviving. They do not have a deep, like cognizance, introspection that
over the horizon is the end. And he says that, I mean, there's a Terra management theory that just all these psychological
experiments that show basically this idea that all of human civilization, everything we
create is kind of trying to forget if, even for a brief moment, that we're going to die.
When do you think humans understand that they're going to die. When do you think humans understand that they're
going to die? Is it learned early on also?
I don't know what point, I mean, it's a question like, you know, at what point do you realize
that, you know, what that really is? And I think most people don't actually realize what
that is, right? I mean, most people believe that you go to heaven or something. Right. So the push back on that, what Ernest Becker says and Sheldon Solomon, all of those folks,
and I find those ideas a little bit compelling is that there is moments in life, early in life.
A lot of this fun happens early in life when you are, when you do deeply experience the terror
of this realization and all the things you think about about religion all those kinds of things that we kind of think about more like teenage years and later.
We're talking about way earlier.
No, it's like seven or eight years, something like that.
You realize Holy crap.
This is like the mystery, the terror. Like, it's almost like you're a little prey,
a little baby deer sitting in the darkness
of the jungle of the woods,
looking all around you, the darkness full of terror.
I mean, that's, that realization says,
okay, I'm gonna go back in the comfort of my mind
where there is a deep meaning,
where there is maybe like pretend I'm immortal,
and however way, however kind of idea can construct to help me understand
that I'm immortal. Religion helps with that. You can delude yourself in all kinds of ways.
Like, lose yourself in the busyness of each day, have little goals in mind, all those kinds of
things to think that it's going to go on forever. And you kind of know you're going to die, yeah.
And it's going to be sad, but you don't really understand that you're going to die.
And so that's their idea. And I find that compelling because it does seem to be a core unique aspect of human nature that we were able to think that we're going...
We're able to really understand that this life is finite. That seems important.
There's a bunch of different things there. So first of all, I don't think there is a quite a different difference between us and
cats in the term.
I think the difference is that we just have a better long term ability to predict in the
long term.
And so we have a better understanding of other world works.
So we have a better understanding of finance and life and things like that.
So we have a better planning engine than cats?
Yeah.
But what's the motivation for planning that?
Well, I think it's just a side effect to the fact that we have just a better planning
engine because it makes us, as I said, you know, the essence of intelligence is the
ability to predict.
And so the because we're smarter as a side effect, we also have this ability to kind of make predictions about our own
future existence or lack the rough. You say religion helps with that. I think religion hurts actually.
It makes people worry about like, you know, what's going to happen after their death, etc.
If you believe that, you know, you just don't exist after death, that I can, you know, it solves completely the problem.
Wait, at least, you're saying if you don't believe in God, you don't worry about what happens
after death. Yeah. I don't know. You worry about this life because that's the only one you have.
I think it's, well, I don't know. If I were to say what Ernest Becker says, and I have said,
I agree with him more than not is you do deeply worry.
If you believe there's no God, there's still a deep worry of the mystery of it all.
Like how does that make any sense that it just ends?
I don't think we can truly understand that this right, I mean so much of our life, the
consciousness, the ego is invested
in this being.
And then science keeps bringing humanity down from its pedestal.
And that's another example of it.
That's wonderful.
But for us individual humans, we don't like to be brought down from our pedestal.
You're saying like, but see, you're fine with it because, well, so what Ernest Becker would
say is you're fine with it because that's just a more peaceful existence for you, but
you're not really fine.
You're hiding from it.
In fact, some of the people that experienced the deepest trauma that earlier in life, they
often before they seek extensive therapy will say that I'm fine.
It's like when you talk to people who are truly angry,
how are you doing? I'm fine. The question is, what's going on? I had a near-death experience.
I had a very bad motorbike accident when it was 17. But that didn't have any impact on my
reflection on that topic. So I'm basically just playing a bit of Delzav,
getting pushin' back and wondering,
is it truly possible to accept death?
And the flip side, this more interesting, I think,
for AI and robotics, is how important is it to have
this as one of the suite of motivations,
is to not just avoid falling off the roof
or something like that, but ponder the end of the ride. If you listen to the stoics, it's a great motivator.
It adds a sense of urgency.
So it might be a truly fear death or be cognizant of it event might give a deeper meaning and urgency to the moment
to live fully.
Well, maybe I don't disagree with that.
I mean, I think what motivates me here is,
you know, knowing more about human nature.
I mean, I think human nature and human intelligence
is a big mystery, It's a scientific mystery in addition to, you know, fields of equal and etc. But, you
know, I'm too believer in science. So, and I do have kind of a belief that for complex
systems like the brain on the mind, the way to understand it is to try to reproduce it with artifacts
that you build, because you know what's essential to it when you try to build it.
You know, the same way I've used this analogy before with you, I believe, the same way we
only started to understand aerodynamics when we started building airplanes, and that
helped us understand how birds fly. So I think there's kind of a similar process here where we don't have a theory of
full theory of intelligence, but building intelligent artifacts will help us perhaps develop some
underlying theory that encompasses not just artificial implements, but also
human and biological intelligence in general.
So you're an interesting person to ask this question about all kinds of different other
intelligent entities or intelligences.
What are your thoughts about the touring or Chinese room question?
If we create an AI system that exhibits a lot of properties of intelligence and consciousness,
how comfortable are you thinking of that entity as intelligent or conscious?
So you're trying to build now systems that have intelligence and there's metrics about
their performance, but that metric is external.
Okay. trick is external. Okay, so are you are you okay, calling a thing intelligent, or are you
going to be like most humans and be once again unhappy to be brought down from a pedestal
of consciousness, slash intelligence? No, I'll be very happy to understand more about human nature, human mind and human intelligence through the construction
of machines that have similar abilities.
And if a consequence of this is to bring down humanity one notch down from, it's already
low-key.
I'm just fine with it.
That's just a reality of life.
So I'm fine with that.
Now you were asking me about things
that opinions I have that a lot of people
may disagree with.
I think if we think about the design
of an autonomous intelligence system,
so assuming that we are somewhat successful
at some level of getting machines to learn models
of the world, pretty team models of the world,
we build intrinsic motivation objective functions
to drive the behavior of that system.
The system also has perception modules
that allows it to estimate the state of the world
and then have some way of figuring out
the sequence of actions that, you know,
to optimize a particular objective.
If it has a critic of the type that was describing before,
the thing that makes you require your arm the second time I try to pinch you,
intelligent autonomous machine will have emotions.
I think emotions are an integral part of autonomous intelligence.
If you have an intelligence system that is driven by intrinsic motivation by objectives.
If it has a critic that allows it to predict in advance
whether the outcome of a situation is gonna be good or bad,
it's going to have emotions, it's gonna have fear.
Yes.
When it predicts that the outcome is gonna be bad
and something to avoid, it's gonna have elation
when it predicts it's gonna be good.
and something to avoid is going to have elation when it predicts it's going to be good.
If it has drives to relate with humans, in some ways, the way humans have,
it's going to be social. So it's going to have emotions about attachment and things of that type. So I think the sci-fi thing where you see commando data like having an emotion chip that you can turn off right, I think that's pretty good.
So, here's the difficult philosophical social question. Do you think there will be a time
like a civil rights movement for robots where, okay,
forget the movement, but a discussion, like the Supreme Court, that particular kinds of
robots, you know, particular kinds of systems deserve the same rights as humans because
they can suffer just as humans can,
all those kinds of things. Well, perhaps, perhaps not.
Like imagine that humans were,
that you could, you know, die and be restored.
Like, you know, you could be sort of, you know,
be 3D reprinted and, you know, your brain could be reconstructed
in its finest details.
R.J. has a rights will change in that case.
If you can always just, there's always a backup, you could always restore.
Maybe the importance of murder will go down one notch.
That's right.
But also your desire to do dangerous things like, you know, being skydiving or, you know, you know, the skydiving or, or, you know,
risk of driving, you know, car racing or that kind of stuff, you know, would probably increase
or, or, you know, airplane aerobatics or that kind of stuff, right?
I mean, we find to do a lot of those things or explore, you know, dangerous areas and things
like that, it would kind of change your relationship.
So now, it's very likely that robots would be like that because you know,
they'll be based on perhaps technology that is somewhat similar to
to this technology and you can you can always have a backup. So it's possible. I don't
feel like video games, but there's a game called Diablo and my sons are huge fans of this.
Yes.
In fact, they made a game that's inspired by it.
Awesome.
Like, built a game.
My three sons have a game design studio between them.
That's awesome.
They came out with a game.
They just came out last year.
No, this was last year, early last year.
About a year ago.
That's awesome.
But in Diablo, there's some hardcore mode, which if you die, there's no, you're gone. Right. That's awesome, but in diablo there's a something called hardcore mode which if you die there's no.
You're gone right that's it and so it's possible with the eye systems.
For them to be able to operate successfully and for us to treat them in a certain way for the because they have to be integrated in human society.
They have to be able to die no copies copies allowed. In fact, copying is illegal.
It's possible with humans as well, like cloning will be illegal, even once possible.
But cloning is not copying, right? I mean, you don't reproduce the mind of the person.
Like experience. Right. It's just a delay twin.
But then as we were talking about with computers that you will be able to copy, you will be able to perfectly save pickle
the mind state. And it's possible that that will be illegal because that goes against
that will destroy the motivation of the system.
Okay, so let's say you have a domestic robot, okay, sometime in the future. Yes. And domestic robot, you know, comes to you kind of somewhat
pre-trained, you know, can do a bunch of things.
Yes.
But it has a particular personality that makes it slightly
different from the other robots because that makes them
more interesting.
And then because it's, you know, it's live with you for
five years, you've grown some attachment to it and
vice versa.
And it's learned a lot about you
or maybe it's not a hard word but maybe it's a virtual assistant that lives in your augmented
reality glasses or whatever right you know the the her movie type thing right and that system to
some extent the intelligence in that system is a bit like your child or maybe
your PhD student, in a sense that there's a lot of you in that machine now, right? And so
if it were a living thing, you would do this for free if you want, right? If it's your child,
your child can, you know, then live is a her own life and the fact that they learn stuff from you doesn't mean that you have
any ownership of it, right? But if it's a robot that you've trained, perhaps you have some
intellectual property claim about intellectual property, or I thought you meant like
permanence value in the sense that's part of use in. Well, the best permanence value,
right? So you would lose a lot if that were about what to be destroyed.
And you had no backup, you would lose a lot.
You would have a lot of investment.
You know, kind of like a person dying, you know,
that a friend of yours dying or a coworker or something like that.
But also, you have like intellectual property rights in the sense that system is fine-tuned
to your particular existence.
So that's now a very unique instantiation of that original background model, whatever
it was, that arrived.
And then there are issues of privacy, right?
Because now, imagine that robot has its own kind of volition and decides to work for
someone else, or thinks life with you is sort of untenable or whatever.
Right. Now, all the things that that system learned from you,
you know, how can you like, you know, delete all the personal information that that system knows about you?
Yeah. I mean, that would be kind of an ethical question. Like, can you erase the mind of an intelligent robot
to protect your privacy?
You can't do this with humans.
You can ask them to shut up, but that you don't have complete power over them.
You can't erase humans. It's the problem with relationships.
If you break up, you can't erase the other human with robots.
I think it will have to be the same thing with robots, that risk that has to be some
risk to our interactions to truly experience them deeply.
It feels like so you have to be able to lose your robot friend and that robot friend to go
tweeting about how much of an asshole you were.
But then are you allowed to murder the robot to protect your private information if
you're not allowed to leave?
I have the situation that for robots with certain, it's almost like regulation.
If you declare your robot to be, let's call it sentient or something like that, this robot
is designed for human interaction.
Then you're not allowed to murder these robots.
It's the same as murdering of the humans.
Well, but what about you do a backup of the robot
that you preserve on a high drive
or the equivalent in the future?
That might be illegal.
It's like a piracy.
Piracy is illegal.
No, it's your own robot, right?
But you can't.
You don't.
But then you can wipe out.
It's brain.
So this robot doesn't know anything about you anymore.
But you still have, technically,
you still exist on it because you backed it up.
And then there'll be these great speeches
of the Supreme Court by saying, oh, sure,
you can erase the mind of the robot,
just like you can erase the mind of a human.
We both can suffer.
There'll be some epic, like, Obam attack
character with a speech that we, like, the robots like you can raise the mind of a human. We both can suffer. There'll be some epic like Obamacare character
with a speech that we, like the robots and the humans
are the same.
We can both suffer.
We can both hope.
We can both all of those, all those kinds of things,
raise families, all that kind of stuff.
It's interesting for these, just like you said,
emotions seems to be a fascinating,
the powerful aspect of human interaction, human robot interaction, and if they're
able to exhibit emotions, at the end of the day, that's probably going to have us
deeply consider human rights, like what we value in humans, what we value in
other animals. That's why robots and AI is great. It makes us ask
many good questions. There are questions. Yeah. But you ask about the Chinese
room type argument. Is it real? If it looks real, I think the Chinese room argument is the
ridiculous one. So for people who don't know Chinese room is you can, I don't even know how
to formulate it well, but basically you can mimic the behavior of an intelligent system by just following a giant
Algorithm code book that tells you exactly how to respond in exactly each case
But is that really intelligent? It's like a giant lookup table when this person says this you answer this when this person says this you answer this and
If you understand how that works,
you have this giant, nearly infinite lookup table, is that really intelligence? Because intelligence
seems to be a mechanism that's much more interesting and complex than this lookup table.
I don't think so. So the, I mean, the real question comes down to, do you think,
you know, you can, you can make an eye's intelligence in some way, even if that involves learning.
And the answer is, of course, yes, there's no question.
There is a second question then, which is assuming you can reproduce intelligence in
different hardware than biological hardware, computers, can you match human intelligence in all the domains in
which humans are intelligent? Is it possible? So, the hypothesis of strong AI. The answer
to this, in my opinion, is unqualified, yes, this will swell happen at some point. There's
no question that machines at some point
will become more intelligent than humans
in all domains where humans are intelligent.
This is not for tomorrow, it's gonna take a long time,
regardless of what, you know, you know,
than others have claimed or believed.
This is a lot harder than many of those guys' think it is.
And many of those guys who thought it was simpler than that years, you know, five years ago,
now I think it's hard because it's been five years and they realize it's going to take a lot longer
than you close a bunch of people, a deep mind, for example. But I don't know if they have an
actually a touch base for the deep mind folks, but some of it Elon or
base for the deep mind folks, but some of it Elon or Demis as I was, I mean sometimes in your role, you have to kind of create deadlines that are nearer than farther away to kind
of create an urgency, because you know, you have to believe the impossible is possible
in order to accomplish it.
And there's of course a flip sides of that coin, but it's a weird, you can't be too cynical
if you want to get something done.
Absolutely.
I agree with that.
But, I mean, you have to inspire people to work on sort of ambitious things.
So it's certainly a lot harder than we believe, but there's no question in my mind that this
will happen.
And now, people are kind of worried about what does that mean for humans. They are going to be brought down from their pit as tall, you know, a bunch of notches with that and, uh,
you know, is that going to be good about, I mean, it's just going to give more power, right? It's a amplifier for human intelligence really.
So speaking of doing cool, ambitious things, fair, the Facebook AI research group has recently celebrated its eighth
birthday, or maybe you can correct me on that.
Looking back, what has been the successes, the failures, the lessons learned from the
eight years of fair, and maybe you can also give context of where does the newly minted
meta AI fit into how does it relate to fair?
Right, so let me tell you a bit about the organization of all this.
Yeah, fair was created almost exactly a year ago.
It wasn't called fair yet.
It took that name a few months later.
And at the time, I joined Facebook.
There was a group called the AI group that had about 12 engineers and a few scientists, like, you
know, 10 engineers and two scientists or something like that. I run it for 300 half years as
a director, you know, hired the first few scientists and kind of set up the culture and organized
it, you know, explained to the Facebook leadership what the fundamental research was about and how it can work within industry and how it needs to be open and everything.
And I think it's been an inquiified success in the sense that fair has simultaneously produced, you know, top level research and advanced the science and the technology, provided tools, open source
tools like PyTorch and many others.
But at the same time, I had a direct or mostly indirect impact on Facebook at the time,
now Meta, in a sense that a lot of systems that Meta is built, now are based on research projects that started at Fair.
So if you were to take out deep learning out of Facebook services now and meta more generally,
I mean the company would literally crumble, I mean it's completely built around AI these days,
and it's really essential to the operations.
So what happened after three and a half years is that I changed role, I became chief scientist, so I'm not doing data-to-day management of fair anymore. I'm more of a kind of, you know,
think about strategy and things like that. And I carry my, I conduct my own research, I've,
you know, my own kind of research group working on self-supervised running and things like that. And I carry my own research, I've kind of research group working on certain professional
learning and things like this, which I didn't have time to do when I was director.
So now, Fer is run by Joel Pino and Antoine Bord together, because Fer is kind of split
into now, there's something called Fer Labs, which is sort of bottom up census driven research
and Ferrexcel, which is slightly more organized
for bigger projects that require a little more kind of focus
and more engineering support and things like that.
So Joel needs FairLab and Antoine Borleid's FairxL.
Where are they located?
It's de-localized all over.
So there's no question that the leadership of the company
believes that this was a very worthwhile investment.
And what that means is that it's there for the long run.
So there is, if you want to talk in these terms,
which I don't like, this is a business model,
if you want, where fair, despite being a very fundamental research lab,
brings a lot of value to the company, mostly indirectly through other groups.
Now, what happened three and a half years ago when I stepped down was also the creation of
Facebook AI, which was basically a larger organization that covers fair, so fair is included in it, but also has other organizations
that are focused on applied research
or advanced development of AI technology
that is more focused on the products of the companies
or less emphasis on fundamental research.
Less fundamental, but it's still a research.
I mean, there's a lot of papers coming out
of those organizations and people are awesome,
awesome and wonderful to interact with.
But it serves as a way to kind of scale up if you want AI technology, which may be very
experimental and lab prototypes in two things that are usable.
So fair is a subset of meta AI. It's fair to become like KFC.
It'll just keep the F. Nobody cares what the F stands for.
We'll know it's not enough.
Probably by the end of 2021.
It's not a giant change, mayor.
Fair.
Well, mayor doesn't sound too good, but
the brand people are deciding on this.
And they've been hesitating for
for while now and they you know the tell us they're going to come up with an answer as to whether fair is going to change name or whether we're going to change the meaning of the F. That's a good call. I will keep fair and change the meaning of the F. That would be my preference. You know, I would tend I would turn the F into fundamental.
Oh, that's good. I research. Oh, that's really good.
Yeah.
Within meta AI.
So this would be really sort of meta-fair.
Yeah.
But people would call it fair, right?
Yeah, exactly.
I like it.
And now meta AI is part of the reality lab.
So meta now, the new Facebook, it's called meta.
And it's kind of divided into, you know,
Facebook, Instagram, WhatsApp and reality lab.
Reality lab is about, you know, ARVR, you know, telepresence communication technology
and stuff like that. That's kind of the, you can think of it as the sort of a combination of sort of new products
and technology part of of meta.
Is that where the touch sensing for robots?
I saw you were posting about that.
But touch sensing for robots is part of fair actually.
That's that's a photo.
It is.
Okay.
Yeah.
It's also the, but there is the other way the, the haptic glove, right? Yes. That's a big part of it. Oh, it is. Yeah. It's also the other way, the haptic glove, right?
Yes.
That's more reality life.
That's reality lab research.
Reality lab research.
By the way, the touch sense is super interesting, like integrating that modality into the whole
sensing suite is very interesting.
So what do you think about the metaverse? What do you
think about this whole, this whole kind of expansion of the view of the role of Facebook
and meta in the world?
Well, I mean, averse really should be thought of as the next step in the internet, right?
So trying to, you know, make the experience more compelling of being connected either with other people
or with content.
We are evolved and trained to evolve in 3D environments where we can see other people,
we can talk to them when near them or other people are far away can't hear us,
you know, things like that, right? So, there's a lot of social conventions that exist in the
real world that we can try to transpose. Now, what is going to be eventually the, how compelling
is it going to be? Like, is it going to be the case that people are going to be willing to
do this if they have to wear a huge pair of goggles all day?
Maybe not.
But then again, that's the experience, it's officially compelling.
Maybe so.
Or if the device that you have to wear is basically a pair of glasses, technology makes sufficient
progress for that.
AR is a much easier concept to grasp that you're going to have augmented reality glasses
that basically contain some sort of virtual assistant that can help you in your daily lives.
But at the same time with the AR, you have to contend with the reality.
With VR, you can completely detach yourself from reality, so it gives you freedom.
It might be easier to design worlds in VR.
Yeah, but you can imagine, you know, the metaverse being a mix, a mix, right? Or like you can have objects that exist in a
metaverse that, you know, pop up on top of the real world or only exist in virtual
reality. Okay, let me ask the hard question. Because all of this was easy. This was easy.
The Facebook, now meta, the social network has been painted
by the media as net negative for society,
even destructive and evil at times.
You've pushed back against this defending Facebook.
Can you explain your defense?
Yeah, so the description, the company that is being
described in some media is not the company we know when we work inside.
It could be claimed that a lot of employees are uninformed about what really goes on in the company.
But I'm a vice president. I mean, I have a pretty good vision of what goes on.
I don't know everything. Obviously, I'm not involved in everything, but certainly not in decision about content moderation or anything like this.
But some decent vision of what goes on, and this evil that is being described, I just
don't see it.
And then I think there is an easy story to buy, which is that all the bad things in the
world and the reason your friend believes crazy stuff,
there's an easy scapegoat in social media
in general, Facebook in particular.
We have to look at the data.
Like, is it the case that Facebook, for example,
polarizes people politically?
Are there academic studies that show this?
Is it the case that teenagers think of themselves less
if they use Instagram more?
Is it the case that people get
more wild up against opposite sides
in a debate or political opinion
if they are more on Facebook or if they are less.
And study after study show that none of this is true.
This is independent studies by academic,
they're not funded by Facebook or meta,
study by Stanford, by some of my colleagues at NYU,
actually, with whom I have no connection.
You know, there's a study recently,
they paid people, I think it was in the formal Yugoslavia, I'm not exactly sure what part,
but they paid people to not use Facebook for a while in the period before the anniversary
of the Cyber Nietziche massacres.
So people get rid of,
like should we have a celebration,
I mean a memorial kind of celebration for it or not.
So they paid a bunch of people
to not use Facebook for a few weeks.
And it turns out that those people
ended up being more polarized
than they were at the beginning
and the people
who were more on Facebook were less polarized.
There's a study from Stanford of economists at Stanford that tried to identify the causes
of increasing polarization in the US.
And it's been going on for 40 years before, you know, Mark Zuckerberg was born continuously. And so if there is a cause,
it's not Facebook or social media.
So you could say if social media just accelerated,
but no, I mean, it's basically a continuous evolution
by some measure of polarization in the US.
And then you compare this with other countries like
the West half of Germany,
because you can can go 40 years
in East Side or Denmark or other countries.
And they use Facebook just as much.
And they're not getting more polarized
or getting less polarized.
So if you want to look for a causal relationship there,
you can find a scapegoat, but you can't find a cause.
Now, if you want to fix the problem,
you have to find the right cause. And what drives me up is that people now are accusing Facebook of bad deeds that
are done by others, and those others are, we're not doing anything about them. And by the way,
those others include the owner of the Wall Street Journal in which all of those papers were published.
So I should mention them talking to Shrap, Mike Shraprep, on this podcast, and also Mark Zuckerberg and probably
these conversations can have with them.
Because it's very interesting.
To me, even if Facebook has some measurable negative effect,
you can't just consider that an isolation.
You have to consider about all the positive ways
it connects us.
So like every technology, it's like question.
You can't just say like there's an increase in division.
Yes, probably Google search engine
has created increase in division.
We have to consider about how much information
are brought to the world.
I'm sure Wikipedia created more division.
If you just look at the division,
we have to look at the full context of the world
and they didn't make about a world.
I mean, the printing press has created more difference.
Exactly.
I mean, so when the printing press was invented, the first books that were printed were things
like the Bible, and that allowed people to read the Bible by themselves,
not get the message uniquely from priests in Europe, and that created the protest
movement and 200 years of religious persecution and wars.
So that's a bad side effect of the printing press. Social networks aren't being nearly as bad
as the printing press, but nobody would say the printing press was a bad idea.
Yeah, a lot of this perception and there's a lot of different incentives operating here.
Maybe a quick comment, since you're one of the top leaders at Facebook and at Meta, sorry,
that's in the tech space, I'm sure Facebook involves a lot of incredible technological
challenges they need to be solved.
A lot of it probably is in the computer infrastructure, the hardware, I mean, it's just a huge amount.
Maybe can you give me context about how much of Shreps life is AI and how much of it is low level compute,
how much of it is flying all around doing business stuff in the same way as Zuckerberg, Mark Zuckerberg.
They really focus on AI.
I mean, certainly in the run-up of the creation of fair and for at least a year after that, if not more, Mark was very, very
much focused on AI and was spending quite a lot of effort on it.
And that's his style.
When he gets interested in something, he reads everything about it.
You know, he read some of my papers, for example, before I joined.
And so he learned a lot about it. He's like notes. Right.
And Shrep was really to it also.
Shrep is really kind of,
something I've tried to preserve also,
despite my not so young age,
which is a sense of wonder about science and technology.
And he certainly has that.
He's also a wonderful person.
I mean, in terms of like as a manager,
like dealing with people and everything,
Mark also actually.
I mean, they're very like, you know, very human people.
For in the case of markets, shockingly human,
you know, given his trajectory. I mean, the case of markets, shockingly human, given his trajectory, the personality of him
that he spent in the press is completely wrong.
But you have to know how to play the press.
I put some of that responsibility on him too.
It's like the director, the conductor of an orchestra, you have to play
the press in the public in a certain kind of way where you convey your true self to them.
If there's a depth and kind of stuff.
And it's like, and it's probably not the best at it.
So yeah.
Yeah, to learn.
And it's sad to see, I'm not talking him about it, but the Shrepp is slowly stepping down.
It's always sad to see folks sort of be there for a long time and slowly.
I guess time is sad.
I think it's done. The thing he said to do and is got family priorities and stuff like that.
I understand after 13 years or something. It's been a good run.
Which in Silicon Valley is basically a lifetime.
Yeah.
It's dog years.
So Europe's the conference just wrapped up.
Let me just go back to something else.
You posted a paper you co-authored was rejected from Europe.
As you said proudly in quotes rejected.
Can you talk? Yeah, I know.
Can you describe this paper and like what was the idea in it?
And also maybe this is a good opportunity to ask what are the pros and cons, what works and what doesn't about the review process?
Yeah, let me talk about the paper first.
I'll talk about the review process afterwards.
The paper is called Vickreg.
So this is, I mentioned that before,
variance in variance, covariance, regularization.
And it's a technique in non-contrastive learning technique
for what I call joint embedding architecture.
So, same is nets are an example of joint embedding architecture.
So, a joint embedding architecture is,
let me back up a little bit, right?
So, if you want to do so supervised running,
you can do it by protection.
So, let's say you want to train the system to predict video, right?
You show it a video clip and you train the system to predict
the next continuation of that video clip. Now, because you need to handle video, right? You show it to video clip and you train the system to predict the next, the continuation of that video clip.
Now, because you need to handle uncertainty,
because there are many, you know,
many continuations that are plausible,
you need to have, you need to handle this in some way.
You need to have a way for the system
to be able to produce multiple predictions.
And the way, the only way I know to do this is to let's call it a latent variable.
So you have some sort of hidden vector of a variable that you can vary over a set or draw
from a distribution. And as you vary this vector over a set, the output, the prediction varies
over a set of plausible predictions. Okay, so that's called a, a generative latent variable model.
Okay, so that's called a generative latent variable model. Got it.
Okay. Now there is an alternative to this,
two hundred and certainty,
and instead of directly predicting
the next frames of the clip,
you also run those through another neural net.
So you now have two neural nets,
one that looks at the initial segment of the video clip,
another one that looks at the continuation during training.
And what you're trying to do is learn a representation of those two video clips that is
maximally informative about the video clips themselves,
but is such that you can predict the representation
of the second video clip from the representation
of the first one, easily.
Okay.
And you can sort of formalize this in terms
of maximizing mutual information,
some stuff like that, but it doesn't matter.
What you want is informative representations
of the two video clips that are mutually predictable.
What that means is that there's a lot of details in the second video clips that are irrelevant.
Let's say video clip consisting of a camera panning the scene, there's going to be a piece of that room that
is going to be revealed. And I can somewhat predict what that room is going to look like,
but I may not be able to predict the details of the texture of the ground and where the
tiles are ending and stuff like that, right? So those are irrelevant details that perhaps
my representation will eliminate. And so what I need is to train this second neural net in such a way that
whenever the continuation video clip varies over all the plausible continuations,
the representation doesn't change. Got it. So yeah, yeah, got it. Over the space of representations
doing the same kind of thing as you do with similarity learning.
Right.
Yeah.
So these are two ways to handle
multi-modality in the prediction.
In the first way,
you parameterize the prediction with a latent variable,
but you predict pixels, essentially.
In the second one,
you don't predict pixels,
you predict an abstract representation of pixels,
and you guarantee that this abstract representation has as much information as possible about the input, but sort of,
you know, drops all the stuff that you really can't predict, essentially.
I used to be a big fan of the first approach, and in fact, in this paper with the Chan
Mishra, this blog post, the documentary intelligence that I was kind of advocating for this.
And in the last year and a half, I've completely changed my mind. I'm now a big fan of the second one.
And it's because of a small collection of algorithms
that have been proposed over the last year and a half or so
two years to do this, including Vikreg.
It's predecessor called Baloo Twins,
which I mentioned, a method from our friends at DeepMind could be YOL.
And this bunch of others now that can work similarly.
So they're all based on this idea of joint embedding.
Some of them have an XPC criterion.
That is an approximation of mutual information.
Some others would be YOL work, but we don't really know why.
And there's been like lots of theoretical papers
about YBWILD works.
No, it's not that because we take it out and it still works.
And you know, blah, blah, blah.
I mean, so there's like a big debate.
But the important point is that we now
have a collection of non-contrastive joint embedding methods,
which I think is the best thing since sliced bread.
So I'm super excited about this because I think it's
the best shot for techniques that would allow us to kind of build predictive world models.
And at the same time, learn hierarchical representations
of the world where what matters about the world is preserved
and what is irrelevant is eliminated.
And whether the representations that
before and after is in the space in a sequence of images
or is it for single images?
It would be either for a single image, for a sequence.
It doesn't have to be images.
This could be applied to text.
This could be applied to just about any signal.
I'm looking for methods that are generally applicable
that are not specific to one particular modality.
It could be audio or whatever.
Got it.
So what's the story behind this paper?
This paper is describing one of the one such method?
This is Vickrack method.
So this is co-authored.
The first author is a student called Adrie Enbarde,
who is a resident PhD student at Fair Paris.
We schooled by me and Jean-Ponce,
who's a professor at the Economic Superior,
and so research director at Inria.
So this is wonderful program in France
where PhD students can basically do their PhD in industry
and that's kind of what's happening here.
And this paper is a follow-up on this bottle twin paper
by my former post-doc, Steve Fandini,
with Leaching and Eorage Montaure
and a bunch of other people from from from fair and one of
main criticism from reviewers is that V-creg is not different enough from bottle twins but
you know my impression is that it's you know bottle twins with a few bugs fixed essentially and
in the end this is what people were used.
Right.
But I'm used to stuff that has to be being rejected for what.
So it might be rejected and actually
exception was cited because people use it.
Well, it's already cited a bunch of times.
So the question is then to the deeper question
about peer review and conferences.
I mean, computer science as a field is kind of unique
that the conference
is highly priced. That's one. Right. And it's interesting because the peer review process there
is similar, I suppose to journals, but it's accelerated significantly or not significantly, but it
goes fast. And it's a nice way to get stuff out quickly. To peer review quickly, go to
present it quickly to the community. So
not quickly but quicker. Yeah. But nevertheless, it has many of the same flaws of peer review
because it's the limited number of people look at it as bias and the following. Like that, if you
want to do new ideas, you're going to get pushed back. There's self-interested people that kind of
can infer who submitted it and kind of, submitted it and be cranky about it,
all that kind of stuff.
Yeah, I mean, there's a lot of social phenomena there. There's one social phenomenon,
which is that because the field has been growing exponentially, the vast majority of people
in the field are extremely junior. So. So as a consequence, and that's just a consequence
of the field growing, right?
So as the number of, as the size of the field
can start saturating, you will have less of that problem
of reviewers being very inexperienced.
A consequence of this is that, you know, young reviewers,
I mean, there's a phenomenon which is that reviewers try to make their life easy and to make that life easy when reviewing a paper is very simple
you just have to find a flaw in the paper, right? So basically they see that task as finding flaws in papers and most papers have flaws even the good ones.
So it's easy to do that. Your job is easier as a reviewer if you just focus on this.
But what's important is,
is there a new idea in that paper that is likely to influence?
It doesn't matter if the experiments are not that great,
if the protocol is,
so things like that.
As long as there is a worthy idea in it that will influence
the way people think about the problem, even if they make it better, you know, eventually
I think that's really what makes a paper useful.
And so, this combination of social phenomena creates a disease that has plagued other fields in the past like speech recognition,
where basically people chase numbers on benchmarks, and it's much easier to get a paper accepted if
it brings an incremental improvement on a mainstream well-accepted method or problem.
So the mainstream well accepted method or problem.
And those are to me boring papers. I mean, they're not useless, right?
Because industry strives on those kind of progress,
but they're not the ones that I'm interested in
in terms of like new concepts and new ideas.
So papers that are really trying to strike
kind of new advances generally don't make it.
Now thankfully we have archive.
Archive, exactly.
And then there's open review type of situations where you, and then I mean, Twitter is a kind of open review.
I'm a huge believer that review should be done by thousands of people, not two people.
I agree.
And so archive, a DCO future where a lot of really strong papers is already the present, but a growing future
where it'll just be archive.
And you're presenting an ongoing continuous conference called Twitter and slash the Internet,
slash archive sanity, Andre just released a new version.
So just not, you know, not being so elitist about this particular gating.
It's not a question of being elitist or not.
It's a question of being basically recommendation
and pseudo-opuvels for people who don't see themselves
as having the ability to do so by themselves.
So it saves time.
If you rely on other people's opinion
and you trust those people,
or those groups to evaluate a
paper for you. That saves you time because you don't have to scrutinize the paper
as much, you know, is brought to your attention. I mean, there's the whole idea of
collective recommender system. So I actually thought about this a lot, you know, about 10, 15 years ago,
cause there were discussions at Nips and, you know,
and we were about to create I clear with Yashav Angel.
And so I wrote a document,
kind of describing a reviewing system,
which basically was, you know,
you post your paper on some repository,
let's say archive or now it could be open review.
And then you can form a reviewing entity, which is equivalent to reviewing board, you know,
a general or a program committee of a conference.
You have to list the members.
And then that group reviewing entity can choose to review a particular paper,
spontaneously or not.
There is no exclusive relationship anymore
between a paper and a venue or reviewing entity.
Any reviewing entity can review any paper,
or may choose not to.
Then given evaluation,
it's not published, it's just an evaluation and a comment,
which would be public,
signed by the reviewing entity. If it's signed by reviewing entity, you know it be public, signed by the reviewing entity.
And if it's signed by the reviewing entity, you know, it's one of the members of reviewing
entity. So if the reviewing entity is, you know, lectreadment's, you know, preferred papers,
right? You know, it's lectreadment writing a review.
Yes. What, so for me, one, that's a beautiful system, I think, but what's in addition
to that, it feels like there should be a
reputation system for the reviewers. For the reviewing entities. Not the reviewers individually.
They're reviewing entities, sure. But even within that, the reviewers too, because there's another
thing here. It's not just the reputation. It's an incentive for an individual person to do great.
reputation, it's an incentive for an individual person to do great. Right now, in the academic setting, the incentive is kind of internal just wanting to do a good job, but honestly, that's
not as strong enough incentive to do a really good job in reading a paper and finding the
beautiful amidst the mistakes and the flaws and all that kind of stuff. Like, if you're the person
that first discovered a powerful paper and you get to be proud of
that discovery, then that gives a huge incentive to you.
That's a big part of my proposal actually, you know, I described that as, you know, if
your evaluation of papers is predictive of future success, then your reputation should
go up as a reviewing entity. So yeah, exactly.
I mean, that even had a master student
who was a master student in library science
and computer science actually kind of work out
exactly how that should work with formalizing everything.
But in terms of implementation,
do you think that's something that's doable?
I mean, I've been sort of talking about this
to sort of various people like Andrew McCallum who started Open Review.
And the reason why we picked Open Review for Eiklier initially, even though it was very early for them, is because my hope was that Eiklier was eventually going to kind of integrate this type of system.
So Eiklier kept the idea of Open Review reviews. So whether reviews are, you know,
published with a paper, which I think is very useful. But in many ways, that's kind of reverted
to kind of more of a conventional type conferences for everything else. And that, I mean, I,
I don't run I clear. I'm just the president of the foundation, but,
you know, people who run it should make decisions about how to
run it and I'm not going to tell them because the Arvaluteers and I'm really thankful that they do
that. So, but I'm saddened by the fact that we're not being innovative enough. Yeah, me too. I hope
that changes. Yeah, because the communication science broadly, but the communication computer science ideas is,
is how you make those ideas have impacted.
Yeah, and I think, you know, a lot of this is,
because people have in their mind,
kind of an objective, which is, you know, fairness for authors,
and the ability to count points, basically,
and give credits accurately.
But that comes at the expense of the progress for science.
So to some extent, we're slowing down the progress for science.
And that we actually achieve in fairness.
And we're not achieving fairness.
We sell biases.
We're doing double-blind review, but the biases are still there, the different kinds of biases.
You write that the phenomenon of emergence, collective behavior exhibited by a large collection
of simple elements and interaction is one of the things that got you into neural nets in the first
place. I love cellular automata. I love simple interacting elements and the things that emerge from them.
Do you think we understand how complex systems can emerge from such simple components that
interact simply?
No, we don't.
It's a big mystery, also it's a mystery for physicists, a mystery for biologists.
How is it that the universe around us seems to be increasing in complexity and not decreasing?
I mean, that is a kind of curious property of physics that despite the second law of thermodynamics,
we seem to be evolution and learning, et cetera, seems to be at least locally to increase complexity, not decrease it. So perhaps the ultimate
purpose of the universe is just get more complex.
I have these, I mean, small pockets of beautiful complexity. Does that, to sell your Tom,
and are these kinds of emergence and complex systems give you some intuition or guide your understanding of machine
learning systems, neural networks and so on.
Are these for you right now, desperate concepts?
Well, you got me into it.
You know, I discovered the existence of the perceptron when I was a college student,
you know, by reading a good book, it was a debate between Trump's key and Piaget
and Simu Papper from MIT, who was kind of singing the praise of the perception on in that book.
And the first time I heard about the running machine, I started digging the literature
and I found those books, which were basically transcription of
workshops or conferences from the 50s and 60s about self-organizing systems.
So there were, there was a series ofs about self-organizing systems. So there were, there was a series of conferences
on self-organizing systems and these books on this.
Some of them are, you can actually get them
at the internet archive, you know, the digital version.
And there are like fascinating articles in there
by this guy whose name has been largely forgotten,
Heinz von Furster.
So it's a German physicist who immigrated to the US and worked on self-organizing systems
in the 50s. In the 60s he created at University of Illinois, Bernache, and Panhe created the
biological computer laboratory, PCL, which was all about neural nets.
Unfortunately, that was kind of towards the end of the popularity of neural nets,
so that lab never striver much.
But your order of a bunch of papers about self-organization
and the mystery of self-organization.
An example he has is you take, imagine you are in space, there is no gravity.
You have a big box with magnets in it.
Okay, what kind of rectangular magnets with North Portland
one end, South Portland the other end?
You shake the box gently, and the magnets
will kind of stick to themselves and probably form
a complex structure spontaneously.
That could be an example of self-organization.
But you have lots of examples.
Neural nets are an example of self-organization
to a minute respect. And it's a bit of a mystery,
you know, how, like, what is possible with this, you know, pattern formation in physical systems,
in chaotic system and things like that, you know, the emergence of life, you know, things like that.
So, you know, how does that happen? It's a big puzzle for physicists as well. It feels like understanding the mathematics of emergence in some constrained situations
might help us create intelligence.
Like help us add a little spice to the systems because you seem to be able to, in complex systems,
with emergence, to be able to get a lot from little.
And so that seems like a shortcut to get big leaps and performance.
But there's a missing conspiracy concept that we are, we don't have.
Yeah. And it's something also I've been fascinated by since my undergrad days.
And it's how you measure complexity.
So we don't actually have good ways of measuring,
or at least we don't have good ways of interpreting
the measures that we have out of disposal.
Like how do you measure the complexity of something?
So there is all those things, like Commodore
of Chiting, Sort of Mono of Complexity,
of the length of the shortest program
that we generate a bit string can be
thought of as the complexity of that bit string.
I've been fascinated by that concept.
The problem with that is that that complexity is defined up to a
constant, which can be very large.
Right.
There are similar concepts that are derived from, you know,
Bayesian probability theory, where, you know, the complexity of
something is the negative log of its probability, essentially, right? And you have a complete
equivalence between the two things. And then you would think, you know, the probability
is something that's well defined mathematically, which means complexity is well defined.
But it's not true. You need to have a model of the distribution. You may need to have a prior, if you're doing Bayesian inference,
and the prior place is the same world as the choice of the computer
with which you measure your comma-graph complexity.
And so every measure complexity we have has some arbitrary nestity.
You know, an additive constant, which can be arbitrary large.
And so, how can we come up with a good theory
of how things become more complex
if we don't have a good measure of complexity?
Yeah, which we need for is one way that people study this
in space of biology, the people that study the origin of life
or try to recreate the life in the laboratory.
And the more interesting one is the alien one is when we go
to other planets, how would we recognize this life?
Because complexity, we associate complexity, maybe some level of mobility with life.
We have to be able to have concrete algorithms for measuring the level of complexity we
see to know the difference in life and non-life.
And the problem is that complexity is in the IO to be holder.
So let me give you an example.
If I give you an image of the M-list digits, right, and I flip through M-list digits,
there is some, obviously some structure to it because local structure, you know, neighboring
pixels are correlated across the entire data set.
Now, imagine that I apply a random
permutation to all the pixels, a fixed random permutation.
I show you those images that will look, you know, really
disorganized to you, more complex.
In fact, they're not more complex, you know,
so returns exactly the same as originally, right? really disorganized to you, more complex. In fact, they're not more complex, in absolute terms,
they're exactly the same as originally, right?
And if you knew what the permutation was, you know,
you could undo the permutation.
Now, imagine I give you special glasses
that undo their permutation.
Now, all of a sudden, what looked complicated
becomes simple. Right.
So if you have two, if you have humans on one end,
and then another race of aliens that
sees a universe with permutation glasses.
Yeah, with the permutation glasses.
We perceive it simple to them, it's hard to be complicated, it's probably heat.
Yeah, heat, yeah.
Okay, and what they perceive as simple to us is random fluctuation, it's heat.
Yeah.
So truly in the eye of the beholder depends what kind of glass is you're
right? Right.
Depends what kind of algorithm you're running in your perception system. So I don't think
we'll have a theory of intelligence, self-organization, evolution, things like this. Until we have
a good handle on a notion of complexity, which we know is in the highly, the highly beholder.
Yeah, it's sad to think that we might not be able to detect or interact with alien species,
because we're wearing different glasses.
Because the notion of locality might be different from ours.
Yeah. This actually connects with fascinating questions in physics at the moment,
like modern physics, quantum physics, like, you know, questions about like, you know,
can we recover the information that's lost in a black hole
and things like that? And that relies on the ocean's complexity, which I find is fascinating.
Can you describe your personal quest to build an expressive electronic wind instrument,
EWI? What is it to what does it take to build it?
Well, I'm a tinkerer, I like building things.
I like building things with combinations of electronics
and mechanical stuff.
You know, I have a bunch of different hobbies,
but you know, probably my first one was little,
was building model airplanes and stuff like that,
and I still do that to some extent.
But also electronics, I taught myself electronics before I studied it.
And the reason I taught myself electronics is because of music.
My cousin was an aspiring electronic musician and he had an analog synthesizer and I was
basically modifying it for him and building sequencers and stuff like that right for him.
I was in high school when I was doing this.
And the interesting like progressive rock, like 80s,
like what's the greatest band of all time,
according to Yalekun?
I'm on this too many of them,
but it's a combination of,
my vision orchestra, whether report, yes, Genesis,
you know, free Peter Gabriel,
gentle giant, you know, things like that.
Great. Okay, so this, this love of electronics
and this love of music combined together.
Right, so I was actually trying to play
Baroque and Renaissance music and I played in
an orchestra when I was in high school in first year of college.
I played the recorder, cram horn, a little bit of oboe, things like that.
So I'm a wind instrument player, but I always wanted to play improvise music, even though
I don't know anything about it.
The only way I figured, short of learning to play saxophone was to play electronic
wind instruments. So they behave the fingering is similar to
saxophone, but you have a wide variety of sound because you
control this synthesizer with it. So I had a bunch of those
going back to the late 80s from either Yamaha or Akai.
They're both kind of the main manufacturers of those that
were classically going back several decades.
But I've never been completely satisfied with them because of lack of expressivity.
And those things are somewhat expressive. I mean, they measure the breath pressure,
they measure the lip pressure, and various parameters. You can vary it with fingers, but they
You have various parameters, you can vary with fingers, but they are not really as expressive as acoustic instrument.
You hear John Coltrine play two notes, and you know it's John Coltrine, it's got a unique
sound or my David's, you can hear it's my David's playing the trumpet because the sound reflects their, you know,
physiognomy, basically, the shape of the vocal track
kind of shapes the sound.
So how do you do this with the electronic instrument?
And I was, many years ago I met,
I got called David Wessel, he was a professor at Berkeley
and created the center for like, you know, music technology
there, and he was interested in that question. Berkeley, I created the center for music technology there.
And it was interesting in that question.
And so I kept kind of thinking about this for many years
and finally because of COVID, I was in my workshop.
My workshop serves also as my kind of Zoom room
and home office.
And this is a new jersey.
In New Jersey.
And I started really being serious about building my own eWe
instrument. What else is going on in a New Jersey workshop? Is there some crazy stuff you
built that just or left on the workshop floor left behind? A lot of crazy stuff is electronics
with, built with microcontrollers of our Skines and you know weird flying contraptions.
So you still love flying. So family disease, my dad got me into it when I was a kid and he was
building model airplanes when he was a kid and he was a mechanical engineer. He taught himself
electronics also, so he built his early radio control systems in the late 60s, early 70s.
And so that's what got me into, I mean, he got me into kind of engineering and science
and technology.
You also have an interest and appreciation of flight in other forms like with drones, quad
ropeders, or do you, you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do you do When it became kind of a center thing you could buy it was boring, you know, I stopped doing it. It was not fun anymore.
Yeah.
You were doing it before it was cool.
Yeah.
What advice would you give to a young person today in high school and college that dreams
of doing something big like Yanlec Koon, like let's talk in the space of intelligence,
dreams of having a chance to solve some fundamental problem in space of intelligence,
both for their career and just in life, being somebody who was a part of creating something special.
So try to get interested by big questions, things like, you know, what is intelligence,
what is the universe made of, what's life all about, things like that.
Like even like crazy big questions, like what's time, like nobody knows what time it.
And then learn basic things, like basic methods, either from math, from physics or from engineering. Things that have a long shelf life, like if you have a choice between
like, you know, learning, you know, mobile programming on iPhone or
quantum mechanics, take quantum mechanics.
Because you're going to learn things that you have no idea exist.
And you may not, you never, You may never be a quantum physicist, but you
will learn about path integrals.
And path integrals are used everywhere.
It's the same formula that you use for vision integration
and stuff like that.
So the idea is the little ideas within quantum mechanics,
or within some of these kind of more solidified fields
will have a longer shelf life.
They will use somehow used indirectly in your work.
Learning classical mechanics, like you learn about Lagongens, for example,
which is a hugely useful concept for all kinds of different things.
Learn statistical physics, because all the math that comes out of,
for machine learning,
basically comes out of what we got at by statistical physicists
in the late 1990s, early 20th century.
So, and for some of them, actually more recently,
by people like George O'Parezi,
who just got Nobel Prize for the replica method
among other things, it's used for a lot of different things.
Votional inference, that math comes from the SQL physics.
So a lot of those kind of basic courses,
you know, if you do it like you're engineering,
you take signal processing,
you all learn about Fourier transforms.
Again, some things super useful.
It's at the basis of things
like graph neural nets, which is an entirely new subarea
of AI machine learning deep learning, which I think
is super promising for all kinds of applications.
Something very promising, if you're more interested
in the applications, is the applications of AI machine learning
and deep learning to science, or to science that can help solve big problems
in the world.
I have colleagues at Meta at Fair, we started this project called Open Catalyst and it's
an open project collaborative and the idea is to use deep learning to help design new
chemical compounds or materials that would facilitate the separation of hydrogen
from oxygen. If you can efficiently separate oxygen from hydrogen with electricity, you
solve climate change. It's as simple as that, because you cover some random desert with
solar panels and you have them work all day, produce hydrogen and then you
see the hydrogen and wherever it's needed, you don't need anything else.
You have controllable power that can be transported anywhere.
So, if we have large scale, efficient energy storage technology like producing hydrogen, we solve climate change.
Here's another way to solve climate change,
is figuring out how to make fusion work.
Now the problem with fusion is that you make a super-hard plasma,
and the plasma is unstable, and you can control it.
Maybe with deep learning, you can find controllers
that will stabilize plasma and make practical fusion reactors.
I mean, that's very speculative,
but it's worth trying because the payoff is huge.
There's a group of Google working on this led by John Platte.
So control, convert as many problems in science
and physics, biology and chemistry
into a learnable problem and see if a machine can learn it.
Right, I mean, there's properties of complex materials that we don't understand from
first principle, for example.
Right. So, if we could design new materials, we could make more efficient batteries.
We could make maybe faster electronics. There's a lot of things we can imagine doing,
or lighter materials for
cars or airplanes, things like that, maybe better fuel sales.
I mean, there's all kinds of stuff we can imagine.
If we had good fuel sales, hydrogen fuel sales, we could use them to power airplanes and
transportation wouldn't be or cars.
We wouldn't have emission problem, CO2 emission problems for air transportation anymore.
So there's a lot of those things I think where AI can be used.
And this is not even talking about all the sort of medicine biology and everything like
that, right?
Like protein folding, figuring out how could you design your proteins that is sticks to
another protein that are particular sites because that's how you design drugs in the end. So deep learning would be
useful, those are kind of, you know, would be sort of enormous progress if we could use it for
that. Here's an example. If you take this is like from recent material physics, you take a
monotonic layer of graphene, right? So it's just carbon on an hexagonal mesh
and you make the single atom thick.
You put another one on top.
You twist them by some magic number of degrees,
three degrees or something.
It becomes superconductor.
Nobody has any idea why.
I want to know how that was discovered,
but that's the kind of thing that machine learning
can actually discover these kinds of things.
Well, maybe not, but there is a hint perhaps that with machine learning, we would train
a system to basically be a phenomenological model of some complex emergent phenomenon,
which superconductivity is one of those, where the comparative phenomenon is too difficult to describe
from first principles with the current,
the usual reductionist type method,
but we could have deep learning systems that predict
the properties of a system from a description of it
after being trained with sufficient came many samples.
This guy Pascal Fuat, DPSL, he has a startup company that,
where he basically trained a commercial net essentially
to predict the aerodynamic properties of solids.
And you can generate as much as you want
by just running computational free dynamics, right?
So you give give a wing, a foil, or something shape of some kind,
and you run computational-free dynamics,
you get as a result the drag and lift and all that stuff,
and you can generate lots of data,
train a neural net to make those predictions,
and now what you have is a differentiable model of, let's say, a drag and lift as a function
of the shape of that solid.
And so you can do background and descent.
You can optimize the shape so you get the properties you want.
Yeah, that's incredible.
That's incredible.
And on top of all that, probably you should read a little bit of literature and a little
bit of history
for Inspiration and for wisdom because after all all of these technologies will have to work in the human world
Yes, and the human world is complicated
Yeah, and this is an
Amazing conversation. I'm really honored to talk with me today. Thank you for all the amazing work
You're doing it fair at meta and
Thank you for being so passionate after all these years about everything that's going on
You're you're beginning of hope for the machine learning community and thank you so much for spending your valuable time with me today
That was awesome. Thanks for having me on that was it was a pleasure
Thanks for listening to this conversation with y'all in the cune to support this podcast
Please check out our sponsors in the description.
And now, let me leave you some words from Isaac asmoth.
Your assumptions are your windows on the world.
Scrub them off every once in a while, or the light won't come in.
Thank you for listening and hope to see you next time.
you