Lex Fridman Podcast - Jeremy Howard: fast.ai Deep Learning Courses and Research
Episode Date: August 27, 2019Jeremy Howard is the founder of fast.ai, a research institute dedicated to make deep learning more accessible. He is also a Distinguished Research Scientist at the University of San Francisco, a forme...r president of Kaggle as well a top-ranking competitor there, and in general, he's a successful entrepreneur, educator, research, and an inspiring personality in the AI community. This conversation is part of the Artificial Intelligence podcast. If you would like to get more information about this podcast go to https://lexfridman.com/ai or connect with @lexfridman on Twitter, LinkedIn, Facebook, Medium, or YouTube where you can watch the video versions of these conversations. If you enjoy the podcast, please rate it 5 stars on iTunes or support it on Patreon.
Transcript
Discussion (0)
The following is a conversation with Jeremy Howard.
He's the founder of Fast AI, a research institute dedicated to making deep learning more accessible.
He's also a distinguished research scientist at the University of San Francisco,
a former president of Kegel, as well as a top-ranking competitor there.
And in general, he's a successful entrepreneur, educator, researcher,
and an inspiring personality in the AI community.
When someone asks me, how do I get started with deep learning?
Fast AI is one of the top places I point them to.
It's free, it's easy to get started, it's insightful, and accessible, and if I may
say so, it has very little BS that can sometimes dilute the value of educational content
on popular topics like deep learning.
FESTI has a focus on practical application of deep learning and hands-on exploration of the
cutting edge that is incredibly both accessible to beginners and useful to experts. This is the
Artificial Intelligence Podcast. If you enjoy it, subscribe on YouTube, give it 5 stars and iTunes, support it on Patreon,
or simply connect with me on Twitter.
Alex Friedman spelled F-R-I-D-M-A-N.
And now, here's my conversation with Jeremy Howard. What's the first program you ever written?
First program I wrote that I remember would be at high school.
I did an assignment where I decided to try to find out if there were some better musical scales
in the number 12 interval scale.
So I wrote a program on my Commodore 64 in basic,
let's search through other scale sizes
to see if you could find one where there were
more accurate, you know, harmonies. Like mid-tone, like flying like you want an extra
exactly three to two ratio, where else with a 12 interval scale it's not exactly three to two,
for example. So that's in the common. Well, as they say in the basic on the commoner 64.
Well, as they say, and the basic on a common order 64. Yeah.
What was the interesting music from?
Or is it just like music on my life?
So I played saxophone and clarinet and piano and guitar and drums and whatever.
So how does that thread go through your life?
Where's music today?
It's not where I wish it was. For various reasons, couldn't really keep it
going, particularly because I had a lot of problems with RSI, with my fingers, and so I had to
kind of like cut back anything that used hands and fingers. I hope one day I'll be able
to get back to what health wise. So there's a love for music underlying it all. Yeah. What's your favorite instrument?
Sexophone. Sex. Baratone sexophone. Well, probably based sexophone, but they're awkward.
Well, I always love it when music is coupled with programming. There's something about a brain that
utilizes those that emerges with creative ideas.
So you've used and studied quite a few programming languages.
Can you give an overview of what you've used, what are the pros and cons of each?
My favorite programming environment almost certainly was Microsoft Access back in like the
earliest days.
So that was a special basic for applications,
which is not a good programming language,
but the programming environment is fantastic.
It's like the ability to create,
you know, user interfaces,
and tie data and actions to them and create reports
and all that as I've never seen anything.
As good, there's things nowadays like air table,
which are like small subsets of that,
which people love for good reason,
but unfortunately nobody's ever achieved anything like that.
What is that, if you could pause on that for a second?
Oh, access.
Is it fundamentally database?
It was a database program that Microsoft produced a part of office and that kind of withered,
you know, but basically it let you in a totally graphical way, create tables and relationships
and queries and tie them to forms and set up, you know, event handlers and calculations.
And it was a very complete, powerful system designed for not massive
scalable things, but for useful little applications that I loved.
So was the connection between Excel and Access?
So very close. So Access kind of was the relational database equivalent,
if you like.
So people still do a lot of that stuff
that should be in access in Excel.
Excel's great as well.
So, but it's just not as rich a programming model
as VBA combined with a relational database.
And so I've always loved relational databases,
but today programming on always loved relational databases, but today
programming on top of a relational databases is just a lot more of a headache, you know, you generally either need to kind of, you know, you need something that connects that runs on
kind of database server unless you use SQLite, which has its own issues. Then you kind of,
often if you want to get a nice programming model, you'll need to like create an add-on-oh-rm on top. And then add-on-oh, there's all these pieces
to tie together. And it's just a lot more awkward than it should be. There are people that are
trying to make it easier so that in particular, I think of F-sharp, you know, Don Simon, who
him and his team have done a great job of making something like a database appear in
the type system, so you actually get like tab completion for fields and tables and stuff
like that.
Anyway, so that was kind of, anyway, so like that whole VBA office thing, I guess, was a
starting point with just your mess.
And I got into standard visual basic, which...
Well, that's interesting, just to pause on that for a second. It's interesting that you're connecting
programming languages to the ease of management of data. Yeah. So in your use of programming languages,
you always had a love and a connection with data. I've always been interested in doing useful things for myself and for others, which generally
means getting some data and doing something with it and putting it out there again.
So that's been my interest throughout.
So I also did a lot of stuff with Apple script back in the early days.
So it's kind of nice being able to get the computer and computers to talk to each other
and to do things for you.
And then I think that one night the programming language I most loved then would have been
Delphi, which was Object Pascal, created by Anders Halesburg, who previously did Turbo
Pascal and then went on to create.net and then
went on to create TypeScript. Delphi was amazing because it was like a compiled fast language that
was as easy to use as Visual Basic. Delphi, what is it similar to in more modern languages?
Visual Basic. Visual basic.
Visual basic.
Yeah, that compiled fast version.
So I'm not sure there's anything quite like it anymore.
If you took like C-Shap or Java and got rid of the virtual machine
and replaced it with something, it could compile a small tight binary.
I feel like it's where Swift could get to
with the new Swift UI and the cross-platform development going on. Like that's one of my
dreams is that we'll hopefully get back to where Delphi was. There is actually a free Pascal
project nowadays called Lazarus, which is also attempting to
kind of recreate Delphi, so they're making good progress.
So, okay, Delphi, that's one of your favorite programming languages.
It's programming environments.
Again, Pascal's not a nice language.
If you wanted to know specifically about what languages I like,
I would definitely pick Jay as being an amazingly wonderful
language.
What's Jay?
Jay, are you aware of APL?
I am not.
So from doing a little research and work you've done.
Okay, so not at all surprising and not familiar familiar with it because it's not well known,
but it's actually one of the main families of programming languages going back to the late 50s,
early 60s. So there was a couple of major directions. One was the kind of Lambda calculus,
Salonzo church direction, which I guess kind of Lisbon's game and whatever,
which has a history going back to the early days of computing.
The second was the kind of imperative slash, you know, our goals, similar going under
C++ so forth.
There was a third which are called array oriented languages, which started with a paper
by a guy called Ken Iverson, which was actually a math theory paper, not a programming paper.
It was called notation as a tool for thought.
And it was the development of a new type of math notation.
And the idea is that this math notation was much more flexible, expressive,
and also well-defined than traditional math notation,
which is none of those things,
math notation is awful.
And so he actually turned that into a programming language.
And because this was the early 50s,
all the, sorry, late 50s,
all the names were available.
So he called his language, programming language, or APL.
APL?
So APL is a implementation of notation as a talker thought by which he means math notation.
And Ken and his son went on to do many things,
but eventually they actually produced a new language that was built on top of all the learnings of APL.
That was called J. And J is the most expressive, composable language of beautifully designed language I've ever seen.
Does it have object-oriented components? Does that kind of thing?
Not really. It's an array- language. It's a new, it's a, it's, it's the third path.
Are you saying array?
Array oriented. Yeah. So it needs to be a ray oriented.
So array oriented means that you generally don't use any loops. But the whole thing is
done with kind of a extreme version of broadcasting, if you're familiar with that num-pie-slash-python concept.
You do a lot with one line of code. It looks a lot like math notation.
So it's basically highly compact. And the idea is that you can kind of, because you can do so much
with one line of code, a single screen of code is very unlikely to, you very rarely need more than that to express your program. So you can keep it all in your head,
and you can clearly communicate it. It's interesting. The APL created two main branches, K and J.
J is this open source niche community of crazy,
enthusiasts like me.
And then the other path, K, was fascinating.
It's an astonishingly expensive programming language,
which many of the world's most ludicrously rich hedge funds
use.
So the entire K machine is so small,
it sits inside level three cache on your CPU,
and it easily wins every benchmark.
I've ever seen in terms of data processing speed,
but you don't come across it very much,
because it's like $100,000 per CPU to run it.
But it's like this path of programming languages,
it's just so much, I don't know, so much more powerful
in every way than the ones that almost anybody uses every day.
So it's all about computation.
It's really focusing.
It's pretty heavily focused on computation.
I mean, so much of programming is data processing
by definition.
And so there's a lot of things you can do with it.
But yeah, there's not much work being done on making,
like, use it in a face talk here, or whatever.
I mean, there's some, but it's, they're not great.
At the same time, you've done a lot of stuff
with Pearl and Python.
Yeah.
So what does that fit into the picture of J and K and APL?
Well, it's just much more pragmatic.
Like in the end, you kind of have to end up
where the libraries are,
like because to me, my focus is on productivity.
I just want to get stuff done and solve problems.
So, Pearl was great.
I created an email company called Fastmail,
and Pell was great, because back in the late 90s, early 2000s, it just had a lot of stuff
it could do. I still had to write my own monitoring system and my own web framework and whatever,
because none of that stuff existed. But it was a super flexible
language to do that in.
And you used Pearl for fast-mole used it as a backend. So everything was written in Pearl.
Yeah. Yeah, everything. Everything was Pearl.
Why do you think Pearl hasn't succeeded or hasn't dominated the mark that we're a Python
really takes over a lot of the tests?
Yeah. Well, I mean, it did dominate.
It was four times.
Everything, everywhere.
But then the guy that ran Pell, Larry Wool, kind of just didn't put the time in anymore.
And no project can be successful if there's,
you know, this, particularly one that's data
with a strong leader that loses that strong leadership.
So then Python has kind of replaced it.
Python is a lot less elegant language in nearly every way,
but it has the data science libraries and a lot of them are pretty great.
So I kind of use it because it's the best we have, but it's definitely not good enough.
What do you think the future programming looks like? What do you hope the future programming looks like?
If we zoom in on the computational fields on data science,
on machine learning, I hope Swift is successful.
Because the goal is Swift,
the way Chris Latner describes it is to be infinitely hackable.
And that's what I want. I want something where me and the people I do research
with and my students can look at and change everything from top
to bottom.
There's nothing mysterious and magical and inaccessible.
Unfortunately, with Python, it's the opposite of that
because Python's so slow, it's extremely unhackable.
You get to a point where it's like, okay,
from here on down at C.
So your debugger doesn't work in the same way,
your profile doesn't work in the same way, your debugger doesn't work in the same way, your profile doesn't work in the same way,
your build system doesn't work in the same way,
it's really not very hackable at all.
What's the part you would like to be hackable?
Is it for the objective of optimizing training
of neural networks, inference in neural networks?
Is it performance of the system,
or is there some non-performance related?
Just creating a new environment? Is it performance of the system or is there some non-performance related just creating?
It's everything.
In the end, I want to be productive as a practitioner.
That means that at the moment, our understanding of deep learning is incredibly primitive.
There's very little we understand.
Most things don't work very well, even though it works better than anything else out there.
There's so many opportunities to make it better.
So you look at any domain area, like, I don't know, speech recognition
with deep learning or natural language processing classification with deep
learning or whatever. Every time I look at an area with deep learning,
I always see like, oh, it's terrible.
There's lots and lots of obviously stupid ways to do things
that need to be fixed.
So then I want to be able to jump in there
and quickly experiment and make them better.
You think the programming language has a role in that?
Huge role.
Yeah.
So currently, Python has a big gap in terms of our ability
to innovate particularly around recurrent neural networks
and natural language processing because it's so slow, the actual loop where we actually
loop through words, we have to do that whole thing in Kruter C. So we actually can't innovate with the kernel, the heart of the most important algorithm.
And it's just a huge problem. And this happens all over the place. So we hit
you know, research limitations. Another example, convolutional neural networks, which
are actually the most popular architecture for lots of things, maybe most things in deep learning. We almost certainly should be using sparse convolutional neural networks, but
only like two people are, because to do it, you have to rewrite all of that coder-sea
level stuff. And yeah, just researches and practitioners don't. So like, there's just
big gaps in like what people actually research on, what people
actually implement because of the programming language problem.
So you think it's just too difficult to write in a kudusy that a program like a higher
level programming language like Swift should enable the easier,
the fulling around, creative stuff with RNNs
or with Spark English in your knowledge.
Kind of.
Who's a fault?
Who's a charge of making it easy
for a researcher to play around?
I mean, no one's a fault.
Just nobody's got to round to it yet.
Or it's just it's hard, right?
And I mean, part of the fault is that we ignored that whole APL kind of direction, most
omni-ally everybody did for 60 years, 50 years.
But recently, people have been starting to reinvent pieces of that and kind of create some
interesting new directions in the compiler technology. So the place where that's particularly happening right now is something called MLIR, which
is something that, okay, Chris Latina, the Swift guy is leading.
And yeah, because it's actually not going to be Swift on its own that solves this problem,
because the problem is that currently writing a acceptably fast GPU program is too complicated regardless
of what language you use.
And that's just because if you have to deal with the fact that I've got 10,000 threads and
I have to synchronize between them all and I have to put my thing into grid blocks and
think about whoops and all this stuff, it it's just so much boilerplate that to do
that well you have to be a specialist at that and it's going to be a year's
work to you know optimize that algorithm in that way. But with things like
tensor comprehensions and tile and MLIR and TVM. There's all these various projects which are all about saying,
let people create like domain specific languages
for tensor computations.
These are the kinds of things we do,
generally, on the GPU for deep learning,
and then have a compiler which can optimize
that tensor computation.
A lot of this work is actually sitting on top of a project called Hallide,
which is a mind-blowing project where they came up with such a domain-specific language.
In fact, two, one domain-specific language for expressing,
this is what my tensor computation is.
Another domain-specific language for expressing,
this is the way I want you to structure
the compilation of that.
Like do it block by block and do these bits in parallel.
And they were able to show how you can compress the amount
of code by 10X compared to optimized GPU code
and get the same performance.
So these are the things that are sitting on top of that
kind of research and MLIR is pulling a lot of those best practices together.
And now we're starting to see what work done on making all of that directly accessible through
Swift so that I could use Swift to write those domain- specific languages and hopefully we'll get then Swift Cutor kernels
written in a very expressive and concise way
that looks a bit like J and APL
and then Swift layers on top of that
and then a Swift UI on top of that
and you know, it'll be so nice if we can get to that point.
Now does it all eventually boil down to CUDA
and then VDDA GPUs?
Unfortunately, at the moment it does,
but one of the nice things about MLIR,
if AMD ever gets their act together,
which they probably won't,
is that they or others could write MLIR backends
for other GPUs or other,
or other tensor computation devices, of which today there are increasing
number like graph core or vertex AI or whatever. So yeah, being able to target lots of
backends would be another benefit of this and the market really needs competitions at the moment and really is massively overcharging
for their enterprise class cards,
because there is no serious competition
because nobody else is doing the software probably.
In the cloud, there's some competition, right?
But not really.
Other than TPUs, perhaps,
TPUs are almost unprogrammable at the moment.
So you can't, the TPUs has the same problem that you can't?
It's even worse.
So TPUs, they Google actually made an explicit decision
to make them almost entirely unprogrammable
because they felt there was too much IP in there
and if they gave people direct access to program them,
people would learn their secrets.
So you can't actually directly program the memory
in a TPU, you can't even directly create code
that runs on and that you look at on the machine
that has the GPU, it all goes through a virtual machine.
So all you can really do is this kind of cookie cutter
thing of like plug-in, high-level stuff together, which is just super tedious and annoying
and totally unnecessary.
So what was the, tell me if you could the origin story of Fast AI?
What is the motivation, its mission, its dream?
So I guess the founding story is heavily tied to my previous start-up,
which is a company called Inletic, which was the first company to focus on deep learning for medicine.
And I created that because I saw that was a huge opportunity to...
There's about a 10x shortage of the number of doctors in the world and the developing world that we need.
I expected it would take about 300 years to train enough doctors to meet that gap, but I guess that maybe if we used deep learning for some of the analytics, we could maybe make it so you don't need as highly trained doctors, for diagnosis treatment planning.
Where's the biggest benefit just before we get the first AI?
Where's the biggest benefit of AI in medicine that you see today?
Not much happening today in terms of like stuff that's actually out there.
It's very early, but in terms of the opportunity, it's to take markets like India and China and Indonesia,
which have big populations, Africa, small numbers of doctors, and provide diagnostic, particularly
treatment planning and triage kind of on device so that if you do a, you know, test for malaria or tuberculosis or
whatever, you immediately get something that even a healthcare worker that's had a month of training
can get a very high quality assessment of whether the patient might be at risk until, you know,
okay, we'll send them off to a hospital. So for example, in Africa,
outside of South Africa, there's only five pediatric radiologists for the entire continent. So most
countries don't have any. So if your kid is sick and they need something diagnosed,
you're medical imaging, the person, even if you're able to get medical imaging done, the person
that looks at it will be a nurse at best,
but actually in India, for example, and China, almost no extra is a red by anybody, by
any trained professional, because they don't have enough.
So if instead we had an algorithm that could take the most likely high risk 5%
and say triage basically say, okay,
someone needs to look at this.
It would massively change the kind of way
that what's possible with medicine in the developing world.
And remember, they have increasingly,
they have money, they're the developing world,
they're not in poor world, they're developing world.
So they have the money, so they're building the hospitals, they're getting the diagnostic
equipment, but there's no way for a very long time will they be able to have the expertise.
Shortage of expertise, okay, and that's where the deep learning systems can step in and
magnify the expertise they do.
Exactly.
Yeah.
So you do see just a linger a little bit longer.
Sure.
The interaction, do you still see the human experts still
at the core of the system?
Yeah.
Absolutely.
Is there something in medicine that
can be automated almost completely?
I don't see the point of even thinking about that,
because we have such a short interest people. Why would we not, why would we want to find a way not to use them? Like we have people.
So the idea of like even from an economic point of view, if you can make them 10x more productive,
getting rid of the person doesn't impact your unit economics at all. And it totally involves
the fact that there are things people do better than machines.
So it's just to me that's not a useful way of framing the problem.
I guess just to clarify, I guess I meant there may be some problems where you can avoid even
going to the expert ever sort of maybe preventative care or some basic stuff,
lowering the fruit, allowing the expert to focus on the things
that are really dead.
Well, that's what the triage would do, right?
So the triage would say, okay, it's 99% sure there's nothing here.
Right.
So, you know, that can be done on device,
and they can just say, okay, go home.
Yeah.
So the experts are being used to look at the stuff
which has some chance it's worth looking at,
which most things is, it's not, you know, it's fine.
What do you think we haven't quite meet progress on that yet
in terms of the scale of how much AI is applied in the meta?
Oh, there's a lot of reasons.
I mean, one is it's pretty new.
I only started in liddick in like 2014.
And before that, it's hard to express to what degree
the medical world was not aware of the opportunities here.
So I went to RSNA, which is the world's largest
radiology conference.
And I told everybody I could, you know,
like I'm doing this thing with deep learning,
please come and check it out,
and no one had any idea what I was talking about,
and no one had any interest in it.
So like we've come from absolute zero, which is hard.
And then the whole regulatory framework, education system, everything is just set up to
think of doctoring in a very different way.
So today there is a small number of people who are
deep learning practitioners and doctors at the same time.
And we started to see the first ones come out of their PhD programs. So, is that Cahain over in Boston Cambridge has a number of students now who are data science
experts, ticketing wanting experts and actual medical doctors. Quite a few doctors have completed our first AI course now
and are publishing papers and creating journal reading
groups in the American Council of Radiology.
And I guess it's just setting out.
But it's going to be a long process.
They regulate us, have to learn how to regulate this.
They have to build guidelines,
and then the lawyers at hospitals have to develop a new way of understanding that
sometimes it makes sense for data to be looked at in raw form, in large quantities,
in order to create well-changing results. Yeah, there's a regulation around data, all that it sounds, probably the hardest problem,
but sounds reminiscent of autonomous vehicles as well.
Many of the same regulatory challenges, many of the same data challenges.
Yeah, I mean, funnily enough, the problem is less the regulation and more the interpretation
of that regulation by lawyers in hospitals.
So HIPAA was designed, it's a DEP and HIPAA does not have a privacy, it sounds for portability.
It's actually meant to be a way that data can be used.
And it was created with lots of gray areas because the idea is that would be more practical and it would help people to use this legislation to actually share data in a more thoughtful
way.
Unfortunately, it's done the opposite because when a lawyer sees a gray area, they see,
oh, if we don't know, we won't get sued, then we can't do it.
So HIPPER is not exactly the problem. The problem is more that there's hospital lawyers are not
insented to make bold decisions about data portability.
Or even to embrace technology that saves lives.
Right.
They more want to not get in trouble for embracing that.
Right.
Also, it is also saves lives in a very abstract way, which is like,
oh, we've been able to release these 100,000 anonymous records. I can't point at the specific
person whose life that saved. I can say like, oh, we've ended up with this paper, which found
this result, which, you know, diagnosed 1,000 more people than we would have otherwise, but it's
like, which ones were helped. It's very abstract. And on the corner side of that, you may be able to point to a life that was
taken because of something that was, yeah, or a person whose privacy was
violated, like, oh, this Pacific person, you know, there was
the identified. So identified just a fascinating topic. We're jumping around.
We'll get back to the fast AI, but on the question of privacy, data is the fuel for so
much innovation in deep learning.
What's your sense and privacy, whether we're talking about Twitter or Facebook, YouTube,
just the technologies like in the medical field that rely on people's data in order
to create impact.
How do we get that right, respecting people's privacy and yet creating technology that
is learned from data?
One of my areas of focus is on doing more with less data, which so most vendors
unfortunately are strongly encended to find ways to require more data and more
computation. So Google and IBM being the most obvious IBM. Yeah, so what's in, you know, so Google and IBM both strongly push the idea that you have
to be, you know, that they have more data and more computation and more intelligent people
than anybody else.
And so you have to trust them to do things because nobody else can do it.
And Google's very upfront about this, like Jeff DeHen has gone out there and given talks
and said, our goal is to require a thousand times more computation, but less people.
Our goal is to use the people that you have better and the data you have better and the
computation you have better.
So one of the things that we've discovered is, or at least highlighted, is that you very,
very, very often don't need much data at all.
And so the data you already have in your organization will be enough to get data-theat results.
So like my starting point would be to kind of say around privacy is a lot of people are
looking for ways to share data and aggregate data, but I think often that's unnecessary.
They assume that they need more data than they do
because they're not familiar with the basics
of transfer learning, which is this critical technique
for needing orders of magnitude less data.
Is your sense, one reason you might wanna collect data
from everyone is like in the recommender system
context where your individual, Jeremy Howard's individual data, is the most useful for providing
a product that's impactful for you.
So for giving you advertisements, for recommending to you movies, for doing medical diagnosis. Is your sense we can build with a small amount of data, general models that will have a huge
impact for most people that we don't need to have data from each individual. On the whole,
let's say yes. I mean, there are things like, you know, recommender systems have this cold start problem where, you know,
Jeremy is a new customer. We haven't seen him before, so we can't recommend him things based
on what else he's bought and liked with us. And there's various workarounds to that. I can,
a lot of music programs will start out by saying, which of these artists do you like,
which of these albums do you like, which of these songs do you like?
Netflix used to do that. Nowadays, they tend not to, people kind of don't like that because they
think, oh, we don't want to bother the user. So you could work around that by having some kind
of data sharing where you get my marketing record from Axiom or whatever and try to guess. And to me, the benefit to me and to society of saving me
five minutes on answering some questions versus the negative
externalities of the privacy issue doesn't add up.
So I think like a lot of the time the places where people are invading our privacy in order to provide convenience is really about just
trying to make them more money and and they move these negative externalities
to places that they don't have to pay for them. So when you actually see
regulations appear that actually cause the
companies that create these negative externalities to have to pay for it themselves, they say
well we can't do it anymore. So the cost is actually too high. But for something like medicine,
yeah, I mean the hospital has my, you know, medical imaging, my pathology studies, my medical imaging, my pathology studies, my medical records.
And also I own my medical data.
So I help a startup called Doc AI.
One of the things Doc AI does is that there's an app.
You can connect to, you know,
side of health and web core, and Walgreens,
and download your medical data to your phone,
and then upload it again at your discretion to share it as you wish. So with that kind of approach,
we can share our medical information with the people we want to.
Yeah, so control. I mean, it really being able to control who you share with and so on.
Yeah.
So that that has a beautiful, interesting tangent to, but to return back to the origin story
of fast AI.
Right.
So so before I started fast AI, I spent a year researching where are the biggest opportunities
for deep learning,
because I knew from my time at Kaggle in particular
that deep learning had kind of hit this threshold point
where it was rapidly becoming the state of the art approach
in every area that adopted it.
And I'd been working with neural nets for over 20 years.
I knew that from a theoretical point of view,
I wanted to hit that point.
It would do that in kind of just about every domain.
And so I kind of spend a year researching
what are the domains it's going to have
the biggest low hanging fruit in the shortest time period,
I picked medicine, but there were so many I could have picked.
And so there was a kind of level of frustration for me
of like, okay, I'm really glad we've
opened up the medical deep learning world and today it's huge as you know, but we can't
do, you know, I can't do everything.
I don't even know like, it took like a medicine, it took me a really long time to even get a
sense of like, what kind of problems do medical practitioners solve, what kind of data do
they have, who has that data?
So I kind of felt like, I need to approach this differently if I wanna maximize the positive impact of deep warning.
Rather than me picking an area
and trying to become good at it and building something,
I should let people who are already domain experts
in those areas and who already have the data
do it themselves.
So that was the reason for Fast AI is to basically try and figure out how to get deep
learning into the hands of people who could benefit from it and help them to do so in
as quick and easy and effective a way as possible. God, it's all sort of empower the domain experts.
Yeah. And like partly, it's because like, I'd like most people in this field,
my background is very applied and industrial at my first job.
But was that McKinsey and company, I spent 10 years in management consulting.
I, I spend a lot of time with domain experts.
So I respect them and appreciate them, and I know that's where the value generation
in society is.
And so I also know how most of them can't code, and most of them don't have the time
to invest three years in a graduate degree or whatever.
So it's like, how do I upskill those domain experts? I think that would be a super powerful thing,
you know, the biggest societal impact I could have. So yeah, that was the thinking.
So so much of FASTA-I students and researchers and the things you teach are
So much of fast AI students and researchers and the things you teach are
pragmatically minded, right, practically minded,
freaking figuring out ways how to solve real problems and fast. Right. So from your experience, what's the difference between theory and practice of deep learning?
Well, most of the research in the deep mining world is a total waste of time.
That's what I was getting at.
It's a problem in science in general.
Scientists need to be published, which means they need to work on things that their peers are extremely familiar with
and can recognize and advance in that area.
So that means that they all need to work on the same thing.
And so it really, and the thing they work on is nothing to encourage them to work on things
that are practically useful.
So you get just a whole lot of research, which is minor advances in stuff that's been
very highly studied and has no significant practical impact.
Where else the things that really make a difference, like I mentioned transfer learning,
like if we can do better at transfer learning, then it's this like world-changing thing.
We're suddenly like lots more people can do world-class work with less resources and less data,
but almost nobody works on that.
Or another example, active learning, which is the study of like how do we get more out of the human beings in the loop?
That's my favorite topic.
Yeah, so active learning is great, but it's almost nobody working on it because it's just not a trendy thing right now.
You know what somebody is to interrupt? He was saying that nobody is publishing
an active learning, but there's people inside companies, anybody who actually has to solve a problem,
they're going to innovate an active learning. Yeah, everybody kind of reinvents active learning
when they actually have to work in practice because they start labeling things and they think,
gosh, this is taking a long time
and it's very expensive.
And then they start thinking, well, why am I labeling everything?
I'm only, the machines only making mistakes
on those two classes.
They're the hard ones.
Maybe I'll just start labeling those two classes
and then you start thinking, well, why did I do that manually?
Why can't I just get the system to tell me
which things are going to be hardest?
It's an obvious thing to do, but yeah, it's
just like transfer learning. It's understudied and the academic world just has no reason
to care about practical results. The funny thing is, I've only really ever written one
paper, I hate writing papers, and I didn't even write it. It was my colleague, Sebastian
Rudeau, who actually wrote it. I just did the research for it.
But it was basically introducing transfer learning,
successful transfer learning to NLP for the first time.
And the algorithm is called Duol Mfit.
And it actually, I actually wrote it for the course,
for the first day at course.
I wanted to teach people in LP.
And I thought, I only want to teach people practical stuff.
And I think the only practical stuff is transfer learning learning and I couldn't find any examples of transfer learning in LP.
So I just did it and I was shocked to find that as soon as I did it, it was, you know, the basic prototype took a couple of days, smashed the state of the art on one of the most important data sets in a field that I knew nothing about. And I just thought, well, this is ridiculous.
And so I spoke to Sebastian about it and he kindly offered to write it up the results.
And so we ended up being published in ACL, which is the top link with computational
linguistics conference.
So like people do actually care once you do it,
but I guess it's difficult for maybe like junior researchers
or like, I don't care whether I get
citations or papers or whatever.
I don't, there's nothing in my life
that makes that important, which is why I've never
actually built a derider paper myself.
But for people who do, I guess they have to pick
the kind of safe option, which is like,
yeah, make a slight improvement on something that everybody's already working on.
Yeah, nobody does anything interesting or succeeds in life with the safe option.
Well, I mean, the nice thing is, nowadays, everybody is now working on any of these transfer learning because since that time we've had GPT and GPT2 and Bert and you know it's like it's so yeah once you
show that something's possible everybody jumps in I guess so I hope I hope to be a part of
and I hope to see more innovation and active learning in the same way I think yeah I'm
trying to learn and active learning are fascinating. Public open work. I actually helped start a startup called platform AI,
which is really all about active learning.
And yeah, it's been interesting trying to kind of see what
research is out there and make the most of it.
And this basically none.
So we've had to do all our own research.
Once again, and just as you described,
can you tell the story of the Stanford competition,
Don Bench and
FAST AI's achievement on it?
Sure.
So something which I really enjoy is that I basically teach two courses a year.
The practical deep learning for coders, which is kind of the introductory course and
then cutting edge deep learning for coders, which is the kind of research level course. And while I teach those courses, I have a, I basically have a big office at the University
of San Francisco.
It would be enough for like 30 people and I invite anybody, any student who wants to come
and hang out with me while I build the course.
And so generally it's full.
And so we have 20 or 30 people
in a big office with nothing to do, but study deep learning. So as during one of these times
that somebody in the group said, oh, there's a thing called dawn bench that looks interesting.
And I was like, what the hell is that? I'm going to sit out some competition to see how quickly
you can train a model, seems kind of not exactly relevant to what we're doing, but it sounds like the kind of thing
which you might be interested in and I'll check it out and I'll say, oh crap, there's only 10 days
till it's over, pretty much too late. And we're kind of busy trying to teach this course.
But we're like, make an interesting case study for the course, like it's all the stuff we're
already doing.
Why don't we just put together our current best practices and ideas.
So me and I guess about four students just decided to give it a go.
And we focused on this more one called sci-fi 10, which is little 32 by 32 pixel images.
Can you say what dimensions?
Yeah, so it's a competition to train a model as fast as possible.
It was run by Stanford and as cheap as possible too.
That's also another one for as cheap as possible.
And there was a couple of categories, image net and so far.
10. So image nets is big 1.3 million image thing that took a couple of days to train.
Remember a friend of mine, Pete Warden, who's now at Google.
I remember he told me how he trained ImageNet a few years ago,
and he basically had this little granny flat out the back
that he turned into his ImageNet training center.
And after a year of work, he figured out how to train it
in like a 10 days or something.
That was a big job. Well, sci-fi 10, at that out how to train it in like 10 days or something. It's like, that was a big job.
Well, sci-fi 10, at that time, you could train in a few hours,
it was much smaller and easier.
So we thought we'd try sci-fi 10.
And yeah, I've really never done that before.
Like I'd never really, like things like using more than one GPU,
a GPU at a time was something I
Try to avoid because to me it's like very against the whole idea of accessibility is she better do things with one GPU
I mean have you asked in the past before
After having accomplished something how do I do this faster much faster? Oh always but it's always for me
It's always how do I get much faster on a single GG that a normal person could afford in their day-to-day life, it's not how could I do it
faster by, you know, having a huge data center, because to me, it's all about like, as many
people should better use something as possible without fassing around with infrastructure.
So anyways, in this case, it's like, well, we can use a GP user just by renting a AWS machine.
So we thought we'd try that.
And yeah, basically using the stuff we were already doing, we were able to get, you know, the speed, you know, within a few days,
we had the speed down to, I don't know, a very small number of minutes.
I can't remember exactly how many minutes it was,
but it might have been like 10 minutes or something.
And so yeah, we found ourselves at the top of the leaderboard
easily for both time and money, which really shocked me
because the other people competing in this were like Google
and Intel and stuff were like, no,
a lot more about this stuff than I think we do.
So then we emboldened, we thought,
let's try the image net one two. I mean, it seemed way out of our league,
but our goal was to get under 12 hours. And we did, which was really exciting, but we didn't
put anything up on the leaderboard, but we were down to like 10 hours. But then Google put in some
hours. But then Google put in some like five hours or something and we're just like, oh, we're so screwed. But we kind of thought, well, keep trying. You know, if Google could
do it in fact, I mean, Google did on five hours on like a TPU pod or something. Like a
lot of hardware. But we kind of like had a bunch of ideas to try. Like a really simple thing was, why are we using these big images?
They're like 224 or 256, but 26 pixels.
You know, why don't we try small ones?
And just to elaborate, there's a constraint on the accuracy that your train model is supposed to achieve.
Yeah, you got to achieve 93%.
I think it was for ImageNet.
Exactly. Which is very tough. So you have to achieve 93%. I think it was for ImageNet. Exactly.
Which is very tough. So you have to be. Yeah, 93% like they picked a good threshold. It was
a little bit higher than what the most commonly used ResNet 50 model could achieve at that time.
So yeah, so it's quite a difficult problem to solve.
But yeah, we realized if we actually just use 64 by 64 images,
a trained a pretty good model,
and then we could take that same model and just give it a couple of E-plugs to learn
2.24 by 2.24 images, and it was basically already trained.
It makes a lot of sense, like if you teach somebody, like,
here's what a dog looks like
and you show them low res versions
and then you say here's a really clear picture of a dog.
They already know what a dog looks like.
So that, like, just, we jumped to the front
and we ended up winning parts of that competition.
We actually ended up doing a distributed version over multiple
machines a couple of months later and ended up at the top of the leaderboard, we had 18 minutes.
Dimagine it. Yeah, and it was, and people have just kept on blasting through again and again
since then. So what's your view on multi-GPU or multiple machine training in general as a way to speed cut up?
I think it's largely a waste of time.
Both multi-GPU on a single machine.
Yeah, particularly multi-machines, because it's just clunky.
Multi-GPUs is less clunky than it used to be, but to me anything that slows down your iteration speed
is a waste of time.
So you could maybe do your very last, you know, perfecting of the model on multi-gpu
use if you need to, but so for example, I think doing stuff on the ImageNet is generally
a waste of time.
Why test things on 1.3 million images?
Most of us don't use 1.3 million images.
We've also done research that shows that
doing things on a smaller subset of images
gives you the same relative answers anyway.
So from a research point of view,
why waste that time?
So actually, I released a couple of new datasets recently.
One is called ImageNet.
The French ImageNet, which is a small subset of ImageNet,
which is designed to be easy to classify.
What's a HoneySpel ImageNet?
It's got an extra T and E at the end,
because it's very French.
I'm fine, okay.
And then another one called ImageWoof,
which is a subset of ImageNet
that only contains dog breeds.
And that's a hard one, right?
That's a hard one.
And I've discovered that if you just look at these two subsets, you can train things on
a single GPU in 10 minutes.
And the results you get directly transferable to image net nearly all the time.
And so now I'm starting to see some researches start to use these smaller data sets.
So deeply love the way you think.
Because I think you might have written a blog post
saying that sort of going these big data sets
is encouraging people to nothing creatively.
Absolutely.
So it sort of constrains you to train
on large resources and because you have these resources,
you think more resources will be better,
and then you start, so like somehow you kill the creativity.
Yeah, and even worse than that,
like I keep hearing from people who say,
I decided not to get into deep learning
because I don't believe it's accessible to people
outside of Google to do useful work.
So like I see a lot of people make an explicit decision to not learn this incredibly valuable tool
because they've they've drank the Google call aid, which is that only Google's big enough and
smart enough to do it. Notice find that so disappointing and it's so wrong. And I think all the major breakthroughs in AI in the next 20 years will be doable on a
single GPU. Like I would say my sense is all the big sort of, let's put it this way.
None of the big breakthroughs of the last 20 years are required multiple GPUs.
Right. So like Vatch norm, relier, dropout, to demonstrate that there's something to that.
Every one of them, none of them has required multiple GPUs.
Gans, the original Gans didn't require multiple GPUs.
Well, and we've actually recently shown that you don't even need Gans.
So we've developed GAN level outcomes without needing GANs, and we can now do it with, again,
by using transfer learning, we can do it in a couple of hours.
On a single-digit.
I think you're using a generator mod without the adversarial point.
Yeah, so we've found lost functions that work super well
without the adversarial part.
And then one of our students, Skycord Jason Antich,
has created a system called De-Oltify,
which uses this technique to colorize old black and white movies.
You can do it on a single GPU, colorize a whole movie
in a couple of hours.
And one of the things that Jason and I did together
was we figured out how to add a little bit of GAN
at the very end, which it turns out for colorization,
makes it just a bit prider and nicer.
And then Jason did masses of experiments to figure out
exactly how much to do, but it's still all done on his home machine on a single GPU and his lounge room.
And like if you think about like colorizing Hollywood movies, that sounds like something a huge
studio would have to do, but he has the world's best results on this.
There's this problem of microphones. We're just talking to microphones now.
Yeah.
It's such a pain in the ass to have these microphones
to get good quality audio.
And I tried to see if it's possible to plop down
a bunch of cheap sensors and reconstruct higher quality audio
from multiple sources.
Because right now, I haven't seen work from,
okay, we can say you can expensive mics,
automatically combining
audio from multiple sources to improve the combined audio.
People haven't done that and that feels like a learning problem.
So hopefully somebody can...
Well, I mean, it's evidently doable and it should have been done by now.
I felt the same way about computational photography four years ago.
That's right.
Why are we investing in big lenses when three cheap lenses,
plus actually a little bit of intentional movement.
So like hold, you know, like take a few frames,
gives you enough information to get excellent sub-pixel
resolution, which particularly with deep learning,
you would know exactly what you meant to be looking at.
We can totally do the same thing with audio.
I think the madness that hasn't been done yet.
Is it been progress on the photography company?
Yeah, photography is basically standard now. So the Google Pixel Night light, I don't
know if you've ever tried it, but it's astonishing. You take a picture, an almost pitch black,
and you get back a very high quality
image. And it's not because of the lens. Same stuff with like adding the bokeh to the,
you know, the background blurring, it's done computationally.
This is the pixel right here.
Yeah, basically, everybody now is doing most of the fanciest stuff on their phones with
computational photography, and also increasingly, people are putting more than one lens on the back of the camera. So,
the same will happen for audio, for sure. And there's applications in the audio side. If you look
at an Alexa-type device, most people I've seen, especially I worked at Google before, when you
look at noise back on removal, you don't think of multiple sources of audio. You don't play with that
as much as I would hope people would. But I mean, you can still do it even with one. Like again,
it's not much work's been done in this area. So we're actually going to be releasing an audio
library soon, which hopefully will encourage development of this because it's so underused.
The basic approach we used for our super resolution, which Jason uses for de-altify of generating
high-quality images,
the exact same approach would work for audio.
No one's done it yet,
but it would be a couple of months' work.
Okay. Also learning rate in terms of Don Bench.
There's some magic on learning rate that you played around with.
It's interesting.
Yeah. So this is all work that came from a guy called Leslie Smith.
Leslie's a researcher who, like us, cares a lot about
just the practicalities of training,
neural networks quickly and accurately,
which you would think is what everybody should care about,
but almost nobody does.
And he discovered something very interesting, which he calls super
convergence, which is there are certain networks that with certain settings of
high parameters could suddenly be trained 10 times faster by using a 10 times higher
learning rate. Now, no one published that paper because it's not an area of kind of active research in the academic world, no academics
recognize that this is important. And also deep learning in academia is not considered a
experimental science. So unlike in physics where you could say like, I just saw a subatomic
particle do something which the theory doesn't explain, you could publish
that without an explanation.
And then the next 60 years, people can try to work out how to explain it.
We don't allow this in the deep learning world.
So, it's literally impossible for Leslie to publish a paper that says, I've just seen something
amazing happen.
This thing trained 10 times faster than it should have.
I don't know why.
And so, the reviewers were like, we can't publish that because you don't know why. So anyway,
that's important to pause on because there's so many discoveries that would need to start
like that. Every, every other scientific field I know of works
of that way, I don't know why hours is uniquely disinterested in publishing unexplained
experimental results, but there it is. So it wasn't published.
Having said that, I read a lot more unpublished papers and published papers because that's where
you find the interesting insights. So I absolutely read this paper and I was just like, this is
astonishingly mind-blowing and weird and awesome.
And like, why isn't everybody only talking about this?
Because like, if you can train these things 10 times faster,
they also generalize better because you're doing less epochs,
which means you look at the data less, you get better accuracy.
So I've been kind of studying that ever since.
And eventually, Leslie kind of figured out a lot of how to
get this done, and we added minor tweaks. And a big part of the trick is starting at a very
low learning rate, very gradually increasing it. So as you're training your model, you take
very small steps at the start, and you gradually make them bigger and bigger. And to all eventually,
you're taking much bigger steps than anybody thought was possible.
There's a few other little tricks to make it work,
but basically, we can reliably get super convergence.
And so for the dawn bench thing,
we were using just much higher learning rates
than people expected to work.
What do you think the future of,
I mean, it makes so much sense for that
to be a critical hyperparameter learning rate
that you vary. What do you think the future of, I mean, it makes so much sense for that to be a critical hyperparameter learning rate that you vary.
What do you think the future of learning rate magic looks like?
Well, there's been a lot of great work in the last 12 months in this area.
And people are increasingly realizing that optimites, like we just have no idea really
how optimizers work.
And the combination of white decay, which is how we regularize optimizers and the learning rate,
and then other things like the Epsilon we use in the atom optimizer,
they all work together in weird ways.
And different parts of the model,
this is another thing we've done a lot of work on,
is research into how different parts of the model should be trained at different rates in different ways.
So we do something we call discriminative learning rates,
which is really important, particularly for transfer learning.
So really, I think in the last 12 months,
a lot of people have realized that all this stuff is important.
There's been a lot of great work coming out.
And we're starting to see algorithms appear,
which have very, very few diels.
If any, that you have to touch.
So I think what's going to happen is the idea of a learning rate.
Well, it almost already has disappeared in the latest research.
And instead, it's just like, you know, we know enough about how to interpret the
gradients and the change of gradients we see to know how to set every parameter.
They can wait it. So you see the future of deep learning,
where is the input of a human expert needed?
Well, hopefully the input of the human expert
will be almost entirely unneeded
from the deep learning point of view.
So again, like Google's approach to this
is to try and use thousands of times more compute
to run lots and lots of models
at the same time and hope that one of them is good. Not all melkin of the model. Melkin of stuff, which I think is insane. When you better understand the mechanics of how models
learn, you don't have to try a thousand different models to find which one happens to work the best.
You can just jump straight to the best one,
which means that it's more accessible in terms of compute,
cheaper, and also with less hyper parameters to set, it means you don't need deep learning experts to train your deep learning model for you,
which means that domain experts can do more of the work, which means that now you can focus the human time on the kind of interpretation,
the data gathering, identifying model all errors and stuff like that.
Yeah, the data side.
How often do you work with data these days in terms of the cleaning, looking at it, like
Darwin looked at different species while traveling about, do you look at data?
I have you in your roots in Kaggle.
Always. Look at data.
Yeah, I mean, it's a key part of our course.
It's like before we train a model in the course,
we see how to look at the data.
And then after first thing we do after we train our first model,
which we fine tune an ImageNet model for five minutes.
And then the thing we immediately do after that is we learn how to analyze
the results of the model by looking at examples of misclassified images,
looking at a classification matrix,
and then doing research on Google
to learn about the kinds of things that it's misclassifying.
So to me, one of the really cool things
about machine learning models in general
is that when you interpret them,
they tell you about things like,
what are the most important features,
which groups you misclassifying, and they help you become a domain expert more quickly, because you can focus your time on the bits that the model is telling you is important.
So it lets you deal with things like data leakage, for example, if it says, oh, the main feature I'm looking at is customer ID, you know, when you're like, oh, customer ID should be predictive. And then you can talk to the people that manage customer
IDs, and they'll tell you, oh, yes.
As soon as a customer's application is accepted,
we add a one on the end of their customer ID.
It was something.
So yeah, model looking at data, particularly
from the lens of which parts of the data, the model says
it's important.
It's super important. Yeah, and from the lens of which parts of the data the model says is important, is super important.
Yeah, and using the model to almost debug the data, to learn more about this.
Exactly.
What are the different cloud options for training on networks?
The last question related to Don Bench.
Well, it's part of a lot of the work you do, but from a perspective of performance, I
think you've written
this in a blog post. There's AWS, there's a TPU from Google. What's your sense with the future
holds? What would you recommend now? Right. In terms of, uh, so from a hardware point of view,
Google's TPUs and the best Nvidia GPUs, uh, similar, I mean, maybe the TPUs are 30% faster, but they're also much harder to program.
There isn't a clear later in terms of hardware right now, although much more importantly,
the Nvidia's GPUs are much more programmable.
They've got much more written for all of them, so that's the clear leader for me and where
I would spend my time as a researcher and practitioner.
But in terms of the platform, I mean, we're super lucky now with stuff like Google,
GCP, Google Cloud, and AWS that you can access a GPU pretty quickly and easily.
But I mean, for AWS, it's still too hard like you have to find an AMI and get the instance running and then install the software you want and blah blah blah. GCP is
currently the best way to get started on a full server environment because they have
a fantastic fast AI in PyTorch ready to go instance,
which has all the courses pre-installed.
It has Jupyter Notebook pre-running.
Jupyter Notebook is this wonderful interactive computing system
which everybody basically should be using
for any kind of data-driven research.
But then even better than that,
there are platforms like Salamander, which we own, and Paper Space,
where literally you click a single button, and it pops up and you put a notebook straight away
without any kind of installation or anything, and all the cost notebooks are all pre-installed.
So like for me, we,
this is one of the things we spent a lot of time
kind of curating and working on.
Because when we first started our courses,
the biggest problem was people dropped out of lesson one
because they couldn't get an AWS instance running.
So things are so much better now.
And like we actually have, if you go to cost.faster.ai, the first thing it says is,
here's how to get started with your GPU.
And it's like, you just click on the link
and you click start and you're going.
And it's, you will go to GCP.
I have to confess, I've never used the Google GCP.
Yeah, GCP gives you $300 of compute for free,
which is really nice.
That as I say, a salamander and paper space
are even easier still.
Okay.
So from the perspective of deep learning frameworks,
you work with FastAI, if you think it is,
framework and PyTorch intensively,
what are the strengths of each platform?
Sure.
You're a perspective.
So in terms of what we've done our research on and taught in our course, we started with
Theano and Carras and then we switched to TensorFlow and Carras and then we switched to PyTorch
and then we switched to PyTorch and FastAI.
And that kind of reflects a growth and development of the ecosystem of deep learning libraries.
Theano and TensorFlow were great, but were much harder to teach and do research and development on because they define what's called a computational graph up front, a static graph, where you basically have to say, here are all the things that I'm going to eventually do
in my model, and then later on you say,
okay, do those things with this data.
And you can't debug them, you can't do them step-by-step,
you can't program them interactively
in a Jupyter Notebook and so forth.
PyTorch was not the first, but PyTorch was certainly
the strongest entrant to come along and say, let's not do it that way, let's just use normal
Python. And everything you know about in Python is just going to work, and we'll
figure out how to make that run on the GPU as and when and necessary.
That turned out to be a huge, a huge leap in terms of what we could do with
our research and what we could do with our teaching.
Because it was a limiting.
Yeah, I mean, it was critical for us for something like dawn bench to be able to rapidly
try things.
It's just so much harder to be a researcher and practitioner when you have to do everything
up front and you can inspect it.
Problem with PyTorch is it's not at all accessible to newcomers because you
have to like write your own training loop and manage the gradients and all
this stuff. And it's also like not great for researchers because you're
spending your time dealing with all this boilerplate and overhead rather than
thinking about your algorithm. So we ended up writing this very multi-layered
API that at the top level you
can train a state of the neural network and three lines of code. And which kind of talks
to an API, which talks to an API, which talks to an API, which like you can dive into
at any level and get progressively closer to the machine kind of levels of control. And
this is the first AI library. That's been critical for us and for our students
and for lots of people that have won
big learning competitions with it
and written academic papers with it.
It's made a big difference.
We're still limited though by Python.
And particularly this problem with things like
my current neural nets say where you just
can't change things unless you accept it going so slowly that it's impractical.
So in the latest incarnation of the course and with some of the research we're starting
to do some stuff in Swift.
I think we're three years away from that being super practical,
but I'm in no hurry, I'm very happy to invest the time
to get there.
But with that, we actually already have a nascent version
of the Fast AI library for vision running on Swift
on TensorFlow.
Because Python for TensorFlow is not going to cut it. It's just a disaster.
What they did was they tried to replicate the bits that people were saying they like about PyTorch.
The is kind of interactive computation, but they didn't actually change their
foundational runtime components. So they kind of added this syntax sugar they call TFEGA,
TensorFlowEGA, TensorFlow
EGA, which makes it look a lot like PyTorch, but it's 10 times slower than PyTorch to actually
do a step. So because they didn't invest the time in retooling the foundations because
they're code bases, so horribly complex.
Yeah, I think it's probably very difficult to do that kind of rejoin.
Yeah, well, particularly the way TensorFlow was written, it was written by a lot of people
very quickly in a very disorganized way. So like when you actually look in the code as I do often,
I'm always just like, oh god, what were they thinking? It's just it's pretty awful.
So I'm really extremely negative about the potential future for Python, Python, that's a flow.
But Swift, for TensorFlow, can be a different beast altogether.
It can basically be a layer on top of MLIR that takes advantage of
all the great compiler stuff that Swift builds on with LLVM.
Yeah, it could be absolutely, I think it will be
absolutely fantastic.
Well, you're inspiring me to try.
Evan truly felt the pain of,
tens of flow 2.0 Python.
It's fine by me, but of,
Yeah, I mean, it does the job if you're using like predefined
things that somebody's already written.
But if you actually compare, you know, like I've had to do, because I've been having to do a lot of stuff with TensorFlow recently,
you actually compare like, okay, I want to write something from scratch. And I like, I just keep
finding it's like, oh, it's running 10 times slower than Tytorch. So, uh, is the biggest cost,
let's throw running time out the window. How long it takes you to program?
That's not too different now.
Thanks to TensorFlow Eager, that's not too different.
But because so many things take so long to run,
you wouldn't run it at 10 times slower.
Like you're just going, oh, is this taking too long?
And also, there's a lot of things
that you're just less programmable, like tf.data,
which is the way data processing works in TensorFlow is just this big mess.
It's incredibly inefficient.
And they kind of had to write it that way because of the TPU problems I described earlier.
So I just, you know, I just feel like they've got this huge technical debt, which they're
not going to solve without starting from scratch.
Here's an interesting question then. If there's a new student starting today, what would
you recommend they use?
Well, I mean, we obviously recommend First AI and PyTorch because we teach new students,
and that's what we teach with. So we would very strongly recommend that because it will let you get on top of the concepts
much more quickly.
So then you'll become an action.
And you'll also learn the actual state-of-the-art techniques.
So you'll actually get world-class results.
Honestly, it doesn't much matter what library you learn
because switching from a shiner to MXNet to TensorFlow to PyTorch is
going to be a couple of days' work if you long as you understand the
foundations well. But you think we'll swift creep in there as a thing that
people start using. Not for a few years, particularly because like Swift has no data science, community, libraries,
schooling, and the Swift community has a total lack of appreciation and understanding of
numeric computing, so like they keep on making stupid decisions, you know, for years, they've just done dumb things around performance and prioritization.
That's clearly changing now because the developer of Swift Kris, a developer of Swift Kris
Latener is working at Google on Swift intensive loads, so that's a priority.
It'll be interesting to see what happens with Apple because like Apple hasn't
shown any sign of caring about numeric programming in Swift. So, I mean, hopefully they'll get
off their ass and start appreciating this because currently all of their low-level libraries
are not written in Swift. They're not particularly Swifty at all stuff like Core ML.
They're really pretty rubbish.
So there's a long way to go.
But at least one nice thing is that Swift for TensorFlow can actually directly use Python code
and Python libraries.
Literally, the entire lesson one notebook of first AI runs in Swift right now in Python mode. So that's that's a nice intermediate thing. How long does it take?
If you look at the two two fast AI courses, how long does it take to get from point zero to completing both courses?
It varies a lot
Somewhere between
Two months and two years generally It varies a lot. Somewhere between
two months and two years generally.
So for two months, how many hours a day? So like somebody who is a very competent coder
can
can do 70 hours per course and
70, 70, that's it. Couldn't do 70 hours per course and
7070. Yeah, that's it. Okay, but a lot of people I know
Take a year off to study first AI full time and say at the end of the year they feel
Pretty competent because generally there's a lot of other things you do. Like there generally there'll be entering Kaggle competitions.
They might be reading Ian Goodfellow's book.
They might, you know, they'll be doing a bunch of stuff.
And often, you know, particularly if they are a domain expert, their coding skills might
be a little on the pedestrian side.
So part of it's just like doing a lot more writing.
What do you find is the bottleneck for people usually except getting started and setting stuff up? I would say coding. Yeah I
would say the best the people who are strong coders pick it up the best.
Although another bottleneck is people who have a lot of experience of
classic statistics can really struggle because the intuition is so the opposite of what they used to.
They're very used to trying to reduce the number of parameters
in their model and looking at individual coefficients
and stuff like that.
So I find people who have a lot of coding background
and know nothing about statistics
are generally going to be the best of.
So you taught several courses on deep learning and as Fiamines says,
the best way to understand something is to teach it.
What have you learned about deep learning from teaching it?
A lot.
It's a key reason for me to teach the courses.
Obviously it's going to be necessary to achieve our goal of getting to main experts to be familiar with deep learning, but it was also necessary for me to achieve my goal
of being really familiar with deep learning.
I mean, to see so many domain experts from so many different backgrounds, it's definitely,
I wouldn't say taught me,
but convinced me something that I liked,
believe it's true, which was anyone can do it.
So there's a lot of kind of snobbishness out there
about only certain people can learn to code.
Only certain people are going to be smart enough to do AI.
That's definitely bullshit.
I've seen so many people from so many different backgrounds
get state-of-the-art results in their domain areas now. It's definitely taught me that the key
differentiator between people that succeed and people that fail is tenacity, that seems to be
basically the only thing that matters. A lot of people give up. But if the ones who don't give up pretty much everybody
succeeds, you know, even if at first I'm just kind of like thinking like, wow, they really
aren't quite getting it yet, are they? But eventually people get it and they succeed.
So I think that's been, I think they're both things I'd liked to believe was true, but
I don't feel like I really had strong evidence for them to be true, but now I can say I've
seen it again and again.
So what advice do you have for someone who wants to get started in deep learning?
Train lots of models.
That's how you learn it. So think our courses are very good, but also lots
of people independently as it's very good. It recently won the COGX Award for AI courses
as being the best in the world. It's a come to our course, course.faster.ai.
The thing I keep on harping on in my lessons is train models, print out the inputs
to the models, print out to the outputs to the models, like study, you know, change the
inputs a bit, look at how the outputs vary, just run lots of experiments to get a, you
know, an intuitive understanding of what's going on.
To get hooked, do you think, you mentioned training? Do you think just running the models in France?
Like if we talk about getting started. No, you've got to find you in the models
So that's that's that's the critical thing because at that point you now have a model that's in your domain area
So there's there's no point running somebody else's model, because it's not your model.
So it only takes five minutes to find
you in a model for the data you care about.
And in less than two of the course,
we teach you how to create your own data set
from scratch by scripting Google Image Search.
So, and we show you how to actually create
a web application running online.
So I create one in the course that differentiates
between a teddy bear, a grizzly bear and a brown bear.
And it does it with basically 100% accuracy.
It took me about four minutes to scrape the images from Google Search from the script.
There's a little graphical widgets we have in the notebook that help you clean up the data set.
There's other widgets that help you study the results to see where the errors are happening.
And so now we've got over a thousand replies. our share your work here thread of students saying here's the thing I built.
And so there's people who like and a lot of the mistake is the art like somebody said,
oh, I tried looking at different Gary characters and I couldn't believe it.
The thing that came out was more accurate than the best academic paper after listen one.
And then there's others which are just more kind of fun like somebody is doing Trinidad and Tobago hummingbirds. She said that's kind of
their national bird and she's got something that can now classify Trinidad and Tobago hummingbirds.
So yeah, train models, fine-tune models with your dataset and then study their inputs and outputs.
How much is fast there? Of course it's free.
Everything we do is free.
We have no revenue sources of any kind.
It's just a service to the community.
Year's Saint.
Okay, once a person understands the basics,
trains a bunch of models,
if we look at the scale of years,
what advice do you have for someone
wanting to eventually
become an expert?
Dr. Justin Traynloss of models.
Especially Traynloss of models in your domain area.
So an expert, what?
We don't need more expert, like create slightly evolutionary research in areas that everybody's
studying. We need experts at using deep learning to diagnose malaria.
We need experts at using deep learning to analyze language to study media bias.
We need experts in analyzing fisheries to identify problem areas and the ocean.
That's what we need.
Become the expert in your passion area.
This is a tool which you can use for just about anything.
You'll be able to do that thing better than other people,
particularly by combining it with your passion and domain expertise.
That's really interesting.
Even if you do want to innovate on transfer learning
or active learning, your thought is,
it means one I certainly share,
is you also need to find a domain or a data set
that you actually really care for.
Right.
If you're not working on a real problem
that you understand, how do you know if you're doing it
any good, how do you know if your results are good, how do you know if you're doing it any good? How do you know
if your results are good? How do you know if you're getting bad results? Why are you getting bad results?
Is it a problem with the data? How do you know you're doing anything useful? Yeah, to me, the only
really interesting research is not the only, but the vast majority of interesting research is like
try and solve an actual problem and solve it really well. So both understanding sufficient tools on the deep learning side and becoming a domain
expert in a particular domain are really things within reach for anybody.
Yeah, I mean, to me, I would compare it to like studying self-driving cars, having never
looked at a car or been in a car or turned a car on.
Right. You know, which is like the way it is for a lot of people, they'll study some academic
data set where they literally have no idea about that.
Either way, I'm not sure how familiar I was with autonomous vehicles, but that is literally
you describe a large percentage of robotics folks working at a self-driving cars as they
actually haven't considered driving.
They haven't actually looked at what driving looks like.
They haven't driven.
It's so it's...
And it's a problem because you know when you've actually driven, you know, like these
are the things that happened to me when I was driving.
There's nothing that beats the real world examples of just experiencing them.
You've created many successful startups.
What does it take to create a successful startup?
Same thing as becoming successful deep learning practitioner, which is not giving up.
So you can run out of money or run out of time
or run out of something, you know,
but if you keep costs super low and try and save
up some money beforehand, so you can afford to have some time, then just sticking with
it is one important thing. Doing something you understand and care about is important.
By something, I don't mean the biggest problem I see with deep learning people is they do a PhD
in deep learning and then they try and commercialize their PhD. It is a waste of time because that
doesn't solve an actual problem. You pick your PhD topic because it was an interesting kind of
engineering or math or research exercise. But yeah, if you've actually spent time as a recruiter
and you know that most of your time was spent a recruiter, and you know that most of your time was spent
sifting through resumes,
and you know that most of the time
you're just looking for certain kinds of things,
and you can try doing that with a model
for a few minutes and see whether that's something
which a model seems to be able to do
as well as you could,
then you're on the right track to creating a startup.
And then I think just being pragmatic
and trying to stay away from venture capital money
as long as possible, preferably forever.
So yeah, on that point, do you venture capital?
So did you were able to successfully run startups
with self-funded portfolio? Yeah, to successfully run startups with self-funded?
Yeah. So my first two were self-funded and that was the right way to do it.
That's scary. No. Species start up so much more scary because you have these people on your back
who do this all the time and who have done it for years telling you grow, grow, grow, grow, grow.
And I don't care if you fail. They only care if you don't grow fast enough. So that's scary. Where else doing the ones myself,
well, with, with partners, who were friends, it's nice because like we just went along with a
pace that made sense and we were able to build it to something which was big enough that we never had to work again, but it was not big enough that any VC would think it was impressive. And that was enough for
us to be excited, you know. So I thought that's so much better way to do things in most
people.
In generally speaking, not for yourself, but how do you make money during that process?
Do you cut into savings? If you go?
So yeah, so for for so I started fast mail
on optimal decisions at the same time in 1999 with two different friends. And for fast mail,
I guess I spent $70 a month on the server. And when the server ran out of space,
I put a payments button on the front page and said
if you want more than 10 megabase, you have to pay $10 a year.
And so run low, like I keep your cost down.
Yeah, so I kept my cost down and once, you know, once I needed to spend more money, I asked
people to spend the money for me and that that was that basically from then on
I we were making money and I was profitable from then.
For optimal decisions it was a bit harder because we were trying to sell something that was
more like a one million dollar sale but what we did was we would sell scoping projects, so kind of like prototype projects,
but rather than being free,
we would sell them $50 to $100,000.
So again, we were covering our costs
and also making the client feel like
we were doing something valuable.
So in both cases, we were profitable from six months in.
Ah, nevertheless, it's scary.
I mean, yeah, sure.
I mean, it's scary before you jump in,
and I just, I guess I was comparing it to this
scarediness of VC.
I felt like with VC stuff, it was more scary.
You kind of, much more in somebody else's hands,
you know, will they fund you or not,
and what do they think of what you're doing?
I also found it very difficult with VCs,
back startups to actually do the thing
which I thought was important.
For the company, rather than doing the thing
which I thought would make the VCs happy.
Now, VCs always tell you not to do the thing
that makes them happy, but then if you don't do the thing
that makes them happy, they get sad.
So.
And do you think optimizing for the thing that makes them happy, they get sad. So. And do you think optimizing for the whatever they call it, the exit is a good,
is a good thing to optimize for?
I mean, it can be, but not at the VC level because the VC exit needs to be, you know,
a thousand X.
So where else the lifestyle exit, if you can sell something for $10 million,
and you've made it, right?
So I don't, it depends.
If you want to build something that's going to, you're going to be happy to do forever
and fine.
If you want to build something, you want to sell in three years time, that's fine too.
I mean, they're both perfectly good outcomes.
So, you're learning Swift?
No.
In a way, I mean, you were already time to. And I read that you use at least in
some cases space repetition as a mechanism for learning new things. I use Enki quite a
lot. Yeah.
Sure. I actually don't never talk to anybody about it. Don't know how many people do it,
but it works incredibly well for me. Can you talk to your experience?
Like, how did you, what do you, but first of all, okay, I was back it up.
What is space repetition?
So space repetition is an idea created by a psychologist named Ebbinghouse.
I don't know, must be a couple of hundred years ago or something 150 years ago, he
did something which sounds pretty damn tedious.
He wrote down random sequences of letters on cards and tested how well he would remember
those random sequences a day later or a week later or whatever.
He discovered that there was this kind of a curve where his probability of remembering one of them would be dramatically smaller the next day and then a little bit smaller the next day and a little bit smaller the next day.
What he discovered is that if he revised those cards after a day, the probabilities would decrease at a smaller rate.
And then if he revised them again a week later, they would decrease it a smaller rate again. And so he basically figured out a roughly optimal equation for when you should revise something you want to remember.
So space repetition learning is using this simple algorithm, just something like
revise something after a day and then three days and then a week and then three weeks and so forth.
And so if you use a program like ANKI, as you a week and then three weeks and so forth.
And so if you use a program like ANCII, as you know, it will just do that for you.
And if you and it will say, did you remember this?
And if you say no, it will reschedule it back to be up here again like 10 times faster
than it otherwise would have.
It's a kind of a way of being guaranteed to learn something,
because by definition, if you're not learning it,
it will be rescheduled to be revised more quickly.
Unfortunately, though, it's also like,
it doesn't let you fool yourself.
If you're not learning something,
you know your revisions will just get more and more.
So you have to find ways to learn things
productively and effectively like treat your brain well.
So using like mnemonics and stories and
context and stuff like that.
So yeah, it's a super great technique. It's like learning how to learn is something which
everybody should learn before they actually learn anything.
But almost nobody does.
So what have you, so certainly works well for learning new languages for, I mean, for
learning like small projects almost, but do you know, I started using it for, I've
got a whole lot of blog posts about this and inspired me.
It might have been you, I'm not sure.
As I started when I read papers, I'll concepts and ideas, I'm not sure. I started when I read papers.
I'll concepts and ideas, I'll put them.
Was it Michael Nielsen?
It was Michael.
Yeah, so Michael started doing this recently
and he's been writing about it.
I, so the kind of today's Ebbing House,
is a guy called Piazza, was Nick,
who developed a system called SuperMemo,
and he's been basically trying to become like
the world's greatest Renaissance man
over the last few decades.
He's basically lived his life
with space-traverted repetition learning for everything.
I, and sort of like,
Michael's only very recently got into this,
but he started really getting
excited about doing it for a lot of different things.
For me personally, I actually don't use it for anything except Chinese.
And the reason for that is that Chinese is specifically a thing I made a conscious decision
that I want to continue to remember, even if I don't get much of a chance to exercise it,
because I'm not often in China, so I don't.
Or else, something like programming languages or papers,
I have a very different approach,
which is I try not to learn anything from them,
but instead I try to identify the important concepts
and actually ingest them.
So really understand that concept deeply and study it carefully,
or decide if it really is important.
It is incorporated into our library, you know,
incorporated into how I do things, or decide it's not worth it.
So I find I then remember the things that I care about
because I'm using it all the time.
So I've I feel the last 25 years,
I've committed to spending at least half of every day
learning or practicing something new,
which is all my colleagues have always hated because
it always looks like I'm not working on what I'm meant to be working on, but that always
means I do everything faster because I've been practicing a lot of stuff. So I kind of
give myself a lot of opportunity to practice new things, and so I find now I don't
yeah, I don't often kind of find myself wishing I could remember
something because if it's something that's useful, then I've
been using it a lot.
It's easy enough to look at Up on Google.
But speaking Chinese, you can't look at Up on Google.
Do you have advice for people learning new things?
So what have you learned as a process?
I mean, it all starts just making the hours
and the day available.
Yeah, you've got to stick with it, which is again, the number one thing that 99 is and people don't do.
So the people I started learning Chinese with none of them were still doing it 12 months later.
I'm still doing it 10 years later. I tried to stay in touch with them, but they just no one did it.
For something like Chinese, like study how human learning works.
So every one of my Chinese flashcards
is associated with a story.
And that story is specifically designed to be memorable.
And we find things memorable,
which are like funny or disgusting or sexy
or related to people that we know or care about.
So I try to make sure all the stories that are in my head
have those
characteristics. Yeah, so you have to, you know, you won't remember things well if they don't have
some context. And yeah, you won't remember them well if you don't regularly practice them,
whether it be just part of your day-to-day life for the Chinese and me flashcards. I mean, the other
thing is, I'll let yourself fail sometimes. I mean, the other thing is,
I'll let yourself fail sometimes.
So like, I've had various medical problems
over the last few years,
and basically my flashcards just stopped
for about three years.
And then there have been other times
I've stopped for a few months,
and it's so hard because you get back to it,
and it's like, you have 18,000 cards to you.
It's like, and so you just have to go, all right, well, I can either stop and give up everything
or just decide to do this every day for the next two years until I get back to it.
The amazing thing has been that even after three years, you know, the Chinese were still
in there.
Like, yeah, it was so much faster to relearn than it was to learn the first time.
Yeah, absolutely. It's in there. I have the same with guitar, with music and so on.
It's sad because the work sometimes takes away and then you won't play for a year.
But really, if you then just get back to it every day, you're right there again.
What do you think is the next big breakthrough
in artificial intelligence?
What are your hopes in deep learning or beyond
that people should be working on
or you hope there'll be breakthroughs?
I don't think it's possible to predict.
I think what we already have is an incredibly powerful
platform to solve lots of
society important problems that are currently un-solved. So I've just hope that people will lots of people will learn this toolkit and try to use it
I don't think we need a lot of
new technological breakthroughs to do a lot of great work right now and
What do you think we're going to create a human level intelligence system?
Do you think don't know how hard is it how far you think, don't know, how hard is it? How far away are we?
Don't know. Don't know. Have no way to know. I don't know, like, I don't know why people make
predictions about this, because there's no data and nothing to go on. And it's just like,
there's so many society important problems to solve right now. I just don't find it a really interesting question
to even answer. So in terms of societally important problems, what's the problem that is within
reached? For example, there are problems that AI creates, right? So most specifically,
labor force displacement is going to be huge and people keep making this
fearless econometric argument of being like, oh there's been other things
that aren't AI that have come along before and haven't created massive labor force
displacement therefore AI won't. So that's a serious concern for you?
Oh yeah. So injury yang is running on it. Yeah it's it's it's I'm desperately concerned. And do you see already that the changing work place has
lived to a hollowing out of the middle class? You're seeing that
students coming out of school today have a less rosy financial
future ahead of them than the parents did, which has never
happened in recent in the last 300
years.
You know, we've always had progress before.
And you see this turning into anxiety and despair and, and even violence.
So I very much worry about that.
You've written quite a bit about ethics too. I do think that every data scientist working with deep learning needs to recognize they have
an incredibly high leverage tool that they're using that can influence society in lots of ways.
And if they're doing research, that research is going to be used by people doing this
kind of work.
And they have a responsibility to consider the consequences and to think about things like
how will humans be in the loop here? How do we avoid runaway feedback loops? How do we ensure an
appeals process for humans that are impacted by my algorithm? How do I ensure that the constraints
of my algorithm are adequately explained to the people that end up using them.
There's all kinds of human issues which only data scientists are actually in the right place to
educate people or about, but data scientists tend to think as themselves as just engineers and
that they don't need to be part of that process. For now, yeah, which is wrong.
Well, you're in the perfect position to educate them
better, to read literature, to read history,
to learn from history.
Well, Jeremy, thank you so much for everything you do
for inspiring huge amount of people,
getting them into deep learning,
and having the ripple effects,
the flap of a butterfly's wings,
they'll probably change the world, so thank you very much.
Cheers.
you