Lex Fridman Podcast - #206 – Ishan Misra: Self-Supervised Deep Learning in Computer Vision

Starting point is 00:00:00 The following is a conversation with Ishaan Misra, research scientist at Facebook AI Research, who works on self-supervised machine learning in the domain of computer vision. Or, in other words, making AI systems understand the visual world with minimal help from us humans. Transformers and self-attention has been successfully used by OpenAI's DPT3

Starting point is 00:00:23 and other language models to do self-supervised learning in the domain of language. Ishaan, together with Yanglacune and others, is trying to achieve the same success in the domain of images and video. The goal is to leave a robot watching YouTube videos all night, and in the morning come back to a much smarter robot. I read the blog post, self-supervised learning, the dark matter of intelligence by Ishaan and Jan Likun, and then listen to Ishaan's appearance on the excellent, machine learning street talk podcast, and I knew I had to talk to him. By the way, if you're interested in machine learning

Starting point is 00:01:01 and AI, I cannot recommend the ML Street Talk podcast highly enough. Those guys are great. Quick mention of our sponsors, on it, the information, Grammarly, and Athletic Greens. Check them out in the description to support this podcast. As a side note, let me say that for those of you who may have been listening for quite a while,

Starting point is 00:01:23 this podcast used to be called Artificial Intelligence Podcast because my life passion has always been will always be Artificial Intelligence, both narrowly and broadly defined. My goal with this podcast is still to have many conversations with world-class researchers in AI, math, physics, biology, and all the other sciences. But I also want to talk to historians, musicians, athletes, and of course occasionally comedians. In fact, I'm trying out doing this podcast three times a week now to give me more freedom with guest selection. It may be a good chance to have a bit more fun.

Starting point is 00:02:01 Speaking of fun, in this conversation, I challenge the listener to count the number of times the word banana is mentioned. Ishaan and I use the word banana as the canonical example at the core of the hard problem of computer vision and maybe the hard problem of consciousness. As usual I'll do a few minutes of ads now, no ads in the middle, I try to make these interesting but I give you time stamps, so if you skip, please check out the sponsors by clicking the links in the description. It's the best way to support this podcast. This episode is sponsored by Onnet, Nutrition Supplement and Fitness Company. They make Alpha Brain, which is a new tropic that helps support memory, mental speed, and focus.

Starting point is 00:02:43 I use it as a kind of super boost. When I'm preparing for a deep work session, when I know I'm gonna have to sit for two, three hours, four hours thinking about a specific problem. And when I know it's going to be something that requires depth versus breadth. So by depth, I mean, you're thinking about a very narrow specific problem

Starting point is 00:03:03 and just thinking through it with a sheet of paper. As opposed to sort of doing a lot of programming, like jumping from one task to another within a particular programming project. So when you're thinking deep, I gotta give myself that extra boost of taking Alpha Brain, it's almost like a trigger that now we're going to do some deep thinking.

Starting point is 00:03:19 It's like that movie over the top with Celvaces the Loan. I think it turns the cap. And when it turns the cap, he goes into that extra intense mode. It's an arm wrestling movie. That's how I think about Alpha Brain. It's turning the cap for that extra level of intensity. Anyway, go to lexfreedman.com slash on it to get up to 10% off Alpha Brain. That's lexfreedman.com slash on it. This show is also sponsored by the information. They do in depth,

Starting point is 00:03:46 data driven, investigative journalism and the world of technology. The information is the first place that made me realize that good journalism costs money. For most of my life, I was broke until very recently. And I remember when I was broke, I mean, this was, like four or five years ago when I first heard about the information. I remember even though I couldn't really afford it. I remember signing up anyway because I just love the depth of the articles. I think the one that first pulled me in was probably related to Google or Tesla, like a very in-depth study of some particular aspect. But I remember thinking that this is the place to really explore a difficult topic and trust that the person is doing a really good job It's not necessarily that you agree. It's they're going to really do thorough in depth journalism

Starting point is 00:04:31 And you can also trust that some of the biggest names in tech are also reading the information So from that perspective, it's definitely beneficial to include the information as one of the things you read Definitely worth it. Definitely a good example of why good journalism costs money. Anyway, get 75% off your first month if you sign up at the information.com slashlex. That's the information.com slashlex. I see it as a good way of supporting in-depth journalism. I hope you do as well. This show is also sponsored by Grammarly, a writing assistant tool that checks spelling, grammar, sentence structure, and readability. Grammarly Premium, the version you pay for, offers a bunch of extra features. My favorite is the clarity check, which helps you detect

Starting point is 00:05:15 rambling chaos that many of us can descend into, and I certainly descend into as I try to ask a question on the podcast for 10 minutes when the question could have been asked in a single sentence. You should strive for that kind of clarity in your speech and in your writing, and I definitely should as well. In writing, in science, in mathematics, even in art, in life, I think simplicity is beautiful. And in writing, I think simplicity is a skill that can be developed. It's not just sort of an art. It's also a science.

Starting point is 00:05:49 It's a skill that can be developed through rigorous practice of cutting. Remove the things that are not necessary. That process, I think, is really beneficial for people to improve their writing, but also to improve their thinking. I think those two are coupled. And in general, for me, simplicity is a good guide. Big words for me get in the way. Anyway, Grammarly is available on basically any platform

Starting point is 00:06:11 on major sites and apps like Gmail and Twitter and so on. Get 20% off Grammarly Premium by signing up at Grammarly.com slash Lex. That's 20% off at Grammarly.com slash Lex. This shows also sponsored by Athletic Greens, the all-in-one daily drink to support better health and peak performance. It replaced the multivitamers for me and went far beyond that with 75 vitamins and minerals. If the first thing I drink every morning, now I drink two of them a day, I really enjoy the taste. I really enjoy the fact that it provides a nutritional base for all the crazy dietary things I do. I'm still eating carnivore these days. It's good to count that I'm getting all the

Starting point is 00:06:51 vitamins that I need. Athletic greens make that super easy through all of these versions. They keep iterating on different versions. These specialize in this one thing. You can trust that it's all these going to keep improving with the latest science. You don't have to think just drinking the thing and you know that you're getting everything you need. Aside from a thought of greens, I take electrolytes, basically salt, magnesium, potassium, and I also take fish oil. And actually, you'll get one month supply of fish oil when you sign up to athleticgreens.com slash Lex.

Starting point is 00:07:21 So, you get everything. That's athleticgreens.com slash Lex. This is Alexa Friedman That's the lettergreens.com slash Lex. This is the Lex Friedman podcast and here is my conversation with Eshan Mizra. What is self-supervised learning? And maybe even give the bigger basics of what is supervised and semi-supervised learning, and maybe why is self-supervised learning a better term than unsupervised learning? Let's start with supervised learning. So typically for machine learning systems, the way they're trained is you get a bunch of humans. The humans point out particular concepts. So if it's in the case of images, you want the humans

Starting point is 00:08:13 to come and tell you what is like, what is present in the image, draw boxes around them, draw masks of like things pixels, which are of particular categories or not. For NLP, again, there are lots of these particular tasks, say about sentiment analysis, about entailment and so on. So typically for supervised learning, we get a big corpus of such annotated or label data. Then we feed that to a system and the system is really trying to mimic. So it's taking this input of the data and then trying to mimic the output.

Starting point is 00:08:40 So it looks at an image and the human has tagged that this image contains a banana and now the system is basically trying to mimic that. So it's at an image and the human has tagged that this image contains a banana. And now the system is basically trying to mimic that. So it's that's its learning signal. And so for supervised learning, we tried to gather lots of such data and we trained these machine learning models to imitate the input route. And the hope is basically by doing so now on unseen or like new kinds of data, this model can automatically learn to predict these concepts. So this is a standard sort of supervised setting. For semi-supervised setting, the idea typically is that you have, of course,

Starting point is 00:09:10 all of the supervised data, but you have lots of other data which is unsupervised or which is not labeled. Now the problem basically with supervised learning and why you actually have all of these alternate sort of learning paradigms is supervised learning just does not scale. So if you look at for computer vision, the sort of largest, but one of the most popular data sets is ImageNet. Right. So the entire ImageNet data set has about 22,000 concepts and about 14 million images. So these concepts are basically just nouns and they're annotated

Starting point is 00:09:39 on images. And this entire data set was a mammoth data collection effort. It actually gave rise to a lot of powerful learning algorithms. It's credited with the rise of deep learning as well. But this data set took about 22 human years to collect to annotate. It's not even that many concepts. It's not even that many images. 14 million is nothing. You have about 400 million images or so,

Starting point is 00:10:01 or even more than that uploaded to most of the poplister of social media websites today. So now supervised learning just doesn't scale. If I want to now annotate more concepts, if I want to have this various types of fine grain concepts, then it won't really scale. So now you come up to these sort of different learning

Starting point is 00:10:17 paradigms, for example, semi supervised learning, where the idea is, you of course, you have this annotated corpus of supervised data, and you have lots of these unlabeled images. And the idea is, of course, you have the sanotated corpus of supervised data. And you have lots of these unlabeled images. And the idea is that the algorithm should basically try to measure some kind of consistency, or really try to measure some kind of signal on this sort of unlabeled data to make it self more confident about what it's really trying to predict. So by access to this lots of unlabeled data,

Starting point is 00:10:42 the idea is that the algorithm actually learns to be more confident and actually gets better at predicting these concepts. And now we come to the other extreme, which is like self-supervised learning. The idea basically is that the machine or the algorithm should really discover concepts or discover things about the world or learn representations about the world which are useful without access to explicit human supervision. So the word supervision is still in the term self-supervised. So what is the supervision signal? And maybe that perhaps is when Jan Macoon and you argue that unsupervised is the

Starting point is 00:11:15 incorrect in terminology here. So what is the supervision signal when the humans aren't part of the picture or not a big part of the picture. Right. So self-supervised, the reason it has the term supervised in itself is because you're using the data itself as supervision. So because the data serves as its own source of supervision, it's self-supervised in that way.

Starting point is 00:11:37 Now, the reason a lot of people, I mean, we did it in that blog post with Yarn, but a lot of other people have also argued for using this term self-supervised. So starting from like 94 from Virginia Desaas group, I think UCSD, and now she that UCSD, G tender Malik has said this a bunch of times as well. So you have supervised. And then unsupervised basically means everything which is not supervised, but that includes stuff like semi supervised that includes other like transductive learning lots of other

Starting point is 00:12:04 sort of settings. So, that's the reason, like, now people are preferring this term self-supervised, because it explicitly says what's happening. The data itself is the source of supervision, and any sort of learning algorithm which tries to extract just sort of data supervision signals from the data itself is a self-supervised learning algorithm. But there is within the data a set of tricks which unlock the supervision. So can you give me some examples? And there's innovation and ingenuity

Starting point is 00:12:33 required to unlock that supervision. The data doesn't just speak to you some ground truth. You have to do some kind of trick. So I don't know what your favorite domain is. So you specifically specialize in visual learning, but is there favorite examples maybe in language or other domains? Perhaps the most successful applications

Starting point is 00:12:50 have been in NLP, language processing. So the idea basically being that you can train models that can you have a sentence and you mask out certain words. And now these models learn to predict the masked out words. So if you have the cat jumped over the dog, so you can basically mask out cat. And now you are essentially asking the model to predict what was missing, what did I mask out?

Starting point is 00:13:12 So the model is going to predict basically a distribution over all the possible words that it knows. And probably it has, like if it's a well-trained model, it has a sort of higher probability density for this word cat. For vision, I would say the sort of more, a well-trained model, it has a high probability density for this word cat. For vision, I would say the easier example, which is not as widely used these days, is basically, say, for example, video prediction. So video is, again, a sequence of things. So you can ask the model, so if you have a video of, say, 10 seconds, you can feed in the first 9 seconds through a model and then ask it, hey, what happens basically in the 10 second?

Starting point is 00:13:45 Can you predict what's going to happen? And the idea basically is because the model is predicting something about the data itself. Of course, you didn't need any human to tell you what was happening because the 10 second video was naturally captured because the model is predicting what's happening there. It's going to automatically learn something about the structure of the world, how objects move, object permanence, and these kinds of things. So like, if I have something at the edge of the table, it will fall down. Things like these which you really don't have to sit and annotate. In a supervised learning setting, I would have to sit and annotate.

Starting point is 00:14:14 This is a cup. Now I move this cup. This is still a cup. Now I move this cup. It's still a cup. And then it falls down. And this is a fallen-down cup. So I won't have to annotate all of these things in a self-supervised setting. Isn't that kind of a brilliant little trick of taking a series of data that is consistent and removing one element in that series and then teaching the algorithm to predict that element? Isn't that, first of all, that's quite brilliant? It seems to be applicable in anything

Starting point is 00:14:45 that has the constraint of being a sequence that is consistent with the physical reality. The question is, are there other tricks like this? They can generate the self-supervision signal. So sequences, possibly the most widely used one in NLP, for vision, the one that is actually used for images, which is very popular these days, is basically taking an image and now taking different crops of that image.

Starting point is 00:15:12 So you can basically decide to crop, say, the top left corner, and you crop, say, the bottom right corner, and asking a network to basically present it with a choice, saying that, okay, now you have this image you have this image are these the same or not. And so the idea basically is that because different like in an image different parts of the image are going to be related. So for example if you have a chair and a table basically these things are going to be close by versus if you take again if you have like a zoomed in picture of a chair if you take in different crops it's going to be different parts of the chair. So the idea basically is that different crops of the image are related and so the features or the representations that you get from these different crops should also be related.

Starting point is 00:15:52 So this is possibly the most like widely used trick these days for self-supervised loading and computer vision. So again, using the consistency that's inherent to physical reality in visual domain that's consistency that's inherent to physical reality in visual domain. That's, you know, parts of an image are consistent. And then in the language domain or see anything that has sequences, like language or something that's like a time series that you can chop off parts in time. Right. Similar to the story of RNNs and CNNs, of RNNs and cabinets,

Starting point is 00:16:24 RNNs and CNNs of RNNs and covenants. You and Jan Lekun wrote the blog post in March 2021 titled, Self-Supervised Learning, The Dark Matter of Intelligence. Can you summarize this blog post and maybe explain the main idea or set of ideas? The blog post was mainly about sort of distelling. I mean, this is really an accepted fact, I would say, for a lot of people now, that's that sells supervised learning is something that is going to be a plain important

Starting point is 00:16:49 Troso machine learning algorithms that come in the future and even now. Oh, let me just comment that We don't yet have a good understanding what dark matter is The matter for doesn't exactly transfer transfer, but maybe it's actually perfectly transfers that we don't know. We have an inkling that it'll be a big part of whatever solving intelligence looks like. Right. So I think self-supervised learning the way it's done right now is I would say like the

Starting point is 00:17:18 first step towards what it probably should end up learning or what it should enable us to do. Yeah. So the idea for that particular piece was, self-supervised learning is going to be a very powerful way to learn common sense about the world, or like stuff that is really hard to label. For example, is this piece over here heavier than the cup?

Starting point is 00:17:37 Now, for all these kinds of things, you'll have to sit and label these things. So supervised learning is clearly not going to scale. So what is the thing that's actually going to scale? It's probably going to be an agent that can either actually interact with it, so lift it up or observe me doing it. So if I am basically lifting these things up,

Starting point is 00:17:54 it can probably reason about, hey, if this is taking him more time to lift up or the velocity is different, whereas the velocity for this is different, probably this one is heavier. So essentially by observations of the data, you should be able to infer a lot of things about the world without someone explicitly telling you, this is heavy, this is not, this is something that can pour, this is something that

Starting point is 00:18:13 cannot pour, this is somewhere that you can sit, this is not somewhere that you can sit. But you've just mentioned ability to interact with the world. There's so many questions that are yet to be there are still open, which is how do you select a set of data over which the self-supervised learning process works? How much interactivity, like in the active learning or the machine teaching context, is there what are the rewards signals, like how much actual interaction there is with the physical world, that kind of thing. Right. So that's a, that could be a huge

Starting point is 00:18:46 quite. And then on top of that, which I have a million questions about, which we don't know the answers to, but it's worth talking about, is this how much reasoning is involved, how much accumulation of knowledge versus something that's more akin to learning or whether that's the same thing. But so it is truly dark matter. We don't know how exactly to do it. But a lot of us are convinced that it's going to be a major thing in machine learning. Let me reframe it then,

Starting point is 00:19:18 that human supervision cannot be at large scale the source of the solution to intelligence. Right. So the machines have to discover the supervision in the natural signal of the world. Right. I mean, the other thing is also that humans are not particularly good labors. They're not very consistent. For example, like, what's the difference between a dining table and a table?

Starting point is 00:19:42 Is it just the fact that one, like if you just look at a particular table, what makes us say one is dining table and the other is not? Humans are not particularly consistent, they're not like very good sources of supervision for a lot of these kind of edge cases. So it may be also the fact that if we want like want an algorithm or want a machine

Starting point is 00:20:00 to solve a particular task for us, we can maybe just specify the end goal. And like the stuff in between, we really probably should not be specifying because we're not maybe going to confuse it a lot actually. Well, humans can't even answer the meaning of life. So I'm not sure if we're good supervisors of the end goal either. So let me ask you about categories. Humans are not very good at telling the difference between what is and isn't a table, like you mentioned. not very good at telling the difference between what is and isn't a table, like you mentioned. Do you think it's possible? Let me ask you like a pretend your play dough.

Starting point is 00:20:33 Is it possible to create a pretty good taxonomy of objects in the world? It seems like a lot of approaches in machine learning kind of assume a hopeful vision that is possible to construct the perfect economy. Or it exists, perhaps out of our reach, but we can always get closer and closer to it. Or is that a hopeless pursuit? I think it's hopeless in some way. So the thing is for any particular categorization that you create, if you have a discrete sort

Starting point is 00:21:00 of categorization, I can always take the nearest two concepts or I can take a third concept and I can blend it in and I can create a new category. So if you were to enumerate N categories, I will always find an N plus one category for you. That's not going to be in the N categories. And I can actually create not just N plus one, I can very easily create far more than N categories. The thing is, a lot of things we talk about are actually compositional. So it's really hard for us to come and sit and enumerate all of these out. And they compose in various weird ways, right? Like you have like a croissant and a donut come together to form a cronut. So if you were to like enumerate all the foods up until I don't

Starting point is 00:21:35 know, whenever the cronut was about 10 years ago or 15 years ago, then this entire thing called cronut would not exist. Yeah, remember there is the most awesome video of a cat wearing a monkey costume. Yeah. Yes. Yes. People should look it up. It's great.

Starting point is 00:21:50 So is that a monkey or is that a cat? It's a very difficult philosophical question. So there is a concept of similarity between objects. So you think that can take us very far, just kind of getting a good function, a good way to tell which parts of things are similar and which parts of things are very different. I think so, yeah. So, you don't necessarily need to name everything or assign a name to everything to be able to use it, right? So, there are like lots of...

Starting point is 00:22:21 Shakespeare said that, what's in a name? What's in a name? Yeah, okay. And I mean lots of like for example animals right they don't have necessarily a well-formed like syntactic language but they're able to go about their day perfectly. The same thing happens for us so I mean we probably look at things and we figure out oh this is similar to something else that I've seen before and then I can probably learn how to use it. So I haven't seen all the possible door knobs in the world. But if you show me, like I was able to get into this particular place fairly easily, I've never seen that particular door knob. So, I of course, related to all the door knobs that I've seen and I know exactly how it's going to open,

Starting point is 00:22:58 but I have a pretty good idea of how it's going to open. And I think this kind of translation between experiences only happens because of similarity, because I'm able to relate it to a door now. If I relate it to a hairdryer, I would probably be stuck still outside, not able to get in. Again, a bit of a philosophical question, but it's, can similarity take us all the way to understanding a thing? Can having a good function that compares objects, get us to understand something profound about singular objects. I think I'd ask you a question back. What does it mean to understand objects?

Starting point is 00:23:33 Well, let me tell you what that's similar to. No. So there's an idea of sort of reasoning by analogy kind of thing. I think understanding is the process of placing that thing in some kind of network of knowledge that you have. That it perhaps is fundamentally related to other concepts. It's not like understanding is fundamentally related by composition of the concepts

Starting point is 00:24:01 and maybe in relation to other concepts. And maybe deeper and deeper understanding is maybe just adding more edges to that graph somehow. So maybe it is a composition of similarities. I mean, ultimately, I suppose it is a kind of embedding in that wisdom space. Yeah. Okay, wisdom space is good. I think I do think, right?

Starting point is 00:24:30 So similarity does get you very, very far. Is it the answer to everything? I mean, I don't even know what everything is, but it's going to take us really far. And I think the thing is, things are similar in very different contexts, right? So an elephant is similar to, I don't know, another sort of wild animal. Let's just pick lion in a different way because they're both four-legged creatures.

Starting point is 00:24:53 They're also lion animals. But of course, they're very different in a lot of different ways. So elephants are like herbivores, lions or not. So similarity does, similarity and particularly dissimilarity also sort actually helps us understand a lot about things. And so that's actually why I think discrete categorization is very hard.

Starting point is 00:25:10 Just like forming this particular category of elephant and a particular category of lion. Maybe it's good for just like taxonomy, biological taxonomies. But when it comes to other things which are not as maybe, for example, like grilled cheese, right? I have a grilled cheese I dip it in tomato and I keep it outside.

Starting point is 00:25:26 No, is that still a grilled cheese or is that something else? Right, so categorization is still very useful for solving problems. But as you're intuition, then sort of the self-supervised should be the, to borrow Yann and the Coons terminology. Should be the cake and then categorization, the classification, maybe the supervised like layer should be just like the thing on top, the cherry or the icing or

Starting point is 00:25:52 whatever. So if you make it the cake, it gets in the way of learning. If you make it the cake, then you don't, we won't be able to sit and annotate everything. That's as simple as it is. Like that's my very practical view on it. It's just, I mean, in my PhD, I sat down and annotate everything. That's as simple as it is. Like, that's my very practical view on it. It's just, I mean, in my PhD, I sat down and annotated like a bunch of cards for one of my projects. And very quickly, I was just like, it was in a video and I was basically drawing boxes and on all these cards.

Starting point is 00:26:16 And I think I spent about a week doing all of that and I barely got anything done. And basically, this was, I think, my first year of my PhD was like, a second year of my masters. And then by the end of it, I'm like, okay, this is just hopeless. I can keep doing it. And when I done that, someone came up to me and they basically told me, oh, this is a pickup truck, this is not a car. And that's like, aha, this actually makes sense because a pickup truck is not really like, what was I annotating? Was I annotating anything that is mobile? Or was I annotating particular sedans or was I annotating SUVs? What was I doing? By the way, the annotation was

Starting point is 00:26:47 bounding boxes or there's so many deep profound questions here. You're almost cheating your way out of by doing self-supervised learning by the way, which is like what makes for an object. I suppose to solve intelligence maybe you don't ever need to answer that question. I mean, this is the question that anyone that's ever done an annotation because it's so painful. Gets to ask like, why am I doing a drawing very careful line around this object? Like what what what is the value? I remember when I first saw semantic segmentation where you have like instant segmentation where you have a very exact line around the object in a 2D plane of a fundamental 3D object projected

Starting point is 00:27:34 on a 2D plane. So you're drawing a line around a car that might be occluded, there might be another thing in front of it, but you're still drawing the line of the part of the car that you see. How is that the car? Why is that the car? Like I had like an existential crisis every time. Like how is that going to help us understand a self-computer vision? I'm not sure I have a good answer to what's better.

Starting point is 00:28:00 And I'm not sure I share the confidence that you have that self-supervised learning can take us far. I think I'm more and more convinced that it's a very important component, but I still feel like we need to understand what makes, like this, this, like dream of maybe what it's called like symbolic AI, of arrive once you have this common sense base, be able to play with these concepts and build graphs or hierarchies of concepts on top, in order to then form a deep sense of this three-dimensional world, or four-dimensional world, and be able to reason,

Starting point is 00:28:46 and then project that onto 2D plane and on the two-interpreter 2D image. Gasky just an out there question. I remember, I think Andre Kapati had a blog post, the blog computer vision, like being really hard. I forgot what the title was, but as many, many years ago, and he had, I think President Obama is stepping on a scale and there's humor

Starting point is 00:29:08 and there's a bunch of people laughing and whatever. And there's a lot of interesting things about that image. And I think Andre highlighted a bunch of things about the image that us humans are able to immediately understand, like the idea, I think, of gravity and that you can, you have the concept of a weight. You immediately project because of our knowledge of pose and how human bodies are constructed, you understand how the forces are being applied with the human body. They're really interesting.

Starting point is 00:29:38 Other things that you're able to understand is multiple people looking at each other in the image. You're able to have a mental model of what the people are thinking about. You're able to infer like, oh, this person is probably thinks like is laughing at how humorous the situation is. And this person is confused about what the situation is because they're looking this way. We're able to infer all of that.

Starting point is 00:30:00 So that's human vision. How difficult is computer vision? Like in order to achieve that level of understanding. And maybe how big of a part that self supervise the learning plane that do you think? And do you still, you know, back, that was like over a decade ago, I think Andre and I think a lot of people agreed as computer vision is really hard. Do you still think computer vision is really hard? I think Andre and I think a lot of people agreed, this computer vision is really hard.

Starting point is 00:30:25 Do you still think computer vision is really hard? I think it is. Yes. And getting to that kind of understanding, I mean, it's really out there. So if you ask me to solve just that particular problem, I can do it the supervised learning route. I can always construct a dataset and basically predict, oh, is there a human in this of course, I can do it. Actually, that's a good question. Do you think basically predict, oh, is there humor in this or not? And of course I can do it.

Starting point is 00:30:45 Actually, that's a good question. Do you think you can, okay, okay. Do you think you can do human supervised annotation of humor? To some extent, yes. I'm sure it'll work. I mean, it won't be as bad as like random. The guessing I'm sure it can still predict whether it's humorous or not in some way.

Starting point is 00:31:00 Yeah, maybe like Reddit upvotes is the signal. I don't know. I mean, it won't do a great job, but it'll do something it may actually be like it may find certain things which are not humorous humorous as well Which is going to be bad for us, but I mean it'll do it won't be random. Yeah, kind of like my sense of humor Okay, so fine so you can that particular problem. Yes, but the general problem you're saying is hard the general problem is hard And I mean self-supervised learning is not the answer to everything. Of course, it's not. I think if you have machines that are going

Starting point is 00:31:29 to communicate with humans at the end of it, you want to understand what the algorithm is doing. You want it to be able to produce an output that you can decipher, that you can understand. Or it's actually useful for something else, which again, is a human. So at some point in this sort of entire loop, a human steps in. And now this human needs to understand what's going on.

Starting point is 00:31:48 So at that point, this entire notion of language or semantic really comes in. If the machine just spits out something and if we can't understand it, then it's not really that useful for us. So self-supervised learning is probably going to be useful for a lot of the things before that part. Before the machine really needs to communicate a particular kind of output with a human.

Starting point is 00:32:08 Because I mean, otherwise, how is it going to do that without language? Or some kind of communication, but you're saying that it's possible to build a big base of understanding or whatever of concepts, the common sense concepts. Supervised learning in the context of computer vision is something you've focused on, but that's a really hard domain and it's kind of the cutting edge of what we're as a community working on today. Can we take a little bit of a step back

Starting point is 00:32:37 and look at language? Can you summarize the history of success of self-supervised learning in natural language processing, language modeling, water transformers, what is the masking, the sentence completion that you mentioned before, how does it lead us to understand anything, semantic meaning of words, syntactic role of words and sentences? So, I'm of course not the expert in NLP. I kind of follow it a little bit from the sides.

Starting point is 00:33:05 So the main reason why all of this masking stuff works is I think it's called the distributional hypothesis in NLP. The idea basically being that words that occur in the same context should have similar meaning. So if you have the blank jumped over the blank, it basically, whatever is like in the first blank is basically an object that can actually jump is going to be something that can jump. So a cat or a dog or I don't know sheep, something, all of these things can basically be in that particular context. And now so essentially the idea is that if you have words that are in the same context and you predict them, you're going to learn lots of useful things about how words

Starting point is 00:33:43 are related because you're predicting by looking at their context what the word is going to be. So in this particular case, the blank jumped over the fence. So now if it's a sheep, the sheep jumped over the fence, the dog jumped over the fence. So essentially the algorithm or the representation basically puts together these two concepts together. So it says, okay, dogs are going to be kind of later to sheep because both of them occur in the same context. Of course, now you can decide depending on your particular application downstream. You can say that dogs are absolutely not related to sheep because well, I don't, I really care about, you know, dog food, for example. I'm a dog food person and I really want to give this dog food to this particular animal. So depending on what your downstream application is, of course, this notion of similarity or this notion or this common sense that you've learned may not be applicable.

Starting point is 00:34:28 But the point is basically that this, just predicting what the blanks are, is going to take you really, really far. So there's a nice feature of language that the number of words in a particular language is very large, but it's finite and it's actually not that large in the grand scheme of things. I still got it because we take it for granted. So first of all, when you say masking, you're talking about this very process of the blank, of removing words from a sentence, and then having the knowledge of what word went there in the initial data set, that's the ground truth that you're training on and then you're asking

Starting point is 00:35:05 then you'll not work to predict what goes there. That's like a little trick. It's a really powerful trick. The question is how far that takes us and the other question is, is there other tricks? Because to me, it's very possible there's other very fascinating tricks. I'll give you an example in autonomous driving. There's other very fascinating tricks. I'll give you an example in them. In autonomous driving, there's a bunch of tricks that give you the self-supervised signal back. For example, very similar to sentences, but not really, which is you have signals from

Starting point is 00:35:41 humans driving the car because a lot of us drive cars to places. And so you can ask the neural network to predict what's going to happen the next two seconds for a safe navigation through the environment. And the signal is comes from the fact that you also have knowledge of what happened in the next two seconds because you have video of the data. The question in autonomous driving,

Starting point is 00:36:06 as it is in language, can we learn how to drive autonomously based on that kind of self-supervision? Probably the answer is no. The question is, how good can we get? And the same with language, how good can we get? And are there other tricks? Like we get sometimes super excited by this trick

Starting point is 00:36:27 that works really well, but I wonder, it's almost like mining for gold. I wonder how many signals there are in the data that could be leveraged that are like there. Is that, I just wanted to kind of linger on that because sometimes it's easy to think that maybe this masking process is self-supervised learning. No sometimes it's easy to think that maybe this masking process is self-supervised learning. No, it's only one one method. So there could be many many other methods,

Starting point is 00:36:51 many tricky methods, maybe interesting ways to leverage human computation in very interesting ways that might actually border on semi-supervised learning, something like that. Obviously, the internet is generated by humans at the end of the day. So all that to say is, what's your sense in this particular context of language, how far can that masking process take us? So it has to do the test of time, right? I mean, so word to word, the initial sort of NLP technique that was using this, to now, for example, like all the BERT and all these big models that we get, BERT and Roberta, for example. All of them

Starting point is 00:37:30 are still sort of based on the same principle of masking. It's taken as really far. I mean, you can actually do things like, oh, these two sentences are similar or not, whether this particular sentence follows this other sentence in terms of logic, so entailment. You can do a lot of these things with just this masking trick. So I'm not sure if I can predict how far it can take us because when it first came out, when Wurtavek was out, I don't think a lot of us would have imagined that this would actually help us do some kind of entailment problems and really that well. And so just the fact that by just scaling up the amount of data that we're training on and like using better and more powerful neural network architectures has taken us from that to this is just showing you how maybe poor predictors we are like as humans how poor

Starting point is 00:38:16 we are at predicting how successful particular technique is going to be. So I think I can say something now, but like 10 years from now, I look completely stupid basically predicting this. I think I can say something now, but like 10 years from now, I look completely stupid basically, predicting this. In language domain, is there something in your work that you find useful and insightful and transferable to computer vision, but also just, I don't know, beautiful and profound that I think carries through to the vision domain? I mean, the idea of masking has been very powerful. It has been used in vision as well for predicting, like you say, the next sort of, if you have

Starting point is 00:38:48 in sort of frames, the new predict what's going to happen in the next frame. So that's been very powerful. In terms of modeling, I can just terms in terms of architecture. I think you would have asked about transformers while back. That has really become, like it has become super exciting for computer vision now. Like in the past, I would say, you year and a half, it's become really powerful. What's the transformer? Right.

Starting point is 00:39:08 I mean, the core part of a transformer is something called the self-attention model. So it came out of Google. And the idea basically is that if you have n elements, what you're creating is a way for all of these n elements to talk to each other. So the idea basically is that you are paying attention. Each element is paying attention to each of the other element. And basically by doing this, it's really trying to figure out, you're basically getting a much better view of the data.

Starting point is 00:39:33 So for example, if you have a sentence of four words, the point is if you get a representation or a feature for this entire sentence, it's constructed in a way such that each word has paid attention to everything else. Now, the reason it's like different from say what you would do in a confnet is basically that in the confnet, you would only pay attention to a local window. So each word would only pay attention to its next neighbor or like one neighbor after that. And the same thing goes for images. And images you would basically pay attention to pixels in a three cross three or a seven cross seven neighborhood. And that's it. Whereas with the transformer the self-attention mainly the sort of idea is that each element needs to pay attention to each other

Starting point is 00:40:12 Element and when you say attention, maybe another way to phrase that is you're considering a context Wide context in terms of the wide context of the sentence in understanding the meaning of a particular word and a computer vision that's understanding a large context to understand the local pattern of a particular local part of an image. Right. So basically if you have say again a banana in the image, you're looking at the full image first. So whether it's like, you know, you're looking at all the pixels that are of a kitchen, of a dining table, and so on. And then you're basically looking at the banana also. Yeah. By the way, in terms of if we were to train the funny classifier, there's something funny about the word banana. Just going to anticipate my, my, my, I am wearing a banana shirt. So yeah. Is there bananas on it? Okay. So masking has worked for the vision context as well. And so this transformer idea has worked as well.

Starting point is 00:41:06 So basically looking at all the elements to understand a particular element. It has been really powerful in vision. The reason is like a lot of things when you're looking at them in isolation. So if you look at just a blob of pixels, so Antonio Terralba at MIT used to have this like really famous image, which I looked at when I was a PhD student, where he would basically have a blob of pixels and he would ask you, hey, what is this? And it looked basically like a shoe or like it could look like a TV remote, it could look like anything and it turns out it was a beer bottle. But I'm not sure it was one of these three things, but basically he showed you the full picture and then it was very obvious what it was. But

Starting point is 00:41:40 the point is just by looking at that particular local window, you couldn't figure out. Because of resolution, because of other things, it's just not easy always to just figure out by looking at just the neighborhood of pixels, what these pixels are. Yeah. And the same thing happens for language as well. For the parameters that have to learn something about the data,

Starting point is 00:41:57 you need to give it the capacity to learn these central things. Like if it's not actually able to receive the signal at all, it's not going to be able to learn that signal. And to understand languages, to understand language, you have to be able to see words in their full context. OK, what is harder to solve vision or language? Visual intelligence or linguistic intelligence? So I'm going to say computer vision is harder.

Starting point is 00:42:22 My reason for this is basically that language, of course, has a big structure to it because we developed it. Whereas vision is something that is common in a lot of animals. Everyone is able to get by a lot of these animals on Earth are actually able to get by without language. And a lot of these animals, we also deem to be intelligent. So clearly intelligence does have a visual component to it. And yes, of course, in the case of humans, it of course also has a linguistic component.

Starting point is 00:42:48 But it means that there is something far more fundamental about vision than there is about language. And I'm sorry to anyone who disagrees, but yes, this is what I feel. So that's being a little bit reflected in the challenges that have to do with the progress of self-supervised learning, would you say? Or is that just the peculiar accidents of the progress of the AI community that we focused on? Or we discovered self-attention and transformers in the context of language first? So, like the self-supervised learning success was actually, for Vision has not much to do with the Transformers part.

Starting point is 00:43:22 I would say it's actually been independent a little bit. I think it's just that the signal was a little bit different for vision then there was for like NLP and probably NLP folks discovered it before. So for vision, the main success has basically been this like crops so far, like taking different crops of images. Whereas for NLP, it was this masking thing. But also the level of success is still much higher for language. It has. So that has a lot to do with. I mean, I can get into a lot of details for this particular question. Let's go for it. Okay.

Starting point is 00:43:51 So the first thing is language is very structured. So you are going to produce a distribution over a finite vocabulary. English has a finite number of words. It's actually not that large. And you need to produce basically when you're doing this masking thing, all you need to do is basically tell me which one of these like 50,000 words it is Yeah, that's it Now for vision, let's imagine doing the same thing Okay, we're basically going to blank out a particular part of the image and we asked the network or this neural network to predict

Starting point is 00:44:17 What is present in this missing patch It's combinatorially large, right? You have 256 pixel values If you're even producing basically a 7 cross 7 or a 14 cross 14 window of pixels, at each of these 149 locations, you have 256 values to predict. Yeah. So it's really, really large.

Starting point is 00:44:37 Very quickly, the kind of prediction problems that we are setting up, are going to be extremely intractable for us. So the thing is for NLP, it has been really successful because we are very good at predicting like doing this like distribution over a finite set. And the problem is when this set becomes really large, we are going to become really, really bad at making these predictions. And at solving basically this particular set of problems. So if you were to do it exactly in the same way as NLP for vision, there is very limited success. The way stuff is working right now

Starting point is 00:45:11 is actually not by predicting these masks. It's basically by saying that you take these two crops from the image, you get a feature representation from it and just saying that these two features, so they're vectors, just saying that the distance between these vectors should be small. And so it's a very different way of learning from the visual signal than there is from NLP.

Starting point is 00:45:31 The other reason is the distribution hypothesis that we talked about for NLP, right? So a word given its context, basically the context actually supplies a lot of meaning to the word. Now, because there are just finite number of words, and there is a finite way in which we compose them. Of course, the same thing holds for pixels, but in language, there's a lot of structure. I always say whatever, they're dashed, jumped over the fence, for example.

Starting point is 00:45:56 There are lots of these sentences that you'll get. And from this, you can actually look at this particular sentence might occur in a lot of different contexts as well. This exact same sentence might occur in a different context. So the sheep jumped over the fence, the cat jumped over the fence, the talk jumped over the fence. So you immediately get a lot of these words, which are because this particular token itself has so much meaning, you get a lot of these tokens or these words which are actually going to have sort of this related meaning across, given this context. Whereas for vision, it's much harder. Because just by the way we capture images,

Starting point is 00:46:28 lighting can be different. There might be different noise in the sensor. So the thing is you're capturing a physical phenomenon and then you're basically going through a very complicated pipeline of image processing and then you're translating that into some kind of digital signal. Whereas with language, you write it down and you transfer it to a digital signal almost like it's a lossless like transfer. And each of these tokens

Starting point is 00:46:50 are very, very well defined. There could be a little bit of an argument there because language has written down is a projection of thought. This is one of the open questions is, if you perfectly can solve language, are you getting close to being able to solve, easily with flying colors pass the touring test kind of thing? So that's, it's similar, but different than the computer vision problem is in the 2D plane, is a projection of a 3D dimensional world. So perhaps there are similar problems there.

Starting point is 00:47:28 Maybe this is a good thing. I think what I'm saying is NLP is not easy. Of course, don't get me wrong. Like abstract thought expressed in knowledge or knowledge basically expressed in language is really hard to understand it. I mean, we've been communicating with language for so long. And it's, it is of course a very complicated concept. The thing is, at least getting like some, some more reasonable, like being able to solve some kind of reasonable tasks with language,

Starting point is 00:47:53 I would say slightly easier than it is with computer vision. Yeah, I would say, yeah, so that's well put. I would say getting impressive performance on language is easier. I feel like for both language and computer vision, there's going to be this wall of like, like this hump you have to overcome to achieve super human level performance

Starting point is 00:48:17 or human level performance. And I feel like for language, that wall is farther away. So you can get pretty nice, you can do a lot of tricks. You can show really impressive performance. You can even fool people that you're tweeting or you're blog post writing or your question answering has intelligence behind it. But to truly demonstrate understanding of dialogue, of continuous long-form dialogue, that would

Starting point is 00:48:48 require perhaps big breakthroughs. In the same way in computer vision, I think the big breakthroughs need to happen earlier to achieve impressive performance. This may be a good place to, you already mentioned it, but what is contrastive learning and what are energy-based models? Contrastive learning is sort of a paradigm of learning where the idea is that you are learning this embedding space, or so you're learning this sort of vector space of all your concepts. And the way you learn that is basically by contrasting. So the idea is that you have

Starting point is 00:49:20 a sample, you have another sample that's related to it. So that's called the positive. And you have another sample that's not related to it. So that's negative. So for example, let's just take an NLP or in a simple example in computer efficient. So you have an image of a cat, you have an image of a dog, and for whatever application that you're doing,

Starting point is 00:49:39 say you're trying to figure out what a pets are, you're seeing that these two images are related. So image of a cat and dog are related. But now you have another third image of a banana because you don't like that word. So now you basically have this banana. Thank you for speaking to the crowd. And so you take both of these images and you take the image from the cat, the image from the dog, you get a feature from both of them. And now what you're training the network to do is basically pull both of these features together while pushing them away from

Starting point is 00:50:05 the future of a banana. So this is the contrastive part. So you're contrasting against the banana. So there's always this notion of a negative and a positive. Now energy-based models are like like one way that yarn sort of explains a lot of these methods. So yarn basically, I think a couple of years or more than that, like when I joined Facebook, Jan used to keep mentioning this word, energy-based models. And of course, I had no idea what he was talking about. So then one day I caught him in one of the conference rooms

Starting point is 00:50:32 and I'm like, can you please tell me what this is? So then, like, very patiently, he sat down with like a marker and a whiteboard. And his idea basically is that rather than talking about probability distributions, you can talk about energies of models. So a model that trying to minimize certain energies in certain space, or they're trying

Starting point is 00:50:47 to maximize a certain kind of energy. And the idea basically is that you can explain a lot of the contrastive models, GANS, for example, which are like generative adversarial networks. A lot of these modern learning methods, or VIEs, which are variational autoencoders, you can really explain them very nicely in terms of an energy function that they're trying to minimize or maximize.

Starting point is 00:51:08 And so by putting this common sort of language for all of these models, what looks very different in machine learning that OVIEs are very different from what GANs are, are very different from what contrastive models are, you actually get a sense of like, oh, these are actually very, very related. It's just that the way or the mechanism in which they're sort of maximizing or minimizing this energy function is slightly different. So revealing the commonalities between all these approaches

Starting point is 00:51:32 and putting a sexy word on top of it, like energy. And so similarity, two things that are similar have low energy, like the low energy signifying similarity. Right. Exactly. So basically, the idea is that if you were to imagine like the embedding as a manifold, a 2D manifold, you would get a hill or like a high sort of peak in the energy manifold, wherever two things are not related. And basically you would have like a

Starting point is 00:51:56 dip where two things are related. So you would get a dip in the manner. And in the self-supervised context, how do you know two things are related and two things are not related? Right. So this is where all the sort of ingenuity or tricks comes in, right? So for example, like, you can take the fill in the blank problem or you can take in the context problem. And what you can say is two words that are in the same context are related.

Starting point is 00:52:20 Two words that are in different contexts are not related. For images, basically two crops from the same image are related, and whereas a third image is not related at all. For a video, it can be two frames from that video are related because it likely to contain the same sort of concepts in them, whereas a third frame from a different video is not related. So it basically is, it's a very general term. Contrastal learning is nothing really to do with self-supervised learning.

Starting point is 00:52:43 It actually is very popular in for example, like any kind of metric learning or any kind of embedding learning. So it's also used in supervised learning. It's also, and the thing is, because we are not really using labels to get these positive or negative pairs, it can basically also be used for self-supervised learning. So you mentioned one of the ideas in the vision context that works is to have different crops. So you could think of that as a way to sort of manipulating the data to generate examples that are similar. Obviously, there's a bunch of other techniques. You mentioned lighting as a very,

Starting point is 00:53:20 you know, in images lighting is something that varies a lot, and you can artificially change those kinds of things. There's the whole broad field of data augmentation, which manipulates images in order to increase arbitrarily the size of the data set. First of all, what is data augmentation? And second of all, what's the role of data augmentation in self-supervised learning and contrastive learning? So data augmentation is just a way, like you said, it's basically a way to augment the data. So you have say n samples, and what you do is you basically define some kind of transforms for the sample.

Starting point is 00:53:54 So you take your say image, and then you define a transform where you can just increase the colors, like the colors or the brightness of the image, or increase or decrease the contrast of the image, for example, or take different crops of it. So data augmentation is just a process to basically perturb the data or augment the data. And so it has played a fundamental role for computer vision for self-supervised learning, especially. The way most of the current methods work contrastive or otherwise is by taking an image, in the case of images, is by taking an image and then computing basically two perturbations of it.

Starting point is 00:54:30 So these can be two different crops of the image with like different types of lighting or different contrast or different colors. So you did it or the colors a little bit and so on. And now the idea is basically because it's the same object, or because it's related concepts in both of these perturbations, you want the features from both of these perturbations to be similar. So now you can use a variety of different ways to enforce this constraint, like these features being similar. You can do this by contrastive learning. So basically both of these things are positives,

Starting point is 00:55:00 a third sort of images negative. You can do this basically by like clustering. For example, you can say that both of these images should, the features from both of these images should belong in the same cluster, because they're related. Whereas another image should belong to a different cluster.

Starting point is 00:55:16 So there's a variety of different ways to basically enforce this particular constraint. By the way, when you say features, it means there's a very large neural network that extracting patterns from the image and the kind of patterns that extracts should be either identical or very similar. Right. That's what that means. So the neural network basically takes in the image and then outputs a set of like basically a vector of like numbers.

Starting point is 00:55:39 And that's the feature. And you want this feature for both of these like different crops that you computer to be similar. So you want this vector to be identical in its entries, for example. Be literally close in this multidimensional space. Right. And like you said, close can mean part of the same cluster, something like that in this large space. First of all, I wonder if there is connection to the way humans learn to this, almost like maybe subconsciously,

Starting point is 00:56:10 in order to understand a thing, you kind of have to see it from two, three multiple angles. I wonder, there's a lot of friends who are in your side, this may be working and cognitive side, this I wonder if that's in there somewhere. Like in order for us to place a concept in this proper place, we have to basically crop it in all kinds of ways, do a basic date augmentation on it in whatever very clever ways that the brain likes

Starting point is 00:56:39 to do, like spinning around in our mind somehow that that is very effective. So I think for some of them, we like need to do it. So like babies, for example, pick up objects, like move them, put them, goes to the right and whatnot. Yeah. But for certain other things, actually, we are good at imagining it as well. Right. So if you I have never seen, for example, an elephant from the top, I've never basically looked at it from like top down.

Starting point is 00:57:01 Yeah. But if you showed me a picture of it, I could very well tell you that that's an elephant. So I think some of it will just like be naturally built it or transfer it from other objects that we've seen tools. Imagine what it's going to look like. Has anyone done that with the augmentation? Like imagine all the possible things that are occluded

Starting point is 00:57:21 or not there, but not just like normal things, like wild things, but they're nevertheless physically consistent. So I mean, people do kind of like occlusion based augmentation as well. So you place in like a random like box, gray box to sort of mask out a certain part of the image. And the thing is basically you're kind of occluding it. For example, you place it on half of a person's face. So basically saying that something below there knows it's occluded, because it's great out. So no, I meant like you have like, what is it?

Starting point is 00:57:54 A table and you can't see behind the table. And you imagine there's a bunch of elves with bananas behind the table. Like, I wonder if there's useful to have a wild imagination for the network, because that's possible, well, maybe not else, but puppies and kittens or something like that. Just have a wild imagination and constantly be generating that wild imagination. Because in terms of data augmentation that's currently applied, it's super ultra, very boring. It's very basic data augmentation.

Starting point is 00:58:25 I wonder if there's a benefit to being wildly imaginable while trying to be consistent with physical reality. I think it's a kind of a chicken and egg problem, right? Because to have like amazing data augmentation, you need to understand what the scene is. Right. And what we're trying to do data augmentation to learn what a scene is anyway. So it basically just keeps going on. Before you understand it, just put else with bananas until you know it's not to be true. Just like children have a wild imagination until the adults ruin it all. Okay.

Starting point is 00:58:57 So what are the different kinds of data augmentation that you've seen to be effective in visual intelligence? For like vision, it's a lot of these image filtering operations. So blurring the image, all the kind of Instagram filters that you can think of. So like, arbitrarily, make the red super red, make the green super greens saturate the image. Rotation cropping.

Starting point is 00:59:19 Rotation cropping, exactly. All of these kinds of things. I said, lighting is a really interesting one. Yes, to me. That feels really complicated to do.. I said, lighting is a really interesting one to me. That feels like really complicated to do. So I mean, the augmentations that we work on aren't like that involved, is they're not going to be like physically realistic versions of lighting.

Starting point is 00:59:33 It's not that you're assuming that there's a light source up and then you're moving it to the right and then what does the thing look like? It's really more about like brightness of the image, overall brightness of the image or overall contrast of the image and so on. But this is a really important point to me. I always thought that data augmentation holds an important key to big improvements in

Starting point is 00:59:55 machine learning. And it seems that it is an important aspect of self-supervised learning. So I wonder if there's big improvements to be achieved on much more intelligent kinds of data augmentation. For example, currently, maybe you can correct me if I'm wrong, data augmentation is not parametrized. Yeah. You're not learning. You're not learning. To me, it seems like data augmentation potentially should involve more learning documentation potentially should involve more learning than the learning process itself. You're almost like thinking of like generative kind of, it's the elves of bananas. You're trying to, it's like very active imagination of messing with the world and teaching that

Starting point is 01:00:38 mechanism for messing with the world to be realistic. Because that feels like, I mean, imagination. It's just as you said, it things, it feels like us humans are able to maybe sometimes subconsciously imagine before we see the thing, imagine what we're expecting to see. Like, maybe several options. And especially we probably forgot, but when we're younger, probably the possibilities will wilder and more numerous. And then as we get older, we become to understand the world and the possibilities of what we might see becomes less and less and less. So I wonder if you think there's a lot of breakthroughs yet to be had in data augmentation.

Starting point is 01:01:19 And maybe also can you just comment on the stuff we have? Is that a big part of self-supervised learning? Yes, so data augmentation is like part of self-supervised learning? Yes. So data augmentation is key to self-supervised learning. That is the kind of augmentation that we're using. And basically, the fact that we're trying to learn these neural networks that are predicting these features from images that are robust on data augmentation has been the key for visual self-supervised learning. And they play a fairly fundamental role to it. Now, the irony of all of this is that for deep learning purists, we'll say the entire point of deep learning is that you feed in the pixels to the neural network

Starting point is 01:01:53 and it should figure out the patterns on its own. So if it really wants to look at edges, it should look at edges. You shouldn't really go and handcraft these like features. You shouldn't go tell it that look at edges. So data augmentation should basically be in the same category, right? Why should we tell the network or tell this entire learning paradigm

Starting point is 01:02:10 what kind of data augmentation that we are looking for? We are encoding a very sort of human specific bias there that we know things are, like, if you change the contrast of the image, it should still be an apple. Or it should still see apple, not banana. Thank you. Basically, if we the image, it should still be an Apple. Or it should still see Apple, not banana. Thank you. Basically, if we change colors, it should still be the same concept. Yes.

Starting point is 01:02:30 Of course, this is not. One, this doesn't feel like super satisfactory because a lot of our human knowledge or our human supervision is actually going into the data augmentation. So although we are calling itself supervised learning, a lot of the human knowledge is actually being encoded in the data augmentation process. So it's really like we've kind of sneaked away the supervision at the input and we're like really designing these nice list of data augmentations that are working very well. Of course, the idea is that it's much easier to design a list of data augmentation that it is. So humans are doing never less, doing less and less work and maybe leveraging their creativity more and more.

Starting point is 01:03:10 When we say data augmentation is not parametrized, it means it's not part of the learning process. Do you think it's possible to integrate some of the data augmentation into the learning process? I think so. In fact, it will be really beneficial for us because a lot of these data augmentation that we use in vision are very extreme. For example, like when you have certain concepts again, a banana, you take the banana and then basically you change the color of the banana, right? So you make it a purple banana. Now this data augmentation process is actually independent of the, like, it has no notion of what is present in the image. So it can change this color arbitrarily.

Starting point is 01:03:43 It can make it a red banana as well. And now what we're doing is we're telling the neural network that this red banana and so a crop of this image which has the red banana and a crop of this image where I change the color to a purple banana should be the features should be the same. Now bananas are in red or purple mostly. So really the retaugmentation process should take into account what is present in the image and what are the kinds of physical realities that are possible. It shouldn't be completely independent of the image. So you might get big gains if you instead of being drastic do subtle augmentation but realistic augmentation.

Starting point is 01:04:15 Right, realistic. I'm not sure if it's subtle but like realistic for sure. If it's realistic then even subtle augmentation will give you big benefits. Exactly. And it will be like for particular domains, you might actually see like, if for example, now we're doing medical imaging, there are going to be certain kinds of like geometric

Starting point is 01:04:33 augmentation which are not really going to be very valid for the human body. So if you were to like actually loop in data augmentation into the learning process, it will actually be much more useful. Now, this actually does take us to maybe a semi-supervised kind of a setting because you do want to understand what is it that you're trying to solve.

Starting point is 01:04:51 So currently self-supervised learning kind of operates in the wild, right? So you do the self-supervised learning, and the purists and all of us basically say that, okay, this should learn useful representations and they should be useful for any kind of end task. No matter it's like banana recognition or like autonomous driving. Now, it's a tall order. Maybe the first baby step for us should be that, okay, if you're trying to

Starting point is 01:05:13 loop in this data augmentation into the learning process, then we at least need to have some sense of what we're trying to do. Are we trying to distinguish between different types of bananas, or are we trying to distinguish between banana and apple, or are we trying to do all of these things at once? And so some notion of what happens at the end might actually help us too much better at this side. Let me ask you a ridiculous question. If I were to give you a black box, like a choice to have an arbitrary large data set

Starting point is 01:05:41 of real natural data versus really good data augmentation algorithms, which would you like to train in a self-supervised way on? So natural data from the internet are arbitrary large, so unlimited data. Or it's like more controlled good data augmentation on the finite dataset. The thing is because I'm learning algorithms for vision right now, really rely on data augmentation, even if you were to give me an infinite source of image data, I still need a good data augmentation.

Starting point is 01:06:17 You need something to tell you that two things are similar. Right. And so something, because you've given me an arbitrarily large dataset, I still need to use data augmentation to take that image, construct like these two perturbations of it and then learn from it. So the thing is our learning paradigm is very primitive right now. Even if you were to give me lots of images, it's still not really useful.

Starting point is 01:06:37 A good data augmentation algorithm is actually going to be more useful. So you can reduce down the amount of data that you give me by like 10 times. But if you were to give me a good data augmentation algorithm, that will probably do better than giving me like 10 times the size of that data. But me having to rely on like a very primitive data augmentation algorithm. Like through tagging and all those kinds of things, is there a way to discover things that are semantically similar on the internet? Obviously there is, but it might be extremely noisy. Right. And the difference might be farther away than you would be comfortable with. So, I mean, yes, tagging will help you a lot. It'll actually go a very long way in figuring

Starting point is 01:07:14 out what images are related or not. And then, so, but then the purists would argue that when you're using human tags because these tags are like supervision, is it really, really self-supervised learning now? Because you're using human tags because these tags are like supervision, is it really, really self supervised learning now? Because you're using human tags to figure out which images are similar. Has tag no filter means a lot of things? Yes. I mean, there are certain tags which are going to be applicable pretty much to anything. So they're pretty useless for learning. But I mean, certain tags are actually like the eye filter, for example, or the Taj Mahal, for example. These tags are like very indicative of what's going on and they are, I mean, they are human supervision.

Starting point is 01:07:51 Yeah. So this is one of the tasks of discovering from human generated data, strong signals that could be leveraged for self-supervision. Like humans are doing so much work already. Like many years ago, there was something that was called, I guess, human computation back in the day. Humans are doing so much work. It would be exciting to discover ways to leverage the work they're doing to teach machines

Starting point is 01:08:18 without any extra effort from them. An example could be, like we said, driving humans driving and machines can learn from the driving. I always hoped that there could be like we said driving, humans driving and machines can learn from the driving. I always hope that there could be some supervision signal discovered in video games because there's so many people that play video games that it feels like so much effort is put into video games into playing video games and you can design video games somewhat cheaply, to include whatever signals you want. It feels like that could be leverier somehow. So people are using that.

Starting point is 01:08:51 Like there are actually folks right here in UT Austin, like Philip Granbull is a professor at UT Austin. He's been like working on video games as a source of supervision. I mean, it's really fun, like as a PhD student getting to play video games all day. Yeah, but so I do hope that kind of thing scales. And like ultimately, both don't discover some

Starting point is 01:09:11 undeniably very good signal. It's like masking in an LP. But that said, there's non-contrastive methods. What do non-contrastive energy-based self-supervised learning methods look like? And why are they promising? So, like I said about contrastive learning, you have this notion of a positive and a negative. Now, the thing is, this entire learning paradigm really requires access to a lot of negatives to learn a good sort of feature space. The idea is if I tell you, okay, so a cat and a dog are similar

Starting point is 01:09:46 and they're very different from a banana. The thing is, this is a fairly simple analogy, right? Because, well, bananas look visually very different from what cats and dogs do. So very quickly, if this is the only source of supervision that I'm giving you, your learning is not going to be like, after a point, the neural network

Starting point is 01:10:02 is really not going to learn a lot because the negative that you're getting is going to be so random. So it can be over cat and a dog are where similar, but they're very different from a folks fog in beetle. Now, like, this car looks very different from these animals again. So the thing is in contrast to learning the quality of the negative sample really matters a lot. And so what has happened is basically that typically these methods that are contrastive really require access to lots of negatives, which becomes harder and

Starting point is 01:10:28 harder to sort of scale when designing a learning algorithm. So that's been one of the reasons why non-contrastive methods I have become like popular and why people think that they're going to be more useful. So a non-contrastive method, for example, like clustering is one non-contrastive method. The idea basically being that you have two of these samples. So the cat and dog or two crops of this image, they belong to the same cluster. And so essentially you're basically doing clustering online when you're learning this network and which is very different from having access to a lot of negatives explicitly.

Starting point is 01:11:01 The other way which has become really popular is something called self-distillation. So the idea basically is that you have a teacher network and a student network. And the teacher network produces a feature. So it takes in the image and it basically the neural network figures out the patterns gets the feature out. And there's another neural network, which is the student neural network. And that also produces a feature. And now all you're doing is basically saying that the features produced by the teacher network and that also produces a feature. Now all you're doing is basically saying that the features produced by the teacher network and the student network should be very similar. That's it.

Starting point is 01:11:29 There is no notion of a negative anymore. And that's it. So it's all about similarity maximization between these two features. And so all I need to now do is figure out how to have these two sorts of parallel networks, a student network and a teacher network. And basically researchers have figured out very cheap methods to do this. So you can actually have, for free, really,

Starting point is 01:11:49 two types of neural networks, they're kind of related, but there's different enough that you can actually basically have a learning problem set up. So you can ensure that they always remain different enough? So the thing doesn't collapse into something boring. Exactly. So the main sort of enemy of self-supervised learning, any kind of similarity, maximization technique is collapse. So collapse means that you learn the same feature representation for all the emitters in the world, which is completely useless. Everything is a banana. Everything is a banana, everything is a cat, everything is a car.

Starting point is 01:12:20 Yeah. And so all we need to do is basically come up with ways to prevent collapse. Contrastable learning is one way of doing it. And then, for example, like clustering or self-distillation or other ways of doing it, we also had a recent paper where we used like deco relation between like two sets of features to prevent collapse. So that's inspired a little bit by like Horace Barlow's

Starting point is 01:12:41 neuroscience principles. By the way, I should comment that whoever counts the number of times, the word banana, apple, cat, and dog were using this conversation, wins the internet. I wish you luck. What is Swav and the main improvement proposed in the paper on supervised learning of visual features by contrasting cluster assignments. So I've basically is a clustering-based technique, which is for, again, the same thing, for self-supervised learning

Starting point is 01:13:12 in vision, where we have two crops. And the idea basically is that you want the features from these two crops of an image to lie in the same cluster. And basically crops that are coming from different images to be in different clusters. Now, typically in a sort of, if you were to do this clustering, you would perform clustering offline. What that means is you would, if you have a dataset of N examples, you would run over all of these N examples, get features for them, perform clustering, so basically get some clusters, and then repeat the process again. So this is offline basically because I need to do one pass through the data to

Starting point is 01:13:48 compute its clusters. Swab is basically just a simple way of doing this online. So as you're going through the data, you're actually computing these clusters online. And so of course, there is like a lot of tricks involved in how to do this in a robust manner without collapsing, but this is this sort of key idea to it. Is there a nice way to say this is this key idea to it. Is there a nice way to say what is the key methodology of the clustering that enables that? All right, so the idea basically is that when you have n samples, we assume that we have access to,

Starting point is 01:14:17 like, there are always k clusters in a dataset, k is a fixed number. So for example, k is 3000. And so if you have any, when you look at any sort of small number of examples, all of them must belong to one of these K clusters. And we impose this equi partition constraint. What this means is that basically your entire set of n samples should be equally partitioned into K clusters. So all your K clusters are basically equal, they have equal contribution to these n samples. And this ensures that we never collapse. So collapse can be viewed as a way in which all samples belong to one cluster.

Starting point is 01:14:54 So all this, if all features become the same, then you have basically just one mega cluster. You don't even have like 10 clusters or 3000 clusters. So swap basically ensures that at each point, all these 3000 clusters are being used in the clustering process. That's it. Basically, just to figure out how to do this online. Again, basically just make sure that two crops from the same image belong to the same cluster and others don't. In the fact they have a fixed K makes things simpler. Fixed K makes things simpler.

Starting point is 01:15:22 Our clustering is not really hard clustering, it's soft clustering. So basically, you can be point two to cluster number one and point eight to cluster number two. So it's not really hard. So essentially, even though we have like 3000 clusters, we can actually represent a lot of clusters. What is CIR, S-E-E-R, and what are the key results

Starting point is 01:15:44 in insights in the paper, self supervised pre-training of visual features in the wild? What is this big beautiful seer system? Seer so I'll first go to swap because swap is actually like one of the key components per seer So swap was when we use swap it was demonstrated on image net So typically like self supervised methods, the way we sort of operate is like in the research community, we kind of cheat. So we take ImageNet, which of course I talked about as having lots of labels.

Starting point is 01:16:13 And then we throw away the labels, like throw away all the hard work that went behind basically the labeling process. And we pretend that it is self- like unsupervised. But the problem here is that we have, when we collected these images, the ImageNet dataset has a particular distribution of concepts. So these images are very curated.

Starting point is 01:16:34 And what that means is these images, of course, belong to a certain set of noun concepts. And also ImageNet has this bias that all images contain an object, which is very big big and it's typically in the center. So when you're talking about a dog, it's a well-framed dog. It's towards the center of the image. So a lot of the data augmentation, a lot of the hidden assumptions in self-supervised learning,

Starting point is 01:16:55 actually really exploit this bias of ImageNet. And so a lot of my work, a lot of work from other people, always uses ImageNet as the benchmark to show the success of self-supervised learning. So you're implying that there's particular limitations to this kind of data set? Yes, I mean, it's basically because our data augmentation is that we designed, like in the, like all data augmentation that we designed for self-supervised learning envisioned, are kind of overfit to ImageNet. But you're saying a little bit hard-coded like the cropping. Exactly. The cropping parameters, the kind of lightingfit to image net. But you're saying a little bit hard coded like the cropping.

Starting point is 01:17:26 Exactly. The cropping parameters, the kind of lighting that we're using, the kind of blurring that we're using. Yeah, but you would for more in the wild data set, you would need to be clever or more careful in setting the range of parameters and those kinds of things. So for sure, our main goal was to fold one, basically to move away from ImageNet for training. So the images that we used were like uncurated images.

Starting point is 01:17:50 Now, there's a lot of debate whether they're actually curated or not, but I'll talk about that later. But the idea was basically these are going to be random internet images that we're not going to filter out based on like particular categories. So we did not say that images that belong to dogs and cats should be the only images that come in this dataset. Banana.

Starting point is 01:18:09 And basically other images should be thrown out. So we didn't do any of that. So these are random internet images. And of course, it also goes back to the problem of scale that you talked about. So these were basically about a billion or so images. And for context image net, the image net version that we use was one million images earlier.

Starting point is 01:18:26 So this is basically going like three orders of magnitude more. The idea was basically to see if we can train a very large convolutional model in a self supervised way on this uncurated but really large set of images. And how well would this model do? So is self supervised learning really overfit to image net or can it actually work in the wild? And it was also out of curiosity what kind of things will this model learn? Will it actually be able to still figure out different types of objects and so on? Would there be particular kinds of tasks it would actually do better than an image net train model? And so for CIR, one of our main

Starting point is 01:19:02 findings was that we can actually train very large models in a completely self-supervised way on lots of internet images without really necessarily filtering them out, which was in itself a good thing because it's a fairly simple process, right? So you get images which are uploaded and you basically can immediately use them to train a model in a non-supervised way. You don't really need to sit and filter them out. These images can be cartoons, these can be memes, these can be actual pictures uploaded by people, and you don't really care about what these images are,

Starting point is 01:19:28 you don't even care about what concepts they contain. So this was a very simple setup. What image selection mechanism would you say is there inherent in some aspect of the process. So you're kind of implying that there's almost none. But what is there, would you say, if you would just respect? Right. So it's not like uncurated can basically like one way of imagining uncurated is basically you have like cameras of the cameras that can take pictures at random viewpoints. When people upload pictures to the internet, they are typically going to care about the framing of it. They're not going to upload say the picture of a zoomed in wall, for example.

Starting point is 01:20:06 Well, we'll see internet doing me in social networks. Yes. Okay. So these are not going to be like pictures of like a zoomed in a table or a zoomed in wall. So it's not really completely uncurated because people do have their like photographers bias where they do want to keep things towards the center a little bit or like really have like, you know, nice looking things and so on in the picture. So that's the kind of bias that it typically exists in this dataset. And also the user base,

Starting point is 01:20:29 right? You're not going to get lots of pictures from different parts of the world because there are certain parts of the world where people may not actually be uploading a lot of pictures to the internet or may not even have access to a lot of internet. So this is a giant dataset and a giant neural network. I don't think we've talked about what architectures work well for SSL, for self-supervised learning. For CIR and for SWAP, we were using convolutional networks. But recently in a work called Dino, we've basically started using transformers for vision.

Starting point is 01:20:59 Both seem to work really well, connets and transformers, and depending on what you want to do, you might choose to use a particular formulation. So for CIR, it was a consonant. It was particularly a regnet model, which was also work from Facebook. Regnets are really good when it comes to compute versus accuracy. So because it was a very efficient model, compute and memory wise efficient and basically

Starting point is 01:21:23 it worked really well in terms of scaling. So we used a very large red net model and trained it on the bill and images. Can you maybe quickly comment on what reg nets are? It comes from this paper, Designing Network Design Spaces. This is a super interesting concept that emphasizes how to create efficient your own networks. Right. Large your own networks. So one of the key takeaways from this paper, which the author, like whenever you hear them present this work, they keep saying is a lot of neural networks are characterized in terms of flops, right?

Starting point is 01:21:51 Flops basically being the floating point operations. And people really love to use flops to say this model is like really computationally heavy, or like our model is computationally cheap and so on. Now it turns out that flops are really not a good indicator of how well a particular network is like how efficient it is really. And what a better indicator is is the activations or the memory that is being used by this particular model. And so designing like one of the key findings

Starting point is 01:22:17 from this paper was basically that you need to design network families or neural network architectures that are actually very efficient in the memory space as well, not just in terms of pure flops. So regnet is basically a network architecture family that came out of this paper that is particularly good at both flops and the sort of memory required for it. And of course, it builds upon like earlier work like ResNet being like the sort of more popular inspiration for it where you have residual connections. But one of the things in this work is basically they also use like squeeze excitation blocks. So it's a lot of nice sort of technical innovation in all of

Starting point is 01:22:49 this from prior work and a lot of the ingenuity of these particular authors and how to combine these multiple building blocks. But the key constraint was optimized for both flops and memory when you're basically doing this, don't just look at flops. And that allows you to have very large networks through this process can optimize for low efficiency for low memory. And also, in just in terms of pure hardware,

Starting point is 01:23:16 they fit very well on GPU memory. So they can be really powerful neural network architectures with lots of parameters, lots of Flops, but also because they're efficient in terms of the amount of memory that they're using, you can actually fit a lot of these on like a you can fit a very large model on a single GPU, for example. Would you say that the choice of architecture matters more than the choice of maybe data augmentation techniques? Is there a possibility to say what matters more?

Starting point is 01:23:44 You kind of imply that you can probably go really far with just using basic conv nuts. All right, I think data and data augmentation, the algorithm being used for the self-supervised training matters a lot more than the particular kind of architecture. With different types of architecture, you will get different properties in the resulting sort of representation. But really, I mean, the secret source is in the retaugmentation and the algorithm being used to train them. The architectures, at this point, a lot of them perform very similarly,

Starting point is 01:24:14 depending on the particular task that you care about. They have certain advantages and disadvantages. Is there something interesting to be said about what it takes with Sears to train a giant neural network? You're talking about a huge amount of data, a huge neural network. Is there something interesting to be said of how to effectively train something like that fast? Lots of GPUs. Okay.

Starting point is 01:24:35 So this. I mean, so the model was like a billion parameters. Yeah. And it was turned on the billion images. Yeah. So if like basically the same number of parameters as the number of images, and it was turned on the billion images. So if like the same number of parameters as the number of images, and it took a while, I don't remember the exact number it's in the paper, but it took a while. I guess I'm trying to get at is when you're thinking of

Starting point is 01:24:59 scaling this kind of thing. I mean, one of the exciting possibilities of self-supervised learning is the several orders of magnitude scaling of everything, both in your own network and the size of the data. And so the question is, do you think there are some interesting tricks to do large scale distributed compute, or is that really outside of even deep learning? That's more about like hardware engineering. I think more and more there is like this, a lot of like systems are designed basically taking into account the machine learning needs, right?

Starting point is 01:25:35 So because whenever you're doing this kind of distributed training, there is a lot of intercommunication between nodes. So like gradients or the model parameters are being passed. So you really want to minimize communication costs when you really want to scale these models up. You want basically to be able to do as much, like as limited amount of communication as possible. So currently like a dominant paradigm is synchronized sort of training. So essentially after every sort of gradient step, all you basically have like a synchronization step between all the sort of

Starting point is 01:26:05 compute chips that you're going on with. I think asynchronous training was popular, but it doesn't seem to perform as well. But in general, I think that's sort of the, I guess it's outside my scope as well. But the main thing is, like, minimize the amount of synchronization steps that you have. That has been the key takeaway takeaway at least in my experience. The others I have no idea about how to design the chip. There's very few things that I see. Jim Colors' eyes light up as much as talking about giant computers doing,

Starting point is 01:26:37 like that fast communication that you're talking to, well, when they're training machine learning systems, what is VISL, the ISSL, the PyTorch-based SSL library? What are the use cases that you might have? VISL basically was born out of a lot of us at Facebook doing the self-supervised learning research. So it's a common framework in which we have a lot of self-supervised learning methods implemented for vision. It's also, it has in itself a benchmark of tasks that you can evaluate the

Starting point is 01:27:09 self-supervised representations on. So the use case for it is basically for anyone who's either trying to evaluate their self-supervised model or train their self-supervised model, or a researcher who's trying to build a new self-supervised technique. So it's basically supposed to be all of these things. So as a researcher, before Vishal, for example, or like when we started doing this work fairly seriously at Facebook, it was very hard for us to go and implement

Starting point is 01:27:32 every self-supervised learning model, test it out in a consistent manner. The experimental setup was very different across different groups. Even when someone said that they were reporting image and accuracy, it could mean lots of different things. So with Vistil, we tried to really sort of standardize that as much as possible. And it was a paper like we did in 2019, just about benchmarking. And so Vistil basically builds up on a lot of this kind of work that we did,

Starting point is 01:27:56 about like benchmarking. And then every time we tried to like, we come up with a self-supervised learning method, a lot of us tried to push that into Vistil as well, just so that it basically is like the central piece where a lot of these methods can reside. Just out of curiosity, people may be, so certainly outside of Facebook, but just researchers, or just even people that know how to program and Python and how to use PyTorch, what would be the use case, or what would be a fun thing to play around with Vistil on? Like, what's a fun thing to play around

Starting point is 01:28:26 with self-supervised learning on, would you say? Is there a good, hello world program? Like, is it always about big size that's important to have or is there fun little smaller case playgrounds to play around with? So we're trying to push something towards that. I think there are a few setups out there, but nothing like super standard on the smaller scale. I mean, image net itself is actually pretty big also. So that is not something which is feasible for a lot of

Starting point is 01:28:55 people. But we are trying to like push up with like smaller sort of use cases. The thing is add a smaller scale. A lot of the observations are a lot of the algorithms that work. Don't necessarily translate into the medium or the larger scale. So it's really tricky to come up with a good small scale setup where a lot of your empirical observations will really translate to the other setup. So it's been really challenging.

Starting point is 01:29:15 I've been trying to do that for a little bit as well because it does take time to train stuff on image and it does take time to train on like more images. But pretty much every time I've tried to do that, it's been unsuccessful because all the observations that are off from my set of experiments on a smaller dataset don't translate into ImageNet, or like don't translate to another sort of dataset.

Starting point is 01:29:34 So it's been hard for us to figure this one out, but it's an important problem. So this is a really interesting idea of learning across multiple modalities. You have a CVPR 2021 best paper candidate titled Audiovisual Instance Discrimination with Cross-Model Agreement. What are the key results insights in this paper and what can you say in general about the promise and power of multimodal learning? For this paper, it actually came as a little bit of a shock to me at how well it worked,

Starting point is 01:30:04 so I can describe what the problem setup was. So it's been used in the past by lots of folks, like for example Andrew Owens from MIT, Alia Shive-Fross from Berkeley, Andrew's Descendants from Oxford. So a lot of these people have been showing results in this. Of course, I was aware of this result, but I wasn't really sure how well it would work in practice for other downstream tasks. So the results kept getting better. And I wasn't sure if a lot of our insights

Starting point is 01:30:26 from self supervised learning would translate into this multi-modal learning problem. So multi-modal learning is when you have multiple modalities. That's not your problem. Yeah, absolutely. OK, so the particular modalities that we worked on in this work were audio and video. So the idea was basically if you have a video, you have it corresponding audio track.

Starting point is 01:30:48 And you want to use both of these signals, the audio signal and the video signal to learn a good representation for video and good representation for audio. Like this podcast. Like this podcast, exactly. So what we did in this work was basically trained two different neural networks, one on the video signal, one on the audio signal. And what we wanted is basically the features that we get from both of these neural networks should be similar.

Starting point is 01:31:08 So it should basically be able to produce the same kinds of features from the video and the same kinds of features from the audio. Now why is this useful? Well, for a lot of these objects that we have, there is a characteristic sound, right? So trains when they go by, they make a particular kind of sound, boats make a particular kind of sound. People when they're jumping around will shout, whatever, bananas don't make a sound. So where you can't learn anything about bananas there.

Starting point is 01:31:31 Or when humans mention bananas. Well, yes, when they say the word banana, then... So you can't trust basically anything that comes out of a human's mouth as a source of that source of audio because it's used. So the typical use case is basically like, for example, someone's playing a musical instrument. So guitars have a particular kind of sound and so on.

Starting point is 01:31:47 So because a lot of these things are correlated, the idea and multi-modal learning is to take these two kinds of modalities, video and audio, and learn a common embedding space, a common feature space where both of these related modalities can basically be close together. And again, you use contrasted learning for this. So in contrastive learning, basically the video and the corresponding audio

Starting point is 01:32:07 or positives, and you can take any other video or any other audio and that becomes a negative. And so basically that's it. It's just a simple application of contrastive learning. The main sort of finding from this work for us was basically that you can actually learn very, very powerful feature representations, very, very powerful video representations. So you can learn the video network that we ended up learning can actually be used for downstream, for example, recognizing human actions, or recognizing different types of sounds, for example. So this was sort of the key finding.

Starting point is 01:32:39 Can you give a kind of example of a human action or like just so we can build up intuition of what kind of thing. Right, so there is this dataset called kinetics, for example, which is like 400 different types of human actions. So people jumping, people, you know, doing different kinds of sports or different types of swimming, so like different strokes and swimming, uh, golf and so on. So they're, they're like just different types of actions right there. And the point is this kind of video network that you learn in a self supervised way. And the point is this kind of video network that you learn in a self supervised way

Starting point is 01:33:06 can be used very easily to kind of recognize these different types of actions. It can also be used for recognizing different types of objects. And what we did is we tried to visualize whether the network can figure out where the sound is coming from. So basically give it a video and basically play

Starting point is 01:33:23 of say a person just strumming a guitar, but of course there's no audio in this and now you give it this sound of a guitar and you ask like basically try to visualize where the network is where the network thinks the sound is coming from and it can kind of example, for certain people's voices, like famous celebrity's voices, it can actually figure out where their mouth is. So it can actually distinguish different people's voices, for example, a little bit as well. Without that ever being annotated in any way. Right. So this is all what it had discovered. We never like, we never pointed out that this is a guitar

Starting point is 01:34:00 and this is the kind of sound it produces. It can actually naturally figure that out because it's seen so many correlations of this sound coming with this kind of like an object that it basically learns to associate this sound with this kind of an object. Yeah, that's really fascinating, right? That's really interesting.

Starting point is 01:34:16 So the idea with this kind of network is then you then fine tune it for a particular task. So this is forming like a really good knowledge base within a neural network based on which you could then the train a little bit more to accomplish a specific task. Well, so you don't need a lot of videos of humans doing actions annotated. You can just use a few of them to basically get your how much insight do you draw from the fact that you can figure out where the sound is coming from? I'm trying to see. So that's kind of very, it's very CVPR beautiful.

Starting point is 01:34:50 Right? So it's a cool insight. I wonder how profound that is. You know, does it speak to the idea that multiple modalities are somehow much bigger than the sum of their parts, or is it really, really useful to have multiple modalities? Or is it just that cool thing that there's parts of our world that can be revealed effectively through multiple modalities, but most of it is really all about vision or about one of the modalities.

Starting point is 01:35:26 I would say a little tending more towards the second part, so most of it can be sort of figured out with one modality, but having an extra modality always helps you. So in this case, for example, like one thing is when you're, if you observe someone cutting something and you don't have any sort of sound there, whether it's an apple or whether it's an onion, it's very hard to figure that out. But if you hear someone cutting it, it's very easy to figure it out because apples and onions make a very different kind of characteristic sound when they're cutting. So you really figure this out based on audio. It's much easier. So your life will become much easier when you have access to different kinds of modalities. And the other thing is, so I like to relate it to in this way, it may be like completely wrong, but the distribution hypothesis in NLP, right, where context basically determines gives kind of meaning to that word. Sound kind of does that too, right? So if you

Starting point is 01:36:17 have the same sound, so that's the same context across different videos, you're very likely to be observing the same kind of concept. So that's the kind of reason why it figures out the guitar thing, right? It observed the same sound across multiple different videos and it figures out maybe this is the common factor that's actually doing it. I wonder at least to have this argument with my dad a bunch for creating general intelligence, whether a smell is important, like if that's important sensory information, mostly we're talking about like falling in love with an AI system.

Starting point is 01:36:51 And for him, smell and touch are important. And I was arguing that it's not at all. It's important. It's nice and everything, but like you can fall in love with just language really, but voice is very powerful and vision is next and smell is not that important. Can I ask you about this process of active learning, you mentioned interactivity. Is there some value within the self-supervised learning context to select parts of the data in intelligent ways such that they would most benefit the learning

Starting point is 01:37:28 process. So I think so. I think, I mean, I know I'm talking to an active learning fan here. Of course, I think the answer. First you were talking bananas and now you're talking about active learning. I love it. I think Yana Kuhn told me that active learning is not that interesting. I think I was back then,

Starting point is 01:37:45 I didn't wanna argue with him too much, but when we talk again, we're gonna spend three hours arguing about active learning. My sense was you can go extremely far with active learning, perhaps farther than anything else. Like to me, there's this kind of intuition

Starting point is 01:38:00 that similar to data augmentation, you can get a lot from the data, from intelligent, optimized usage of the data. I'm trying to speak generally in such a way that includes data augmentation and active learning, that there's something about maybe interactive exploration of the data that at least this part of the solution to intelligence, like an important part. I don't know what your thoughts are on active learning in general. I actually really like active learning. So back in the day, we did this largely ignored CVPR paper called learning by asking questions. So the idea was basically you would train an

Starting point is 01:38:40 agent that would ask a question about the image, it would get an answer. And basically then it would update itself, it would see the next image, it would decide what's the next hardest question that I can ask to learn the most. And the idea was basically because it was being smart about the kinds of questions it was asking, it would learn in fewer samples, it would be more efficient at using data. And we did find to some extent that it was actually better than randomly asking questions. Kind of weird thing about active learning is it's also a check in and egg problem because when you look at an image to ask a good question about the image, you need to understand something about the image.

Starting point is 01:39:12 Right. You can't ask a completely arbitrarily random question. It may not even apply to that particular image. So there is some amount of understanding or knowledge that basically keeps getting built when you're doing active learning. So I think active learning in by itself is really good. And the main thing we need to figure out is basically how do we come up with a technique to first model what the model knows.

Starting point is 01:39:35 And also model what the model does not know. I think that's the sort of beauty of it, right? Because when you know that there are certain things that don't know anything about asking a question about those concepts, that's actually going to bring you the most value. And I think that's the sort of key challenge. Now, self-supervised learning by itself, like selecting data for it and so on, that's actually really useful. But I think that's a very narrow view of looking at active learning, right? If you look at it more broadly, it is basically about if the model has a knowledge about end concepts, and it is weak basically about certain things,

Starting point is 01:40:06 so it needs to ask questions either to discover new concepts, or to basically increase its knowledge about these end concepts. So at that level, it's a very powerful technique. I actually do think it's going to be really useful. Even in simple things such as data labeling, it's super useful. So here is one simple way that you can use active learning, for example, you have your self supervised model, which is very good at predicting similarities and disimmelalities between things. And so if you label a picture as basically say,

Starting point is 01:40:36 a banana, now you know that all the images that are very similar to this image are also likely to contain bananas. So probably when you want to understand what else is a banana, you're not going to use these other images. You're actually going to use an image that is not completely dissimilar, but somewhere in between, which is not super similar to this image, but not super dissimilar either. And that's going to tell you a lot more about what this concept of a banana is. So that that's kind of a heuristic.

Starting point is 01:41:04 I wonder if it's possible to also learn, learn ways to discover the most likely the most beneficial image. So like, so not just looking a thing that's somewhat similar to a banana, but not exactly similar, but have some kind of more complicated learning system like learned discovering mechanism that tells you what image to look for. Yeah, like actually in a self supervised way learning strictly a function that says is

Starting point is 01:41:39 this image going to be very useful to me given what I currently know. I think there is a lot of synergy there. It's just I think, yeah, it's going to be explored. I think very much related to that. I kind of think of what Tesla autopilot is doing at currently as kind of active learning. There's something that Andre Kapati and their team are calling data engine. Yes. that Andre Kapathi and their team are calling data engine. So you're basically deploying a bunch of instantiations of a neural network into the wild, and they're collecting a bunch of edge cases

Starting point is 01:42:13 that are then sent back for annotation for particular, in edge cases, as defined as near failure or some weirdness on a particular task that's then sent back, it's that not exactly a banana, but almost the banana cases send back for annotation and then there's this loop that keeps going and you keep retraining and retraining. And the active learning step there or whatever you want to call it is the cars themselves

Starting point is 01:42:38 that are sending you back to the data like, what the hell happened here? This was this was weird What are your thoughts about that sort of deployment of neural networks in the wild another way to ask a question For first is your thoughts and maybe if you want to comment is there Applications for autonomous driving like computer vision-based autonomous driving Applications of self-supervised learning in the context of computer vision-based autonomous driving applications of self-supervised learning in the context of computer vision-based autonomous driving. So I think so. I think for self-supervised learning to be used in autonomous driving, there are lots of opportunities. I mean, just pure consistency in predictions is one way, right?

Starting point is 01:43:18 So because you have this nice sequence of data that is coming in, a video stream of it, associated of course, with the actions that say the car took. You can form a very nice predictive model of what's happening. So for example, like all the way, like one way, possibly in which how they're figuring out what data to get labeled is, basically through prediction uncertainty, right?

Starting point is 01:43:40 So you predict that the car was going to turn right. So this was the action that was going to happen saying the shadow mode, and now the driver turned left. And this was, this is a really big surprise. So basically by forming these good predictive models, you are, I mean, these are kind of self-supervised models. Prediction models are basically being trained just by looking at what's going to happen next and asking them to predict

Starting point is 01:44:00 what's going to happen next. So I would say this is really like one use of self-supervised learning. It's a predictive model and you're learning a predictive model basically just by looking at what data you have. Is there something about that active learning context that you find insights from? Like that kind of deployment of the system seeing cases where it doesn't perform as you expected and then retraining the system based on that? I think that really resonates with me.

Starting point is 01:44:25 It's super smart to do it that way. Because the thing is with any kind of practical system, autonomous driving, there are those edge cases that are actually the problem. Highway driving or freeway driving has basically been, there has been a lot of success in that particular part of autonomous driving for a long time, I would say, like since the ATs or something. Now, the point is all these failure cases are the sort of

Starting point is 01:44:51 reason why autonomous driving hasn't become, hasn't become like super, super mainstream and available like in every possible car right now. And so basically by really scaling this problem out, by really trying to get all of these edge cases as quickly as possible, and then just like using those to improve your model, that's super smart. And prediction uncertainty to do that is like one really nice way of doing it. Let me put you on the spot. So we mentioned offline Jitandra. He thinks that the Tesla

Starting point is 01:45:19 computer vision approach or really any approach for autonomous driving is very far away. How many years away, if you have to bet all your money on it, are we dissolving autonomous driving with this kind of computer vision only, machine learning based approach? Okay, so what is solving autonomous driving mean? Does it mean solving it in the US? Does it mean solving it in India? Because I can tell you there's very different types of driving after that. Not India. not Russia.

Starting point is 01:45:45 In the United States, autonomous, so what solving means is when the car says it has control, it is fully liable. You can go to sleep, it's driving by itself. So this is highway and city driving, but not everywhere, but mostly everywhere. And it's, let not everywhere, but mostly everywhere. And it's, let's say, significantly better. Like, say, five times less accidents than humans.

Starting point is 01:46:13 It's sufficiently safer, such that the public feels like that transition is enticing beneficial both for our safety and financially and all those kinds of things. Okay. So first disclaimer, I'm not an expert in auto and was driving. So let me put it out there. I would say like at least five to 10 years. This is this would be my like guess from now. Then yeah, I'm actually very impressed like when I sat in a friend's test lawyers simply and of

Starting point is 01:46:40 course, like looking so it can on the screen, it basically shows all the detections and everything. The car is doing as you're driving by. And that's super distracting for me, as a person, because all I keep looking at is the bounding boxes and the cars it's tracking, and it's really impressive, especially when it's raining and it's able to do that, that was the most impressive part for me. It's actually able to get through rain and do that. And one of the reasons why a lot of of us believed and I would put myself in

Starting point is 01:47:06 that category is LIDAR based sort of technology for autonomous driving was the key driver, right? So VAMO was using it for the longest time. And Tesla then decided to go this completely other out that oh, we're not going to even use LIDAR. So their initial system I think was Camerai and radar based and now they're actually moving to a completely like-based system. And so that was just like, it sounded completely crazy. LIDAR is very useful in cases where you have low visibility. Of course, it comes to its own set of complications. But now to see that happen in live Tesla, that basically just proves everyone wrong,

Starting point is 01:47:39 I would say, in a way. And that's just working really well. I think there were also a lot of advancements in camera technology. And now there were like I know at CMU when I was there there was a particular kind of camera that had been developed that was really good at basically low visibility setting. So like lots of snow and lots of rain. It could actually still have a very reasonable visibility. And I think there are lots of these kinds of innovations that will happen on the sensor

Starting point is 01:48:02 side itself, which is actually going to make this very easy in the future. And so maybe that's actually why I'm more optimistic about vision-based self-autonomous driving. It's going to call it self-supervised driving, but vision-based autonomous driving, that's the reason I'm quite optimistic about it, because I think there are going to be lots of these advances on the sensor side itself. So acquiring this data, we're actually going to get much better about it. And then of course, when once we're able to scale out and get all of these edge cases in, as like Andre described, I think that's

Starting point is 01:48:31 going to make us go very far away. Yeah, so it's funny. I'm very much with you on the five to 10 years, maybe 10 years. But I'm not sure how you made a sound. But for some people, that might seem like really far away and then for other people It might seem like very close There's a lot of fundamental questions about how much game theory isn't this whole thing? So like how much is this simply a collision avoidance problem. And how much of it is you're still interacting

Starting point is 01:49:07 with other humans in the scene and you're trying to create an experience that's compelling so you want to get from point eight to point B quickly, you want to navigate the scene in a safe way, but you also want to show some level of aggression because, well, certainly this is why you're screwed in India because you have to show aggression. Or Jersey. Or New York. So like, or New York, or basically any major city, but I think

Starting point is 01:49:34 it's probably Elon that I talked to most about this, which is surprised the level of which they're not considering human beings as a huge problem in this, as a source of problem. Like, the driving is fundamentally a robot on robot versus the environment problem versus like a, you know, you can just consider humans not part of the problem. I used to think humans are almost certainly have to be modeled really well. Pedestrians and cyclists and humans inside of the cars, you have to have like mental models for them. You cannot just see it as objects,

Starting point is 01:50:10 but more and more, it's like the, it's the same kind of intuition breaking thing that self-supervised learning does, which is well, maybe through the learning, you'll get all the human information you need. Maybe you'll get it with enough data. You don't need to have explicit good models of human behavior. Maybe you get it through the data.

Starting point is 01:50:35 My skepticism, also just knowing a lot of automotive companies and how difficult it is to be innovative, I was skeptical that they would be able at scale to convert the driving scene across the world into digital form such that you can create this data engine at scale. And the fact that Tesla is at least getting there or are already there makes me think that it's now starting to be coupled to the self-supervised learning vision, which is like, if that's going to work, if through purely this process, you can get really far, then maybe you can solve driving that way.

Starting point is 01:51:17 I don't know. I tend to believe we don't give enough credit to the how amazing humans are both at driving and at supervising autonomous systems. And also we don't, I wish there was much more driver sensing in Cyteslas and much deeper consideration of human factors, like understanding psychology and drowsiness and all those kinds of things, when the car does more and more of the work, how to keep utilizing the little human supervision that I needed to keep this whole thing safe. I mean, it's a fascinating dance of human robot interaction.

Starting point is 01:52:01 To me, autonomous driving for a long time is a human robot interaction problem. It is not a robotics problem or computer vision problem. Like you have to have a human envelope. So, which is why I think it's 10 years plus. But I do think there will be a bunch of cities and contexts where, you know, geo restricted, it will work really, really damn well. Yeah. So I think for me, it gets five if I'm being optimistic, and it's going to be five for a lot of cases. And 10 plus, I agree with you. 10 plus, basically, if we want to recover most of the, say, country-curse United States or something. Oh, interesting. So my optimistic is five and pessimistic is 30.

Starting point is 01:52:42 30. I have a long tail on this one. I've watched enough driving videos. I've watched enough pedestrians to think like we may be like there's a small part of me still not a small like a pretty big part of me that thinks we will have to build AGI to solve driving. Oh well. Like there's something to me like because humans are part of the picture, deeply part of the picture, and also human society is part of the picture in that human life is a stake.

Starting point is 01:53:10 Anytime a robot kills a human, it's not clear to me that that's not a problem that machine learning will also have to solve. Yeah. Like it has to, you have to integrate that into the whole thing, just like Facebook or social networks. One thing is to say how to make a really good recommender system. The other thing is to integrate into that recommender system, all the journalists that will write articles about that recommender system.

Starting point is 01:53:36 You have to consider the society within which the AI system operates. In politicians too, this is regulatory stuff for autonomous driving. It's kind of fascinating that the more successful your AI system becomes, the more it gets integrated in society, and the more precious politicians and the public and the clickbait journalists and the, all the different fascinating forces of our society start acting on it. And then it's no longer how good you are at doing the initial task. It's also how good you are at navigating human nature, which is a fascinating space. What do you think are the limits of deep learning? If you allow me, we'll zoom out a little bit into the big question of artificial intelligence. You said dark matter of intelligence is self-supervised learning, but there could be more.

Starting point is 01:54:26 What do you think the limits of self-supervised learning and just learning in general, deep learning are? I think for deep learning in particular, because self-supervised learning is, I would say, a little bit more vague right now, so I wouldn't, like for something that's so vague, it's hard to predict what its limits are going to be. But like I said, I think any where you want to interact with human self-supervisor, and then kind of hits a boundary very quickly

Starting point is 01:54:51 because you need to have an interface to be able to communicate with the human. So really, like, if you have just like vacuous concepts, or like just like nebulous concepts discovered by a network, it's very hard to communicate those with a human without, like inserting some kind of human knowledge or some kind of of human bias there. In general, I think for deep learning, the biggest challenge is just data efficiency,

Starting point is 01:55:13 even with self-supervised learning, even with anything else, if you just see a single concept once, one image of a, I don't know or whatever you want to call it, like any concept, it's really hard for these methods to generalize by looking at just one or two samples of things. And that has been a real, real challenge. And I think that's actually why, like, these edge cases, for example, for Tesla, are actually that important.

Starting point is 01:55:37 Because if you see just one instance of the car failing, and if you just annotate that and you get, like that into your data set, it's, you have like very limited guarantee that it's not going to happen again. And you're actually going to be able to recognize this kind of instance in a very different scenario. So like when it was snowing, so you got that thing labeled when it was snowing, but now when it's raining, you're actually not able to get it. Or you basically have the same scenario in a different part of the world, so the lighting was different or so on. So it's just

Starting point is 01:56:02 really hard for these models, like deep learning, especially to do that. What's your intuition? How do we solve Henry and Digi recognition problem? When we only have one example for each number, it feels like humans are using something like learning. Right, I think it's we are good at transferring knowledge a little bit. We are just better at like for a lot of these problems where we are generalizing from a single sample, recognizing from a single sample. We are using a lot of our own domain knowledge and a lot of our like inductive bias into let one sample to generalize it. So I've never seen you write the number nine, for example. And if you were to write it, I would

Starting point is 01:56:39 still get it. And if you were to write a different kind of alphabet and like write it in two different ways, I would still probably be able to figure out that these are the same two characters. It's just that I have been very used to seeing 100 digits in my life. The other sort of problem with any deep learning system or any kind of machine learning system is it's guarantees, right? There are no guarantees for it.

Starting point is 01:56:58 Now, you can argue that humans also don't have any guarantees. There is no guarantee that I can recognize a cat in every scenario. I'm sure there are a lot going to be lots of cats that I can recognize a cat in every scenario. I'm sure there are going to be lots of cats that I don't recognize, lots of scenarios in which I don't recognize cats in general. But I think from just a sort of application perspective, you do need guarantees, right? We call these things algorithms. Now algorithms, traditional CS algorithms have guarantees. Sorting is a guarantee. If you were to call sort on a particular array of numbers,

Starting point is 01:57:28 you are guaranteed that it's going to be sorted. Otherwise, it's a bug. Now, for machine learning, it's very hard to characterize this. We know for a fact that a cat trick mission model is not going to recognize cats every cat in the world and every circumstance. I think most people would agree with that statement. But we are still okay with it.

Starting point is 01:57:45 We still don't call this as a bug. There is a traditional computer science or traditional science. Like if you have this kind of failure case existing, then you think of it as like something is wrong. I think there is this sort of notion of nebulous correctness for machine learning and that's something we just need to be very comfortable with. And for deep learning or like for a lot of these machine learning algorithms, it's not clear how do we characterize this notion of correctness. I think limitation in our understanding or at least a limitation not phrasing of this.

Starting point is 01:58:13 And if we were to come up with better ways to understand this limitation, then it would actually help us a lot. Do you think there's a distinction between the concept of learning and the concept of reasoning? Do you think it's possible for neural networks to reason? So I think of it slightly differently. So for me, learning is whenever I can like make a snap judgment. So if you show me a picture of a dog, I can immediately say it's a dog. But if you give me like a puzzle, you know, like, whatever a Goldberg machine of like, this thing's going to happen, then I have to reason because I've never, it's a very complicated setup.

Starting point is 01:58:50 I've never seen that particular setup and I really need to, you know, draw and like, imagine in my head what's going to happen to figure it out. So I think yes, neural networks are really good at recognition, but they're not very good at reasoning. Because they're like, if they have seen seen something before or seen something similar before, they're very good at making those sort of snap judgments.

Starting point is 01:59:10 But if you were to give them a very complicated thing that they've not seen before, they have very limited ability right now to compose different things. Like, oh, I've seen this particular part before, I've seen this particular part before, and now probably like this is how they're going to work in tandem It's very hard for them to come up with these kinds of things. Well, there's a certain aspect to Reasoning that you can maybe convert into the process of programming And so there's the whole field of program synthesis and people have finished Applying when she learning to the problem of program synthesis and the question is you know, can they?

Starting point is 01:59:43 the step of composition why can't that be learned? You know this the step of program synthesis. And the question is, can they, the step of composition, why can't that be learned? The step of building things on top of each other, can that be learnable? What's your intuition there? Or I guess similar sort of techniques, do you think that would be applicable? So I think it is of course learnable because we are prime examples of machines that have like or individuals that have learned this, right?

Starting point is 02:00:12 The humans have learned this. So it is of course, it is a technique that is very easy to learn. I think where we are kind of hitting a wall basically with like current machine learning is the fact that when the network learns all of this information, we basically are not able to figure out how well it's going to generalize to an unseen thing. And we have no like a priori, no way of characterizing that. And I think that's basically telling us a lot about like a lot about the fact that

Starting point is 02:00:42 we really don't know what this model has learned and how well it's basically, because we don't know how well it's going to transfer. There's also a sense in which it feels like we humans may not be aware of how much like background, how good our background model is, how much knowledge we just have slowly building on top of each other. It feels like neural networks at constantly throwing stuff out. Like you'll do some incredible thing where you're learning a particular task in computer vision, you celebrate your state of the art successes,

Starting point is 02:01:13 and you throw that out. Like it feels like it's, you're never using stuff you've learned for your future successes in other domains. And humans are obviously doing that exceptionally well, still throwing stuff away in their mind, but keeping certain colonels of truth. Right, so I think we're like, continual learning is sort of the paradigm

Starting point is 02:01:33 for this machine learning. And I don't think it's a very well-exploit paradigm. Yeah. We have like things in deep learning, for example, right? Catastrophic forgetting is like one of the standard things. The thing basically being that if you teach a network like to recognize dogs and now you teach that same network to recognize cats, it basically forgets how to recognize dogs. So it forgets very quickly. I mean, and whereas a human, if you were to teach someone to recognize dogs and then to recognize cats, they don't forget

Starting point is 02:01:59 immediately how to recognize these dogs. I think that's basically sort of what you're trying to get to. Yeah, I just I wonder if like the long-term memory mechanisms or the mechanisms that store not just memories but concepts that allow you to the reason, like and compose concepts. If those things will look very different than your networks or if you can do that within a single neural network with some particular sort of architecture quirks, that seems to be a really open problem. And of course, I go up and down on that because there's something so compelling to the symbolic AI or to the ideas of logic based sort of expert systems.

Starting point is 02:02:43 You have like human interpretable facts that built on top of each other. It's really annoying like with self-supervised learning that the AI is not very explainable. Like you can't like understand all the beautiful thing is as has learned. You can't ask it like questions. But then again, maybe that's a stupid thing

Starting point is 02:03:03 for us humans to want. I think whenever we try to understand it, we're putting our own subjective human bias into it. I think that's the problem. It's super well-learning the goal is that it should learn naturally from the data. Now, if you try to understand it, you are using your own preconceived notions of what this model has learned. That's the problem.

Starting point is 02:03:27 High level question. What do you think it takes to build a system with superhuman? Maybe let's say human level or superhuman level general intelligence? We've already kind of started talking about this, but what's your intuition? Like does this thing have to have a body? Does it? Does it have to interact with the world? Does it have to have some more human elements like self-awareness?

Starting point is 02:03:53 I think emotion. I think emotion is something which is, like it's not really attributed typically in standard machine learning, it's not something we think about. Like there is NLP, there is vision, there is no emotion. Emotion is never a part of all of this.

Starting point is 02:04:09 And that just seems a little bit weird to me. I think the reason basically being that there is surprise and like, basically emotion is like one of the reasons emotions arises. Like what happens and what you expect to happen, right? There is like a mismatch between these things. And so that gives rise like I can either be surprised or I can be saddened or I can be happy and all of this. And so this basically indicates that I already have a predictive model in my head

Starting point is 02:04:32 and something that I predicted or something that I thought was likely to happen. And then there was something that I observed that happened that there was a disconnect between these two things. And that basically is like maybe one of the reasons I like you have a lot of emotions. Yeah, I think so I talked to people a lot about them like Lisa Feldman Barrett. I think that's an interesting concept of emotion, but I have a sense that emotion

Starting point is 02:04:57 primarily in the way we think about it, which is the display of emotion, is a communication mechanism between humans. So it's a part of basically human-human interaction, an important part, but just a part. So it's like I will throw it into the, into the full mix of communication. And to me, communication can be done with objects that don't look at all like humans. Okay. I've seen our ability to anthropomorphize our ability to connect with things that look like a Rumba, our ability to connect, first of all, let's talk about other biological systems like dogs,

Starting point is 02:05:38 our ability to love things that are very different than humans. But they do display emotions, right? I mean, dogs do display emotions. So they don't have to be anthropomorphic for them to display the kind of emotions that we don't. Exactly. So the war, I mean, but then the war, the emotions starts to lose.

Starting point is 02:05:56 So then we have to be specific, I guess, specific. But yeah, so have rich, flavorful communication. Communication, yeah. Yeah, so like it, yes, it's full of emotion, it's full of wit and humor and Moods and all those kinds of things. Yeah, so so you're talking about like flavor So there's content and then there is flavor and I'm talking about the flavor. Do you think you need to have a body? Do you think like to interact with the physical world? Do you think you're going to understand the physical world without

Starting point is 02:06:27 being able to directly interact with it? I don't think so. Yeah. I think at some point, we will need to bite the bullet and actually interact with the physical world as much as I like working on like passive computer vision, where I just like sit in my armchair and look at videos and learn. I do think that we will need to have some kind of embodiment or some kind of interaction to figure out things about the world. What about consciousness? Do you think? How often do you think about consciousness when you think about your work? You could think of as the more simple thing of self-awareness of being aware that you are a perceiving sensing acting thing in this world, or you can think about the bigger version of that which is consciousness, which is having it feel

Starting point is 02:07:16 like something to be that entity, the subjective experience of being in this world. So I think of self-awareness a little bit more than the broader goal of it, because I think self-awareness is pretty critical for any kind of AGI or whatever you want to call it, that field build, because it needs to contextualize what it is and what role it's playing with respect to all the other things that exist around it. I think that requires self-awareness. It needs to understand that it's an autonomous car, right? And what does that mean? What are its limitations? What are the things that it is supposed to do and so on? What is its role in some way? Or I mean, so I mean, these

Starting point is 02:07:56 are the kind of things that we kind of expect from it, I would say. And so that's the level of self-awareness that's, I would say, basically required at least, if not more than that. Yeah, I tend to, on the emotion side, believe that it has to have, it has to be able to display consciousness. Display consciousness, then what do you mean by that? Meaning, like, for us humans to connect with each other, or to connect with other living entities, I think we need to feel, like in order for us to truly feel like that there's another being there, we have to believe that they're conscious. And so, we won't ever connect with something that doesn't have elements of consciousness. Now, I tend to think that that's easier to achieve than it may sound because we anthropomorphize

Starting point is 02:08:46 stuff so hard. You have a mug that just has wheels and rotates every once in a while and makes a sound. I think a couple of days in, especially if you don't hang out with humans, you might start to believe that mug on wheels is conscious. I think anthropomorphize pretty effectively as human beings. But I do think that it's in the same bucket that we'll call emotion, that show that you're, you know, I think of consciousness as the capacity to suffer. And if you're an entity that's able to feel things in the world and to communicate that

Starting point is 02:09:27 to others, I think that's a really powerful way to interact with humans. And in order to create an AGI system, I believe you should be able to richly interact with humans. Humans would need to want to interact with you. Like, it can't be like, it's the self-supervised learning versus like, the robot shouldn't have to pay you to interact with me. So, it should be a natural, fun thing, and then you're going to scale up significantly

Starting point is 02:09:58 how much interaction it gets. It's the Alexa Prize, which they're trying to give me to be a judge on their contest. I'll see if I want to do that. But their challenge is to talk to you, make the judge, make the human sufficiently interested that the human keeps talking for 20 minutes. Do it Alexa. To Alexa. Yeah. And right now, they're not even close to that because it's just get so boring when you're like like when the intelligence is not there It gets very not interesting to talk to it

Starting point is 02:10:28 And so the robot needs to be interesting and one of the ways you can be interesting is Display the capacity to to love the suffer and I would say that Essentially means the the capacity to display consciousness like it is a entity Much like a human being of course, what that really means, I don't know if that's fundamentally a robotics problem or some kind of problem that we're not yet even aware. Like if it is truly a hard problem of consciousness, I tend to maybe optimistically think it's a,

Starting point is 02:11:02 we can pretty effectively fake it to will make it. So we can display a lot of human like elements for a while and that will be sufficient to form really close connections with humans. What to use the most beautiful idea in self-supervised learning? Like when you sit back with, I don't know, with a glass of wine and armchair and just at a fireplace, just thinking how beautiful this world that you get to explore is, what do you think is the especially beautiful idea? The fact that like object level, what object sign in some notion of object-ness emerges from these models by just like self-supervised learning. So for example, one of the things like the dyno paper

Starting point is 02:11:51 that I was a part of at Facebook is the object sort of boundaries emerge from these representations. So if you have a dog running in the field, the boundaries around the dog, the network is basically able to figure out what the boundaries of this dog are automatically. And it was never trained to do that. It was never trained to, no one taught it that this is a dog and these pixels belong to a dog. It's able to group these things together automatically. So that's one. I think in general that entire

Starting point is 02:12:20 notion that this dumb idea that you take like these two crops of an image and then you say that the feature should be similar, that has resulted in something like this. The model is able to figure out what the dog pixels are and so on. That just seems like so surprising. I don't think a lot of us even understand how that is happening really. It's something we are taking for granted, maybe a lot in terms of how we're're setting up these algorithms, but it's just it's a very beautiful and powerful idea, so it's really fundamentally telling us something about. That there is so much signal in the pixels that we can be super dumb about it about how we are setting up the self-supervised learning problem. And despite being like super dumb about it, we'll actually get very good like we'll actually get something that is able to do very like surprising things.

Starting point is 02:13:08 I wonder if there's other like object-ness other concepts that can emerge. I don't know if you follow Francois Cholet, you hit the competition for intelligence that basically it's kind of like an IQ test but for machines. But for an IQ test you have to have a few concepts that you want to apply. One of them is object-ness. I wonder if those concepts can merge through self-supervised learning on billions of images.

Starting point is 02:13:37 I think something like object permanence can definitely emerge, right? So that's like a fundamental concept, which we have. Maybe not through images through video, but that's another concept that should be emerging from it because it's not something that, like we don't teach humans that this is, this is like about this concept of object

Starting point is 02:13:53 permins that actually emerges. And the same thing for like animals like dogs, I think actually, permanent automatically is something that they are born with. So I think it should emerge from the data. It should emerge basically very quickly. I wonder if I did as like symmetry rotation. These kinds of things might emerge. So I think rotation probably yes. Yeah. Rotation yes. I mean, there's some constraints in the architecture itself. Right. But it's interesting if all of them could be like counting was another one.

Starting point is 02:14:23 than could be like counting was another one. You know, being able to kind of understand that there's multiple objects of the same kind in the image and be able to count them. I wonder if all of that could be if constructed correctly, they can emerge because then you can transfer those concepts to, um, to then interpret images at a deeper level. Right. Counting I do believe, I'm insured be possible. Yeah.

Starting point is 02:14:48 Don't know like yet, but I do think it's not that far in the realm of possibility. Yeah, that'd be interesting. If using self-supervised learning on images can then be applied to then solving those kinds of IQ tests, which seem currently to be kind of impossible. What I do do believe might be true that most people think is not true or don't agree with you on? Is there something like that?

Starting point is 02:15:12 So this is going to be a little controversial, but okay, sure. I don't believe in simulation, like actually using simulation to do things very much. Just to clarify, because this is a podcast where you talk about, are we leaving it simulation often? You're referring to using simulation to construct worlds that you then leverage for machine learning. Right. Yeah. For example, one example would be like to train an autonomous car driving system. You basically first build a simulator,

Starting point is 02:15:39 which builds the environment of the world, and then you basically have a lot of, you train your machine learning system in that. which builds the environment of the world and then you basically have a lot of, you train your machine learning system in that. So I believe it is possible, but I think it's a really expensive way of doing things. And at the end of it, you do need the real world. So I'm not sure. So maybe for certain settings, maybe the payout is so large, like for autonomous driving, the payout is so large that you can actually invest that much money to build it. But I think as a general sort of principle,

Starting point is 02:16:08 it does not apply to a lot of concepts. You can't really build simulations of everything, not only because like one, it's expensive because second, it's also not possible for a lot of things. So in general, like there is a lot of work on using synthetic data and synthetic simulators. I generally am not very, I don't believe in that. So you're saying it's very challenging visually to correctly simulate the visual, the lighting, all those kinds of things. I mean, all these companies that you have, so like Pixar and whatever, all these companies are, if all this computer graphics stuff is really about accurately, lot of them, is about like accurately trying to figure out how the lighting is and how things reflect off of fun and so on and how sparkly things look and so on. So it's a very

Starting point is 02:16:55 hard problem. So do we really need to solve that first to be able to do computer vision? Probably not. And for me in the context of autonomous driving, it's very tempting to be able to do computer vision? Probably not. And for me, in the context of autonomous driving, it's very tempting to be able to use simulation, right? Because it's a safety critical application. But the other limitation of simulation that perhaps is a bigger one than the visual limitation is the behavior of objects. Because so you're ultimately interested in edge cases. And the question is, how well can you generate edge cases in simulation, especially with human behavior? I think another problem is for autonomous driving.

Starting point is 02:17:35 It's a constantly changing world. So say autonomous driving, like in 10 years from now, there are lots of autonomous cars, but there's still going to be humans. So now there are 50% of the agents, say, there are lots of autonomous cars, but there's still going to be humans. So now there are 50% of the agents, say which are humans, 50% of the agents that are autonomous like car driving agents. So now the mixture has changed.

Starting point is 02:17:52 So now the kinds of behaviors that you actually expect from the other agents or other cars on the road are actually going to be very different. And as the proportion of the number of autonomous cars to humans keeps changing, this behavior will actually change a lot. So now if you were to build a simulator based on just like right cars to humans keeps changing, this behavior will actually change a lot. So now if you were to build a simulator based on just like right now to build them today, you don't have that many autonomous cars on the road. So you'll try to like make all of the other

Starting point is 02:18:12 agents in that simulator behave as humans, but that's not really going to hold through 10, 15, 20, 30 years from now. Do you think we're living in simulation? No. How hard is it, this is why I think it's an interesting question. How hard is it to build a video game, like virtual reality game, where it is so real, forget like ultra realistic to where you can't tell the difference. But like, it's so nice that you just want to stay there. You just want to stay there and you

Starting point is 02:18:45 don't want to come back. Do you think that's, you think that's doable within our lifetime? Within our lifetime, probably. Yeah. I you tell the other level long. Does that make you sad that there will be like population of kids that basically spend 95% or 99% of their time in a virtual world. Very, very hard question to answer. For certain people, it might be something that they really derive a lot of value out of, derive a lot of enjoyment and like happiness out of, and maybe the real world wasn't giving them that, that's why they did that. So maybe it is good for certain people. So ultimately, if it maximizes happiness, I think if it's making people happy, maybe it's okay. Again, I think it's, this is a very hard question.

Starting point is 02:19:40 So, like, you've been a part of a lot of amazing papers, what advice would you give to somebody on what it takes to write a good paper? Grad students writing papers now, is there common things that you've learned along the way that you think it takes both for a good idea and a good paper? Right, so I think both of these have picked up from lots of people I've worked with and the past. So one of them is picking the right problem to work on in research is as important as finding the solution to it.

Starting point is 02:20:16 So I mean, there are multiple reasons for this. One is that there are certain problems that can actually be solved in a particular time frame. So now say you want to work on finding the meaning of life. This is a great problem. I think most people will agree with that. But do you believe that your talents and the energy that you'll spend on it will make a meaningful progress in your lifetime? If you are optimistic about it, then go ahead.

Starting point is 02:20:43 That's why I started this podcast. I keep asking people about the meaning of life. I'm hoping by episode 2.20, I optimistic about it, then like go ahead. That's why I started this podcast. I keep asking people about the media life I'm hoping by episode like 220, I'll figure it out. Oh, not too many episodes to go. Maybe today, I don't know. You're right. So that seems intractable at the moment. Right. So I think it's just the fact of like if you're starting a PhD, for example, what is one problem that you want to focus on that you do think is interesting enough and you will be able to make a reasonable

Starting point is 02:21:10 amount of headway into it that you think you'll be doing a PhD for. So in that kind of a time frame. So that's one. Of course, there's the second part, which is what excites you genuinely. So you shouldn't just pick problems that you are not excited about because as a grad student or as a researcher, you really need to be passionate about it to continue doing that because there are so many other things that you could be doing in life. So you really need to believe in that to be able to do that for that long. In terms of papers, I think the one thing that I've learned is, I've like in the past whenever I used to write things and even now whenever I do that, I try to cram in a lot of things

Starting point is 02:21:42 into the paper, whereas what really matters is just pushing one simple idea, that's it. That's all because that's the paper is going to be like whatever eight or nine pages. If you keep cramming in lots of ideas, it's really hard for the single thing that you believe in to stand out. So if you really try to just focus, like my, especially in terms of writing, really try to focus on one particular idea and articulate it out in multiple different ways, it's far more valuable to the reader as well. And basically to the reader of course,

Starting point is 02:22:14 because they get to, they know that this particular idea is associated with this paper. And also for you, because you have, like when you write about a particular idea in different ways, you think about it more deeply. So as a grad student, I used to always wait toward, like, maybe in the last week or whatever to write the paper, because I used to always believe that doing the experiments was actually the bigger part of research than writing.

Starting point is 02:22:36 And my advice that always told me that you should start writing very early on, and I thought, oh, it doesn't matter. I don't know what he's talking about. But I think more and more I realized that's the case. Like whenever I write something that I'm doing, I actually think much better about it. And so if you start writing early on, you actually, I think get better ideas or at least you figure out like holes in your theory or like particular experiments that you should run to block those holes and so on. Yeah, I'm continually surprised how many really good papers thought history are quite short and quite simple. And there's a lesson to that. Like, if you want to dream about

Starting point is 02:23:14 writing a paper that changes the world and you want to go by example, they're usually simple. And that's not cramming or it's focusing on one idea and thinking deeply. And you're right that the writing process itself reveals the idea. It challenges you to really think about what is the idea that explains that the thread that ties it all together. And so I like a lot of famous, I know actually would start off like first they were even before the experiments were in a lot of them would actually start with writing the introduction of the paper with zero experiments in. Yeah. Because that at least helps them figure out what they're like what they're trying to solve and how it fits in like the context of things right

Starting point is 02:24:00 now. And that would really guide their entire research. So a lot of them would actually first write an intros with like experiments in, and that's how they would start prox. Some basic questions about people, maybe there are more beginners in this field. What's the best programming language to learn if you're interested in machine learning? I would say Python, just because it's the easiest one

Starting point is 02:24:21 to learn. And also, a lot of programming and machine learning happens in Python. So if you don't know any other programming language, Python is actually going to get you along the way. Yeah, it seems like sort of a toss up question because it seems like Python is so much dominating the space now.

Starting point is 02:24:39 But I wonder if there's an interesting alternative. Obviously, there's like Swift. And there's a lot of interesting alternatives popping up, even JavaScript. So I, or are more like for the data science applications, but it seems like Python more and more is actually being used to teach like introduction to programming at universities. So it just combines everything very nicely. Even harder question.

Starting point is 02:25:04 What are the pros and cons of PyTorch versus TensorFlow? I see. You can go with no comment. So a disclaimer to this is that the last time I used TensorFlow was probably like four years ago. And so it was right when it had come out. Because so I started on like deep learning in 2014 or so and the dominant sort of foreign framework for us then for vision was cafe which was out of Berkeley and we used cafe a lot was really nice and then TensorFlow came in which was basically like Python first. So cafe was mainly C++ and it had like very loose kind of Python binding. So Python was literally the first language you would use. You would really use either MATLAB or C++,

Starting point is 02:25:47 like get stuff done in like cafe. And then Python of course became popular a little bit later. So TensorFlow was basically around that time. So 2015, 2016 is when I last used it. It's been a while. And then what did you use Torch? Or did you put it in the Python?

Starting point is 02:26:03 So then I moved to Lua Torch, which was the Torch in Lua. And then in 2017, I think basically pretty much took my torch completely. Oh, interesting. So you want to Lua cool. Yeah. Huh. So you're there before it was cool.

Starting point is 02:26:16 Yeah. I mean, so Lua Torch was really good because it actually allowed you to do a lot of different kinds of things. So which cafe was very rigid in terms of its structure. Like you would create a neural network once and that's it. Whereas if you wanted like very dynamic graph sense one, it was very hard to do that and Lua Torch was much more friendly for all of these things.

Starting point is 02:26:35 Okay. So in terms of PyTorch and TensorFlow, my personal biases PyTorch just because I've been using it longer and I'm more familiar with it. Also that PyTorch is much easier to debug is what I find, because it's imperative in nature compared to like TensorFlow, which is not imperative. But that's telling you a lot that basically, the imperative design is sort of a vein,

Starting point is 02:26:55 which a lot of people are taught programming, and that's what actually makes debugging easier for them. So like I learned programming in C++, and so for me, imperative way of programming is more natural. Do you think it's good to have kind of these two communities, this kind of competition? I think PyTorch is kind of more and more becoming dominant in the research community, but TensorFlow is still very popular in the more sort of application machine learning community. So do you think it's good to have that kind of split in code bases? Or the benefit there is the competition challenges the library developers to step up their game.

Starting point is 02:27:31 But the downside is there's these code bases that are in different libraries. Right, so I think the downside is that I mean for a lot of research code that's released in month framework and if you're using the other one it's really hard to like really build on top of it. But thankfully, the open source community and machine learning is amazing. So whenever like something pops up in TensorFlow, you wait a few days and someone who's like super sharp will actually come and translate that particular code

Starting point is 02:27:57 based into PyTorch and it will, and basically have figured that all looks and crannies out. So the open source community is amazing. And they really like figure out this gap. So I think in terms of having these two frameworks or multipliers, I think of course, there are different use cases,

Starting point is 02:28:12 so there are going to be benefits to using one or the other framework. And like you said, I think competition is just healthy because both of these frameworks keep, or all of these frameworks really keep learning from each other and keep incorporating different things to just make them better and better. What advice would you have for someone new to machine learning, you know, maybe just

Starting point is 02:28:32 started or haven't even started but are curious about it and who want to get on the field? Don't be afraid to get your hands dirty. I think that's the main thing. So if something doesn't work, like really drill into why things are not working. Can you elaborate what your hands already means? Right. So for example, like if an algorithm, if you try to train a network and it's not converging, whatever, rather than trying to like Google the answer or trying to do something, like really spend those like 5, 8, 10, 15, 20, whatever number of hours really trying to figure

Starting point is 02:29:00 it out yourself. Because in that process, you'll actually learn a lot more. Yeah. Uh, googling is of course like a good way to solve it when you need a quick answer. But I think initially, especially when you're starting out, it's much nicer to figure things out by yourself. And I just say that from experience, because when I started out, there were not a lot of resources, so we would, like in the lab, a lot of us, we would look up to senior students. And the senior students were, of course, busy and they would be like, why don't you go figure it out because I just don't have the time I'm working on my dissertation or whatever.

Starting point is 02:29:29 I find here a PhD students and so then we would sit down and like just try to figure it out. And that I think really helped me. That has really helped me figure a lot of things out. I think in general, if I were to generalize that, I feel like persevering through any kind of struggle and a thing you care about is good. So you're basically, you try to make it seem like it's good to, you know, spend time debugging, but really any kind of struggle, whatever form that takes, it could

Starting point is 02:29:56 be just googling a lot. It's basically anything just go sticking with it and go into the hard thing that could take a form of implementing stuff from scratch, it could take the form of re-implementing with different libraries or different programming languages, it could take a lot of different forms, but struggle is good for the soul. Like in Pittsburgh, where I did my PhD, the thing was it used to snow a lot. And so when it was snowed, you really couldn't do much. So the thing that a lot of people said was snow builds character because when it's snowing, you can't do anything else. You focus on work. Do you have advice in general for people you've already exception successful, you're young, but do you have advice for young people starting out in college or maybe in high school?

Starting point is 02:30:40 You know, advice for their career advice for their life, how to you know advice for their career advice for their life, how to pave a successful path in career and life. I would say just be hungry, like always be hungry for what you want. And I think like I've been inspired by a lot of people who are just like driven and who really like go for what they want no matter what. Like you shouldn't want it, you should need it. So if you need something you basically go towards the ends to make it work. How do you know when you come across a thing that's like, you need? I think there's not going to be any single thing that you're going to need. There are going to be different types of things that you need.

Starting point is 02:31:17 But whenever you need something, you just go push for it. And of course, once you may not get it, or you may find that this was not even the thing that you were looking for, it might be a different thing. But the point is, like, you're pushing through things and that actually gives you brings a lot of skills and brings a lot of, like, business, certain kind of attitude, which will probably help you get the other thing. Once you figure out what's really the thing that you want. Yeah, I think a lot of people are, I've noticed the kind of afraid of that is because one is the fear of commitment. And two, there's so many amazing things in this world. You almost don't want to miss out on all the other amazing things

Starting point is 02:31:52 but committing to this one thing. So I think a lot of it has to do with just allowing yourself to notice that thing and just go all the way with it. I mean, I also like failure, right? So, I know this is super cheesy that failure is something that you should be prepared for and so on, but I do think, especially in research, for example, failure is something that happens almost like, almost every day, experiments, failing and not working. And so, you really need to be so used to it. You need to have a thick skin But and only basically through like when you get through it is when you find the one thing that's actually working Like Thomas Edison was like one person like that. So I really like when I was a kid

Starting point is 02:32:36 I used to really read about how he found like the filament the light bulb filament and then he I think his thing was like He tried 990 things that didn't work or something of the sort. And then they asked him like, so what did you learn? Because all of these were failed experiments. And then he says, oh, these 990 things don't work. And I know that. Did you know that? I mean, that's really inspiring. So you spent a few years on this earth, performing a self-supervised kind of learning process. Have you figured out the meaning of life yet? I told you I'm doing this podcast to try to get the answer.

Starting point is 02:33:11 I'm hoping you could tell me, what do you think the meaning of it all is? I don't think I figured this out, no. I have no idea. Do you think AI will help us figure it out? Or do you think there's no answer? The whole point is to keep searching. I think it's an endless sort of quest for us. I don't think AI will help us there. This is like a very hard, hard, hard question

Starting point is 02:33:36 which so many humans have tried to answer. Well, that's the interesting thing, but the difference between AI and humans, humans don't seem to know what the hell they're doing. And AI is almost always operating under well-defined objective functions. And I wonder, like, whether our lack of ability to define good, long-term objective functions or in introspect, what is the objective function under which we operate if that's a feature or a bug? I would say it's a feature because then everyone actually has very different kinds of objective

Starting point is 02:34:10 functions that they're optimizing and those objective functions evolve and like change dramatically through their course of their life. That's actually what makes us interesting, right? If otherwise, like if everyone was doing the exact same thing, that would be pretty boring. We do want like people with different kinds of perspectives. Also people evolve continuously. That's like I would say the biggest feature of being human. And then we get to the ones that die because they do something stupid.

Starting point is 02:34:35 We get to watch that, see it, and learn from it. And as a species, we take that lesson and become better and better because of all the dumb people in the world that died doing something wild and beautiful. Yixuan, thank you so much for this incredible conversation. We did a depth first search through the space of machine learning and it was fun and fascinating. So it's really an honor to meet you and it was a really

Starting point is 02:35:07 awesome conversation. Thanks for coming down today and talk with me. Thanks, Lexi. I mean, I've listened to you. I told you I was on the earth for me to actually meet you in person and I'm so happy to be here. Thank you. Thanks, man. Thanks for listening to this conversation with Ishaan Misra. And thank you to Onet, the information, Grammarly, and Athletic Greens. Check them out in the description to support this podcast. And now, let me leave you some words from Arthur C. Clarke. Any sufficiently advanced technology is indistinguishable from magic.

Starting point is 02:35:40 Thank you for listening, and hope to see you next time. you

Lex Fridman Podcast - #206 – Ishan Misra: Self-Supervised Deep Learning in Computer Vision

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.