Hidden Brain - Ep. 70: Who We Are At 2 A.M.
Episode Date: May 2, 2017Have you ever googled something that you would never dream of saying out loud to another human being? Many of us turn to Google when we have a deeply personal or embarrassing question. And we're often... more honest when we type our questions into search engines than when we answer surveys or talk to friends. Seth Stephens-Davidowitz, a former data scientist at Google, says our online searches provide unprecedented insight into what we truly think, want, and do. This week on Hidden Brain, what big data knows about our deepest thoughts and secrets.
Transcript
Discussion (0)
And not before we get started, this episode includes a racial epithet and discussions about pornography.
If you have small kids with you, please save this for later.
This is Hidden Brain, I'm Shankar Vidhathan.
We start today's show with a personal question.
Have you ever Googled something that you would never dream of saying out loud to another human being?
When we have a question about something embarrassing
or deeply personal, many of us today
don't turn to a parent or to a friend, but to our computers.
Because there's just some things you just
can't ask a real person in real life,
and you need to ask Google.
Because it's completely anonymous,
and there are no judgements attached.
Google knows everything.
I agree agree that.
Every time we type into a search box,
we reveal something about ourselves.
As millions of us look for answers to questions
or things to buy or places to meet friends,
our searches produce a map of our collective hopes,
fears, and desires.
You do learn a lot about people that's very, very different
from what they say and kind of the weirdness at the heart of the human psyche
That doesn't really reveal itself in everyday life or lunch tables, but does reveal itself at 2 a.m. on porn hub
Today on Hidden Brain what big data knows about our deepest thoughts and secrets
about our deepest thoughts and secrets.
My guest today is Seth Stevens-Dividewitz. He used to be a data scientist at Google
and he's the author of the book, Everybody Lies,
Big Data, New Data,
and what the internet can tell us about who we really are.
Seth, welcome to Hidden Brain.
Oh, thanks so much for having me, Shankar.
So Seth, we all know that Google handles billions of searches every day, but one of the insights
you've had is that the reason Google knows a lot about us is not just because of the volume
of search terms, but because people turn to Google as they might turn to a friend or
a confidant.
That's exactly right.
I think there's something very comforting about that little white box that people feel very comfortable telling things that they may not tell anybody else about their sexual interests, their health problems, their insecurities, and using this anonymous aggregate data. We can learn a lot more about people is through these very strange correlations,
you find, for example, there's a relationship between the unemployment rate and the kinds of searches people make online.
Yeah, I was looking at what searches correlate most with the unemployment rate,
and I was expecting something like new jobs or unemployment benefits,
but during the time period I looked at the single search that was most highly correlated with the unemployment rate was slut load, which is a pornography site.
And you can imagine that if a lot of people are out of work, they have nothing to do during
the day.
They may be more likely to look at porn sites.
Another search that was high on the list was solitaire.
So again, when people are out of work, they're bored.
They do leisure activities, and potentially this measure of how
much leisure there is on the internet may help us know how many people are out of work on
a given day.
Of course, this helps us reconsider what we think of as data.
When we think about the unemployment rate, as you say, our normal approach is to say
how many people are selling jobs, let's track down all the jobs.
This is coming at the question entirely differently.
Yeah, I think the traditional way to collect data was to send a survey out to people and
have them answer questions, checkboxes.
There are lots of problems with this approach.
Many people don't answer surveys and many people lie to surveys.
So the new era of data is kind of looking through all the clues that we leave.
Many of them, not as part of questions or as part of surveys,
but just clues we leave as we go through our lives.
One of the important differences between mining this kind
of data and the responses we get on surveys
has to do with how people report their sexual orientation.
I understand that the kind of queries that you see on Google
might reveal something quite different than if you ask people if they're gay
That's right if you ask people in surveys today in the United States about two and a half or three percent of
men say that they're primarily attracted to men and
This number is far higher in certain states where tolerance to homosexuality is greater
So there are a lot more gay men
according to surveys in California than in Mississippi. But if you look at search data
for gay male pornography, it's a tiny bit higher in California, but not that much higher.
And overall about 5% of male pornography search is for gay porn. So almost twice as high as the numbers
you get in surveys. Your research has important implications for a topic that we've looked
at a lot on hidden brain, the topic of implicit bias. People aren't always aware of the biases
they hold, and so scientists have had to find clever ways to unearth these biases. You
think that Google searchers can reveal some forms of implicit bias?
That's right.
So, one I look at is the questions that parents have about their children.
If you ask many parents today, they would say that they treat their sons and daughters
equally, that they're equally excited about their intellectual potential, equally concerned
about maybe their weight problems.
But if you aggregate everybody's Google searches, you see large differences in gender that when
parents in the United States ask questions starting, is my son, they're much more likely
to use words such as gifted or a genius than they would in a search starting is my daughter.
When parents in the United States search is my daughter, they're much more likely to complete it with is my daughter overweight or is my daughter ugly.
So parents are much more excited about the intellectual potential of their sons and
much more concerned about the physical appearance of their daughters.
You report that in some states after Barack Obama was elected president, there were more Google searches for a certain racist term than searches for first black president.
I think there is a disturbing element to some of this search data where in the United States
today, many people, and maybe this is a good thing, don't feel comfortable sharing that they
have racist thoughts or racist feelings, but on Google, they do make these searches
in strikingly high frequency.
I need to use sorted language to this.
The measure is the percent of Google searches
that include the word nigger.
And these searches are predominantly searches,
looking for jokes, mocking African-Americans.
I should clarify, this is not searches for rap lyrics,
which tend to use the word nigger, the ending in A.
But if you look at
the racist search volumes, I think if you had asked me, based on everything I had read
about racism in the United States, I would have thought that racism in the United States
predominantly concentrated in the South, that really the big divide of the United States
when it comes to racism is South versus North. But the Google data reveals that's not really the case that racism
is actually very, very high in many places in the north, places like Western Pennsylvania,
Eastern Ohio, or industrial Michigan, or rural Illinois, or upstate New York.
The real divide these days when it comes to racism is not north versus south, it's east
versus west.
There's much higher racism, East of the Mississippi,
than West of the Mississippi.
So besides just saying, you know, we know that there are these patterns of racist searches
in different parts of the country, you're actually saying you can do more than that.
You can actually predict how different parts of the country might vote in a presidential
election based on the kind of Google searches you see in different parts of the country.
Yeah, well, the first thing I found is that there was a large correlation between racist
search volume and parts of the country where Obama did worse than other democratic candidates
had done.
So Barack Obama was the first major party general election nominee who was African-American,
and you see a clear relationship that Obama lost large numbers of votes in parts of the
country where there are high racist search volumes and other researchers have found such as
Nate Silver at 538 and Nate Cohn at the New York Times that there was a large correlation between
racist search volumes and support for Donald Trump and the Republican Party that parts of the country that made racist searches in
high numbers were much more likely to support Donald Trump.
And this relationship was much stronger than really any other variable that they tested.
I'm wondering how you try and understand that kind of information.
It's hard not to listen to what you're saying and draw sort of what seems to be a superficial conclusion, which is that racist people vote for Donald Trump.
I'm not sure, is that what you're saying?
That's one of those things where it sounds so offensive to say it that I think everyone
tipped toes around the line.
I will say that the data does show strong correlation between racist searches and support for Donald Trump that is hard to explain
with any other explanation.
You know, it's, yeah, I mean, yeah,
that kind of is what I'm saying.
I'm not saying that everybody who supported Donald Trump
is racist by anti-stretch imagination.
There are plenty of people who support Donald Trump
without this racist tendency,
but a significant fraction of his supporters,
I think were motivated by racial animus.
You spend a lot of time in the book talking about sex.
It turns out to be an area where marketers and companies know that what we say about
ourselves is nowhere close to the truth.
Most people report being not interested in pornography, but the website Pornhab reports
that in 2015 alone, viewers watched two and a half billion hours of porn,
which is apparently longer than the entire amount of time
that humans have been on Earth.
What is the say about us, the fact that we either have
very little insight about ourselves
or we're actually lying through our teeth?
Yeah, I'd say we're probably lying through our teeth.
Yeah, I'd say that, I do talk a lot about sex in this book.
One thing I like to say is that big data is so powerful, it turned me into a sex expert
because it wasn't a natural area of expertise for me, but I do talk a lot about sexuality.
And I think you do learn a lot about people that's very, very different from what they say
and kind of the weirdness at the heart of the human psyche that doesn't really reveal itself in everyday life or at lunch tables, but does
reveal itself at 2am on PornHub.
One of the things that I was wondering about as I read your book was how much search terms
tell us about what people are actually thinking or actually feeling and how much they might
just tell us about things that people are actually thinking or actually feeling, and how much they might just tell us about things
that people are curious about.
So certainly people search for a lot of things
related to sex that would indicate
that there is a large amount of interest in,
say, domesticism and fetishes and so forth,
but could some of it just be that people are curious?
People hear a lot about this in the news
or on social media and they Google something because something because they just curious about it not necessarily because they themselves
want to you know be part of the BDSM culture. I think it depends on the particular question
you're looking at. So the reasons we can trust the racism data is meaningful is because
it correlates with voting patterns. With the sex data, there's not really necessarily something to check it against.
On the internet, we do see the videos that people watch,
and I think that is pretty telling about some people's fantasies,
even if it's not definitive,
because some people may just be curious.
Pornography sites are the only ones gathering information
about our sexual and romantic preferences.
We now have apps like Tinder and sites like OkCupid that gather tons of data about us.
As a result, these apps and sites know a lot about our romantic preferences.
But for a long time, we've had a human version of big data for romance.
Grandma.
Seth has some personal experience with this big data source.
A couple of years ago, he was having Thanksgiving dinner with his family.
He was 33, didn't have a date with him,
and his family was trying to figure out the qualities
Seth needed in a romantic partner.
My family was going back and forth.
My sister was saying that I need a crazy girl because I'm crazy.
My brother was saying that my sister was crazy,
that I need a normal girl to balance me out and my mom was screaming at my brother and
sister that I'm not crazy and my dad was then screaming at my mom that of course
Seth is crazy. So it's kind of a classic Steven's Davidowitz family Thanksgiving
where everyone's just yelling at each other for being crazy and we're not really
getting any progress in learning about what I need in my love life. And then my soft spoken 88 year old grandma started to speak and everyone went quiet.
And she explained to me that I need a nice girl, not too pretty, very smart, good with
people, social see you will do things, sense of humor because you have a good sense of
humor.
And I describe why I was her advice so much better than everybody else's.
I think one of the reasons that she's big data, right?
So, grandmas and grandpas throughout history have had access to more data points than anybody else.
And they've been able to correlate larger patterns than anybody else has
because they've been around longer.
And that's why they've been such an important source of wisdom historically. The problem, of course, as you also point out, is that it's very hard to disentangle your
personal experiences from what actually happens in the world, and in your grandmother's
case, she actually had a very specific piece of relationship advice about the kind of person
you should want, and some of that might not actually be backed up by the empirical evidence.
Yeah, well, my grandma has told me on multiple occasions that it's important
to have a common set of friends and a partner.
So she lived in a small apartment in Queens, New York, with my grandfather, and every evening
they'd go outside and gossip with their neighbors.
And she thought that was a big part in why their relationship worked.
But actually recently, computer scientists have analyzed data from Facebook, and they can actually
look when people are in relationships
and when they're out of relationships
and try to predict what factors or relationship
make it more likely to last.
One of the things they tested was having
a common group of friends.
Some partners on Facebook share pretty much the same friend
group, and some people have totally isolated friend groups.
And they found, contrary to my grandmother's advice, that having a separate social circle
is actually a positive predictor of a relationship lasting.
And so of course, the risk of trusting the individual is that the individual's intuition
about what work for his or her life might not work for everyone else.
That's right.
I think we tend to get biased by our own situation.
Data scientists have a phrase called waiting data.
Some data points get extra weight in our models.
And our intuition gives too much weight to our own experience.
And we tend to assume that what worked for us
will work for others as well.
And that's frequently not the case.
Many companies know that we don't really understand ourselves.
When we come back, we look at how companies are using big data
to predict what we're going to do before we know it ourselves.
We'll also ask, if sites like Google can use data to forecast
whether you're going to get a serious illness,
should they give you that information?
Stay with us.
This is Hidden Brain, I'm Shankar Vedanthan. Netflix used to ask users what kind of movies they wanted to watch.
Seth Stevens' Davidowitz says eventually, the company realized
that asking this kind of question was a complete waste of time.
Yeah, initially Netflix would ask people what they want to view in the future.
So they could queue up the movies that they said.
And if you ask people, what are you going to want to watch tomorrow or this weekend?
People are very aspirational.
They want to watch documentaries or about World War II or avant-garde French films.
But then when Saturday or Sunday comes around,
they wanna watch the same low-brow comedies
that they've always watched.
So, Netflix realized they had to just ignore
what people told them and use their algorithms
to figure out what they'd actually wanna watch.
So, one of the things that's intriguing
about what you just said, is it's,
I don't think it's actually the case
that people were lying to Netflix when they
said they wanted to watch the avant-garde film. They actually genuinely probably aspire to do that.
It might actually be that big data understands people better that they understand themselves.
Yeah, probably even more common than lying to other people is lying to ourselves.
Particularly when we're trying to predict what we're going to do in two or three days, we tend to assume that we're going to go to the gym more than we go to the
gym or eat better than we actually will eat or watch more intellectual stuff than we
actually will watch.
So the algorithms can correct for this over optimism that we all tend to share.
When you look at a company like Facebook, which has access to these huge amounts of data
about us and what we like and whom we like in our relationships, you have to wonder how
the company is using this data in all kinds of different ways.
I remember Facebook got into some hot water a couple of years ago because they ran an
experiment that seemed to be manipulating how people feel.
Of course, there was a huge outcry about the experiment at the time.
And since then, there hasn't been very much reported about what Facebook is doing.
But I suspect that it might just be because Facebook is no longer telling us what it's
doing, but it's still doing it anyway.
Every major tech company now runs lots and lots of what are called A, B, tests, which
are little experiments where you put people
into two different groups, a treatment and control group, and you show one group, one version
of your site, and the other group, another version of the site, and you see which version
gets the most clicks or the most views.
This is really exploded in the tech industry.
It's not just the tech industry that uses AB testing.
Newspapers do too.
Newspapers lack the Boston Globe.
A few years ago, the Globe tried out two different headlines for the same story, and then
measured which headline got the most clicks.
The newspaper then used the more effective headline for the rest of the day.
I've been a journalist for about 25 years and spent most
of that time working at newspapers. Seth wanted to test my headline writing expertise. He
read out two versions of a headline for a Boston Globe story and he asked me to guess which
one worked better.
So let's see, Shunker, if you can guess some of these winners. The first headline test, I'll give you headline A first and then headline B second,
headline A, when the first subway opened in Boston,
headline B, cartoons from when the first subway
opened in Boston.
All right, that's gonna be easy.
Car tunes from when the first subway opened in Boston.
No, it's headline A, got 33% more clicks for when the first subway opened at Boston. No, it's headline A, got 33% more clicks
for when the first subway opened at Boston.
Oh, no.
You want another one?
Yeah, let's try it.
I know where this is going, but let's try it.
OK, headline A is woman makes bank off-rare baseball card.
And headline B is woman makes $179,000
off rare baseball card.
I'm gonna go with the specific dollar amount, so B.
No, it's headline A, 38% more clicks for headline A.
You're over two.
Is there a third one?
Can I review myself?
Yeah, okay, let's do another one.
All right, headline A, hook up contest at heart of St. Paul
Rape trial, headline B, no charges in prep school sex scandal.
All right, so I'm going to follow a completely different strategy than I did the last two
times, which is I'm just going to pick a number and I picked the number before you even
read it, read the headlines out to prevent myself from being biased.
And I'm going to go with B again, just on the off chance that you couldn't tell me three the number before you even read it, read the headlines out to prevent myself from being biased.
I'm going to go with B again, just on the off chance that you couldn't tell me three answers
where all the answers were A.
That's right.
I didn't even realize I was doing that, but headline B is correct.
A hundred eight percent more clicks for headline B. So a good job.
You got it one for three, not so bad.
And the interesting thing, of course, is I use sort of an algorithmic solution.
Yeah, you have to.
You have to.
I guess, right? of course, is I use sort of an algorithmic solution. Yeah, you have to. And I'm not too sure.
Yeah, so I think what this shows is that the reason that A.B. testing is so important
is because our intuition can trick us that you've been around journalism for many, many
years, and you have your own ideas of what makes a successful headline, but even someone
like you is frequently wrong, And we can use AB testing
to correct our faulty intuition, find what actually works, now what we think works.
It's one thing when companies use big data to serve us better. You could argue that a newspaper
that delivers to catch your headline is serving its audience better. But there are many, many instances where companies are now using big data against us.
Banks and other financial institutions are using clues from big data to decide who should get a loan.
I think it's an area of a big concern.
So I talk about a study in the book where they started up here to peer lending a site,
and they started the text that people used in their requests for loans,
and you can figure out just from what people say
in their loans how likely they are to pay back.
And there are some strange correlations.
For example, if you mention the word God,
your 2.2 times less likely to pay back,
2.2 times more likely to default.
And this does get eerie.
Are you really supposed to be penalized
if you mentioned God in a loan application?
That would seem to be really wrong, even evil, to penalize somebody for a religious preference.
Basically, everything's correlated with everything.
Just about anything anybody does is going to have some predictive power for other things
they do.
The legal system is really not set up for a world
in which companies potentially can mine correlations
over just about everything anybody does in their life.
I was thinking about an ethical issue.
I'm not sure if necessarily this is a legal issue,
but you mentioned in the book that,
if someone is Googling, I've been diagnosed
with pancreatic cancer, what should I do?
It's reasonable to assume that this person has been diagnosed with pancreatic cancer. What should I do? It's reasonable to assume that this person has been diagnosed with pancreatic cancer.
But if you collect all the people who are Googling what to do about that diagnosis with pancreatic
cancer and then work backwards to see what they've been searching for in the weeks and months
prior to their diagnosis, you can discover some pretty amazing things.
Yeah, this is a study that researchers used Microsoft Bing data.
They looked at people who searched for just diagnosis of pancreatic cancer, and then similar
people who never made such a search.
And then they looked at all the health symptoms they had made in the lead-up to either a diagnosis
or no diagnosis.
And they found that there were very, very clear patterns of symptoms that were far more likely
to suggest a future diagnosis of pancreatic cancer.
For example, they found that searching for indigestion
and then abdominal pain was evidence of pancreatic cancer
while searching for just indigestion
without abdominal pain meant a person was much more unlikely
to have pancreatic cancer.
And that's a really, really subtle pattern in symptoms, right? Like a time series
of one symptom followed by another symptom is a evidence of a potential disease. It really
shows, I think the power of this data where you can really tease out very subtle patterns
in symptoms and figure out which ones are potentially threatening and which ones are benign.
So here's the ethical question. Once you establish that there is this correlation that you sort of say I have a universe of search terms seem to be correlated with people who go on to have the diagnosis versus
these search terms that do not go on to predict a diagnosis.
So does a company like Microsoft now have an obligation to tell people who are googling
for these combinations of search terms?
Look, you might actually need to get checked out.
You might actually need to go see a doctor because of course, if you can be diagnosed with pancreatic cancer, you know,
four weeks earlier, you have a much better chance of survival than if you have to wait for
a month.
I lean in the direction of yes, some people would not lean that direction. It could be
a little creepy. If Google right below the button, I feel lucky, you know, I have, you
may have pancreatic cancer. It's not exactly the most friendly thing to see on a website. But personally, if I had some sort of symptom
pattern that suggested I may have a disease and there was a chance of curing it if I was
told, I'd want to know that. It's just another example that really the ethical and legal framework that we've set
up is not necessarily prepared for big data.
Seth Stevens-Dividowicz is a former data scientist at Google and the author of the book, Everybody
Lies, Big Data, New Data, and what the internet can tell us about who we really are.
Seth, thank you for joining me today on Hidden Brain.
Thanks so much for having me, Shankar.
This week's episode was produced by Raina Cohen and edited by Tara Boyle. Our staff includes Jenny Schmidt, Maggie Pennman, and Renee Clarre.
Our unsung hero this week is Hugo Rojo.
Hugo walks on NPR's Media Relations team and he's one of those people who's always willing to be helpful.
Hugo helps us with social media for the show.
He's also our in-house professional photographer.
When we need a producer to record a line of narration and Spanish,
Hugo puts up his hand.
He's had some terrific ideas on how to reach new listeners,
and he's always willing to share those ideas with us.
Thanks, Hugo.
Speaking of reaching listeners,
we're hoping to get a better sense of how you found our
show.
We've put together a quick survey, your feedback can help us find more listeners for
Hidden Brain.
You can find the survey at n.pr-hiddenbrainsurvey.
That's n.pr-hiddenbrainsurvey.
And thanks.
I'm Shankar Vedantum, and this is NPR.