Hidden Brain - Finding Your Voice

Starting point is 00:00:00 This is Hidden Brain, I'm Shankar Vedanta. At some point in our lives, many of us realized that the way we hear our own voice, isn't the way others hear our voice. Shay had that realization as a child, helping out at the family business, a Delhi, in southwest Virginia. We did a lot of business with the Delhi, a lot of call-in orders, and I had to answer the phone with the name of the business often when the phone rang the same thing happened over and over again vividly remember always being confused for my mother. They would always say oh hey Judy

Starting point is 00:00:42 You know and either Oh, hey, Judy. You know, and either start into their questions or whatever they were looking to speak to my mother about. This case of mistaken identity became a running joke. You know, it was, ha ha, they thought that he was Judy. She didn't correct the collars. That's because she didn't mind being mistaken for Judy. It was just comforting to me because it felt natural.

Starting point is 00:01:16 She was raised as a boy, but now decades later she identifies as a transgender woman. We're not using her legal name at her request because it's a man's name. She's experienced the deli became a template for the rest of her life. She listened to her voice and she listened to the way others heard her voice. There was always a gap between the two. She first tried to sound more masculine, to fit in with the way the world saw her. So I would consciously make an effort to try to talk a little deeper. It was, you know, I practiced it.

Starting point is 00:01:59 And I woke up in the morning, I got myself up here. One way she practiced was to sing along with the doors, tool, and nine-inch nails. Attended to sort of go towards heavier music. You know, it's raspy, deep, yelling almost voices. I'm going wrong, oh night long. Sounding more masculine became second nature, but it wore on her. My entire life I have been playing the role of a boy. And it is exhausting.

Starting point is 00:02:45 It truly is. Years and years pass like this. A divorce, a second wife, two kids, and a cancer scare later, she began to reconsider how she wanted to sound. Instead of trying to sound more masculine, she now started to try to sound more feminine. Even prior to accepting that I was trans, before I could put a label on what I was, I consciously made an effort to not sound as masculine, and that started in my early 30s. Once again she used music as a way to practice.

Starting point is 00:03:27 I would always sing in the car alone and I would attempt Britney Spears. I think I did it again, I would sort of turn my head towards my driver's side window. Because it would reflect the sound back to me a little more loudly so that I could hear the pitch and tone of my voice. And I would try to make my voice sound at a higher pitch without it sounding like I was trying. What did you hear? Did you hear the voice that you wanted to hear? No, no, I've never, I don't know that I have ever actually been able to hear the same voice that I hear. Come from me.

Starting point is 00:04:45 She has spent a lifetime being dissatisfied with the way she sounds. She viscerally knows something, the rest of us often forget. Our voices shape who we are. They shape how other people think of us. This week on Hidden Brain, we look at the relationship between our voices and our identities. Voices about who you are, our voice signals, things about our personality. Plus, how technology might help people with vocal impairments find voices that reflect who they are. Once you close your eyes and let your mind relax, it doesn't take long to escape to that beautiful beach.

Starting point is 00:05:27 And the ethical quandaries that arise when we can create personalized, customized voices. This is huge, they can make us say anything, not really anything. Jackie Cork used to love the sound of her voice. She spent her 20s in San Francisco. And like many young people living in the big city, she enjoyed going out with her friends. She danced to electronic music at clubs and drank at bars. She was outgoing. And she was a flirt. I have to admit I've always enjoyed flirting and so I was quite the flirter. It was sort

Starting point is 00:06:14 of a fun activity to help pass the day when you're doing something mundane. For Jackie, flirting was also a demonstration of her confidence. It reinforces my own identity, how I felt about myself. Fun, I'm somebody that people are attracted to, not just physically or sexually, but a person who people like. Jackie liked to be liked. She liked being someone people wanted to be around. When she thinks about who she was back then, she refers to that person as voice one Jackie. Voice one Jackie was really fun-loving, always joking, pretty carefree. For years, Jackie had been doing backpacking trips with her then boyfriend. All of our gear on our back, 50 pounds, 60 pounds, etc.

Starting point is 00:07:11 Even carrying bottles of wine, you know, in the backpack. She'd been doing these backpacking trips for several years. But during a trip to a national forest in California, Jackie came up short. I couldn't go at one step further. I, you know, in being young, you think, you know, what could it be? There's nothing wrong with me in 20, whatever, 20, something years old. And I noticed, you can be really short of breath. And it was a real struggle. I finally made it, of course, but it was really slow and a real struggle to make it back.

Starting point is 00:07:44 Hiking became too taxing for her. It finally made it, of course, but it was really slow and a real struggle to make it back. Hiking became too taxing for her, so she cut back and switched to ballet. But one day, during a series of relives, a more aerobically challenging dance move, Jackie felt light-headed and dizzy. It was serious. I had a seizure and all of the medical follow-up led me to discover that I had a lung disease. She was diagnosed with idiopathic pulmonary hypertension. It's a rare progressive disease where the blood vessels in the lung shrink and oxygen is not distributed properly.

Starting point is 00:08:24 There's no cure for it except for lung transplant. In 2008 at 32, Jackie received a double lung transplant. The surgery was successful. Within weeks, she was out of the hospital and back in the Down studio. In 2010, she left San Francisco to explore Latin America and Europe. In 2010, she left San Francisco to explore Latin America and Europe. But her body began to reject her new lungs. Before long, Jackie was back in the hospital, this time in Switzerland, waiting for another lung transplant. I had the surgery in January of 2013, and I was asleep still for a done or month and a half.

Starting point is 00:09:02 When she woke up, she found her surgery had been successful, except for one very important thing. I couldn't speak. I couldn't speak. During the operation, the medical staff had used a ventilator to help her breathe. So I was intubated. That means they basically have a tube

Starting point is 00:09:24 that they put inside your mouth and it goes down your throat and they send that down into your lungs. And during this intubation, the tube was rigid enough to cause some damage to my vocal cords. Jackie began speech therapy and within a few weeks she slowly regained the ability to speak. But the voice that came out of her mouth? It wasn't her voice. My voice changed, it's raspy, or it's broken a bit. Ever since, speaking has been hard work. Yeah, I really have to push my vocal. I feel it. It's actually a physical effort. Like I'm actually squeezing the vocal cords. As hard as I can to make

Starting point is 00:10:05 the loudest sound possible to get to be to be heard and it's very tiring. The harder it became to produce sound, the more self-conscious she became about the way she sounded. I feel less confident. I'm aware of how people might perceive me. So I'm a little more shy. I don't approach people like I used to. Jackie believes the change in her voice has led to a dramatic change in her personality. For much of her life, her voice was a manifestation of her confidence. I used to go to clubs quite a bit, you know, but you know, when you have a normal voice, you can still talk to people in those environments where it's kind of loud and noisy, or bars to meet friends or to flirt,

Starting point is 00:11:07 like I like doing. But, you know, those places now, I don't really go to anymore. Jackie, a woman who once described herself as carefree and outgoing, who took pride in her ability to flirt, became withdrawn, reserved. You know, I have tons of scars all over my body, and that plays on my confidence as well. But in public life, people can't see those scars, and I feel like my voice is that, that, you know, that scar they can hear, you know? They know something's wrong. And they know, oh, maybe she's weak, maybe she's sick. Just by hearing my voice, it's the signal. Our voices communicate so much more than mere information.

Starting point is 00:11:57 They communicate our feelings, our temperament, our identity. When we come back, how scientists are weaving this insight into custom-built voices. I can't wait for my friends to hear my new voice. Scientists have been trying for more than two centuries to analyze the human voice, decode its components, and recreate it. An early success came from a man named Homer Dudley. He developed an organ-like machine that he called the voter. It worked using special keys and a foot pedal, and was capable of creating about 20 different

Starting point is 00:12:43 electronic buzzes and sounds. When those sounds were combined, they formed words. The voter fascinated people at the 1939 World's Fair in New York. Well, we've heard the voter make a word, and by combining words, of course, we get a sentence. For example, Helen, but you have the voters say, cheese saw me? Cheese. Uh... Cheese. That sounded awfully flat. How about a little expression?

Starting point is 00:13:10 Say the sentence in answer to these questions. Who saw you? Cheese. Uh... whom did she say? Cheese. Uh... Well did she see you or hear you?

Starting point is 00:13:21 Cheese. Oh! The voter was an early example of electronic speech, but it was cumbersome to operate and required special training. Over the next 40 years, speech scientists continued studying the components of the human voice. The eventually developed methods to mathematically map the acoustic patterns and phonetic properties of natural speech, vowels, syllable constructions, consonants. to be or not to be. That is a question. I can read stories that speak never loud. I do not understand what the work means when I read them.

Starting point is 00:14:16 This is a complete record track making. You are listening to the right of a machine. Well, we know our A B key. By the 80s, Speed Synthesis was no longer the staff of science demonstrations at shows and fairs. Text to speech systems are beginning to be applied in many ways, including aid for the handicapped, medical aid, and teaching devices. The first kind of aid to be considered as a talking aid for the locally handicapped. The research of Dennis Clatt at MIT paved the way for the voices we might be familiar with today, many of them used in assistive communication devices. I am beautiful Betty, the standard female voice. I am view of cherry, a very large person with the teeth voice.

Starting point is 00:15:08 I am the standard male voice, perfect all. This last one by the way, became famous after Stephen Hawking adopted it. Speech technology has come a long way in the year since Homeradugly unveiled the voter. But in many ways, synthetic voices still sounded synthetic. They didn't convey all the information that's packed into the human voice. Voice is identity, right? Voice is about who you are, our voice signals, how old we are. Our voice signals, our gender, our voice signals, you know, things about our personality. Rupert Patel is a speech scientist at Northeastern University. Perhaps more than many people, she has thought a lot about the human voice.

Starting point is 00:15:56 When she misses her mother, for instance, Rupert has a special technique to evoke her presence. That's right. My parents now live in LA and I live here in Boston. And oftentimes I find myself imitating my mom, you know, say, oh, beta, how are you today? You know, or something like that. I'll imitate her the way she might say something. I might say that the same way to my daughter or something like that. But what I'm evoking is my mother's voice, primarily, to feel the closeness of her here.

Starting point is 00:16:29 In 2002, Rupert took these ideas with her to Denmark, where she was scheduled to speak at a conference for researchers and patients. I was presenting some of my early work showing that individuals with very severe speech disorder still have the ability to make sound and those sounds have some communicative content in them, some information that could be used. After her presentation, she walked out to the Exhibit Hall, and that's where she noticed

Starting point is 00:16:54 something. Lots of people were using devices that produced synthetic voices. What was odd was that many of the voices didn't seem to match the people using them. At that point, back in 2002, we had very limited synthetic voice options available. And so I heard a little girl or a young girl using a device to talk with an adult male voice and having a conversation with another person, a middle-aged man, who also was using the same voice. And so, they're using different devices, but their voices were identical. She had just presented on the idea that our individual voices carry something unique

Starting point is 00:17:36 about us. So why was this not reflected in these synthetic voices? Why are we giving them the same black box to speak through? There's got to be something that we can do, that we can harness the quality of the voices that they have, and imprint those or use that to give them a prosthetic voice that somehow reflects who they are and not just the same voice for everyone.

Starting point is 00:18:02 Could a synthetic voice capture the richness of natural human speech? Rupert launched a company to answer this question. It's called Vocal ID and it uses machine learning and other artificial intelligence technologies to create personalized voices. So what synthetic speech is, is taking recordings of anyone and then taking those recordings and building a model of the voice quality of the annunciation abilities, right? You aren't necessarily analyzing it from a top down saying, well, this person has a high pitched voice to this person, has a low pitched voice, you're taking the recordings as basically the raw ingredients to feed to a machine learning algorithm or set

Starting point is 00:18:48 of algorithms really. And those are learning the patterns of the clarity of the person's s. The, you know, how that sound is changed in the different phonetic environments. The voice quality, aspects of, all of these are learned by the machine. It's really re-emulating these are learned by the machine, it's really re-emulating the human voice by machine. In other words, the idea is to build a model of how a person sounds. To do so, you use a vast range of examples of that person's speech.

Starting point is 00:19:24 Then you use the model to produce spoken language that incorporates all the idiosyncrasies and texture of that person's voice. One of Ruppels' early clients was a young girl, male flack. Mave was born with cerebral palsy, she is in a beautiful family where she's two other older sisters. Ruppels' goal was to give Maeve her own unique synthetic voice. One that could express not just her words, but her identity. The first step was to record her. A-A-A-A-A- you know, when she's in a classroom with several other kids who also have communication disabilities,

Starting point is 00:20:12 when she makes that sound, you know what's Maeve speaking. So we harnessed those sounds of Maeve's to create her unique voice for her. Then Rupert turned to Maeve's older sisters, Erin and Megan, who volunteered to record their voices so they could be blended with Maeve's. A scream is my guilty pleasure. That man ran fast. Erin and Megan read hundreds of sentences and phrases and uploaded them to a website for Rupert. Like a painter mixing a palette, Ruperal took elements of Maeve's voice and

Starting point is 00:20:46 mixed them with those of our sisters and other vocal donors to create what she calls a bespoke voice. I can't wait for my friends to hear my new voice. My parents are really happy I'm not addicted to Fortnite. I want to meet Taylor Swift. So we're hearing, you know, Maeve at this age in terms of her sound as well as her siblings' recordings, being combined and being produced through this speech and this ascension. It's possible that Maeve may decide as she gets older that her voice needs to age with her. She'll need a new, bespoke voice at that point. The same technology can also be used to preserve a person's existing voice. Sometimes, this is done when a person faces the prospect of losing his voice.

Starting point is 00:21:36 These could be individuals who are losing their voice to degenerative conditions so slowly their voice is changing, such as ALS or Parkinson's disease, and then those who the trauma is actually far more pronounced for individuals with something like head and neck cancer where they learn that they're going to have their voice box removed within a couple of weeks. Lonnie Blanchard confronted this traumatic news in 2018. Doctors had diagnosed him with cancer and said surgery was the only option. Lonnie had to have his tongue removed. Here he is speaking to the BBC.

Starting point is 00:22:11 Now that I know I'm going to lose my voice, I got to get some sayings down on a person recorder to get what I would normally say to my wife and kids. But every time I go to do that, I draw up a blank. By the time Lonnie started working with Rupal, he only had a few weeks to back his voice. We helped him get set up in terms of the microphone he would need and things like that. Rupal worked with Lonnie to build a database of sound samples before his surgery.

Starting point is 00:22:44 He recorded sentences that gave Ruperl and her colleagues the different kinds of sounds they would need to build a new voice. I wish we could get acquainted. I'm going to be a teacher when I grow up. After Lonnie banged his voice, Ruperl used the recordings to create a personalized voice for him. The difference between his voice and Maeve's voice is that Rupal didn't need to blend voices from donors. Lonnie was his own donor.

Starting point is 00:23:12 Those voice samples then are used, are cleaned up and annotated by machine actually, and then used to feed into the algorithms we have to create the synthetic voice. Similar to Maeve, Lonnie uses an assistive device, in his case an iPad. He can type out what he wants to say, and hear his voice speaking to his family. Once you close your eyes and let your mind relax, it doesn't take long to escape to their beautiful beach. mind relax. It doesn't take long to escape to the beautiful beach. It's really empowering. It's continued to be a way that he can connect to family members and feel part that part of him is not fully lost. While most of us will never have the experience of losing our voices and having to obtain synthetic voices as replacements, increasingly many of us are coming into contact with these voices.

Starting point is 00:24:19 Hey Siri, how many ounces are in a cup? One cup is eight blue ounces. Hey Siri, set a timer. For how long? 56 minutes. OK, sure, 56 minutes. Starting now. Alexa, can you play music? Play some jazz.

Starting point is 00:24:35 There's a station you might like. Synthetic voices are already changing our lives. And it's likely we're going to become even more reliant on them. In May 2018, Google revealed a new program it was working on. CEO Sundar Pichai presented it to an audience of software developers. The technology is called Google Duplex. It allows you to make a restaurant reservation through a voice assistant. Hi, how many are you?

Starting point is 00:25:09 Hi, I'd like to reserve a table for Wednesday's e7. For 7 people? It's for 4 people. 4 people, when? 9? 9, Wednesday at 6 p.m. Oh, actually we live here for like, up to like, five people.

Starting point is 00:25:31 For people you can come. How long is the wait usually to be seated? For when tomorrow or weekday or? For next Wednesday, the seven. Oh, no, it's not too busy. You can't go for people, okay? Oh, I got you. Thanks. Bye-bye.

Starting point is 00:25:54 The audience is laughing and applauding because the man making the call isn't a man, but a machine. It didn't seem like it was a robotic voice. The robotic voices were used to are the voices like when you are in a parking garage and you hear the, you know, please place your ticket with the stripe facing to the right. Very, very, uh, canned sort of speech. This was far more sophisticated and much more like you and I talk with hesitations and pauses and ums and us. You think it's a human on the other end. Now of course one of the

Starting point is 00:26:30 things about that voice that Google had is that it did seem like a convincing voice but if you need to convince me that that voice is not just a human voice but a particular human's voice. You need to convince me that this is not just anyone calling for a restaurant reservation but it's Barack Obama calling for a restaurant reservation. Presumably Barack Obama calling for a restaurant reservation. Presumably now the bar is much, much higher. That's right, it is. Barack Obama though does have a ton of his audio on the internet.

Starting point is 00:26:54 And there's a lot more audio to make, his voice than there is, my voice, for example. And so, yeah, but it's absolutely possible to learn. And if you have long enough, you can learn anybody's voice and if you have enough data. It's not hard to see how bad actors could misuse this. Create havoc in people's lives, trouble at companies, political misinformation.

Starting point is 00:27:20 That's absolutely, I mean, we're seeing deep fakes in video. We've seen, you know, President Obama's face with being manipulated and the audio coming out, you know, people creating these fake media in video and you're also seeing an audio. That's exactly why the security aspects of what we're doing are trying to detect. Is that fake audio or is that real? Is part of that fake audio or is part of that real, right? So this isn't completely sci-fi. It isn't so far away. It isn't necessarily 20, you know, 2028.

Starting point is 00:27:58 It's probably 2020. So we've got to get our defenses up in terms of questioning where the authenticity of audio just says we do video. In 2017, a Canadian company called Liobird showed how audio deepfakes might work in politics. This is huge. They can make us say anything now really anything. The good news is that they will offer the technology to anyone. This is huge. How does their technology work?

Starting point is 00:28:26 Hey guys, I think that they used deep learning and artificial neural networks. By 2019, deep fake audio technology had gotten even better. Shortly after critics pan the final season of Game of Thrones, a YouTube channel called Eating Things with Famous People put out this tongue-in-cheek video showing the supposed remorse of the lead character, John Snow. Spoofing a TV show is one thing, but imagine such high quality deepfakes occurring in a more high-stake setting. Voices are increasingly being used by financial institutions to authenticate

Starting point is 00:29:28 the identities of consumers. Recently, Rupert worked with a bank to assess how vulnerable it was to vocal hacking. We tested their authentication system by creating synthetic samples or synthetic voices of particular individuals who are enrolled in their authentication system and we try to test those voices against the system to see if we could get through with the synthetic voices. And we were not able to do that for every single voice, but we were able to do it for some

Starting point is 00:30:02 voices. And so it just starts to show that there is a vulnerability in this technology. So how would you guard against it? Well there are many ways to guard against it. One is you can classify the difference between, is the audio signal I'm listening to, is it synthetic or is it human? As the synthetic voices become better and better sounding, that will be a more difficult decision to make. And it is something that if we can proactively solve,

Starting point is 00:30:31 I think, or at least start to address, we're going to be way ahead of the curve than if we're trying to clean up our mess after the fact. Despite the potential risks of these new technologies, Ruperl is also optimistic. Voice synthesis tools have the potential risks of these new technologies, Ruperl is also optimistic. Voice synthesis tools have the potential to allow people to craft the voice they hear on the outside so that it matches the identity they feel on the inside. Ideally in the future, these decisions are made by the end user themselves.

Starting point is 00:30:59 Like, oh, I actually want that to be sound a little breathier. I'd love that to sound a little bit more confident. And I mean, how does that translate to the acoustics? We don't quite know yet, but that's actually, I think, where when we can finally give the control of what the voice sounds like to the individual, I mean, that's the holy grail. Uh... She's a... A... Hello, I am T. T. T.

Starting point is 00:31:29 Why are you reading machine? You are listening to the rights of a talking aid for the locally handicapped. I am beautiful Betty, the standard female voice. I am the standard male voice, perfect Paul. I was sad because there was no ice cream in the freezer. Asking as clear and the stars are sprinkling. I can't wait for my friends to hear my voice. One cup is eight flu bounces. You are listening to a machine.

Starting point is 00:32:06 This is huge. How does their technology work? Hey guys, I think that they used deep learning in our official neural networks. Hi, I'd like to introduce a table for Wednesday V7. This week's show was produced by Thomas Liu. It was edited by Tara Boyle and Raina Cohen. Our team includes Parthascha, Jenny Schmidt and Laura Quarell. Special thanks to Brent Bachman, Greg Sauer and Kébon Jones.

Starting point is 00:32:43 Our aunts and hero this week is Rebecca Raul. She's part of NPR's team looking at our changing interactions with smart speakers. She helped us record some of the smart devices you heard in this week's episode. Thanks Rebecca. If you like this episode, please share it with a friend. We're always looking for new people to discover hidden brain. I'm Shankar Vedantam, and this is NPR.

Hidden Brain - Finding Your Voice

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.