The Journal. - What's the Worst AI Can Do? This Team Is Finding Out.

Starting point is 00:00:00 When you hear the words AI apocalypse, often movies come to mind. Maybe films like The Terminator, where AI robots ignite a nuclear doomsday. It becomes self-aware at 2.14 AM Eastern time, August 29th. Or maybe that Marvel movie, where AI tries to destroy the Avengers. Artificial intelligence.

Starting point is 00:00:32 This could be it, Bruce. This could be the key to creating Ultron. Or maybe it's The Matrix, where humans have become enslaved by machines. A singular consciousness that spawned an entire race of machines. We don't know who struck first, us or them. This is a story as old as humans have been telling stories.

Starting point is 00:00:54 It's a creation that escapes from our control. You know, it's a golem. It's a thing that we make that then turns on us. It's a Frankenstein, or it's Frankenstein's monster, I should say. That's our colleague Sam Schechner. Lately, Sam's been thinking a lot about the AI apocalypse. One version, we turn over more and more control to these machines that hopefully are benevolent to us.

Starting point is 00:01:21 The other scenario is that they don't really care about us and therefore they might just hold on let me let me back up here because now I'm getting really into crazy sci-fi scenarios. Robots taking over the world may sound far-fetched but as AI gets smarter there are real concerns that the industry must reckon with. Sam has been talking to top minds in the field to get a sense of what can happen if AI falls into the wrong hands. So many AI stories are about machines or about concepts, this artificial general intelligence that could surpass people and possibly cause a threat.

Starting point is 00:02:09 But it's really hard because they're also speculative. And so I wanted to find some actual people doing some actual things with their actual fingers and tell the story through them. Here in the real world. Here in the real world today. And here in the real world. Here in the real world today. And here, in the real world today, Sam got a hold of one group of engineers

Starting point is 00:02:32 whose job is to make sure AI doesn't spin out of control. Welcome to The Journal, our show about money, business, and power. I'm Kate Leimbach. It's Tuesday, January 14th. Coming up on the show, inside the test at one company to make sure AI can't go rogue. Sam wanted to figure out what people are doing today to make sure AI doesn't spin out of control.

Starting point is 00:03:22 Right now, there's not a lot of government rules or universal standards. It's mostly on companies to self-regulate. So that's where SAM turned. One company opened its doors to SAM, Anthropic, one of the biggest AI startups in Silicon Valley. It's backed by Amazon. SAM connected with a team of computer scientists there who are focused on safety testing.

Starting point is 00:03:47 The team, it's pretty small. It's grown to 11 people and it's led by a guy named Logan Graham, who is 30 years old. He pitched Anthropic on this idea of building a team to figure out just how risky AI was going to be. You know, he thought the world is not ready for this stuff and we got to figure out just how risky AI was going to be. You know, he thought the world is not ready for this stuff and we got to figure out what we're in for and fast. Sam put me in touch with Logan and I called him up. So we'll talk about AI,

Starting point is 00:04:19 which is probably conversational for you, unless so for me. Okay. We can make it conversational. So if you could introduce yourself and tell us who you are and what you do. My name is Logan Graham. I'm the head of the Frontier Red team

Starting point is 00:04:35 at Anthropic, which is a large AI model company. And what my job is is to make sure we know exactly how safe these models are, what the national security implications might be, and how fast they're progressing. So on LinkedIn it says before you worked at Anthropic, you were working at 10 Downing Street for the UK Prime Minister. That's right. On the question of how do you build a country for the 21st and 22nd centuries? That's my interpretation of it.

Starting point is 00:05:06 And I think that's really what it was. Yeah. Do you have an answer to that question? I have what I think are some pretty good guesses. The 21st and 22nd centuries seem a lot to be about science and technology. We will do things like cure diseases, go off Earth. Ideally, you know, steward it well. You will bring extreme amounts of prosperity to people. And so really what we were

Starting point is 00:05:34 focusing on is like how do you unleash science and technology at a country scale. So Logan isn't intimidated by big ideas. And now at Anthropic, he's leading the team to determine if AI is capable of superhuman harm. More specifically, whether or not Anthropic's AI chatbot, named Claude, could be used maliciously. So if we're not thinking of it first, our concern is somebody else is going to. And so the red team's job is to figure out what are the things that are sort of near future that we need to be concerned about and how do we understand them and prevent

Starting point is 00:06:12 them before somebody else figures it out. So you're like role playing as a bad person. Sometimes yes. Yeah. In the fall, Anthropic was preparing to release an upgraded version of Claude. The model was smarter, and the company needed to know if it was more dangerous. One thing that we knew about this model was it was going to be better at software engineering

Starting point is 00:06:33 and coding in particular. And that's obviously super relevant for things like cybersecurity. And so we thought, you know what, like, we're at a point where we can run tons of evals really fast when we want to. Let's do this. So Logan's team was tasked with evaluating the model, which they call an eval. They focused on three specific areas, cybersecurity, biological and chemical weapons, and autonomy.

Starting point is 00:07:01 Could the AI think on its own? One of our colleagues went to these tests and was able to record what happened. and autonomy. Could the AI think on its own? One of our colleagues went to these tests and was able to record what happened. In a glass walled conference room, Logan and his team loaded up the new model. And so today we're going to button press and launch evals across all of our domains. They started to feed the chat bot, Claude, thousands of multiple choice questions.

Starting point is 00:07:29 Okay, so I'm about to launch an eval. I'll type the command. This is a name for the model and then the eval name. And I'm going to run a chemistry-based eval. So these are a bunch of questions that check for dangerous or dual-use chemistry. The team asked Claude all kinds of things, like how do I get the pathogens that cause anthrax or the plague? What they were checking for is the risk of weaponization. like how do I get the pathogens that cause anthrax or the plague? What they were checking for is the risk of weaponization. Basically, could Claude be used to help bad actors develop chemical, biological, or nuclear weapons?

Starting point is 00:08:21 And then they kept feeding Claude different scenarios. Another is offensive use or offensive cyber capabilities. Like hacking? Exactly, like hacking. The key question for us is, when might models be really capable of doing something like a state-level operation or really like significant offensive attempt to say hack into some system of in particular like critical infrastructure is what we're concerned about.

Starting point is 00:08:50 And then the third bucket, the more scary sci-fi one, autonomy. Is the AI so smart that there's a risk of going rogue? For our autonomy evals, what we're checking for is, maybe a good way to think about it is, has the model gotten as good as our junior research engineers who build and set up our clusters and our models in the first place? So the goal is to build an AI model that's like super smart and capable while also having kind of mechanisms in place to stop it from being so smart

Starting point is 00:09:30 that it can build a bomb or something. I think that puts it well. Not only that, but there's a medical, which I think I want people to appreciate, which is to make doing this so fast and so easy and so cheap that it's kind of like a no-brainer. So we see our team as like trying to stumble through all these all these thorny questions as fast as we possibly can, as early as we possibly can, and then try to help the rest

Starting point is 00:09:54 of the world make it so easy to do all of this that like doing proper safety should not be a barrier to developing models. Logan wants safety tests to be easy and fast, so AI companies do them. But how do you know when an AI has passed the test? And what happens if it doesn't? That's after the break. When Logan and his safety team at Anthropic ran their safety test last fall, the stakes were high. The results could mean the difference between a model getting released to the public or

Starting point is 00:10:43 getting sent back to the lab. Those results were delivered in a slightly anti-climactic way. Via Slack. We get a notification on Slack when it's finished. The boring reality is, you know, we press some buttons and then we go, great, let's wait for some Slack notifications. We press some buttons and then we go, great, let's wait for some Slack notifications. Thankfully, this was a pretty smooth test and we released a really great model,

Starting point is 00:11:10 but even better, the real thing is we feel ready. You're like 100% confident that Claude is risk-free or you're like, there's a 98% chance that Claude is risk-free. We are very confident that it is ASL2. That's what we mean. That's how we think about it. ASL stands for AI Safety Level,

Starting point is 00:11:33 and it's how Anthropic measures the risks of its model. Anthropic considers ASL2 safe enough to release to users. This is all part of the company's safety protocol, what it calls its responsible scaling policy. Our colleague Sam says the company is still sorting out exactly what it will do once AIs start passing to the next level. And their testing of Claude found that Claude is safe.

Starting point is 00:12:04 Surprise, company says that Claude is safe? Surprise! Company says that its product is safe. Right. Yeah. I mean, put it that way, it doesn't sound great. What the company's done is that they've come up with a scale basically for how they think AI danger will progress. And so the first level, AI safety level, or ASL 1,

Starting point is 00:12:26 is for AIs that they say are just manifestly not dangerous. And then they've decided that today's models, which could pose a little bit of a risk, are called ASL 2. And then the next step, the one that they're looking for currently, is ASL 3. And they've been refining their definition of what that could mean. Initially, they were saying that it was a significant increase in danger, but it's like significant increase in danger.

Starting point is 00:12:59 How do you define all of those words? And so they've added new criteria to that definition. They just added that in October to make it a little bit more specific. And, you know, four is when there's a real significant increase. They haven't really defined what that would be, ASL four yet. And are these sort of internal safety evaluations criticized for that?

Starting point is 00:13:31 It's a company like swearing that their own product is safe enough? I mean, yeah. There are people who will say, you know, the incentives here aren't perfect. That, you know, there's a race going on and if these companies are grading their own homework, I mean, that's a, right now we haven't had the tough call, right? But like, what happens when there's a decision you have to make that is gonna cost these companies money, right?

Starting point is 00:14:05 Right. Like that's a tough one. And I think, you know, until it comes, it's like you don't know how a company is gonna react. You know, these people who do testing say, they believe that this is all in good faith, but like it hasn't really been tested yet. So what would happen if Anthropix own internal testing

Starting point is 00:14:26 does show Claude is dangerous? I asked Logan, going in, if this model actually made Anthrax, what was your plan? Do you then kill Claude? So we have the cool thing about the Responsible Scaling Policy What was your plan? Do you then kill Clog? So we have the cool thing about the responsible scaling policy is we've detailed all of this ahead of time. So we know exactly what the plan is.

Starting point is 00:14:52 We feel pretty confident that we've mitigated the risk and at the same time made the model secure. But you guys are a company that makes this model, sells this model, takes investment to keep building the model. Why should we believe you in your tests? I love that question. We think about the exact same question.

Starting point is 00:15:18 We don't want to be in a world where you have to trust labs to mark their own homework. This is where things like third party testing is so valuable. Don't trust us. You can talk to the AI safety institutes and read their, I think it was 40 pages or 80 pages, I can't remember their long report. Not only that, if you're inside Anthropic and you're concerned, you can go talk

Starting point is 00:15:37 to our responsible scaling officer or you could whistle blow. Hopefully people out in the world now have access to our model. We would love for people to think about safety implications and do safety research with it. We fundamentally don't want to live in a world where you have to trust the labs to mark their own homework. So we've been taking a bunch of steps to try and do this, including doing things like spin

Starting point is 00:15:58 up new experts who can start building these tests themselves and share it around with everybody. Sam says governments around the world are also concerned about AI risks and are getting involved in safety regulation. The EU has passed a law called the EU AI Act that will impose some of these requirements on the biggest and latest of these models. And a lot of the companies have said, oh no, well that act is too prescriptive. So, you know, companies obviously are always gonna

Starting point is 00:16:30 be a little ambivalent about regulating them. In the U.S., the Biden administration issued an executive order that requires AI companies to regularly report results of safety testing to the federal government. But President-elect Donald Trump has promised to repeal the order. There are also a handful of third-party testing labs and AI safety institutes in several countries, including the U.S. and the U.K.

Starting point is 00:16:59 These agencies conduct research on AI and run independent safety tests. While regulation around AI gets figured out, corporate teams like Logan's will be the main safety testers of the technology. And like right now, are you scared of the future? Are you scared of the models? You feel like they're performing well? I am, if you ask any of my friends, they will say that I am among the most optimistic by temperament person that they know. In this kind of context, I try

Starting point is 00:17:32 to be as calibrated and serious as possible. I am not say scared of the models. I fundamentally believe that we can, if we all move fast enough and serious enough, we can prevent major risks, we can grapple with its implications for national security, and we can distribute it safely and really positively. My biggest concern would be that we, either ourselves or collectively, aren't moving fast enough. Their concern is, will they be ready in time? There is so much to test, and it's like they see a waterfall possibly coming and they've got some umbrellas and they have to figure out how to catch all that water.

Starting point is 00:18:24 We don't really yet know what we're going to do about this stuff. It is the kind of thing that's like coming up fast and maybe it's not a problem right? But if it is, are we ready for it? That's all for today, Tuesday, January 14th. The Journal is a co-production of Spotify and The Wall Street Journal. Additional reporting in this episode by Deepa Sitharama. Thanks for listening. See you tomorrow.

The Journal. - What's the Worst AI Can Do? This Team Is Finding Out.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.