Effectively Wild: A FanGraphs Baseball Podcast - Effectively Wild Episode 1082: Some of These Arms Are Not Like the Others

Episode Date: July 12, 2017

Ben Lindbergh and Jeff Sullivan banter about Aaron Judge’s Home Run Derby performance, a subtle change in the way Rob Manfred talked about the baseball at the All-Star Game, and the Angels’ stats ...with and without Mike Trout. Then they make it two consecutive episodes of talking to a Glenn by bringing on UC Irvine […]

Transcript
Discussion (0)
Starting point is 00:00:00 Your eyes are so clear, they seem to speak to me from a gleeful world And they look like singing moons made of black pearl I'm not saying that you're the only one like you that I find But I'm sure you're one of a very few of a kind Yes, you are! Hello and welcome to episode 1082 of Effectively Wild, a baseball podcast from Fangraphass presented by our Patreon supporters. I'm Ben Lindberg of The Ringer, joined by Jeff Sullivan of Fangrass.
Starting point is 00:00:32 Hello. Hello. All-star break, not really a break for us in any way. I don't think it has affected our workload whatsoever. But here we are on an all-star break episode. And we're recording this prior to the all-star break episode and we're recording this prior to the all-star game after the home run derby so we're not going to talk about the all-star game yet because we don't know what happened will happen will have happened in that game we can
Starting point is 00:00:57 talk about that tomorrow if there's anything to talk about but did you have any home run derby reactions other than oh my goodness Aaron Judge it wasn reactions other than, oh my goodness, Aaron Judge? It wasn't even so much, oh my goodness, Aaron Judge, because in a sense that seemed like it was inevitable. I know the home run derby doesn't always go the way you think it's going to go. The favorite doesn't always win. But I'm not sure we've ever had a favorite quite like Aaron Judge, who very genuinely is just a statistical outlier, even among statistical outliers. I mean, you saw presumably the list of the hardest hit home runs in the home run derby, and it was just like 90% judge at the top. It's just, it's unbelievable. And that's with Sano and Stanton and all these,
Starting point is 00:01:36 like the best hitters in baseball, best power hitters are in this thing. And yeah, they're just nowhere close to it you wrote something or chatted something recently i think about how maybe we're underrating aaron judge which seems incredible given the first half he just had and he's probably the mvp favorite at this point and certainly no shortage of attention paid to aaron judge and yet i think you might be right about that just based on his performance just but based on watching that home run derby. And I don't really think of the home run derby typically as a tool for evaluation or being very revealing about player talent. Because sometimes you have guys come in and have a weird fluky derby where they're hitting a bunch of home runs.
Starting point is 00:02:21 Or sometimes the best power hitters just won't look very good in the home run derby It doesn't really mean a whole lot Usually but it just Seemed so inevitable That Aaron Judge would win At basically every point During that contest I mean he hit
Starting point is 00:02:39 Balls incredibly far I think at one point he hit a ball like from He hit one out to right and then to center and then to left or maybe it was in in the other direction just one after the other and it was so easy it wasn't like he was hitting them only in one place it was going the other way who does that it's just incredible it's I mean, assuming that he continues to participate in home run derbies, like, how could anyone ever win this thing? Like, some year because just no one can compare to that power and unless he's having an extreme off day it's just hard to imagine anyone showing this kind of essentially batting practice power yeah it's it's stupid although i do wonder so looking at uh we
Starting point is 00:03:38 only have two and a half years i guess of giancarlo stantham and i always refer to stantham because he is the only comparison for aaron judge he's the only one that works your compare anyone else's comparison is wrong it's only giancarlo stanton it's judge and it's stanton that's the only one that's why i don't need to go into more detail so stanton used to be this kind of freak he was in the same universe well that's dumb he was in the same galaxy the same solar system we're all in the same galaxy, the same solar system. We're all in the same universe, probably. But 2015, Giancarlo Stanton, according to Baseball Savant, had an average exit velocity of 95.9. This is just all overall batted balls. 95.9 Stanton.
Starting point is 00:04:13 Last year, 93.9. This year, 90.8. This is a sample size of either one or three, depending, I guess, on how you think about it. But Stanton has lost some added ball velocity. Maybe that's deliberate because he's also making more contact. He could have just kind of calmed down his swing. But I do wonder, we know that right now Aaron Judge is a freak.
Starting point is 00:04:34 There is nobody like him in baseball. He won the home run derby without breaking a sweat. He just hits home runs on, like, he could bunt. He could bunt a ball 300 feet i am convinced but what we don't know is exactly what the aging curve is for something like this because we we can tell that he's and just ferociously strong and quick and everything he has this like contact hitters swing except then he like he's just he's i don't know swinging a wrecking ball that doesn't make sense he we don't know how he's going to age we don't know how fast or how long his body can keep this up because we have such a limited sphere of peers i'm just
Starting point is 00:05:11 talking really poorly today but also rhyming so i don't know if he's going to age like stanton because stanton didn't become some kind of world destroyer but we all remember what stanton did to that jamie moyer pitch years ago where he broke the scoreboard. And he doesn't do that so much anymore. Stanton didn't win the home run derby yesterday. Judge clobbered him, you could say. He won it with ease. But Judge is 25, and I don't know what Judge is going to be doing when he's 28 or 29.
Starting point is 00:05:38 I'm not going to say that because of Stanton he's going to slow down. But it's a definite possibility. So the long story short here being that I don't know if it's inevitable that he's going to win every home run derby, even though it does feel inevitable he will win the next one. Yeah, and I wonder what this means in terms of his potential for national stardom even among non-baseball fans because I was having a conversation on Ringer Slack during the home run derby
Starting point is 00:06:03 and someone was saying like is aaron judge already the most famous baseball player and i think we all agreed no probably not he he has a fraction of the twitter followers for one thing of of guys like harper and trout although i think he's also tweeted like three times this year or something so maybe that has something to do with it that's probably not the best metric to use, but he does seem to have the potential to break out in a way that we haven't seen a baseball player break out in a while, just because everyone can understand that he is 6'7 and 280 and he hits the ball farther than everyone else. And it's very obvious that he does that. And he kind of came out of nowhere and he plays for the Yankees and on and on and on. It seems like
Starting point is 00:06:50 you can just appreciate what Aaron Judge does as a casual fan much more easily than you could appreciate even someone like Mike Trout, where you would have to watch him regularly to see all of the little things he does and all the ways he contributes. And he's on the Angels, obviously, and he hasn't been making the playoffs and that hurts. But I think he's not necessarily a better player, as good a player, but I think just what he does is so easy to see and so easy to communicate. And given that and the market and everything, communicate and given that and the market and everything it seems like maybe he is the guy who would have the most kind of crossover potential for people who don't care about baseball and more subtle baseball skills yeah i was actually having a very similar conversation over the
Starting point is 00:07:38 weekend a friend of mine who's a showrunner for a popular late night television program i don't know why i'm referring to him and honestly but anyway he was in town over the weekend he's based in new york and he and he's uh he's interacted with aaron judge he's had aaron judge on on the television show and uh you know we've all we've all read the articles about how aaron judge is just a great dude and he's very level-headed and he's always reaching out to other players for help and he's always receptive to the fans and the coaches and the peers and all that stuff he's just a good dude who is able to make a bunch of the adjustments that he needed to make in order to become the second best baseball player on the planet i don't know i don't know where that is yet but
Starting point is 00:08:19 what the what the showrunner was saying was that in his opinion at least the only thing that was really uh holding judge back was we had to wait to see what he did in the home runner was saying was that, in his opinion at least, the only thing that was really holding Judge back was we had to wait to see what he did in the home run derby. I was talking to this friend on Saturday. He was like, if Judge kills it in the home run derby, then that's it. That's just going to launch him towards stardom. Everyone will see what he's able to do, and that's going to be the whole story. Well, he went and he had one of the most amazing home run derby performances
Starting point is 00:08:44 that anyone's ever seen. Won the thing. i guess he didn't necessarily win it that easily but he kind of did win it easily depends on how you define it so the most watched home run derby in years featured the most freakish player in years doing absolutely freaky things hitting balls more than 500 feet against pitches that were not moving very fast this is another useful reminder that pitches do not account for that much of the force that goes into a batted ball the judge was generating his own force hitting balls 513 feet off the glass beyond the stands in the outfield and yeah he he definitely feels like like the crossover star to come which is maybe unfair to trout but on the other hand that it's probably exactly how Trout would like it. Yeah, that's right. So I also wanted to mention, just since we're on the subject of home runs, Rob Manfred was asked repeatedly, it seems like, about the baseball in his annual meet with the
Starting point is 00:09:37 baseball writers state of the game, all-star game kind of thing. And he didn't come out and say anything all that revealing, Certainly didn't say, yeah, the baseballs are different. You got us. But he, I thought, changed the way that he spoke about it subtly. But for someone like me, who's read a lot of Rob Manfred comments about baseballs, it seemed to me that he was actually moderating the way that he talked about this in certain ways. He continued to say the standard line about how the balls are within specifications and we haven't intentionally done anything.
Starting point is 00:10:17 But he also seemed to allow more than I've heard him allow before for the possibility that something may have changed inadvertently. And I heard him mention a couple times that the baseballs are hand-stitched and that there is some inevitable amount of variation from season to season. And he also said that they are looking at the parameters. And instead of just saying they're within specifications, which is what they always say, he mentioned that they're looking at possibly revising or narrowing those specifications, because I think the response to that statement is always that, well, yes, no one is disputing that, but the range of allowable baseballs is extremely wide, so that is not all that meaningful. So he did actually change his tune slightly, I thought. And maybe that is because they want to allow for the possibility that something has happened. And again, I don't think they have intentionally changed anything. allowing a little more leeway for something to have changed without an element of intention. And apparently this is something that the Players Association has asked MLB about. Bill Shakin tweeted something about how Tony Clark said that the Players Association is discussing the allegedly changed baseballs for health and safety reasons.
Starting point is 00:11:43 And I don't know whether that's because of the rise in blisters, which I wrote about at the ringer this week, or whether it has to do with the speed of batted balls and concern that a pitcher could get hit with one or something like that. But if they're getting pressure from that quarter, as well as from writers repeatedly writing about this and asking about it, that could maybe explain why they are talking about it in a slightly different way. And it sounds as if they are looking into bats also as a possible explanation now, although I don't think that would explain any of the effects that Rob Arthur found about how the ball seems to be carrying differently once it has been put in play. I just don't think a bat would affect anything after the ball has left the bat.
Starting point is 00:12:31 So not sure that that would be an explanation for that, but it's certainly worth looking into at least. So it wasn't like the big grand revelation that everyone has maybe been hoping for. grand revelation that everyone has maybe been hoping for. And I think Manfred said that he doesn't think we'll ever entirely understand this change in home run rate, which I actually agree with. But to me, at least, it seemed significant that he was talking about this in a slightly different way, perhaps as a result of some of the information that has come to light. You're never going to be able to give this up. You just want to be done with a baseball story, and then it's just always going to come back. You wonder if this is just sort of like maybe baseball has gotten a lot more popular in Costa Rica,
Starting point is 00:13:14 and the factory workers in Torrealba have decided they like the long ball, and so they've just kind of made some adjustments. But Manfred did, when he was talking about they're investigating the bats right now he has probably been told by several scientists or science affiliated people that the ranges allow for baseballs to do crazy things while being legal so yeah i i agree with you this is uh it's interesting at this point he definitely was speaking in a uh in a different way i don't know what that's going to mean because at this point, I pretty firmly am willing to believe that Rob Manfred doesn't really know what happened. I don't think that there's a cover up here, but maybe just an unwillingness on his part to concede that, yeah, actually the baseball changed very little, but it also changed a lot at the same time. And
Starting point is 00:14:01 we didn't intend that. Yeah, right right and i think he also said something to the effect of he likes homers and fans like homers and and generally i agree with that and it was sort of striking reading just some of the replies to tweets like you know bob nightingale tweets about manfred quotes about how they're going to be looking into the the range of baseballs and that sort of thing and most of the replies just from my skimming were, why? Like, leave it alone. We like homers.
Starting point is 00:14:29 Homers are fun. Baseball is better with homers. So I don't know if that is the standard, typical fan's opinion on this, but I wouldn't be at all surprised if it were. So I think he is well within his rights not to do anything or change anything necessarily, but it's obviously something that will continue to provoke curiosity. And so he
Starting point is 00:14:52 will continue to be asked about it. And it seems as if his response has changed slightly. I saw a number of tweets when all the writers were tweeting out their Manfred quotes about how he's looking into the ball and the bats. I saw a number of tweet responses from people who were like, Rob Manfred is ruining the game. And I couldn't tell if that was what they were saying because he's looking into it or because there was something to look into in the first place. So somebody is upset out there. And any number of people believe that Rob Manfred is ruining the game.
Starting point is 00:15:19 However, those people might have different opinions on how. Maybe there are not enough home runs or maybe there are too many. Turns out you cannot satisfy all baseball fans. All right. Did you have anything else? So one last thing. Might as well mark the occasion. I know we have one more official podcast before Mike Trout is back in the Angels lineup.
Starting point is 00:15:36 But just so we have a complete summary, Mike Trout is going to come back. He will be in the lineup on, what is it, Friday, Thursday, whenever the Angels play next. He's done with his rehab. And so the Angels have completed 39 games, I believe it was, without Mike Trout. He was injured on May 28th. He was immediately out of the lineup. So I can give you some numbers. I actually wrote a little thing on Fangraphs on Monday about this, and I polled the audience. I looked at the Angels, and I split their numbers into two groups. One was up through the game where Mike Trout got hurt, and then one column of stats was the Angels without Mike Trout. And I'm not going to go down the entire table here
Starting point is 00:16:17 because there's too many numbers, but one of the winning percentages was 491. The other one was 487. The Angels with Mike Trout were a game under 500. The Angels without Mike Trout were a game under 500 the angels without mike trout were a game under 500 their wrc plus that's a offense relative to league average in both periods was 89 which is identically bad identical strikeout rates identical hard hit rates identical home runs per fly ball they had a higher slugging percentage by nine points without Trout. However, they had a
Starting point is 00:16:45 higher OBP by eight points with Trout. Anyway, things were almost completely identical. The only difference is that when after Trout went down, the Angels started to run more, which I mean, they were playing more Ben Revere and Eric Young Jr. So that probably accounts for that. But I polled the Fangraphs audience on which time period they thought was the one with Trout, which was the one without Trout. And 55% of respondents guessed correctly, which means 45% of respondents couldn't tell when the Angels were without Mike Trout. I think as we've mentioned before, this should not be read into in any way that belittles mike trout or downplays his significance he is the most valuable player in baseball he is the best player in baseball pending aaron judge's second half so he is amazing angels need him all that stuff no reason to believe the angels are any good without
Starting point is 00:17:37 mike trout however they were almost perfectly fine without mike trout for a quarter of the season which is a useful reminder that anything any kind of player absence over a fraction of a season can be survived, if not thrived during, if not driven. Yeah, no, I totally agree. And I'm sure that I will be citing this for a long time in the future whenever some superstar gets injured and fans of that superstar's team panic. I will cite the Mike Trout example. And it's not like, I don't think it's a Ewing theory application where like, you know, the superstar goes down and everyone else plays better as a response to that. I doubt that has anything to do with it. And it's not as if you would expect this to happen or that you should expect this to happen.
Starting point is 00:18:25 But you're right. It is a good reminder that in baseball, losing one player, gaining one player only helps you so much. Might not help you at all over a certain period of time. So glad that Mike Trout is coming back. Shohei Otani also is making his first pitching appearance of the season on Wednesday. So some effectively wild favorites returning to the field. I was remembering, I don't know why this one stuck out, but back in 2003, Derek Jeter suffered a pretty significant, I think it was opening day injury,
Starting point is 00:18:54 maybe second day injury that knocked him out for a month and a half. And it was supposed to be a big deal for the Yankees. And when the Yankees got Derek Jeter back, they played 600 baseball. They won. Well, you know what a 620 percentage back, they played 600 baseball. They won. Well, you know what a 620 percentage means. They played 600 baseball. However, without Derek Jeter, they played 700 baseball. They played better without Derek Jeter than they did when Jeter came back. So, you know, doesn't mean anything.
Starting point is 00:19:19 But there are several cases like this. I wish they were easier to isolate because it would be useful research. are several cases like this i wish they were easier to isolate because it would be useful research i suspect that if you looked at well i guess i could nearly confirm that if you looked at this over a big enough sample you could confirm the wins above replacement numbers by just identifying that hey look these aren't coming out of nowhere teams are actually worse by this much however they are not always worse by that much sometimes a team just does better without its hall of fame center fielder or shortstop. And there's no rhyme or reason for it. Aside from, I don't know, who took over for the Yankees playing shortstop that season.
Starting point is 00:19:50 Let's just go ahead and say, I don't know, Enrique Wilson. I haven't looked this up. Yeah, someone not very good, I'm sure. Almonte. Eric Almonte. Oh, dear. Yeah, no. Not good.
Starting point is 00:20:03 But Yankees didn't care. All right. So that concludes the banter. We probably could have mentioned at some point before this that we have a guest on this episode, but we do. We're talking to Glenn Healy, professor and baseball researcher about a cool piece published earlier this week at Baseball Prospectus about computing pitcher similarity. So we will be back with Glenn in just a second. That the man that stands in front of you Is not the sum of all his dreams But I'm hoping they've got something in common So earlier this week, Baseball Perspectives published some cool research in an article called Measuring Pitcher Similarity. This was similar to work that Jeff and I have done, but far more complex and better in every way. And so we wanted to bring on one of the authors to talk about it. This was the product of a collaboration between former
Starting point is 00:21:21 podcast guest Dan Brooks, who I believe made a casual suggestion at some point that prompted some of this research, as well as Xiyuan Zhao, and our guest now, Glenn Healy, who is a professor of electrical engineering and computer science. Hey, Glenn. How you doing? Good. So how did you get into baseball research, first of all, because I know there's a long and rich tradition of academics who are probably too brilliant to be wasting their time on baseball, wasting their time on baseball. So how did you come to do this research? Where did the idea come from? And what's your history also when it comes to doing baseball research? Well, I've always loved baseball, so I've been wasting time on baseball since I was a little kid, we could say. The work I've done in the past, it's been concerned with lots of different
Starting point is 00:22:11 kinds of sensor systems. I've done a lot of work with computer vision and remote sensing and have a background in computer science, math, physics, statistics, and signal processing. So it was kind of natural when the data started coming available, when they put all these sensors into ballparks and the data started coming out, that I could have a lot of fun working with that data based on my background and love of baseball and also try and make a contribution. So this particular work, as he said, I was talking to Dan Brooks at Sabre Seminar last summer, and he mentioned that he was interested to Dan Brooks at Sabre Seminar last summer, and he mentioned that he was interested in measuring pitcher similarity.
Starting point is 00:22:51 And I remember being at the Red Sox game that night, and I was explaining to him how we could compare pitch distributions using something called the Earth Movers Distance. And I got back to California, and I wrote something up, and then I got one of my graduate students, Shi Yuan, involved, and she was already working on some pitch sequencing research, so she was familiar with the data, and that evolved into what we've got. So I mentioned that Jeff and I have done similar work, and what I mean by that is just that kind of a back-of-the-envelope method of comparing pitchers based on their stuff. And I think this was a method that the baseball
Starting point is 00:23:27 analyst Joe Sheehan, the one who works for teams, not the one who writes about baseball, pioneered originally. And then Jeff came up with it sometime later independently, I think. And then I just stole it from Jeff. And this was just a way of comparing pitchers just based on the average values of their pitches, so speed and vertical movement and horizontal movement, which is sort of what you're doing here, but I think you're doing it much more rigorously and looking at every pitch rather than just the average values. So is there a way to explain your method here in a way that won't completely lose us because of our inexperience? I think so. Yeah, it's kind of like what you described. We're looking at, you can think of looking at the average values of different pitches.
Starting point is 00:24:16 And then were you comparing across pitch types or the values for the same pitch type? for the same pitch type. Specific pitch types, I think, is what we've done and just looking at standard deviations from the average in terms of speed and movement. Okay. So you can think of what we're doing as a bit of a generalization of that, but the same basic idea. So we can take a pitcher's distribution of pitches, and there's some plots in the paper, but I think the example is we had John Lester, and he can plot every one of his pitches at some point in a three-dimensional space with the velocity and the horizontal and vertical movement. And then you can do that for every pitcher.
Starting point is 00:24:58 So you've got these cubes with points in them that represent pitches. And what we're doing is we're saying how much work, if you have pitcher A and B, how much work would it take to move pitcher A's pitches so that they wind up in the same place as pitcher B's pitches? So how much effort do you have to do to distort one guy's distribution of pitches to the other guy's distribution of pitches? And if the distributions are exactly the same,
Starting point is 00:25:25 then that amount of work is going to be zero. You don't have to do anything. If they're very different, then you have to do a lot of work. So you're going to have a big difference between the pitchers. And then we worry about things like how the space, it's harder to move pitches in certain directions in the space because of the correlations of the variables and things like that. But there's a method, it's called the Earth Movers Distance, it's been around for a long time, that we're just leveraging off of. It's basically set up to solve this kind of a problem. So when you went into this and you were analyzing pitchers by their velocity and their pitch
Starting point is 00:26:00 movements, because you had Dan Brooks sort of folded into the mix how much how much concern if any did you have over say proper in-game calibrations or how pitch movements and tracking has changed in 2017 because i know there there are still game-to-game inconsistencies and while i suppose over larger samples that would make too much of a difference for individual pitches it seems like there's there's enough there that it could be, it could sort of undermine the research. So how, if at all, did you try to correct for potential, I guess, calibration errors? Right. Yeah, that's a very good point. The work that we've done has just been for 2015 and 2016. So the 2017 data, there's been a lot on the internet that talks
Starting point is 00:26:42 about how things have changed when they move from PitchFX to StatCast. And we haven't done anything with the 2017 data yet. So I think if we did, we'd probably notice guys changing from 2016 to 2017 just based on differences in the data. But everything in the paper was just 2016 and before. So we're just in the PitchFX regime. And can you talk about some of the areas you looked into? And we'll link to the article and people should go check it out. You have a lot of lists and leaderboards and similarity lists in the article too.
Starting point is 00:27:19 But can you talk about the different ways that you approached pitcher similarity? Yeah, we compared pitchers based on a number of different parameters. One of the things we did was just find the most similar pitchers for each pitcher. And there's a link to a table that people can go to if they just want to check their favorite pitcher and see who the most similar other pitcher is. And then we could figure out who the most unique pitchers are. So those are the guys that have the biggest distance between themselves and the nearest other pitcher. And then we looked at things like how much guys change from one year to the next. And we learned different things there. For example, how stable the knuckleballers are, R.A. Dickey and Stephen Wright were from
Starting point is 00:27:59 year to year. And then how some of the biggest changers were guys that say moved from the rotation to the bullpen. And then you see like how much James Paxton changed from 2015 to 2016. He, he made that big change in his arm slot. And so things like that, you can, you can learn. We looked at which pitchers are the most different in response to the, the change in the batter handedness. So what are the platoon differences? Which guys stay the same when facing right and left? And which guys change the most?
Starting point is 00:28:29 And I learned some things there. One of the things I learned was that for the guys that have the big differences, several of them throw more four seams to the same side and then sinkers to guys on the opposite side. Guys like Jared Weaver, Dustin McGowan, Brad Hand, Danny Duffy. And then there was Kyle Hendricks, who on the list was the one guy that was the opposite side, guys like Jared Weaver, Dustin McGowan, and Brad Hand, Denny Duffy. And then there was Kyle Hendricks, who on the list was the one guy that was the opposite. He was throwing more sinkers to right-handed batters
Starting point is 00:28:52 and more foreseams to the left-handed batters. When I watched the game not long after that, I saw him trying to backdoor right-handed batters by running his two-seam back to the outside corner, stuff like that, that I probably never would have noticed before that pops out after I start creating these leader boards. And then we also analyze guys who change the most with two strikes in their distribution. So how different is a single pitcher from one before two strikes to after two strikes and guys like Lance McCullers popped out with just a huge increase in his curveball,
Starting point is 00:29:26 which made him very different across those two cases. And then I was watching a game not long after that. And here's this ridiculous curveball that I might not have appreciated as much before. So it just changes in pitch mix across years, guys like Trevor Bauer and Kelvin Herrera that were written about on fan graphs from 2015 to 2016. So we generated the leaderboards, and then you watch the games and things like that pop out. So like anyone who is reading an article like this, I end up enraptured by the lists and the leaderboards because you always want to see how the compelling research leads to its results. And one of the first results that you have listed is that you found that at least in 2016, the most similar, at least right-handed pitchers were Matt Harvey and
Starting point is 00:30:15 Shelby Miller. I don't know if they were more similar than the two most similar left-handed pitchers who were John Neese and Chris Rusin. But in any case, I think Harvey and Miller are far more interesting than Neese and Rusin. So we'll stick with them. You found that they were the two most similar right-handed pitchers. But given that, how would you recommend that, I guess, a reader or a listener choose to interpret that? You have this data point. So what do you make of the fact that Harvey and Miller are so alike? Well, I don't necessarily make anything out of it. Just that the data says that if we look at their pitches, they're very, very similar. Both of them have had health issues, which may or may not be related to that. There's a number
Starting point is 00:30:56 of things that we can do with this similarity measure to try and, say, project the future or compare people across levels, that kind of thing that I think is very interesting as far as the applications. That particular point, I would just say they're very similar. That's about all I can say. Would you be willing to declare that this is probably the coolest baseball research that's been published this July? Would you be willing to admit that? I would not say that. All right, I'll say it for you. I eat this stuff up.
Starting point is 00:31:30 Do you have the lists handy? Could you talk about some of the pitchers who are least similar to any other pitcher? Because that's always fun. I've got the least similar pitchers across the whole data set. So among right-handers, it was Brad Ziegler and Marco Estrada were the two most dissimilar right-handers in 2016. And then the left-handers, it was Zach Britton and Tommy Malone. And those things both make sense. Brad Ziegler had just extreme sync on just about everything. He had the lowest vertical movement, most negative vertical movement
Starting point is 00:32:05 over all of his pitches. And then Estrada had the highest vertical movement. So Ziegler, all sinkers, Estrada, lots of four seams that have the rise in them. So that makes sense. And then Zach Britton and Tommy Malone, similar kind of an idea. Zach Britton throws lots of really hard sinkers, at least in 2016, we're looking at. And Malone does not throw hard. And his most frequent pitch is the four-seam fastball, which has the opposite kind of vertical movement. So the dissimilar pitchers made sense. And then we generated the visualizations. And when we look at those, you can also see that Ziegler and Estrada are the most separated on the right-hander plot, and Britton and Malone are the most separated on the left-hander plot.
Starting point is 00:32:49 But we can set these up if you want leaderboards for the most dissimilar pitchers across all the data set. We can generate those if people would be interested in them. But as far as dissimilar, I think it's more curiosity and fun for fans. I don't know if you can necessarily use that for any kind of an application. movement, I'm sure makes a lot of sense. But do we know that those are the three things that you would want to compare in order to establish how similar pitchers are to each other? As you mentioned in the article, there are other factors that determine the characteristics of a pitcher, pitch location, sequencing, deception, release point. There's all sorts of stuff. So do we know that these three are the most telling and
Starting point is 00:33:45 are they weighted equally in the method? Like would two guys who have similar velocity, would that make them more similar than guys who have similar vertical movement, for instance, or are they all kind of on an equal field? Right. Yeah, that's a good question. So the goal here was to try and compare guys just based on their raw stuff. And that's why we selected velocity and movement. And we're not even trying to relate these things directly to performance, just raw stuff. And like you said, location, very important. I wrote another paper. It was at the Sabre Analytics Conference back in March, talking about how location and movement and velocity and the count and the platoon configuration all go into the value of a pitch. But things like the location sequencing deception,
Starting point is 00:34:38 that's less indicative of raw stuff. And the reason that we liked looking at raw stuff is because one of the applications was to be able to compare pitchers across environments. So if you're, for example, trying to compare guys across majors, minors, amateur, foreign league, maybe you're scouting a foreign pitcher and you can look at them, you can have your scouts look at them. But one thing that would be very useful is to be able to ask who are the most similar major league pitchers to this guy, if you have his pitch data. And so you're not necessarily looking at trying to, I mean, one thing you could do is you could say, what are the performance corrections? If we know this guy's ERA in this other foreign league, we can try and
Starting point is 00:35:19 correct that to predict what his ERA would be. But that's going to be, I think, a lot less indicative of what he's going to look like than if we can just take his raw stuff and say, who are the comparables in MLB? Or similarly, if we're trying to evaluate some college pitchers, do we want to try and correct for performance? Or I think it'd be much better if we could just have a measure of the raw stuff and we can say, all right, over the last six, seven years or however far back we've got data, who are the most similar college pitchers to this guy based on raw stuff or the same thing for minor league pitchers. As far as the variables, what we use, and that's a good question. So how do you weight velocity and movement in the model? And again, it's not directly for performance. The paper I wrote,
Starting point is 00:36:05 The Intrinsic Value of a Pitch, there we're talking about exactly how much you gain from, say, velocity or vertical movement on a given pitch. But here what we're doing is a fairly standard thing that's done when you're looking at multidimensional distributions is we have a whitening transform. And what that means is you take the original variables, the velocity and the horizontal and vertical movement, and then you convert that to new variables that are uncorrelated and that have unit variance. So Mitchell Lickman was asking a question on the BP article about, well, what's going to happen if we decide to measure movement in meters instead of inches or something? Is that going to change the result?
Starting point is 00:36:49 And the answer is no, because we do this whitening transform that scales the variables. So the result won't depend, the similarity distance won't depend on how we scale the axes. So that's how we treat the different variables. You talk about separating the raw stuff from the performance, which I certainly agree with. So that's how we treat the different variables. theory learned to throw that stuff differently for example two pitchers could throw the same stuff but if one guy uses it better than another guy well then maybe one of them should learn to throw like the other one anyway one of the traits of the raw stuff is it's it's sort of intrinsic it is part of the pitcher whereas you can change your your performance in theory but along those lines did you did you give much thought to including or folding in release point or pitcher height pitcher size or any of that stuff or pitcher height, pitcher size,
Starting point is 00:37:45 or any of that stuff? Or are you comfortable enough with the fact that most pitchers are generally about the same height and that you can sort of infer arm angle from how pitches move? You can put these other variables into the mix. I've actually gotten a few emails just in the last day or two from people asking about incorporating those kinds of variables. two from people asking about incorporating those kinds of variables. So the algorithm would support that. But you have the usual issue where if you start bringing in variables that are less significant than the variables you already have, then those less significant variables can start to blur the impact of the more significant variables. So there's a study that can be done with that. And what you would have
Starting point is 00:38:26 to do is you would have to have some objective function at the end to try and tell you how you want to start to treat the different variables. So if your goal is, for example, to try and optimize projections for minor league pitchers, what they're going to look like in three years, then you can try and look at those variables and, you know, the height, the weight, the release point, all those kinds of things, and then try to optimize those to figure out how you can find your similarity classes that are going to tell you the most about where this guy is going to be in a few years or however long your projection window is.
Starting point is 00:39:02 So those are things that can be done. We just have to figure out a way to bring those statistics in to optimize whatever we would be trying to optimize. But there's always the danger, not just for this problem, but for any problem. If you start to bring in new variables that don't contribute a lot to what you're trying to discriminate, they can take away from the impact of your good variables. So it's something you want to be careful of and know exactly what you're trying to optimize when you bring those in.
Starting point is 00:39:32 You set a pitch minimum of 1,000, which is pretty reasonable, especially when you're dealing with any sort of research where you'd like a larger sample size and no one's going to complain about how much information you have when you have 1,000 pitches. But one of the cool things, I think think about the pitch tracking technology is that you get a pretty good idea of what somebody throws after a very short period of time so had you given much consideration to you to allowing even a sample as low as 100 pitches but maybe 250 or 500 or what what was it that had you settle on 1,000? Yeah, we could certainly lower the numbers. You get a good idea, like you say, of what this guy's fastball looks like, what this guy's slider looks like.
Starting point is 00:40:12 The thing that you don't get from the small sample is the distribution. So you don't necessarily get a good estimate of what fraction of forcing fastballs he throws, what fraction of slider he throws until you get up to more pitches. The other thing that kind of pushed us away from doing that is if we're trying to generate these similarity tables that would look nice on a website and you type in the name of your favorite pitcher and the top five guys that come up are all guys that threw 25 pitches over the course of the year that nobody's ever heard of, it might not be that interesting for people looking at the measure. But the method would certainly support bringing in people with small numbers of pitches. I guess I would be remiss if I didn't mention that you also had a list
Starting point is 00:40:56 just of the most unique pitchers, left-handed pitchers, the ones with the greatest distance to their similar match in 2016. And Britton, as you mentioned, was at the top, but then Rich Hill and Clayton Kershaw were number two and number three, very popular pitchers on this podcast. So it is nice to see that they are uncommon. They are not really like anyone else, which kind of checks out. I think that is a validation of the results. When you look at a lot of the lists that you include in the article. They make sense. They check out. It makes sense probably that Ari Dickey is the pitcher who is most similar to himself from one year to the next, which is
Starting point is 00:41:39 not to say that there aren't some surprises here and there. I think it's probably like the old Bill James line about a useful stat will just confirm what you already thought most of the time and surprise you now and then. And that's a good sign that it is something useful to you. Yeah, we were pretty happy with most of the lists that came out that they seemed to make sense. Having Ziegler and Britton at the top of the unique list seemed to make sense. And of course, Rich Hill and Clayton Kershaw, you watch them pitch and the word unique does come to mind along with a bunch of the other guys. So that was good. That made us happy. Yeah. What are some of the other applications or possible applications of this information that we haven't talked about? We've talked about tracking guys
Starting point is 00:42:25 from one year to the next to see how they change. We've talked about looking at, say, amateur pitchers and comparing their stuff to guys with established track records, which at least in theory one would think could possibly improve your projections for those players looking long-term. We haven't really talked about it, but I assume one of the major ways we could use this is projecting batter pitcher performance and looking at pitchers who are similar to each other and how batters have performed against that type of pitcher. And maybe that would be instructive in figuring out how they would be against pitchers of that type in the future. Have you done any work on that?
Starting point is 00:43:06 Are you aware of any other work that's been done along those lines? Yeah, I've done some work on that in the past, and that's something that we want to investigate using this for. So as you've said, kind of the method one that you hear about on broadcasts, the announcer will say, well, Altuve's two-for-eleven career against Scherzer. And that tells you basically nothing, as much research has shown. And then at the other end, the other extreme is the log five model that Bill James developed in 1981, I think. And that uses all the data for the batter and the pitcher to try and determine the probability of different outcomes.
Starting point is 00:43:50 So we could use this similarity measure to give us something in between. So we'd have a bigger sample than just the Altuve versus Scherzer, but we'd have a more focused sample than what you get with log five. So you can say, okay, we can look at Altuve against Scherzer-like pitchers and then try and predict what the performance would be. So I think that'll allow us to do a better job trying to project individual matchups. And then another big application that we're working on is to try and optimize projections. So how can we forecast what a guy is going to do next year for the next couple of years. The first big step in forecasting is usually trying to estimate a player's talent level or what sometimes people call the true talent for different skills that are represented by statistics. And the estimate's
Starting point is 00:44:37 based on a regression that depends on the reliability or the year-to-year correlation of statistic and its population mean. So that's the regression to the mean that people talk about. And the big question always is, what population are we going to use to try and get the reliability of the statistic, the correlation from year-to-year, and to get the mean that we're going to be regressing to? And people often use the population, say, of all major league pitchers or of all major league starting pitchers, and then they do their estimate. But if we have this similarity measure, what we can do is we can generate for a given pitcher position player with something similar, we can generate the population statistics using similar players. We don't have to use the whole league. And then that should give us more accurate ability to estimate the talent level or the true talent. And then as far as projections go, the next step after you've got the talent level is trying to use an aging curve
Starting point is 00:45:37 or health forecasting as part of an aging curve to figure out what that talent level is going to look like next year or for the next few years, depending on how long a contract you want to sign the guy to. And what we can do with these similarity measures, we can generate separate aging curves and separate health risk forecasts for different classes of pitchers. So if we have these similarity classes, then we can talk about how do guys in this particular group age and what's their health risk. And it's the first step, just what the reliability and mean we should use for their projection. So that's, I think, we can predict matchups.
Starting point is 00:46:17 We can also do a better job with this, trying to just project future performance because we've got better populations that we can represent pitchers with. Are you looking forward to carving time out of your professorial schedule over the next 50 or 75 years until you develop a large enough sample to inform these longer-term projections? Yeah. People often say, well, can't you just solve the whole problem, figure out everything about pitching? Somebody talked about coming up with a unified theory of pitching, but that's going to take a long time. So there's certainly a lot of different things that I can be working on here. And one of the nice things about being a professor is that you get to define your own research problems. So I've got quite a few of them here. Yeah.
Starting point is 00:47:07 Well, how often does baseball either become something that you can test out a method on, or how often do you get to bring something to bear that has nothing to do with baseball for baseball research, or you're teaching something in class, or you're reading something in your own field, and you think, oh, there are baseball applications of this? Pretty much every baseball problem I think of, I can go back in my past and say, this would be the best way. This would be some paper I read 11 years ago. This would be the way to try and address this problem. So I kind of have a unique situation in the sense that usually people go into a field and they try to learn the tools they need for that particular field. And I've been
Starting point is 00:47:53 studying all these other problems and just building up a broad background for quite a few years. And now all this baseball data appears and I've got the tools and here's the data. And so I can just go back and just about every problem I see, I can say it's similar in some sense to some other problem I've worked on before. And I can just pull some technique out of the past and say, let's apply this to this baseball problem. So it's everywhere. Yeah. And you did something similar to this with hitters, right? Using stat cast data in the
Starting point is 00:48:27 past? That's right. Yeah, I was using hit FX for it, actually. But that work was called the intrinsic value of a batted ball. And the idea was to build a mapping from the three parameters, the exit speed and the launch angle and the horizontal angle to batted ball intrinsic values. And this is, again, something that we can use to just compare guys between the majors and the minors. And if we have the data in amateur and foreign leagues, because it takes out all the context. So it removes the factors such as the ballpark and the defense and the weather. And it's also nice for trying to project the future because it allows skill separation. So if you think of looking at, for example, a guy's WOBA or his WOBA on contact, that depends on his batting ability.
Starting point is 00:49:15 And it also depends on his running speed. And you can't really pull those apart if you're looking at statistics that are based on outcomes. if you're looking at statistics that are based on outcomes. But this intrinsic value work, what we have is we have a measure of just the value of a batted ball just based on the contact parameters. And so we can build a statistic that's based on a batter's contact ability, and then we can build a separate statistic that just measures his running speed, and now StatCast has these measurements of sprint speed that we can use. And the reason this is important for projection models is that we can now age these different
Starting point is 00:49:54 skills separately. So one of the difficulties with trying to, for example, project WOBA forward is that it's got these mixtures of different skills and a batter's contact skills age differently than a batter's running speed. And the reliability of those statistics, the year to year correlation of those statistics are different. So it allows us to break those things into separate pieces. And then we get models that are specific to say just the batted ball skill of the batter. And then we did the same thing for pitchers. We looked at their batted ball skill. And then we can, in addition to getting just a skill specific measurement, we can project them more accurately into the future. So yeah, anything, the theme of the work I've been doing for the last few years is taking the sensor
Starting point is 00:50:41 data and then trying to develop models that are as independent of the context as possible. So just try to take out ballpark, defense, weather, and if we're talking about pitchers, try to take out the framing, take out the umpire, take out the random luck, and then we can get much more accurate models for the players because we've gotten rid of all these different sources of variability in the models. So yeah, I've done that for batters and pitchers. So lastly, you're speaking at Sabre Seminar next month in Boston. Jeff and I will be there, so hopefully we'll see you there. I assume you're speaking about this research, and what else can people look forward to as far as being able to
Starting point is 00:51:28 play with this information? As you mentioned in the article, you link to something where people can look up individual pitchers, but what are the plans as far as making this available at a site or sites? Yeah, Dan Brooks was talking about making it available on Brooks Baseball, and then Harry Pavlidis was also talking about making it available on Baseball Perspectives. So the plan is to get the similarity measure out there so that people can use it. So you can just click on your favorite pitcher, and it'll bring up the top 10 comps based on a year. You can go back in history if you want, or really any of the different things that we put in the paper, it wouldn't be too hard to put those out there.
Starting point is 00:52:11 So hopefully Brooks Baseball and Baseball Prospectus. I hope it's available for individual pitch types at some point in the future too, because that would save us a lot of spreadsheeting. So that would be appreciated. But this is really cool, and we enjoyed reading it and looking at all the names, and I look forward to the future research that will be based on this. So we're glad we could have you on to talk about it. So thank you for doing it and for talking about it.
Starting point is 00:52:38 Well, thanks for having me. Thank you very much. Okay, you can support the podcast on Patreon by going to patreon.com slash effectively wild. Five listeners who have recently pledged their support include Chris Flanagan, Ivy Envy, Patrick Morgan, Scott Slezak, Sam Normington, and hey, what the heck, a bonus sixth supporter, Matthew Sanders. Thanks to all of you. You can join our Facebook group at facebook.com slash group slash effectively wild.
Starting point is 00:53:02 And you can rate and review and subscribe to the podcast on iTunes. If you need something else to listen to, Michael Bauman and I did a fun episode of the Ringer MLB show. We talked to five local writers about five unlikely veteran first-time all-stars, trying to explain their success. You can find that on the Ringer MLB show feed. Keep your questions and comments coming for me and Jeff at podcast.amgraphs.com or via the Patreon messaging system. We will talk to you soon. One day and I'll find you there By the great big lake

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.