I, Robin, like baseball. I could analyze reasons -- the discreteness of it, the heavily analytic tradition in the fan community, the peanuts, etc. -- but I think instead I'll just talk about baseball.
Estimating Expected Future Runs in an Inning
Over on the Idea Pad, Zhurnaly (m'dad) asked whether I could come up with some easy rules of thumb for a couple of relevant baseball probabilities -- first, what are the odds of a victory given n innings remaining and a difference in scores d, and second, what are the odds of getting some given number of runs in an inning given a particular number of outs and runners on the basepaths.
I proceeded to punt the latter of these two questions and ignore the former completely, but I think I came up with something handy.
The Data and The Ballpark Fit (Pun Intended)
From this Baseball Prospectus article you can find the 2003 expecrted number of runs given a particular situation on the basepaths:
Runners Outs None 1st 2nd 3rd 1st&2nd 1st&3rd 2nd&3rd Loaded 0 0.531 0.919 1.177 1.380 1.551 1.869 2.023 2.474 1 0.282 0.535 0.706 1.032 0.909 1.211 1.428 1.544 2 0.109 0.237 0.341 0.384 0.454 0.518 0.541 0.797
From these data, I worked out a one-significant-figure solution that might be actually usable by a human being during a game. It works as follows.
- When there are no outs and no runners, you start with .5.
- When you have at least one runner, start with the lead runner. For the no-outs situation, add .4, .3, and .2 in turn to advance the runner to first, then second, then third. (Incidentally, I note with amusement that the next term, .1, makes the net contribution from a solo homer exactly 1 run. If this has significance, I would be heartily surprised.)
- Second, correct for the number of outs. To go from no outs to one out, halve unless the lead runner is on second or third - a lead runner on second is worth 0.7 runs, on third is worth 1.0 runs. To go from no outs to two outs, divide by 4.
- To add a runner on first behind the leader, add .4, .2, or .1 runs with 0, 1, or 2 outs, respectively.
- To add a runner on second behind the leader, add .6, .4, or .2 runs with 0, 1, or 2 outs, respectively.
First, I offer my initial test case: the June 23rd, 2012 interleague game with the Nationals at Orioles Park in Baltimore. Specifically, the top of the 2nd. (Many thanks to nytimes.stats.com for providing the play-by-play I shamelessly cribbed from for these data.)
- Initially, the expected number of future runs (henceforth Er) is 0.5.
- Michael Morse hits a single to right. Er increases by 0.4, to 0.9.
- Adam LaRoche hits a single to right, advancing Morse to second. Er increases by 0.3 for advancing the lead runner, and you add 0.4 for having a runner on first behind him. Er is now 1.6.
- Ian Desmond grounds out, second baseman to first baseman, but the runners advance. Er for the lead runner goes from 1.2 for second with no outs to 1.0 for third with one out, and the 0.4 for having an extra runner on first with no outs becomes 0.4 for having an extra runner on second with one out. Er is now 1.4.
- Tyler Moore strikes out. This is extremely painful - instead of 1.0 for the runner on third, we now have 1.4/4 = 0.35, and instead of 0.4 for the runner on second, we have 0.2. Er plummets to 0.55 - a loss of .65 to the Nationals offense for one strikeout.
- Xavier Nady hits an infield single to third and Morse scores. This is already a gain of .9 runs (+1 for the actual run, -0.2 to Er for not having an extra runner on second, and +0.1 to Er for having an extra runner on first), but LaRoche tries to score as well ... and does, on an error. This would normally be a loss of .45 runs for getting the third out and dropping Er to zero; the error cost the Orioles 1.2 runs (1 for the run plus Er = 0.2 for a runner on first).
- Jesus Flores grounds out, shortstop to first baseman. Er drops to zero with the third out.
A second example: from the April 20th, 2012 game with the Miami Marlins at Nationals Park in DC - a game I went to and scored. Specifically, the bottom of the 8th. (Yes, I am choosing innings where the Nationals score. So I'm a fan.)
- We start, as always, at Er = 0.5.
- Xavier Nady strikes out. Er halves, to 0.25.
- Rick Ankiel hits ... well, I'm not sure. There's a question mark in my book. It's either a single and he advanced to second on an error, or it's a double. Either way, runner on 2nd, one out, Er = 0.7.
- Jesus Flores grounds out, pitcher to first base, and Ankiel advances. Runner on third with two outs is worth 0.35.
- Roger Bernadina is put in for pitcher Ross Detwiler and walks. This adds .1 of a run, so Er = 0.45.
- Ian Desmond hits a single. Ankiel scores, Bernadina to second. Desmond adds the run and Er drops to runner-on-second + 0.1 = 0.4.
- Danny Espinosa strikes out. Er drops to zero.
The Effect of Count on Batting Average
The next question I want to answer was inspired by a comment in the classic baseball book Moneyball. In that account, one of the rules of thumb that were articulated was this: the nature of the first pitch - ball or strike - does not have a large effect on the batting average, but the first two out of three does ... to the point that a mediocre player with 2 balls and 1 strike became a star, and a strong player with 1 ball and 2 strikes became a dud.
It occurs to me that I don't know the basis for this remark. Further, I don't even know the intensity of the effect.
Now, initially I thought that the solution would be somewhat difficult for one simple reason: I needed to collect data on what kind of batting performance people gave with different counts. Fortunately, after I posted a number of comments on the idea on Twitter, a friend-of-a-friend pointed me to [Baseball Reference's "Splits" page] - which contains exactly what I was looking for. This made it an almost instantaneous (although, I have now determined, entirely accidental) process to discover that I was completely mistaken.
The problem is samples.
Suppose you had only two batters. One drew 67% strikes, 17% balls, and 17% doubles, and the second drew 67% balls, 17% strikes, and 17% singles. In this case, the slugging percentage of each batter is unchanged by any walks they draw, and therefore by any balls thrown ... but, in the splits table, while 50% of 0-0 counts are each batter, only 20% of the 1-0 counts are the doubler ... and therefore there would be the appearance of the slugging percentage dropping as a batter gets ahead on the count!
The only solution I can see is a rather drastic one: take splits for individual batters after each count, compare the curves, and derive a formula that way. That, or close your eyes and pretend you don't notice.
More when I have made progress. Robin 00:40, 24 July 2012 (EDT)