Friday, March 5, 2010

Sports: The Problem With the Dead Ball Era

On Tuesday, I outlined some of the challenges I'm facing while attempting to rank baseball's all-time offensive greats in order. Today, I'm adding another: The Dead Ball Era.

Originally, I didn't see it as an issue. As long as data was complete, ratios would have no bias. Whether you hit 30 home runs in an era when the average was 15 or four when the average was two, your ratio and dominance over the league was considered the same.

I was comfortable with this assessment. I kept an open mind when Barry Bonds' 2001 season was ranked surprisingly low in terms of home run dominance for a season. The home run was at its most common point, after all. We are attracted to big numbers, so it's easy to be conditioned to think that 73 in such an era is greater than 40 in another (even if they are equals).

I saw it as an opportunity when all of these names I was unfamiliar with started popping up on the list. Pretty cool, really. I knew, though, it would open me up to scrutiny. So it was important that I dug deep to verify that what I was doing made sense.

So I took a closer look at the most dominant home run seasons in order. I stumbled upon a very common theme: Dead Ball Era.

I have no problem with Dead Ball Era players appearing near the top. But when I took a closer look, I realized they were in the top 10, top 50, and top 100 at a disproportionate rate.

Understand that the only players in my analysis that played in this era are from 1885 through 1919. So even if we could expect an equal proportion from each era, we're looking at a max of 33% from that era. And understand, this would be a very favorable number. While the era makes up approximately one third of the years analyzed, it makes up far fewer of the number of players.

Three Dead Ball players among the 10 most dominant home run seasons ever. Ok. Possible. Nine of the first 20. Twenty-five of the first 50. Forty-eight of the first 100.

Now we're looking at a problem.

It makes sense why. When the average player is hitting one or two home runs -- or even three or four -- the ability to reach "dominance" with a few swings is much easier. And when that happens, not only do you get dominant ratios, but you increase the likelihood that several players from the same year will have high ratios. In other words, the bell curve is different during the Dead Ball Era. There is a higher concentration of players near the top.

Think of the example earlier. Let's say the average number of home runs during a given season is 1.33, like it was in 1918. Babe Ruth hit 11 that year (8.25 ratio). So did Tilly Walker. Gavvy Cravath hit eight (6.0). Frank Baker, George Burns, Walter Cruise and Cy Williams all hit six (4.5). Another group of six players hit five (3.75).

Now let's put that into perspective. During 2001, the average number of home runs hit was 15.7. Barry Bonds hit 73 (4.65 ratio). If players in 2001 (one of the biggest home run seasons ever) hit over the average number at the same rate as these players in 1918, we'd get...

Two players with 130 home runs
One player with 94
Four players with 71
Six players with 59

Ultimately, we need to determine if these players were simply more dominant in 1918, or if there is something about the data that is unreliable.

While I'm open to Barry Bonds' 73 home runs not being at the top, it seemed incredibly odd that he was 167th (coincidentally, behind 166 Roger Maris in 1961). And far too many of the players ahead of them had put up single digit totals (12 are below 10 and 73 below 20).

While there will be similar variations across different stats and eras, I see this as the extreme. There is no other stat I am evaluating for which an average could be so close to zero. When that's the case -- and one or two home runs swing the perception of a player tremendously -- such an analysis is volatile.

Do we eliminate the home run stat? An argument could be made that we put far too much value in the home run. Such oddities are not evident when evaluating total bases, and this stat is probably a better measure of a player's ability anyway.

[It should be noted that, unlike the home run stat, concentration of Dead Ball Era players are much more reasonably concentrated when viewing the top seasons ever for total bases. Seventeen are in the top 100 (vs. 48 for home runs), zero in the top 10, three in the top 20, nine in the top 50 and 29 in the top 150.]

Do we eliminate the era? The problem, of course, is that some great players are part of this era. Honus Wagner, Ty Cobb, Nap Lajoie, Tris Speaker and even Babe Ruth played at least a portion of their careers prior to 1920.

What is worse, eliminating these players from the discussion or keeping them in, thus providing some potentially questionable data? It may make sense to simply evaluate these years separately. But I worry that such an adjustment could be a slippery slope.

So on one hand, I can't imagine an analysis of baseball's all-time greats without including players from the Dead Ball Era. On the other, I can't imagine an analysis without home runs.

What do you think?


greebs on March 06, 2010 said...

I can't see not counting the HR, it's just too fundamentally a part of what baseball is.

And you can't eliminate the Dead Ball Era either, for the reasons you mentioned.

So maybe instead, you can - and I know this sort of goes against the whole idea - apply a factor to the Dead Ball ERA, just counting it less than afterwards, so you don't get skewed data. Not sure that would please me, but that's all I can think of here.

Jon Loomer on March 06, 2010 said...

I hear ya, Greebs. Doesn't seem like an easy solution. And I really don't want to alter data or create formulas around an era.

It may make sense to evaluate the era in it's own context prior to making the final list. Of course, making a final list will be difficult without comparing Dwad Ball Era players to other eras.

