Tuesday, March 2, 2010

Sports: The Challenges of Rating the Best

This blog has gone relatively quiet during the past week, but for a reason. I'm holding back a lot of info that I've been working on behind the scenes.

As you know, I'm in the process of attacking "The Project," a comprehensive statistical analysis of baseball's hitters, in search of a final ranking. It's a lot of work, and the wife is getting tired of my spreadsheet obsession, but we must trudge on.

And trudge, I will. I initially planned to rank the top 300 players. Then I decided to go to 500. Now... I may just go all the way, thanks to Sean Lahman's Baseball Database. I now have easy access to stats dating back to 1871. No more copying and pasting from BaseballReference.com.

Of course, even with this easy access, there are plenty of challenges.

1) Data only goes through 2006. Lahman's work was completed after the 2006 season, so the data stops there. I could add in the 2007-09 data manually, but I have a bad feeling that extra step may create a new one to two week mess. Since I'm focusing only on retired players, it's possible I could stick with the data I have. Maybe manually add in data for those players who may be in the top 300-1000 (or 100,000). Of course, I hate adding in incomplete data. In all likelihood, I'll add in every missing line of data or none at all.

2) Volume of data. The first time I started using this data, it crashed my computer several times. I've experienced several frustrating waiting games for data to calculate. I have to wait because I hadn't recently saved and feared losing the data. There are nearly 10,000 players I am evaluating with more than 50,000 lines of data by season. I've had to get creative to make the data manageable, but this has meant creating dozens of separate docs. Currently, I'm going category by category and comparing the stats of all players from all seasons. I created a template that I paste into that calculates. Paste and wait 90 minutes to complete. It's good fun.

3) Which stats are important? This isn't the first time I've pondered the question. I want to consider runs, hits, doubles, triples, home runs, RBI, stolen bases, total bases, walks, AVG, OBP, SLG and OPS. However, you can't just view them all equally. Home runs are also part of total bases, SLG and OPS. OPS already includes OBP and SLG. Hits are a big part of AVG and OBP. Lots of duplication and overlap. I'm hesitant to create some crazy formula that weights categories differently because there is no true way to weight them accurately. Instead, I will likely view them all individually to assist with my decisions.

4) Quality or quantity? I've actually considered eliminating the qualitative stats entirely (AVG, OBP, SLG, OPS) when evaluating individual seasons. Ultimately, does it really matter what your batting average was? I want results, and if you had the season's best batting average over 500 plate appearances, but someone else had the league's most hits in 600 plate appearances, I'd take the most hits. It's also very tricky differentiating averages when looking at an individual season. While I can eliminate anyone who didn't average 3.1 plate appearances per league game, there are still evaluational dilemmas. What is better, .350 in 500 plate appearances or .330 in 700? While I did create a formula for this way back in the day that considers comparison to league average and number of plate appearances, I haven't decided yet whether it's worth using in this case.

5) Seasons vs. career. I've decided that possibly the most important measure will be number of dominant seasons in one's career when compared to the league average. However, we can also look at career stats and compare how the player did over the course of their career (in totality) compared to the league average. In this case, there would still be value in the qualitative stats, although it is again something of a duplication of efforts (a player who was 10% above the league average over the course of his career in hits will be quite close to 10% above the league average in batting average, depending on number of walks against the league average).
Was Lou Brock the greatest base stealer of his era?

6) Definition of dominance. While I have a nice list of the most dominant seasons in each statistical category, working with this data is not easy. If Player A had the most dominant season ever in home runs, that doesn't make him the most dominant player ever, even in that category. Maybe Player B had the next three most dominant seasons. So you could add all of the ratios the player had over the league average in a given category. However, if you do this the player who lasted 24 years will naturally have the advantage (24 slightly above average seasons will trump 15 dominant ones in most cases). This would not accomplish my goal. So I am going to look at most dominant season as well as most dominant two, three, four, five, down to 20 seasons. I may decide that dominance is ultimately defined by comparing the 10 best seasons of each player. Could be 15. Or maybe all will be taken into consideration. Still working it out. Either way, I think I'm a step closer than I once was.

An example of this that I uncovered in between 90 minute calculations was the most dominant base stealers. When compared to the league average, the most dominant season is owned by Maury Wills. He maintains the status of most dominant base stealer through seven seasons. However, Lou Brock surpasses Wills when you look at their eight most dominant seasons. And Rickey Henderson tops them both from seasons 18 and on. Of course, Henderson accumulated more seasons than either player. When compared to the number of seasons the three share (14), Brock is the most dominant. If you want to focus only on a core number of seasons (say, five), Wills is the most dominant. If you want to factor in accumulation and longevity, Henderson is the most dominant. What is the right answer? I think Brock, particularly since Henderson only passed him by once he started accumulating stats in seasons Brock didn't play. Henderson played a total of 25 seasons.

I'm sure we'll see something very similar when it comes to hits and Pete Rose. It's a dilemma.

7) Incomplete data. Only a core number of statistics have been around since 1871. Some, like strikeouts and stolen bases, have gaps of years when we don't have the data. And some years, particularly before 1885, we're missing data for some teams but not others. As a result, there was a skewed baseline for the average player, resulting in some inflated ratios. I eliminated statistics of all players before 1885 to solve this. While one could eliminate all seasons that don't have every core statistic that is available today, that would result in a very incomplete analysis. And I could, technically, eliminate the Dead Ball Era, but what fun would that be? That takes care of core seasons of players like Ty Cobb and even some Babe Ruth years. While we have fewer stats for these seasons, they are complete. Ratios will still work for these years since player performance is compared to the league average that is accurate.

8) Readjust preconceived idea of greatness. When we include the Dead Ball Era, it's important to take the ratios seriously. A player who hit 10, 15 or 20 home runs in some years is the equivalent of 50 now. That may seem crazy -- and even flawed -- but it's greatness in the perspective of era. I would never consider 30 home runs now "great." Meanwhile, hitting 15 during some seasons was considered an amazing accomplishment. It's not because they were inferior players back then. While diet and average strength were certainly part of it, the overwhelming factors were cavernous parks, dead baseballs, and a pitcher's advantage. Prior to 1920, the ball was wound less tightly, one would be used per game, and pitchers could freely spit on them and mark them up. By the end of the game, they were often misshapen and lopsided. Even Barry Bonds would have had trouble hitting home runs then.

I recently published the stats of Home Run Baker in my daily "Awesome Baseball Names" list on Twitter, and one response I received about him was that he should change his nickname because he finished with fewer than 100 career round-trippers. Though he never hit more than 12 in a season, he hit nearly three times that of the league average in his era. He is a Hall of Famer, and when taken in the context of his era, he earned his nickname.


