I have created some 84 spreadsheets, totaling more than 1.5 GB of file space, to help me with this analysis. And all 84 have helped me to created the final, oh-so-important document.
I call it the RankingEngine. Yes, it's not the coolest name. But when you create 84 documents, you'd better name each file properly so that you can keep them straight. This is the doc that will get me to the promised land.
This doc has the records of nearly 10,000 hitters dating back to 1871, focusing on 23 offensive statistics (not including such things as games, plate appearances and at bats. Additionally, I have found the league average for each player in each season and compared all players' annual performance to that average player to create a ratio against the theoretical average player.
I've also compiled these ratios to determine each player's best one, two, three, four, five, down to 25 seasons in each statistic compared to the average. Which player had the best five home run seasons against the league average? Was a different player better over 10 years? Fifteen? This part of the analysis is critical when comparing players.
There is also a look at each player's performance for their entire career versus the league average. This is done in two ways -- 1) comparing how the average player would do over the course of the player's career with the same number of plate appearances, and 2) comparing the player's performance to the theoretical average starting player, which includes considering how many plate appearances that average player would have had.
Everything I've laid out so far is split into separate tabs within one document. It's static, 24 tabs. But the dynamic portion is where the magic starts.
I've created a final tab that allows me to select any two players and compare them. Take a look at the screen grab to the right to get a better idea.
Crazy stuff, right? In this case, we're comparing the careers of Craig Counsell and Jim Gantner (don't want to give away any surprises, so we're starting with some scrappy old vets). The main body of the sheet compares the two players in each category starting with the best season and finishing with the 25 best seasons. The higher ratio will appear green, the lower ratio red to help with quick comparisons (black italics indicates a player did not play this many seasons and is therefore no longer accumulating statistics).
However, I also make it easier by reviewing the number of "wins" each player has in best five, 10, 15, 20 and total years at the top. Additionally, I found it necessary to isolate the "important" statistics so that my analysis wasn't improperly skewed.
While I will look at all 23 stats, my main focus will be on the following 12:
Runs
Home Runs
Runs Batted In
Stolen Bases
On Base Percentage
Total On Base (Walks + Hits + Hit By Pitch)
All Total Bases (Total Bases + Walks + Hit By Pitch)
Slugging Percentage based on All Total Bases (over Plate Appearances)
Equivalent Average
wOBA
Runs Created
OPS+
It's a nice mixture of conventional, advanced, and my own concoction (though my concoctions are very minor variations of conventional statistics). You'll notice that I find little value in hits and batting average, preferring On Base and On Base Percentage instead. Additionally, I have scrapped Total Bases and Slugging Percentage for versions that include walks and hit by pitch.
Keep in mind that some of the more advanced statistics (EqA, wOBA, OPS+) need to be recreated. We often see a final version that accounts for park effects. I don't think this will be possible for my analysis. In the example of EqA, I am using Raw EqA for this very reason. I have recreated these statistics the best that I can and have found that I am very close to published statistics. I continue to tweak these stats, so if anyone has any advice on how to make them more precise, please let me know.
I am comparing player by player. You bet, this is going to take some time. But I feel this is the best way to do it as opposed to coming up with some master formula to determine how players should be ranked. Instead, I am taking multiple factors into consideration. It's not a vacuum. So I will be looking at top five years separately from top 10, top 15 separately from top 20. And each statistic is not created equal.
In some cases, it will be easy. If Player A is better than Player B in 95% of the metrics, I know my answer for how those two will be ranked. Anything under around 70% will need to be more closely scrutinized.
So far, I've had a lot of fun ranking 40 players. While some players don't fall where I'd expect (or even want) them to fall, I am determined to stick to the stats and not be biased by perception, loyalties or popularity. I am keeping examples of this vague so as not to give any results away.
What was originally a weekend project turned into a one or two week project to one that I will work on indefinitely. This may take several months. And while I work on it, I will keep the results close to the vest for two reasons: 1) it's fun to unveil the results bit by bit, and 2) I want to make sure that all results are final before revealing anything.
In the meantime, I'd love to get opinions on what you think of the path I'm taking. Do you agree with the statistical categories that are the focus of my analysis? Do you have any recommendations on how I might recreate advanced statistics for historical data most accurately?
0 comments:
Post a Comment