Thursday, January 2, 2014

Unskewed Polls: Hall of Fame Edition

In my opinion, the most useful contributions to Baseball Hall of Fame voting and the attendant debate are the ballot aggregators of Twitter users @RRepoz and @leokitty. The former runs the comprehensive HOF Ballot Collecting Gizmo, while the latter maintains a Google spreadsheet with each individual ballot detailed. Together, they are the Hall of Fame equivalent of exit polling an election.

However, as those who work in politics know, every poll has a margin of error. They can even be flat wrong—remember when aides were calling John Kerry "Mr. President" after looking at the first wave of exits in 2004? In this case, these Hall of Fame polls are at one big disadvantage to the generally sound practice of political polling: they aren't representative, as scientific polls are made to be by weighting.

Through no fault of these aggregators, Hall of Fame exit polling is by definition skewed toward a self-selected pool: BBWAA members who are willing to make their ballots public. This tends to include more progressive scribes: those who value transparency, and not those who stopped covering baseball 20 years ago (these retired reporters may not even have an outlet to publish their Hall of Fame column even if they wanted to write one). In political terms, the poll over-represents certain demographics and undercounts certain other populations who still vote in high numbers.

Therefore, if you quote these aggregators' raw numbers as direct predictions of final vote totals—as many people on Twitter seem to be doing—you're going to be in for a surprise on January 8, when full results are announced. It's just as dangerous as relying on unweighted polling numbers in politics.

What we need to predict the Hall of Fame is more than a flawed poll—it's a model, of the sort used with great success by Nate Silver in the past few elections. Except ours is much simpler—all we have to do, in pollster terms, is tweak the numbers based on where past polls have historically fallen short.

You could say what we're doing here is the baseball version of, the ill-fated conservative alternative to FiveThirtyEight that manipulated 2012 polls and convinced many that Mitt Romney would actually win the election. Well, I prefer to think of it as the work any pollster must do to refine his or her raw data into a releasable scientific survey: weighting the numbers based on known facts and sound logic to get a representative sample.

Below is a comparison of the final exit polls for last year's Hall of Fame election and the actual results. We 're using @RRepoz's polling here, since he had a larger sample size than @leokitty—194 ballots out of an eventual 569 cast (34.1%):

You can see that the polls significantly short-changed certain players and over-hyped certain others. A similar discrepancy exists in the 2012 Hall of Fame exit polls, for which we use @leokitty's data. Her sample size that year was 114 out of 573 ballots cast (19.9%):

In 2011, @leokitty captured 122 votes out of an eventual 581 (21.0%):

And, finally, @leokitty had the biggest sample in 2010—92 out of 539 ballots (17.1%):

Four years of data should be sufficient for our purposes, especially since a clear pattern has emerged. The exit polls understate support for "old-school" candidates like Jack Morris, Lee Smith, and Don Mattingly. They overstate support for more subtle greatness—especially Tim Raines—and controversial candidates like Barry Bonds and Roger Clemens.

Each individual is over- or under-sampled by different degrees, however. A simple average of each player's "margin of error" in the past four elections yields an adjustment factor that we can apply to this year's exit polling. The chart below extrapolates each candidate's projected vote share for 2014 based on this adjustment factor and @RRepoz's current (updated as of 9pm ET on January 5) exit polls.

(Note that, unfortunately, these projections don't work on first-time candidates, because there is no vote history to calculate an adjustment factor from. In these cases, I've left their projected vote totals as they are in the polls. However, a future post here will explore ways to guesstimate adjustment factors for these players as well.)

The model projects that four candidates will be elected to the Hall this year: Greg Maddux, Tom Glavine, Frank Thomas, and—in a scraper—Craig Biggio. Falling short, thanks in part to his negative adjustment factor, will be Mike Piazza. Meanwhile, the top candidate likely to benefit from the old-school boost, Jack Morris, simply has too much ground to make up. He's currently polling at 60.3%, and while he is almost certain to get more than that on Wednesday, a 14.7-point bump would be an unprecedented margin of error. Falling off for another reason could be Rafael Palmeiro, who looks like he'll survive according to the raw numbers, but his adjustment puts him under 5%. While polls indicate Don Mattingly is in real danger of dropping off, precedent suggests he'll get a big boost on Election Day.

Of course, even a few days out, it's still fairly early in the voting process. @RRepoz has aggregated just 131 ballots; if history is any guide, more will soon be publicly released, which will increase the accuracy of exit polls and thus improve these projections. Stay tuned to Baseballot on Twitter to get daily updates as we count down the days to January 8.

1 comment:

  1. I'd suggest that recent performance is more informative than older performance, and that their should be greater weighting to the more recent ballotings. Secondly, some of the variance is due to random variation, and some regression to the mean should be built into the model. Finally, you could do a linkage analysis to give you some idea which way direction the new candidates are going to move. If Mussina voters are highly correlated with Raines voters, one should expect his final number to be reduced from the Gizmo numbers; conversely, if Kent voters are highly correlated with Lee Smith voters, we should expect Kent to outperform the Gizmo.