On Polling Models, Skewed & Unskewed

There’s a very large gulf between my conclusion, explained on Friday, that Obama is toast on Election Day and confident projections like Nate Silver’s poll-reading model still giving the president (at last check) a 77.4% chance of victory. Let me explain why, and what that says about the difference between my approach and Nate’s.

The Limits of Mathematical Models

“A page of history is worth a volume of logic”
– Oliver Wendell Holmes

Mathematical models are all the rage these days, but you need to start with the most basic of facts: a model is only as good as the underlying data, and that data comes in two varieties: (1) actual raw data about the current and recent past, and (2) historical evidence from which the future is projected from the raw data, on the assumption that the future will behave like the past. Consider the models under closest scrutiny right now: weather models such as hurricane models. These are the best kind of model, in the sense that the raw data is derived from intensive real-time observation and the historical data is derived from a huge number of observations and thus not dependent on a tiny and potentially unrepresentative sample.

Yet, as you watch any storm develop, you see its projected path change, sometimes dramatically. Why? Because the models are highly sensitive to changes in raw data, and because storms are dynamic systems: their path follows a certain logic, but does not track a wholly predictable trajectory. The constant adjustments made to weather models ought to give us a little more humility in dealing with models that suffer from greater flaws in raw data observations, smaller sample sizes in their bases of historical data, or that purport to explain even more complex or dynamic systems – models like climate modeling, financial market forecasts, economic and budgetary forecasting, or the behavior of voters. Yet somehow, liberals in particular seem so enamored of such models that they decry any skepticism of their projections as a “War on Objectivity,” in the words of Paul Krugman. Conservatives get labeled “climate deniers” or “poll deniers” (by the likes of Tom Jensen of PPP, Markos Moulitsas, Jonathan Chait and the American Prospect) or, in the case of disagreeing with budgetary forecasts that aren’t really even forecasts, “liars.” But if history teaches us anything, it’s that the more abuse that’s directed towards skeptics, the greater the need for someone to play Socrates.

Consider an argument Michael Lewis makes in his book The Big Short: nearly everybody involved in the mortgage-backed securities market (buy-side, sell-side, ratings agencies, regulators) bought into mathematical models valuing MBS as low-risk based on models whose historical data didn’t go back far enough to capture a collapse in housing prices. And it was precisely such a collapse that destroyed all the assumptions on which the models rested. But the people who saw the collapse coming weren’t people who built better models; they were people who questioned the assumptions in the existing models and figured out how dependent they were on those unquestioned assumptions. Something similar is what I believe is going on today with poll averages and the polling models on which they are based. The 2008 electorate that put Barack Obama in the White House is the 2005 housing market, the Dow 36,000 of politics. And any model that directly or indirectly assumes its continuation in 2012 is – no matter how diligently applied – combining bad raw data with a flawed reading of the historical evidence.

Different sets of polls are, more or less, describing two alternate universes in terms of what the 2012 electorate will look like, one strongly favorable to Obama, one essentially decisive in favor of Romney. The pro-Obama view requires a number of things to happen that are effectively unprecedented in electoral history, but Nate Silver argues that we should trust them because state poll averages have a better track record in other elections than national polls. The pro-Romney view, by contrast, simply assumes that things have gone wrong in a number of the polls’ samples that have gone wrong before. Sean Trende argues that the national pollsters currently in the field are more reliable, and that this (rather than the history of state and national pollsters in the abstract) should be significant:

Among national pollsters, you have a battle-tested group with a long track record performing national polls. Of the 14 pollsters producing national surveys in October, all but three were doing the same in 2004 (although AP used Ipsos as its pollster that year rather than GfK, and I believe a few others may have changed their data-collection companies). Of the 14 pollsters surveying Ohio in October, only four did so in 2004 (five if you count CNN/USAToday/Gallup and CNN/Opinion Research as the same poll).

Pollsters such as ABC/Washington Post, Gallup, Pew, Battleground, and NBC/WSJ are well-funded, well-staffed organizations. It’s not immediately obvious why the Gravises, Purple Strategies and Marists of the world should be trusted as much as them, let alone more. And since virtually none of the present state pollsters were around in 1996 or 2000 (except Rasmussen Reports, which had a terrible year in 2000 and has since overhauled its methodology), it’s even less clear why we should now defer to state poll performance based upon those years.

In my opinion, which view is correct is not one that can be resolved by mathematical models, but rather by an examination of the competing assumptions underlying the two sets of polls and an assessment of their reasonableness in light of history and current political reality.

Where Polls Come From

Polling is “scientific,” in the sense that it attempts to follow well-established mathematical concepts of random sampling, but political polls remain as much art as science, and each polling cycle presents different challenges to pollsters’ ability to accurately capture public sentiment. Quick summary: dating back roughly to George Gallup’s introduction of modern political polling in the 1936 election, a pollster seeks to extrapolate the voting behavior of many millions of people (130 million people voted in the 2008 presidential election) from a poll of several hundred or a few thousand people. In a poll that seeks only the opinion of the public at large, the pollster will seek to use a variety of sampling techniques to ensure that the population called actually matches the population as a whole in terms of age, gender, race, geography and other demographic factors. In some cases, where the raw data doesn’t provide a random sample, the pollster may re-weight the sample to reflect a fair cross-section.

Political polling is a somewhat different animal, however: not all adults are registered voters, and not all registered voters show up to vote every time there’s an election. So, a pollster has to use a variety of different methods – in particular, a “likely voter” screen designed to tease out the poll respondent’s likelihood of voting – to try to figure out whether the pollster’s results have sampled a group of people who correspond to the actual electorate for a given election. This is complicated by the fact that voter turnout isn’t uniform: in some years and some states Republican enthusiasm is higher than others, in some Democratic enthusiasm is higher than others. You can conduct the best poll in the world in terms of accurately ascertaining the views of a population that mirrors your sample – but if your sample doesn’t mirror that season’s electorate, your poll will mislead its readers in the same way that the Literary Digest’s unscientific poll did in 1936, or the RCP averages in the Senate elections in Colorado and Nevada in 2010, or the polls that failed to capture the GOP surge in 2002.

Technology, economics and other factors affect polling. The rise of caller ID in particular has dramatically reduced response rates – that is, pollsters have to call 8 or 9 people for every one who will answer their poll. That raises the level of difficulty in ensuring that the people who actually do answer the questions are a representative sample. Liberals argue that pollsters undersample people who have only cell phones (a disproportionately younger and/or poorer group) and non-English speakers; conservatives counter that Tea Partiers may be less likely to talk to pollsters and that polls in some cases can suffer a “Shy Tory Factor” where voters are less likely to admit to voting Republican. Partisans dispute the relative merits of in-person versus automated polling and the structure of polls that ask a lot of leading questions before asking for voter preferences. And the economics of the polling business itself is under stress, as news organizations have less money to spend on polls and pollsters do public political polling for a variety of business reasons, only some of which have anything to do with a desire to be accurate – some pollsters like PPP make most of their money off serving partisan clients, news organizations do it to drive news, universities do it for name recognition.

2012, even moreso than past elections, is apt to produce another round of reflection and recrimination on all of these issues, as a great many of the individual polls we have seen so far have been largely or wholly irreconcilable, especially in terms of their view of the partisan makeup of the 2012 electorate. If you assume that (1) the various players in national and state polling have essentially random tendencies towards inaccuracy in modeling the electorate in all conceivable environments and (2) each state’s poll average includes a large enough sample of different polls by different pollsters to bear out this assumption – in that case, state polling averages and the models that rest on them should be good predictors of turnout, as they have been in most (but not all) past elections. But when you consider that 2008 was a very unusual environment and that every turnout indicator we have other than the state poll averages is pointing to a different electorate, these become far more questionable assumptions.

Toplines and Internals

Nate Silver’s much-celebrated model is, like other poll averages, based simply on analyzing the toplines of public polls. This, more than any other factor, is where he and I part company.

If you read only the toplines of polls – the single number that says something like “Romney 48, Obama 47” – you would get the impression from a great many polls that this is a very tight race nationally, in which Obama has a steady lead in key swing states. In an ordinary year, the toplines of the polls eventually converge around the final result – but this year, there seems to be some stubborn splits among the poll toplines that reflect the pollsters’ struggles to come to agreement on who is going to vote.

Poll toplines are simply the sum of their internals: that is, different subgroups within the sample. The one poll-watchers track most closely is the partisan breakdowns: how each candidate is doing with Republican voters, Democratic voters and independent voters, two of whom (the Rs & Ds) have relatively predictable voting patterns. Bridging the gap from those internals to the topline is the percentage of each group included in the poll, which of course derives from the likely-voter modeling and other sampling issues described above. And therein lies the controversy.

My thesis, and that of a good many conservative skeptics of the 538 model, is that these internals are telling an entirely different story than some of the toplines: that Obama is getting clobbered with independent voters, traditionally the largest variable in any election and especially in a presidential election, where both sides will usually have sophisticated, well-funded turnout operations in the field. He’s on track to lose independents by double digits nationally, and the last three candidates to do that were Dukakis, Mondale and Carter in 1980. And he’s not balancing that with any particular crossover advantage (i.e., drawing more crossover Republican voters than Romney is drawing crossover Democratic voters). Similar trends are apparent throughout the state-by-state polls, not in every single poll but in enough of them to show a clear trend all over the battleground states.

If you averaged Obama’s standing in all the internals, you’d capture a profile of a candidate that looks an awful lot like a whole lot of people who have gone down to defeat in the past, and nearly nobody who has won. Under such circumstances, Obama can only win if the electorate features a historically decisive turnout advantage for Democrats – an advantage that none of the historically predictive turnout metrics are seeing, with the sole exception of the poll samples used by some (but not all) pollsters. Thus, Obama’s position in the toplines depends entirely on whether those pollsters are correctly sampling the partisan turnout.

That’s where the importance of knowing and understanding electoral history comes in. Because if your model is relying entirely on toplines that don’t make any sense when you look at the internals with a knowledge of the past history of what winning campaigns look like, you need to start playing Socrates.

Moneyball and PECOTA’s World

Let me use an analogy from baseball statistics, which I think is appropriate here because it’s where both I and Nate Silver first learned to read statistics critically and first got an audience on the internet: in terms of their predictive power, poll toplines are like pitcher win-loss records or batter RBI.

At a very general level, the job of a baseball batter is to make runs score, and the job of a baseball pitcher is to win games, so traditionally people looked at W-L records and RBI as evidence of who was good at their jobs. And it’s true that any group of pitchers with really good W-L record will, on average, be better than a group with bad ones; any group of batters with a lot of RBI will, on average, be better than a group with very few RBI. If you built a model around those numbers, you’d be right more often than you’d be wrong.

But wins and RBI are not skills; they are the byproducts of other skills (striking people out, hitting home runs, etc.) combined with opportunities: you can’t drive in runners who aren’t on base, and you can’t win games if your team doesn’t score runs. If you build your team around acquiring guys who get a lot of RBI and wins, you may end up making an awful lot of mistakes. Similarly, you can’t win the votes of people who don’t come to the polls.

Baseball analysis has come a long way in recent decades, because baseball is a closed system: nearly everything is recorded and quantified, so statistical analysis is less likely to founder on hidden, uncounted variables. Yet, even highly sophisticated baseball models can still make mistakes if they rest on mistaken assumptions. Baseball Prospectus.com’s PECOTA player projection system – designed by Nate Silver and his colleagues at BP – is one of the best state-of-the-art systems in the business. But one of PECOTA’s more recent, well-known failures presents an object lesson. In 2009, PECOTA projected rookie Orioles catcher Matt Wieters to hit .311/.395/.546 (batting/on base percentage/slugging). As regular consumers of PECOTA know, this is just a probabilistic projection of his most likely performance, and the actual projection provided a range of possible outcomes. But the projection clearly was wrong, and not just unsuccessful. While Wieters has developed into a good player, nothing in his major league performance since has justfied such optimism: Wieters hit .288/.340/.412 as a rookie, and .260/.328/.421 over his first four major league seasons. What went wrong? Wieters had batted .355/.454/.600 between AA and A ball in 2008, and systems like PECOTA are supposed to adjust those numbers downward for the difference in the level of competition between A ball, AA ball and the major leagues. But as Colin Wyers noted at the time, the problem was that the context adjustments used by PECOTA that season used an unusually generous translation, assuming that the two leagues Wieters had played in – the Eastern League and the Carolina League – were much more competitive in 2008 than they had been in previous years. By getting the baseline of the 2008 environment Wieters played in wrong, PECOTA got the projection wrong, a projection that was out of step with what other models were much more realistically projecting at the time. The sophistication of the PECOTA system was no match for two bad inputs in the historical data.

My point is not to beat up on PECOTA, which as I said is a fantastic system and much better than anything I could design. Let’s consider for a further example one of PECOTA’s most notable successes, one where I questioned Nate Silver at the time and was wrong; I think it also illustrates the differing approaches at work here. In 2008, PECOTA projected the Tampa Bay Rays to win 88-89 games, a projection that Nate Silver touted in a widely-read Sports Illustrated article. It was a daring projection, seeing as the Rays had lost 95 or more games three years running and never won more than 70 games in franchise history. As Silver wrote, “[i]t’s in the field…that the Rays will make their biggest gains…the Rays’ defense projects to be 10 runs above average this year, an 82-run improvement.” I wrote at the time: “this is nuts. Last season, Tampa allowed 944 runs (5.83 per game), the highest in the majors by a margin of more than 50 runs. This season, BP is projecting them to allow 713 runs (4.40 per game), the lowest in the AL, third-lowest in the majors…and a 32% reduction from last season…it’s an incredibly ambitious goal.”

PECOTA was right, and if anything was too conservative. The Rays won 97 games and went to the World Series, without any improvement by their offense, almost entirely on the strength of an improved defense. I later calculated that their one-year defensive improvement was the largest since 1878. Looking at history and common sense, I was right that PECOTA was projecting an event nearly unprecedented in the history of the game, and I would raise the same objection again. But the model was right in seeing it coming.

If you looked closely, you could see why: the frontiers of statistical analysis had shifted. Michael Lewis’ book Moneyball, following the 2002 Oakland A’s, captured the era when statistical analysts stressed hitting and de-emphasized fielding on the theory that it was easier to use sophisticated metrics to find better hitters, but harder to quantify the benefits of defense. By 2008, the metrics were creating more opportunities to study defense, and – as captured in Jonah Keri’s book The Extra 2% (about the building of that Rays team) – the Rays took advantage.

But for the Rays, the 2008 environment was not so easily repeated in subsequent years. While still a successful club with a solid defense in a pitcher’s park (and still far better defensively than in 2007) they have led the league in “Defensive Efficiency Rating” only once in the past four years. It’s what Bill James called the Law of Competitive Balance: unsuccessful teams adapt more quickly to imitate the successes of the successful teams, bringing both sides closer to parity. Trende, in his book The Lost Majority, applies the same essential lesson to political coalitions. Assuming that the 2008 turnout models, which depended heavily on unusually low Republican turnout, still apply to Obama’s current campaign ignores the extent to which multiple factors favor a balance swinging back to the Republicans. And the polls that make up the averages – averages upon which Nate Silver’s model rests – are doing just that. Nate’s model might well work in an election where the relationship between the internals and the toplines was unchanged from 2008. But because that assumption is an unreasonable one, yet almost by definition not subject to question in his model, the model is delivering a conclusion at odds with current, observable political reality.

Painted Into A Corner

Poll analysis by campaign professionals often involves a large dollop of conscious partisan hackery: spinning the polls to suggest a result the campaigns know is not realistic, in the hopes of avoiding the bottom-drops-out loss of voter confidence that sets in when a campaign is visibly doomed. For the record, unlike some of my conservative colleagues, I don’t think Nate is a conscious partisan hack. I have a lot of respect for his intelligence and his thoroughness as a baseball analyst and we have mutual friends in the world of baseball analysis, and I think he undoubtedly recognizes that it will not be good for his credibility to be committed to the last ditch to defending Obama as a prohibitive favorite in an election he ends up losing. (It’s true that the 538 model is just probabilities, but as Prof. Jacobson notes, Nate won his reputation as an electoral forecaster with similar probabilistic projections in 2008; if you project a guy to have a 77% chance to win an election he loses, that will inevitably cause people to put less faith in your odds-laying later on).

I do, however, think that – for whatever reasons – Nate has painted himself into a corner from which there is no easy escape. If I’m right about the electorate and the polls are right about the internals, Romney wins – and if Romney wins, the 538 model will require some serious rethinking. There’s a bunch of reasons why he finds himself in this position. One is that his model has been oversold: he made his poll-reading reputation based on a single election cycle, in which he had access to non-public polls to check his work. Nate is, in fact, not the first poll-reader to get 49 states right: RedState’s own Gerry Daly did the same thing in 2004, missing only Wisconsin (which Bush lost by half a point) in his Election Day forecast, and Gerry did this through careful common-sense reading of the state-by-state polls checked against the national polls, not through a model that purported to do his thinking for him. (As it happened, the RCP averages at the end of the cycle did the same thing, as they did in 2008.) I’m inclined to listen to guys like Gerry who have been doing this for years and have not only recounted the numbers from past elections but lived through the reading of polls while they were happening. In 2010, the 538 model fared well – but no better than the poll averages at RCP. And that was only after Nate was much slower to pick up on the coming GOP wave than Scott Rasmussen, who called it a lot earlier in the cycle.

There are a raft of methodological quibbles with the 538 model (some larger than others), many of which reek of confirmation bias (ie, the tendency to question bad news more closely than good). For example, while Nate’s commentaries have included lengthy broadsides against Rasmussen and Gallup, his model tends to give a lot of weight to partisan pollster PPP. Ted Frank noted one example that perfectly captures the value of knowing your history; the 538 model’s assumptions about how late-deciding undecided voters will break are tilted towards Obama by including the 2000 election, when Gore did far better on Election Day than the late-October polls suggested. But Gore wasn’t an incumbent, and there was a major event (the Bush DUI story) that had a major impact on turnout and undecided voters. If you make different assumptions based on a different reading of history, you get different conclusions. The spirit of open scientific inquiry should welcome this kind of scrutiny, even in the heat of election season.

None of this is a reason to conclude that the 538 concept is broken beyond repair. If you regard poll analysis as something like an objective calling, you can learn from your failures as well as your successes. If Obama wins, my own assumptions (and indeed, nearly everything we know about winning campaigns) will have to be re-examined. If Romney wins, the model of simply aggregating the topline state-by-state poll averages will have to be sent back to the drawing board. But there will be no hiding, in that case, from the fact of its failure.

Unskewed Polls

One of the more widely-discussed efforts to fix the problem of topline poll data varying by turnout models is Dean Chambers’ UnskewedPolls.com, which takes the internals of each poll and re-weights them for a more Romney-friendly turnout model. In concept, what Chambers is doing is on the right track, because it lets us separate how much of the poll toplines is due to the sentiments of different groups and how much is due to assumptions about turnout. But his execution is a methodological hash.

I haven’t pulled apart all the pieces of Chambers’ model, but my main objection to UnskewedPolls is that it re-weights the electorate twice:

The QStarNews poll works with the premise that the partisan makeup of the electorate 34.8 percent Republicans, 35.2 percent Democrats and 30.0 percent independent voters. Additionally, our model is based on the electorate including approximately 41.0 percent conservatives, 20.0 percent moderates and 39.0 percent liberals.

Republicans are 89 percent conservative, 9 percent moderate and 2 percent liberal. Among Democrats, 3 percent are conservative, 23 percent are moderate and 74 percent are liberal. Independents include 33 percent conservatives, 49 percent moderates and 18 percent liberals.

Our polls are doubly-weighted, to doubly insure the results are most accurate and not skewed, by both party identification and self-identified ideology. For instance, no matter how many Republicans answer our survey, they are weighted at 34.8 percent. If conservatives are over-represented among Republicans in the raw sample, they are still weighted at 89 percent of Republicans regardless.

The problem with this method is that neither the raw data (the current polls) nor a lot of the historical data (past years’ exit polls) has crosstabs showing how the votes of each partisan group break out by ideology. That is, for example, we have nearly no separate polling (certainly none on the polls Chambers is “unskewing”) showing how Romney is polling among independents who self-identify as moderates, or how Obama is polling among Democrats who self-identify as conservatives. That’s aside from the question of whether ideological self-ID is nearly as predictive a variable as party ID. Re-weighting the samples twice by these two separate variables, without access to those crosstabs, means you don’t really have any idea whether you are just adding a mutiplier that double-counts your adjustments to the turnout model. It’s more alchemy than science.

Conclusion

We can’t know until Election Day who is right. I stand by my view that Obama is losing independent voters decisively, because the national and state polls both support that thesis. I stand by my view that Republican turnout will be up significantly from recent-historic lows in 2008 in the key swing states (Ohio, Wisconsin, Colorado) and nationally, because the post-2008 elections, the party registration data, the early-voting and absentee-ballot numbers, and the Rasmussen and Gallup national party-ID surveys (both of which have solid track records) all point to this conclusion. I stand by my view that no countervailing evidence outside of poll samples shows a similar surge above 2008 levels in Democratic voter turnout, as would be needed to offset Romney’s advantage with independents and increased GOP voter turnout. And I stand by the view that a mechanical reading of polling averages is an inadequate basis to project an event unprecedented in American history: the re-election of a sitting president without a clear-cut victory in the national popular vote.

Perhaps, despite the paucity of evidence to the contrary, these assumptions are wrong. But if they are correct, no mathematical model can provide a convincing explanation of how Obama is going to win re-election. He remains toast.