The last essay I wrote can be summed up in a sentence: measurement won’t tell you anything unless you know what you’re measuring and why you’re measuring it. This seems obvious and reasonable, which is why it invariably attracts criticism.
There are two bad responses and one good:
“We teach this in 101, you’re not saying anything new.” On the contrary, I’d have no reason to write any of this without the catastrophe of modern academia. Anyone remotely invested in the advancement of human knowledge and/or the use of that knowledge should be screaming right now.
“It’s the fault of media.” This is partially true, inasmuch as the media picks and chooses and can’t read a study to save its life, but even a good review of the evidence would be bad because the evidence is bad.
The more interesting, approximating a comment so as not to put anyone on blast: “Numbers – even imperfect approximations of qualitative judgments – are still a helpful tool for making judgments. For instance, APGAR scores rely on some degree of qualitative opinion enumerated, but they have a definite use and it would be bad to lose them. You’re overplaying your hand by attacking them as inherently bad.”
Normally I’d just reply to the comment, but I have a feeling I’ll be using these democracy essays going forward and I want to get it clear.
I don’t mean to say that all qualitative and/or subjective (ignore that conflation for now) made-into-quantitative measurements are bad and useless. Some have clear and compelling reasons to exist. I’m grateful for the Apgar example because it helps draw out a few distinctions. As [comment] pointed out, “grimace” isn’t really much better than “democractic-ish” as far as qualitative judgments go. First, this is kind of unfair: I doubt many people would prefer Apgar scores to full examinations given infinite time, and it’s only due to the need for a measure right at that second that we default to them. Democracy scores aren’t really like that – I can’t think of a single situation where someone would need to assign one or look one up within a five-minute time span. Maybe if they’re trying to flunk a 101 social science class.
Still, the real difference isn’t time limitations, it’s subsequent tests. They’re imprecise but helpful, and we know this because a) doctors agree on at least a few of the letters and, b) they have solid predictive power:
For 13,399 infants born before term (at 26 to 36 weeks of gestation), the neonatal mortality rate was 315 per 1000 for infants with five-minute Apgar scores of 0 to 3, as compared with 5 per 1000 for infants with five-minute Apgar scores of 7 to 10. For 132,228 infants born at term (37 weeks of gestation or later), the mortality rate was 244 per 1000 for infants with five-minute Apgar scores of 0 to 3, as compared with 0.2 per 1000 for infants with five-minute Apgar scores of 7 to 10. The risk of neonatal death in term infants with five-minute Apgar scores of 0 to 3 (relative risk, 1460; 95 percent confidence interval, 835 to 2555) was eight times the risk in term infants with umbilical-artery blood pH values of 7.0 or less (relative risk, 180; 95 percent confidence interval, 97 to 334).
APGAR scores aren’t, of course, useful for everything. As the studies discusses, the APGAR score gets misapplied:
We cannot dispute the contemporary viewpoint that use of the Apgar score for the prediction of long-term neurologic outcome is inappropriate. However, the poor performance of the Apgar system as a predictor of neurologic development, a purpose for which it was never intended, should not undermine the continuing value of assigning Apgar scores to newborn infants.
Saying “it’s a question of what something is used for” is obvious, but the really interesting thing is this: not only can we tell when APGAR scores are properly used, there’s another observation that informs us of their improper use.
One problem with democracy scores is that they can only ever rely on other democracy scores for evidence. That is to say, democracy is a human categorization for political organization, there’s always going to be an element of human judgment. That’s not unique to democracy, and I can still think of cases where there are still uses for seemingly-totally-arbitrary scales.
For instance, meters do not exist in nature. They’re an arbitrary length we took for various purposes, but despite their arbitrary nature, everyone agrees on what a meter is. There’s a consensus that allows for their use. Democracy is not like this, as should be clear by the fact that all these measurements disagree, e.g. is “mandatory voting” more or less democratic than “free voting”?
There are cases where completely-arbitrary and also-lacking-consensus scales might serve a purpose. Films are different enough that everyone understands and accepts that “3.5 thumbs up” doesn’t mean anything objective. The films that fall into that particular configuration of digits could be anything, they’re all different, whatever. It’s still reasonably useful if you trust the reviewer, which is the point. If [Film critic you like] gives it a 7/9, then it’s helpful inasmuch as you’re interpreting it this way: [Critic] and you have similar tastes, you bet that could be your 7/9. The opposite is also true: [Absolute goon] gives [film] 10/10, a perfect boredom, and now you know to avoid it like the plague.
It would be stupid to assume that those are useful as measurements of the films alone. It’s better to say that they’re measurements of films based on your measurements of the person.
Of course, there’s an easy test here. I can watch the films myself to decide whether Ebert and I share tastes, and use that information to gauge subsequent reviews. I can also read his reasons behind that. If someone gives you consistently bad recs, if their Top 10 is all Dorm Room posters, “Scarface is underrated”, then you have something to check against. There is an object outside the person to correct your estimation of their estimations, and I’m sorry that a highly intuitive statement just looked so convoluted.
Now what of democracy?
I say that [democracy index] fails at all three of these. It lacks an empirical correlate for the Apgar-style-qualitative-measurements, inasmuch as “democracy” is not a natural phenomenon. It lacks the social consensus of the “meter.” It’s closest to the film review if anything, but there’s no easy way to determine how much you trust the critic. More:
Two things can be true at once: One, [Index] uses entirely good information, absolutely perfect, empirically unassailable. Two, its method of (a) collecting, (b) weighting, (c) interpreting, makes it completely unacceptable.
I think this is mostly what’s happening here. Folded into all of these indices is good information that tells you things. For example: voting records, self-reported trust in institutions, crime rates, etc. I don’t mean that those are all unproblematic to collect, but they generally tell you a real thing about the country. Moreover, any of those has a use of its own. If I want to head to [place] and I’m worried about crime, then looking at relative crime rates is a good idea. If I’m writing a study on international voting habits, I might look at registration and participation.
The issue is that none of those are the report. The purpose of the report is the scale: Democracy 1-10, that’s all it has to offer, that’s what sets it apart from the disaggregated data. The rest of that information is public, and what is not public is the expert opinion that goes into the ranking. In other words, the utility of the disaggregated information is not relevant to a criticism of a Democracy Index’s methodological flaws. All the Index adds are those flaws.
The trick (and it is a trick) is that they rely on the strength of someone else’s data to establish their credentials. In other words, they’re confusing the empirical work with their interpretations. “[Index] is a combination of miraculously incorruptible public records, death, and taxes.” So you say: must be legit, those are all uncontroversial and certain. But none of those are the index, so to understand the index you need to know why those were chosen and what the purpose of the scale is.
My argument is simple. In bullet points:
- Much of what goes into [score] is qualitative estimation.
- A bad qualitative argument gives a “3”; a good qualitative argument also gives a “3.”
- These are, notably, both just “3.” Everything that would have been useful is now lost. In the rare event that they produce a paragraph or two of justification, I’ll allow for that, but then what use is the scale? Just give me more paragraphs with all the space saved.
- This incentivizes extremely sloppy arguments and partisanship, inasmuch as people can just label a thing [number] and expect it to fly. Note that I say incentivizes rather than “allows for.” This is because the multiplicity of scales – a measurement for every bias – encourages groups to assign different ratings. There’s always a buyer.
- Because you cannot see the argumentation behind it, there’s not even the “do I trust this critic?” test that you’ll have with a film review. All you have is the credentials, backed by being-the-person-who-did-the-study.
Let’s ground ourselves in the scale of the problem, otherwise it’ll just be me insisting that no, actually, this is very bad.
Scales only agree on clear cases (Turkmenistan – autocracy; Norway – Democracy). Right off the bat: if you’re pouring money into an institution the very best results of which could be replicated by locking a few undergrads in a room with wikipedia for an hour, then you should probably rethink funding. Whatever. It’s instructive to look at hybrid regimes. From Gunitsky:
The fact that many indices are highly correlated may suggest that they are measuring roughly the same phenomenon. Polity IV and Freedom House, for example, have a Pearson’s R of 0.88. But as Casper and Tufis (2003) point out, highly correlated measures can produce very different regression results. Using three measures of democracy (Polyarchy, Freedom House, and Polity IV), they show that regressions of these highly correlated measures on variables like education and economic development often produce inconsistent results. Moreover, the correlation among measures is highest for clear-cut cases. The correlation between Polity and Freedom House drops to 0.64 when democracies are excluded from the analysis. And when both clear-cut autocracies and democracies are excluded, the correlation drops further to 0.50. This is especially problematic since democracy promotion and foreign aid is often directed at countries in the middle of the range. In these cases, the substantive results of any analysis may be greatly affected by the choice of measure.
That may superficially resemble checking Apgar scores against neonate mortality rates, but note that mortality rates are predictive. Far more concerning is the fact that the scores actually diverge over time. That is to say, they not only disagree about current democracy level, they also track democratization radically differently:
Such discrepancy is greatest in Russia, where the two scores diverge widely. Starting in the mid-1990s, Freedom House records a steep decline in democratic quality, while the Polity index records a slight rise followed by a leveling off.
But even this is too abstract. I mocked the EIU last time, but it’s not even the worst of the lot. Try Polity IV. Everyone disagrees over what to measure, but there are probably a few factors we’d all agree make something “democratic.” For instance, one assumes that “ability to vote” is a reasonably important aspect of a democratic nation. Polity IV blew that off and blasted ahead, preferring to measure “quality of democracy” by competition and power transfer among society’s elites.
Read Manfried Schmidt’s (pdf) Measuring democracy and autocracy if you want a full breakdown, but you can get the sense from a few excerpts:
According to Polity IV, democracy is characterised by three key items: 1) institutions and processes that allow citizens to effectively express their political preferences and to combine these preferences into a package of alternatives from which they can choose, 2) institutional constraints on the executive and 3) guaranteed civil rights and liberties for all citizens of the state. If all of these conditions are met, the regime in question is classified as an institutionalized democracy (Marshall/Jaggers/Gurr 2014: 14). When the degree of democracy of a regime type is being measured, though, only the first and second key items are included in the calculations; the third key item, civil rights and liberties, is not used (ibid.: 14).
The indicators deal with the constitutional reality only in part and with the existence and realization of civil rights and liberties not at all. The basic idea of measuring the constraints on the executive needs to have a more complex measurement added, for instance a measurement on the model of the index of counter-majoritarian institutions (Schmidt 2010: 332, table 8) or on the model of the veto player theory (Tsebelis 2002). In addition, the Polity Project’s measurements of democracy and autocracy are rather executive-heavy. For one thing, the difference between suffrage for the few and suffrage for all adult citizens is not taken fully into consideration in these measurements. This is also true of the treatment of the relative sizes of electorates and of the voters’ ability to have a say in voting the political leadership in and out of office. This has resulted in serious errors.
This actually gets worse, the data is weighted in a strange enough way that several studies had to be done for the purpose of figuring out how it was actually ranking states. The most famous of them is Double Take: a reexamination of democracy and autocracy in modern polities by Kristian S. Gleditsch and Micahel D. Ward. Admittedly, it used Polity III data, but what it found was absolutely fascinating and still relevant. From the abstract:
[The authors] show how the analytical composition of the well-known democracy and autocracy scores is not upheld by an empirical analysis of the component measurements and demonstrate that democracy, as measured by the Polity indicators, is fundamentally a reflection of decisional constraints on the chief executive. The recruitment and participation dimensions are shown to be empirically extraneous despite their centrality in democratic theory.
A later review points out that Gleditsch and Ward were using a slightly more subjective measurement than desirable (decision trees). More reliable methods lead to:
It seems that Gleditsch and Ward’s finding (that executive constraint is the most important component of all of the components) holds up to some degree. It is clear, however, that executive constraint does not “virtually determine” scores on any of the three scales, as argued by Gleditsch and Ward. Participation competitiveness is much more important in determining democracy scores relative to executive constraint. Autocracy scores and the aggregated scale both show executive constraint to be the most important variable, but the other components also have non-zero importance.
The point here is that “democracy measures” are not only not measuring democracy in a way anyone means it, some aren’t even aware of what they genuinely are.
That last author closes with this:
That so much empirical work that theoretically discusses “democracy,” but is in fact measuring something akin to a (or a few) component(s) of democracy (depending on what scale you are using) should be a major issue in all relevant literature.
This plea is thus far unanswered.
A clever contrarian: “Calling Polity IV the ‘worst’ implies that there are relative merits to the studies, which implies that one is better than another, which means it’s not entirely arbitrary.”
This is true, but I don’t think democracy is entirely arbitrary. I just don’t think that quantifying it in this way serves any purpose. The best information we have – and by far the most useful – is going to be contained in everything that makes up those numbers, not the numbers themselves. The issue is use, and I think the use here depends on a particular role that “measurement” provides.
So, what is the reason to have a democracy scale?
A) I can’t really think of the purpose of a democracy measurement, specifically, that’s separate from a broader political argument. Newspapers use if for opinion pieces, academics need it to make measurements for [reasons], governments and charities want to make determinations for effective aide. It’s extremely rare that the average citizen is going to look one up “just because,” and even in the case that they do, would they prefer an off-the-cuff measure (as above) or something more rigorous? It seems insane to me that one would make the argument that “less information subject to no easy scrutiny” is actually a good thing for how we determine foreign policy, but here we are.
B) No measure will be good, but at least there could be a poorly representative consensus. The issue is that “democracy” is not merely a subjective, difficult, and abstract word with multiple competing interpretations and uses. It carries an implicit moral weight, it’s not a value-neutral term. In other words: it desires exploitation.
Of course, this is the general rule, but it’s also far too broad. I’ll give an example of what I mean, provided you understand this is about more than democracy: Everyone wants “democracy” on their side, and they also all want the empirical evidence to back them. This is for three reasons: 1) Democracy is The Good; 2) Empirical evidence means that you’re objectively correct; 3) It allows you to hide your argument. No one can tell what you’re actually doing, the numbers will be accepted anyway.
I think this is part of the “it’s the newspaper’s fault,” and kind of. But they aren’t falsifying data, they aren’t even misrepresenting the scales. The issue is that we, as a society, told a bunch of researchers to make us arguments and make them look objective. This is for a host of reasons, but the easiest to grasp is probably the following: now we aren’t burdened with actually making our case.
Take two examples:
- Polity IV data, reproduced in an “objective” way that immediately set off culture war; heading title: “The Empirical View”
- Trump ranked last among presidents by 170 Political Scientists
Which is preferable?
At least the latter is explicit about what it’s doing – it may be the opinion of experts, but it’s still presented as an opinion. Polity IV data is presented as empirical evidence, despite being no such thing; that it incorporates empirical evidence does not mean it itself is. “This is the empirical view.” is hiding an ellipse: “…because it relies on empirical data.” which hides another ellipse, “…which was compiled subjectively and weighted subjectively based on a subjective interpretation of what democracy is.” The measurement, the number provided, is thus a numerical representation of that dude’s subjective views of democracy. “This democracy gets 3 thumbs up.” But who are you and why should I trust you?
Polity IV’s data is used by Our World in Data, which is in turn used by Steven Pinker. I don’t disagree with very much of his data but this is a pretty clear case where one should, and it’s a pretty clear case where he chose a terrible measurement that just happened to confirm all his biases. He’s not uniquely bad in this, and the fact that he’s normally careful is actually most of my point. For everyone railing about biased journalists or the illiterate masses, even the most-careful-pretend-objective-popularizations cannot help but use their favorite data. It’s not a problem with bias, or it is, but not only that: it’s a problem with the fact that we’ve made all arguments appear equally valid. Bad argument “3” is the same as good argument “3”, may as well choose the 3 on your team.
As a side note, it is also kind of funny to note that Pinker is accused of technocratic neoliberalism, by which the accusers mean that he ignores democracy in favor of elite control. To prove them wrong, he shows that democracy is increasing according to a measure of democracy without civil rights or the ability of the citizenry to participate in political questions.
For all of these reasons: yes, democracy measures are still useless and bad. The same is true for almost any quantiative measure of a qualitative question that – and this is the critical part – involves questions of morals, terminal values, and social approval. I didn’t overstate my case, I understated it.
Now I’m going to risk overstating my case, because you must hear what I will tell you. Here are four things that fall out of this. I’m reasonably certain of the first two. The third I’m reasonably certain of, but unsure how big of a deal to make of it. The fourth is an ought, not an is. All deserve more time than I’ll give them.
1) One of the weirder responses to the replication crisis (and adjacent crises) is to focus and critique only levels of technical proficiency. It’s not that I think this is unimportant – the story Jacob tells here about his psychologist friend gives me nightmares – but proficiency is not the only problem.
If you read enough Andrew Gelman – and you should read more than enough Andrew Gelman – one of the things you realize is that for every mathematically challenged study there’s one that just doesn’t understand its own tools. At times, technical skill is a net-negative, as in people running robustness checks in order to hide their findings. I think that’s rare, but it does happen. More common is people just not getting what statistical tools are used for or how they relate to the object under study. So far as I can tell, the vast majority of the replication crisis doesn’t come from – or at least isn’t because of – technical problems. It comes from a completely batshit mixture of assumption and misunderstanding and relying on bad studies that were themselves never critiqued.
I think the easiest way to put this is “not understanding how to form qualitative arguments,” as in: not know when to apply what method, never really learning why some tools do some thing. You see it here, in everything I wrote above: the researchers appear to not understand that they were assigning numerical values to their own opinions, because they do not understand when or why or how to assign a quantitative scale.
I’ll just say flat out: we’ve become bad thinkers, we cannot make qualitative arguments, we do not even understand the tools that we’re using. Trying to fix technical problems without addressing this will only make it worse, but for some reason that’s what we’ve mostly settled on. Also: I have no idea how to address this. Maybe read Plato or something.
2) It’s not just soft sciences, and it’s not just from misunderstandings. Scott:
We know many scientific studies are false. But we usually find this out one-at-a-time. This – again, assuming the new study is true, which it might not be – is a massacre. It offers an unusually good chance for reflection.
And looking over the brutal aftermath, I’m struck by how prosocial a lot of the felled studies are. Neurogenesis shows you should exercise more! Neurogenesis shows antidepressants work! Neurogenesis shows we need more enriched environments! Neurogenesis proves growth mindset! I’m in favor of exercise and antidepressants and enriched environments, but this emphasizes how if we want to believe something, it will accrete a protective layer of positive studies whether it’s true or not.
“Prosocial so true” is a subset of the much larger problem: sneaking moral judgments into “scientific language,” which is to say: acting as though you’re being value-neutral when you are not. I have zero idea if this is conscious or not, but I don’t think it really matters much.
That the social sciences are worse than the others, that they’re failing to replicate and that the results above (“democracy”) are so fucked as to be impossible to even begin replicating, is a natural progression. The social sciences study the things absolutely closest to us, the things we consider most important as a society. They’re practically designed to enumerate bias (literally), and so of course they do.
When I ask for “better qualitative research” and the elimination of these terrible faux-empirical scales, it’s not because I think that’s an actual solution. It’s because I want to force the actual methodology into the light. Arguing for them from “objective” standpoints isn’t defending the sanctity of quantitative methods, it’s providing cover for bad qualitative ones.
Which leads into:
Every other week the thinkeria says that American voters are shockingly ignorant, as proved by [thing], and they vote according to [group loyalty] and [irrational beliefs]. All true. Anyway, then some Extremely Serious Man will pronounce that we should train people to follow the scientific results, make their case based on the empirical evidence, input big data output big democracy.
This seems extremely rational, a very common sense solution, expectoration of divine reason. Allow me to suggest that it’s the single best way to destroy both science and also humans.
Proclaiming that your preferred policy is objectively correct because of [questionable data] is obnoxious, sure, but I don’t care about that. Think about what it incentivizes in the oppo. The moment people can only argue – are only allowed to have viable subjective opinions by passing them off as objective data – they’re going to do everything possible to produce results needed for the argument.
This is not to say that we shouldn’t train people to be as rational as possible – although I have no idea what such a thing means – but that we need to be exceptionally careful about what argument goes in what box. If you’re making a case for terminal values but it has to be objective, then expect that paper to be infected beyond measure. Rather than making the careful, philosophical argument such a thing deserves, you’re hiding all of that behind falsified numbers. You might be wondering why this is “worse” inasmuch as it’s basically the same “irrationality” and motivated reasoning as before, and you would be right. But note that it incentivizes different behavior in academia, and thus in journalism, and then…
When I say something like “empirical truth is a terrible lone value for protecting truth” it sounds like some edgy humanities shit, but everything above is the kind of thing I’m talking about. We will lie anyway, and making “facts” the measure of an opinion’s worth only incentivizes the proliferation of bad facts. Instead of there being certain genuine facts and then a bunch of uncertain but viable arguments, there’s a proliferation of terrible arguments presented as terrible fact. How do you plan to counter that? Their 3 is as good as yours, have you paid no attention to social sciences for the past decade? Consider that different terminal values exist, that respecting them as arguments in themselves at least eases the demand for terrible, falsified reports to confirm those as “the Truth.”
You’ll note that this is “reasonably certain” inasmuch as this is already happening. I expect it to get worse and to worsen in linear relationship with demand that all arguments have some data behind them, the data of which will be aggregated opinion.
3) Far less certain, and mostly social sciences specific: One of the more interesting questions that I have no idea how to answer is to what degree social science has begun to report on itself. I mean the following: the reports and studies and books and articles that come out referencing or relying on academia don’t pop into existence and disappear just as quickly. It’s all well and good to talk about the anti-democratic nature of the IMF or failures of USAID or [whatever], but those organizations are using the tools provided by the academics that later condemn them. Last time I talked about a direct case of this (the EIU using studies that used the EIU to produce the study), but almost none of it is going to be that direct. It will take place in terms of, a) government reliance on academic work and, b) media presentation of those reports that influence the populace.
I’m putting this in the “far less certain category” and already admitted to overstating the case, so let’s be as direct as possible: for all the talk of the “anti-democratic effects” of [organization], there’s basically none on the fact that most political methods, half the terminal values, and all “real empirical research” from which it derives, is provided by an unelected bureaucracy of academics and fellow-travelers who do not even realize the power they have. Moreover, it’s hard not to notice that if so, most of their reports are going to be explanations in some way of the impact of previous reports, some monstrous and ungainly turtle-ladder of misunderstood and misapplied research methods is being built.
I don’t know quite how new this is, but there’s definitely been an increased demand for technocratic solutions over the past century, and it would not shock me at all to learn that very recently Polity-IV-style data-science had suddenly assumed a billion more powers on the politometer than it previously did, and if social science is increasingly reporting on the effects of social science, there’s a bizarre conflict in its own data that’s almost impossible to untangle. “No longer a democracy,” says the EIU, so people lose faith, which makes us drop on the democracy scale, which makes…. One would assume that this ought to cause some degree of consternation or something. One would be wrong. Most causes are just going to be “the government” or “[such and such] firm” as though “the government” and “that firm” make decisions without reference to the sciences.
This is obvious for anyone with a particular taste in philosophy, but it’s not nearly as obvious as it should be to academia proper: presumably these studies have some effect, and presumably subsequent studies are studying some part of that effect. How big is that? Is it minor? Or is it like a missionary forgetting their role and claiming that the people converted on their own, and that this is a sign of the divine? At what point does a single researcher choosing between 3 and 4 have a significant say in who gets what developmental loan for democratization efforts? And when the inevitable later study examines the “Impact of [such-and-such] Effort”, how seriously are they going to take that 3-4 waffling?
You might wonder “why does anyone put up with this?” and the obvious answer is that they don’t. There’s tremendous consternation in the media (and the academy) about “anti-intellectualism” and “lack of faith in the media,” and these are real phenomena. The question is where they came from, why now. The media offers thirty dozen equally ingenious explanations, but they all presuppose the same thing: that people are wrong to have lost that faith, that it’s some flaw in them rather than that our media and its academy has failed to earn that trust. That the plebs are right, and our technocrats are elitist without being elite, blinkered by bad studies and their inability to know why these are bad studies.
“Why do they hate us?” Why shouldn’t they? If I had a semi-benevolent overlord that couldn’t tie his shoes I’d hate him too. I know this because I do and I do.
4) The absolute worst response to any of this is to abandon the social sciences and the humanities for being “uncertain” and “subjective.” a) qualitative does not mean subjective, learn to read words; b) tune in drop out only works if you’re a moron with a trust fund. That you’re relatively impacted does not mean other humans beings get off so easy.
If you can hack it then it’s your duty to enter the field and fight. There are no other options.
top image from The Decline of Western Civilizaton