LLMs make metascience easier, but that doesn't increase metascientific validity.
In a week of incredibly annoying and inaccurate AI discourse, I’m instead drawn to write about metascientific squabbles. Engaging with AI discourse means ignoring ignorant bait articles that falsely cast academia as a unipolar axis defined by a professor in the Pacific Northwest and looking at how AI is shaping or being shaped by our social systems.
So yes, I’m intrigued by this new AI-driven political science paper by Ryan Briggs, Jonathan Mellon, and Vincent Arel-Bundock. They use large language models to scour over one hundred thousand political science papers published after 2010 and find that only 2% report null findings in the abstract. They use a statistical model to argue that this number should be 80%. As Briggs said on social media, the authors think this finding is “disastrous” for political science.
This is the latest paper that invariably appears on my timeline claiming to have scientific proof that all science is wrong. Longtime readers know I’m no fan of quantitative social science. However, I’m even less of a fan of quantitative meta-science. I’ve written multiple posts about this bizarre practice. There’s no shorter path to a paper than data mining existing papers and tut-tutting a scientific community for its questionable research practices that pad CVs while failing science. People keep writing these metascience papers, and LLMs make writing them infinitely easier. I guess that means I have to keep writing these blogposts.
So let me ask you, argmin reader: what should the statistical summarization of political science abstracts look like? What is the stochastic process that generates pseudorandom numbers and produces arXiv PDFs? What are the sufficient statistics for this distribution? The authors don’t say. Instead, their paper posits a simplistic statistical model that Berna Devezer pejoratively calls the “urn model of science.” Political scientists reach into a magical Polya urn to find a hypothesis. The urn contains 75% true hypotheses and 25% false hypotheses. Then they test the hypothesis using a perfectly normally distributed experiment. The null hypotheses are rejected based on the nebulous holy parameters statisticians call “power” and “size.” Under this model, where the authors set the power to 25% (a number plucked from a previous meta-scientific screed by one of the coauthors), they compute that eighty percent number I mentioned above.
The authors think the gap between 80% and 2% is large and evidence of widespread research malpractice. They conclude, “The publication filter that we document here is not a benign pathology. It is a structural feature of our reporting practices that threatens the credibility and accumulation of knowledge in political science.”
Now, here’s the thing. I understand that metascientists think they can just appeal to a caricatured version of Karl Popper’s philosophy of science and think that when you get a finding like this, you have refuted the idea that people are practicing good research. But the one thing every scientist should learn is the immediate counterargument to Popperian falsification called the Quine-Duhem problem. When formulating a prediction about an experimental outcome, a hypothesis never stands alone. You need to append a long chain of auxiliary hypotheses about your mathematical models, your theoretical constructs, your conditions of ceteris paribus, your measurement devices, and so on, to make any prediction. When you see the opposite of what your hypothesis predicts, that means either the hypothesis is wrong or one of your auxiliary hypotheses is wrong. You need a logical conjunction of hypotheses to make a prediction. When an observation contradicts a prediction, any of the clauses in that conjunction could be to blame.
In the case of statistical meta-science, if your toy statistical model predicts a certain curve under “ideal” research practices, and you find a different curve, it’s possible that the curve derived from undergraduate probability has nothing to do with scientific practice.
I mean, come on, the urn model of scientific practice is more insulting than the stochastic parrots model of LLMs. We don’t do research by picking random experiments independently. Test choice is informed by past practice, advisors’ tastes, measurement constraints, and the whims of journal reviewers. We certainly don’t write papers about random tests.
And we should ask ourselves why these significance tests litter social science papers. It’s an unfortunate convention that everyone knows is harmful. To first order, everyone hates null hypothesis significance tests. Most people realize that there’s no faster way to strangle a discipline than with the logically incoherent garrote of the significance test.
Unfortunately, some die-hard proponents still believe that null-hypothesis significance testing will prevent people from being fooled by things that are too good to be true. Bizarrely, 100 years of significance testing has not yet convinced them that the promise of significance testing was always too good to be true.
Indeed, we’ve known it’s too good to be true since Neyman proposed the potential outcomes model. Even in the social sciences, we’ve known the statistical testing framework is useless for over fifty years. Significance testing “is never a sufficient condition for claiming that a theory has been usefully corroborated, that a meaningful empirical fact has been established, or that an experimental report ought to be published.” The null ritual of power analyses and 0.05 rejections is incoherent, and it’s just a game evolutionarily designed to grease the publication market. As Mark Copelovitch perfectly put it, “the entire edifice causal identification champions have built over the last decade is mainly barriers to entry and primarily about methodological tastes.”
The hardcore statistical wing of metascience has strong, peculiar normative beliefs about what science should be. Somehow, we fix all of science if we preregister every possible significance test and publish all observed outcomes. This view is not scientific, of course. There is no evidence whatsoever that ornate causal identification strategies, complex regressions, and dozens of pages of robustness checks “fix” quantitative social science. And yet, the proponents disguise their irrational normative claims about mathematical statistics in a language of rationality. They claim all knowledge apparently hinges on significance tests and complex causal inference machinery.
However, as Copelvich put it, the credibility revolution’s central tenet that the only meaningful test of causation is a randomized clinical trial, whether real or imagined, is a matter of taste. There are many confusing mathematical tools you can learn to play this charade of credibility, but it’s just a framework of expertise that crowds out alternative explanation, interpretation, and intervention.
There is an allure to the quantitative end of metascience. Scientists feel they are best equipped to clean their own house. They think the historians, qualitative sociologists, and theorists who have described the nuances, idiosyncrasies, and power structures of contemporary science are all postmodernist weirdos ready to eat up the next Sokal Hoax. And yet, maybe we need those historians, philosophers, and sociologists and their disciplinary norms to understand more clearly the normative world that mainstream metascience wants.


