I am opening a new thread to discuss issues arising in the Grilling Jones post. This has to do with data sharing and the relation between datasets and theoretical models. So please stay on this topic when commenting. The reference is to a sociological study (thanks to Jin W for alerting me!). In case you find it difficult to access this article(free our data!), I reproduce the main findings from the conclusion below.
Young, Cristobal (2009) Model Uncertainty in Sociological Research: An Application to Religion and Economic Growth, AMERICAN SOCIOLOGICAL REVIEW, 2009, VOL. 74 (June:380–397)
...In methodological terms, this article illustrates
how research findings can contain a great
deal of model uncertainty that is not revealed in
conventional significance tests. A point estimate
and its standard error is not a reliable
guide to what the next study is likely to find,
even if it uses the same data. This is true even
if, as in this case, the research is conducted by
a highly respected author and is published in a
top journal. Below, I outline a number of specific
steps that could help improve the transparency
and credibility of statistical research in
sociology.
1. Pay greater attention to model uncertainty.
The more that researchers (and editors and
reviewers) are attuned to the issue of model
uncertainty, it seems likely that more sensitivity
analyses will be reported. Researchers with
results they know are strong will look for ways
to signal that information (i.e., to report estimates
from a wider range of models). Results
that depend on an exact specification, and unravel
with sensible model changes, are not reliable
findings. When this is openly acknowledged, the
extensiveness of sensitivity analysis will, more
and more, augment significance tests as the
measure of a strong finding.
2. Make replication easier. Authors should
submit complete replication packages (dataset
and statistical code) to journals as a condition
of publication, so that skeptical readers can easily
interrogate the results themselves (Freese
2007). This is particularly important for methodologically
complex papers where it can be quite
difficult and time consuming to perform even
basic replications from scratch (Glaeser 2006).
Asking authors for replication materials often
seems confrontational, and authors often do not
respond well to their prospective replicators.
In psychology, an audit study found that only 27
percent of authors complied with data requests
for replication (Wicherts et al. 2006). Barro’s
openness in welcoming this replication—readily
providing the data and even offering encouragement—
seems to be a rare quality. Social
science should not have to rely on strong personal
integrity of this sort to facilitate replication.
The institutional structure that publishes
research should also ensure that any publication
can be subject to critical inspection.16
3. Establish routine random testing of published
results. Pushing the previous point a bit
further, Gerber and Malhotra (2006) suggest
establishing a formal venue within journals for
randomly selected replications of published articles.
The idea is to develop a semiregular section
titled “Replications,” with its own
designated editor, in which several of the statistical
papers each year are announced as randomly
selected for detailed scrutiny, with wide
distribution of the data and code, and the range
of findings reported in brief form (as in Table
1).
Indeed, this could provide ideal applied exercises
for graduate statistics seminars. Even if
only a dozen or so departments across the country
incorporate it into their classes, this alone
would provide a remarkably thorough robustness
check. The degree of model variance would
quickly become transparent. Moreover, the
prospect of such scrutiny would no doubt
encourage researchers to preemptively critique their own findings and report more rigorous
sensitivity analyses.
4. Encourage pre-specification of model
design. One of the problems in statistics today
is that authors have no way to credibly signal
when they have conducted a true (classical)
hypothesis test. Suppose a researcher diligently
plans out her model specifications before
she sees the data and then simply reports those
findings. This researcher would be strategically
better off to conduct specification searches
to improve her results because readers cannot
tell the difference between a true hypothesis
test and a data mining exercise.
The situation
would be greatly improved if there were some
infrastructure to facilitate credible signaling. A
research registry could be a partial solution. In
medical research, clinical trials must be reported
to a registry—giving a detailed account of
how they will conduct the study and analyze the
data—before beginning the trial.17
A social science
registry would similarly allow authors to
specify their models before the data become
available (Nuemark 2001). This is feasible for
established researchers using materials like time
series data or future waves of the major surveys
(e.g., NLSY, PSID, and GSS). This will, for the
subset of work that is registered, bring us back
to a time when model specification had to be
carefully planned out in advance. Authors could
then report the results of their pre-specified
designs (i.e., their true hypothesis tests), as well
as search for alternative, potentially better, specifications
that can be tested again when the next
round of data becomes available.
Because most
data already exist, and authors can only credibly
pre-specify for future data, this would be a
long-term strategy for raising the transparency
of statistical research and reducing the information
asymmetry between analyst and reader.
Thirty years ago, model uncertainty existed
but computational limitations created a “veil
of ignorance”—neither analyst nor reader knew
much about how model specification affected
the results. Today, authors know (or can learn)
much more about the reliability of their estimates—
how much results change from model
to model—than their readers.
As Hoeting and colleagues (1999:399) argue, it seems clear that
in the future, “accounting for model uncertainty
will become an integral part of statistical
modeling.” All of the steps outlined here would
go far, as Leamer (1983) humorously put it, to
“take the con out of econometrics.”
(pp 394-5)
In the amazingly wide ranging The Black Swan by Nassim Nicholas Taleb, he comments on the "confirmation bias" (a lesser point in all he talks about, but still significant).
ReplyDeleteThe Confirmation Bias, or Why None of Us are Really Skeptics
Human nature makes us want to confirm our own theories. And anyone else who looks like overturning them is obviously a threat.
This suggests that an institutional approach to allowing/encouraging replication and falsification would be a huge step forward. Perhaps obvious in hindsight?
This seems especially desirable in climate science, many aspects of which are quite new and unclear. For example, the theoretical basis for "ensembles of models" instead of "a model" - yet an important pillar in attribution of CO2 to 20th century climate change.
Unfortunately, the huge stakes involved have made it that much harder for everyone to take a step back and review the strength of the evidence.
This only increases the desirability of a framework that pushes hard in the other direction, against the natural instincts of the scientists promoting their science and a worthy cause..
A resounding YES to all 4 points!
ReplyDeleteReiner,
ReplyDeleteI wonder how your comment relates tot he situation in climate research. Climate models are not statistical models, but process based models. Indeed, the meaning of the word model differs very strongly from community to community. (See, e.g., Müller, P., and H. von Storch, 2004: Computer Modelling in Atmospheric and Oceanic Sciences - Building Knowledge. Springer Verlag Berlin - Heidelberg - New York, 304pp, ISN 1437-028X)
Hans
ReplyDeleteI am afraid I am not the right person to comment on this question. I thought however there were similarities to the paleoclimate controversy.
Climate models are full of processes, but there are a lot of statistical components. It would be helpful if there were error range calculations carried along with every process step, parameterized adjustment, and random variation. Of course, some other climate work such as paleoclimate study is full of statistical tasks for which these guidelines would be very helpful.
ReplyDeleteGood article. VERY hard to read with no blank lines in between paragraphs.
ReplyDeleteFixed -- better now?
ReplyDeleteWhen I joined a faculty of geography after being educated in meteorology, I did not understand models of human geographers. After one year, I realized that their "models" are our "parameterizations". It was fortunate that we had corresponding (or "commensurable") concepts.
ReplyDeleteHere I understand that the title of the article was written in the dialect of social scientists. In climate modelers' dialect, it reads "Parameterization uncertainty".
@ 5
ReplyDeleteAnonyMoose
could you explain a bit more what do you mean by 'but there are a lot of statistical components' and 'random variations'.
If we make a climate simulation with the same model, on the same computer and with the same initial conditions we get the same result. The only random component is in the choice of initial conditions. Other than that there is no randomness included in the simulations (not considering weird errors in the parallelization or memory management and the like, which I do not think you were referring to)
Hello,
ReplyDeleteI would like to add some comments as someone who is familiar with software engineering as well as someone who has significant experience in programming. I have degrees in Physics and Mathematics.
A bedrock of science is reproducibility. If computer generated output play a part of science the code must be made public; it is not enough to give specifications or methods or whatever. What happens if another scientist, using the specificationis or methods or whatever get different results? What if one of the programs had an error; who made the error? There is a saying in software engineering that the program is its documentation. Many times seemingly small changes are made by a programmer and the programmer forgets to document the changes.
A computer program, if it is useful, almost always undergoes changes as long as it is used. Perhaps new features are added or bugs are corrected. Confusion can arise if one publishes a paper using one version of a program and later changes are made to the program. It is not difficult to keep track of different versions.
CVS is a system used in Open Source software that does just that. To use CVS you make your changes and you save (commit) the file with the changes and you get a new current version. CVS works by creating diff files that keep track of the changes made from the previous version. One can retrieve the version 1, apply the diff file to get version 2, and another diff file to get version 3, and so on. It is not difficult to create all previous versions of a file.
One can also use a versioning system like CVS for data files. The base file of, say weather station data, may be used as version 1, and changes from homogenization or whatever may be applied as different versions. This would allow one to reproduce a changing data file at a particular time.
Often intermediate files are created. Needless to say the progarm that creates them must be saved programs.
Finally one should have a script ( or a makefile) that that goes through all the steps, including creating intermediate files, to get the results. Open Source usually use makefiles. The data files and programs are generally is generally put together in what is called a tarball with a README file.
Not only does something like the above allow results to be easily reproduced, but based on my experience a procedure like what I have advocated can save the programmer time. Admittedly there may be a steep learning curve but if one spends 5 years developing and using a large software system it probably pays to automate as much of the work as possible. A large system by definition involves many different inacting parts and it is hard to keep everything in one's head. Documentation on how to generate the results from the data files to the output generally does not work because it may not be kept up to date. The script or makefile is documentation
klee12
This article deals with quantitative social science, but I do see a lot of parallels with climate modeling (though I admit that is something I have no experience with).
ReplyDeleteThe journal publication format has never been suited for the sort of data and model accountability advocated here. I agree with the need, and am happy that the computer age makes this possible, but I can also understand the resistance.
Random testing by graduate students? (and fodder for one's critics?) Preparing a complete replication package in addition to a journal article?
Hmpf... maybe I'll just take my work to another journal.
The pre-specification of model design seems something more particular to social sciences, where it is very easy to mine datasets for statistically significant correlations and use that as a finding.
I have had real trouble with the quantification of uncertainty in climate model predictions. Sure we can check the variance of multiple models, but that seems to only scratch the surface of the possible uncertainty.
As for the historical temperature record of the last century - it would be great to see how sensitive these records (or models if you will) are to various specifications.
I like the preceding discussion of CVS - wouldn't it be handy if such a tool had been used from the outset on climate datasets (along with other kinds of wishful thinking if people had known how the data would have been used).