Wednesday, July 11, 2012

Trendy trends


A recurrent issue in statistical climatology is how to deal with long-term trends when one is trying to estimate correlations between two time series. A reader sent us the following question, posed to him by a friend of his, related to our test of the method applied by Mann et al to produce the hockey-stick curve in 1998

Im übrigen wurde der angeblich falsche Hockeystick (Stichwort Climategate) nicht nur widerlegt, sondern die Klimaskeptiker haben nun ihr eigenes Climategate, da ausgerechnet einer Studie eines Klimasketikers (von Storch) ein Rechenfehler nachgewiesen wurde. Er hat diesen auch eingeräumt, aber nicht in der Zeitschrift korrigiert, in der er den Artikel veröffentlicht hatte, sondern in einer völlig unbedeutend, was man als Vertuschungsversuch auslegen kann. Ausgerechnet diese - nun vom Verfasser selbst eingeräumt - falsche Studie wird häufig von Klimaskeptikern zitiert.



I will no enter the question of why realclimate linked to Comment published in Science by Wahl et al. (2006) and not to our response, both published side by side. Sometimes, actually never, should one trust blog as the sole source of information. Interestingly, the fact that 'the friend' was not aware that a response did exist and was published in the same journal lead him immediately to assume a dishonest behaviour. He/she did not bother either to check by himself if the comment published in Science has prompted a response. The 'confirmation bias' is present everywhere. It is probably unavoidable, but a useful Chinese proverb may offer some help: 'if you think you are 100% right then you are wrong with a 95% probability.

Instead of delving in Chinese philosophy formulated in terms of IPCC likelihoods, I will illustrate by this example how nasty trends present in many climate records pose some challenges to the design of regression models. The basic problem is that two series that display a prominent trend will always be correlated, independently of whether or not they are indeed physically related. In the press release on the Science comment at that time (2006) we showed a nice example of a completely false inference based on the correlation between two trendy series: the Northern Hemisphere mean temperature and unemployment in West Germany over the last decades. Taking this correlation at face value, one could design a statistical model that predicts the Northern Hemisphere temperature from the unemployment figures, and this statistical model would even deliver a nice value of a validation statistics that is commonly used in climate reconstructions, the Reduction of Error. This diagnostics places more weight on a closer agreement between the mean value of reconstructions and target data more strongly than the correlation between the two series, which in turn focuses on the agreement between their short-term wiggles. Depending on how strong the interannual variability relative to the long term trend is, the RE or the correlation would provide a more faithful measure of the skill of the estimation. In this example, the apparent agreement between both time series is obviously an artefact, since temperatures and unemployment are unrelated, but the problem illustrated here is present in many attempt to calibrate proxy records over the 20th century. A soon as a proxy record exhibits a trend, positive or negative, it will display an apparent correlation to the global mean temperature and thus it might be taken as an adequate proxy to reconstruct the global mean also during past times. It may happen that this correlation is physically sound, and thus correctly interpreted, but when the series are trendy, one cannot be sure. The relationship between proxies and climate is often not physically obvious.

Mann et al (1998), in their lengthy description of their reconstruction method, mentioned at some stages that they had used 'detrended variables ' to calculate some diagnostics of the skill of their method. We interpreted, wrongly as it turned out, that they had detrended the proxy and temperature series to calibrate their statistical model. This was very soon taken as a proof that we had incurred in a calculation error and that the the whole analysis was flawed. Our response to their Comment showed that, in essence, to detrend or not detrend the data did not make a material difference, and in both cases, the method applied to produce the hockey-stick would underestimate the long-term variations in most circumstances. Interestingly, our colleague and friend Gerd Bürger had also submitted in 2005 a comment to Science that raised very similar questions. A more elaborated version was eventually published in Geophysical Research Letters. But the journal Science thought in 2005 that Gerd's manuscript was not interesting enough to warrant publication. A few months later, it has changed its opinion, convincing me that for Science all authors are equal but some authors are more equal than others.
The stage was already set for prejudices to unfold and for the climate aficionados to choose their preferred sides. The paper by von Storch et al. (2004) was perceived by some as an attack to the hockey-stick and, by the same token, to the larger corpus of anthropogenic warming - something it  was not. The mannistas and the anti-mannistas poised to fend-off the forays of their respective adversaries into their own territory, independently of the contents of the Wahl et al. Comment or of our response, which quite likely very few people took the time to read.
This little episode had, however, a positive ending: later on, I had the chance to personally meet Eugene Wahl, one of the nicest scientist you can imagine, both personally and professionally, and, ironically, one of the most unfairly treated since Climategate.

14 comments:

hvw said...

Thanks Eduardo, that is interesting.

For now, just a broken link report:
The last link to Geophys. Res. Lett. needs to be

http://www.agu.org/pubs/crossref/2005/2005GL024155.shtml

wllacer said...

Eduardo
An interesting warning re. the very often neglected "Correlation is not causation" meme.

I got really shocked reading H. v. Storch and you qualified as "skeptics". It sadly reminds me when everyone on the right of the Party were labeled "fascists" ...

Will you publish an entry about "your" latest article? It's making a bit of noise ...

hvw said...

Eduardo,
will no enter the question of why realclimate linked to Comment published in Science by Wahl et al. (2006) and not to our response, both published side by side.

In fact, the realclimate article has a link you your response to Wahl et al. in Science, 2006. The reader you referred to just did not read properly.

However, I perceive the realclimate article a bit unfairly slanted because it blames you for not responding the the critique before the critique was published but you must have been aware of it anyways. RC seems to justify this extraordinary demand by the assumption that this critique of your paper invalidates its conclusion. Judging from your response however you see this differently and from that perspective there isn't any need by any standard to reply to an inconsequential little oversight.

I also perceive a bit misleading your press-release example (unemployment). It simply illustrates the highschool knowledge of correlation not implying causation. However, to extrapolate temperature from proxies we need not only causation but also a good idea about the nature of this causation. This, in principle, can not come from the timeseries themselves. Fortunately there is a huge body of knowledge concerning the physical, biological and chemical processes that relate various proxies to temperature (as opposed to the literature pertainig to unemployment as a function of temperature). So your argument for removing low frequency variation (linear trend) from the timeseries before regression, i.e. to avoid inflation of the validation measure due to correlation "by chance", seems misguided. The validation helps to select the best model among candidates, its value in validating the underlying assumptions is very limited indeed. Intuitively I seems right to me to leave the low frequency signal in the calibration, if the low frequency variation is what we are interested in. Conversely, removing the trend seems to rely on the assumption of scale-invariance of the temperature-proxy relationship.


What I would like to know: 14 years after pretty much the very first attempt of a muliproxy reconstruction (MBH98), to which your article referred, and six years after the exchange describe above, what is the state-of-the art for this problem? The linear regression methods, direct or inverse, with PCA or not, and particularly the results of Bürger & Cubasch (2005) tell me that the statistical methodology taken into consideration at the time was just very much ad-hoc and pedestrian. Do we have more powerful methods today and a better guidance about what method works in which case? Wouldn't this for example be a posterchild problem for Bayesian approaches? Any recommendation for a recent review paper?

Hans von Storch said...

Der ursprüngliche Frager (siehe eduardos Text) schrieb mir als Reaktion auf eduardos Beitrag:

"vielen Dank !

Warum nur kann diese Diskussion nicht ohne Tricks und Desinformation stattfinden ???

Ich lese viele Kommentare von Wissenschaftlern zum Thema und treffe oft auf heftigsten und oft unsachlichen Streit, den ich mir nur dadurch erklären kann, daß keine eindeutigen wissenschaftlichen Ergebnisse existieren und daher viel persönliche Meinungen und Interpretationen sind. Und dafür geben wir Steuerzahler zig Milliarden aus (CO2-Steuer, EEG-Gesetz etc. etc.)…
"

Ich denke, dass sollte uns allen zu denken geben, dass wir versuchen sollten, den gemeinsamen Konsens zu benennen - und den Disens. Also das Einvernehmen, worüber wir uns nicht einig sind.

OBothe said...

two things.

First: I'll second hvw's question whether there is a recent review? I should know about them, but the only thing that comes to mind is Jason Smerdons WiresCC paper on pseudo-proxy work ( see here or here. There is the editorial by Hughes and Ammann and there is Tingley et al.'s "Piecing together the past: statistical insights into paleoclimatic reconstructions", but a "complete" review of the methodologies? Some insights may come from blogposts (lucia, SMcI, JeffID etc.) or the discussions surrounding Bo Christiansen's publications of the last years.

Tingley's BARCAST may be more powerful then the regression based methods. Which leads again back to Smerdon's publication page.

My second point: Consent about dissent. Yes, but it's kind of astonishing how quickly the scientific discussion becomes infested by emotions. One only has to look at the most recent spectacle. Interestingly it goes along trenches quite similar to Eduardo description above.

Hans von Storch said...

O Bothe,

let us assume the p_t is a series of proxy-data, and f_t the geophysical variable of interest. Let us further assume that p_t and f_t are stationary random variables, which is with respect to p_t a nontrivial assumption (without statistical analysis makes little sense; one can weaken this assumption by going to quasi-stationarity or other complex constellations, but I have never seen this done).

When building a statistical link, then you assume that you learn something from the joint variability of the pairs (p_t and f_t). To do so, you must have several, or even better many samples of (p_t and f_t). Also, you should know how often a new pair tells you something NEW about the joint generating process. That is, how often is (p_t+1, f_t+1) essentially the same constellation which was already described by (p_t, f_t). In particular, you do not want to see unrelated trends in both variables. Unfortunately, statistics can hardly tell you if the trends are related or not, only if you refer to difference stationary time series analysis methods known from econometrics (Schmith, T., S. Johansen, and P. Thejll, 2007: Comment on “A Semi-Empirical Approach to Projecting Future Sea-Level Rise” science 10.1126/science.1143286). Thus, what matters is the number of degrees of independent sample pairs; the assumptions about the sampling process are a key element (whenever statements about the reality of a link of the variations is made).

Now, let's write p=p*+p' and f=f*+f', with p* being the archive for variations in f, and f* the archive for variations in p. [Symmetry here, because both forward and inverted regression are in use.] In case of a forward regression, it would be p*=alpha f + random error, with alpha = , and Var(p*) =alpha^2 Var(f)= ^2/var(f) = Corr(p,f)*VAR(p). Since the correlation is in all practical situations less than 1, we find VAR(p*)< Var(p). [The same way the other way around.] Independently if we use forward or inverted regression, we have Var(p*) .ne. Var(p) and Var(f*) .ne. Var(f). Which is obvious, because we have the nonzero contributions f' and p', which are part of p and f, but which do not leave traces on the the other variable, = 0 and = 0. (hope my calculations are complete.)

With statistical analysis, 100% of the variance of f, or of p, can not be recovered by screening p (or f). Some part of the original variability is lost, and lost for good, except if one could recover f' (or p'), which very likely is not just noise. The same applies when more sophisticated links are established, such as neural nets or whatever (methods, which need much more samples in general for leading to reasonably small estimation errors; please check).

An often used trick is to employ "inflation", that is to merely multiply the *-series with a suitable factor so that VAR(p*) = VAR(p). This implicitly assumes that p' = 0, or Corr(p,f)=1, which is obviously an invalid assumption. All proxies contain variations, which are not related to, whatever we want to take it as representative for, temperature, precip etc ... but to other influences, such as local environmental changes, ranging from bug contamination to land slides etc.

Another trick is to add complexity in the statistical method, in the hope that more simple minded people would not understand such methods and trust that the complexity would add reliability of the result. In general this is not the case.

In short: the problem that proxy-reconstructions tell us only part of what happened is an intrinsic property of the approach and can not be overcome by statistical analysis alone. A possible solution may be process-based modeling using proxy-data for constraining the dynamical modeling (c.f. data assimilation) - but on the other hand: what is lost, is lost. Proxies do not tell past states, but part of past states and variations.

Hans von Storch said...

Forgot two points:
a) the link between f and p, proxy and geophysical data may not be stationary.
b) Correlations in this business are often of about 0.7 and less, corresponding to 1/2 or less of the variance, or 1/2 or more of the variance remains "unexplained".

Anonymous said...

@ hvw

what is the state-of-the art for this problem?

Maybe ensemble reconstructions, as described here:
http://www.climate.unibe.ch/~joos/papers/frank10nat.pdf

The idea is, that when you have no chance to find the "best" reconstruction, you can obtain valuable information about uncertainties by using several methods and creating an ensemble of reconstructions.

Andreas

Anonymous said...

@ Eduardo

Thanks for your interesting contribution. I wish you had told more about your encounter with E. Wahl.

PS:
I've just read an excellent article and learned a lot about MWP and LIA and their relevance for climate projections.
http://www.st-andrews.ac.uk/~rjsw/all%20pdfs/Franketal2010.pdf
Thanks!

Andreas

eduardo said...

@ 2

Wallacer,
the issue with the trends is actually a step previous to the 'correlation is not causation' meme. I would rather describe it as ' common trends are not correlation. All series with a long-term trend appear correlated, but tested properly that correlation is not statistically significant. The number of degrees of freedom is much less that the number of time steps

eduardo said...

@3 hvv,

well, that link was included at some point later. I remember posting a comment there to make the realclimate readers aware of the existence of the response, but it was 'moderated'. Anyway, this is not really important now, after the years passed.

With the perspective of these few years - and independent of the issue of detrending- the problem of the underestimation of the variance has been confirmed by many other studies. A nice review was written by Smerdon just a few months ago (linked in comment 5), which if I remember properly does not include the recent application of Bayesian methods. Form the results that we are getting on other projects - still unpublished- I would say that the Bayesian Hierarchical methods still suffer from this underestimation. A previous attempt with Bayesian methods, including not only proxy information but also information about the external forcing, was published by Lee and others

As Hans explained before this can be a fundamental property of a large family of statistical models.

I would mention that other methods, based on local calibration of one proxy record with one instrumental temperature, based on inverse regression (also known as classical calibration, predictor is the instrumental variable, predictand is the proxy) show promising results. Bo Christiansen blogged here in the Klimazwiebel some time ago.

To leave the low frequency signal would be, in my opinion, justified if we are completely sure that the proxy is reacting to climate and we just wished to calibrate the proxy as accurately as possible. Unfortunately, this is not the case. There are are many proxy records around, but by no means all, that simply do not contain any climate signal. They have sometimes interpreted as a temperature signal, then later as precipitation signal, later as a mixed signal that flips in certain periods...etc. In other cases, for instance stalagmites, records from the same cave look quite different, and the experts here claim that you need a very developed mechanistic knowledge of the proxy to identify the best locations within a single cave. In the Mann et al (1998) study, clearly some precipitation records were wrongly interpreted as containing a temperature signal, something that was much criticized in the paleo community at that time.

I would say that most of us still have to learn quite a lot from professionals, but they dont show much interest, with some exceptions. One initiative was the workshop organized last year .

hvw said...

OBothe, eduardo

thanks for the pointers and extensive comments. Actually I find Tingley et al. (2012) quite informative. They give a nice overview about what is out there, from a Bayesian perspective, which makes it conceptually simpler. The paper is not at all rigorous or deep, but that makes it a very accessible easy read.

Some artice-comment-response are elucidating on a meta-level and nicely illustrate how people with a classical (frequentist) background can misunderstand those who try to advance a modern (Bayesian) approach. (Christiansen's (2012) LOC, Tingley's (2012)comment and Christiansen's reply).

The take-away for me is:
1) Damn hard problem

2) State-of-the art, performing at least as good as everything else in the majority of evaluations and being currently practically applicable is RegEM (Schneider 2011)

3) This by no means implies that RegEM is "good enough", on the contary.

4) The way forward are Bayesian Hierarchical Models (BHMs). Because only this framework allows for a clear inclusion of all information available (e.g. spatiotemporal covariance structure of both, predictand and predictor) and proper uncertainty propagation. Most importantly, and referring to eduardo's last paragraph, BHMs can easily (well, conceptually easily) incorporate detailed, complicated (aka realistic) process-level models, and this to me seems the proper way to constrain the uncertainty of proxy reconstructions. As opposed to simulate uncertainty assessment through GCM driven pseudo-proxies which are constructed so that they behave in a way that is doable in the chosen statistical framework (aka linear) but don't incorporate (and possibly even contradict) the empirical process knowledge available.

5) To actually do 4) you need Bayesian statisticians, paleo-people, climatologists, a good scientific programmer and a huge cluster all working nicely together. But I am pretty sure that projects into that direction are underway.

hvw said...

OBothe, eduardo

thanks for the pointers and extensive comments. Actually I find Tingley et al. (2012) quite informative. They give a nice overview about what is out there, from a Bayesian perspective, which makes it conceptually simpler. The paper is not at all rigorous or deep, but that makes it a very accessible easy read.

Some artice-comment-response are elucidating on a meta-level and nicely illustrate how people with a classical (frequentist) background can misunderstand those who try to advance a modern (Bayesian) approach. (Christiansen's (2012) LOC, Tingley's (2012)comment and Christiansen's reply).

The take-away for me is:
1) Damn hard problem

2) State-of-the art, performing at least as good as everything else in the majority of evaluations and being currently practically applicable is RegEM (Schneider 2011)

3) This by no means implies that RegEM is "good enough", on the contary.

4) The way forward are Bayesian Hierarchical Models (BHMs). Because only this framework allows for a clear inclusion of all information available (e.g. spatiotemporal covariance structure of both, predictand and predictor) and proper uncertainty propagation. Most importantly, and referring to eduardo's last paragraph, BHMs can easily (well, conceptually easily) incorporate detailed, complicated (aka realistic) process-level models, and this to me seems the proper way to constrain the uncertainty of proxy reconstructions. As opposed to simulate uncertainty assessment through GCM driven pseudo-proxies which are constructed so that they behave in a way that is doable in the chosen statistical framework (aka linear) but don't incorporate (and possibly even contradict) the empirical process knowledge available.

5) To actually do 4) you need Bayesian statisticians, paleo-people, climatologists, a good scientific programmer and a huge cluster all working nicely together. But I am pretty sure that projects into that direction are underway.

hvw said...

HvS, #6
Another trick is to add complexity in the statistical method, in the hope that more simple minded people would not understand such methods and trust that the complexity would add reliability of the result. In general this is not the case.

Such evil tricks might actually happen and I am certainly among the "simple minded people" who do not understand purposely obfuscated methods (but I am not so simple minded to automatically assume they work). But obfuscated methods that do not work will have no impact in the long run. What worries me more is the opposite, and you actually point to a nice example of this attitude: A valid statistical criticism (Schmith et al.(2007)) of a study that is flawed by an oversimplistic statistical approach is handwaivingly discarded.