We are pleased to see that some of the authors whose methods we criticized put forth a response for the press. We look forward to responding to a full rebuttal in a peer reviewed journal, but in the meantime, this is a brief response.
You can find the original text of the rebuttal to which we responded on Jan 15th mirrored here: voodoorebuttal.pdf. These authors also have an evolving rebuttal, but it has changed at least once since we replied to it, so we can’t be sure whether our comments below will address the points in this version: http://www.bcn-nic.nl/replyVul.pdf.
Below we synopsize the points of the rebuttal and offer brief reactions.
1. Correction for multiple comparisons safeguards against inflation of correlations.
It appears that the authors of the rebuttal misunderstand what correction for multiple comparisons provides. This correction decreases the probability of a false positive — the probability of ‘detecting’ a signal if no signal deviating from zero is present. That is all that it provides.
Choosing only voxels that exceed a high threshold (regardless what that threshold represents, or what calculations were used to select it) truncates the noise distribution. Necessarily, therefore, this procedure yields a set of measures that have, on average, been exaggerated by chance. In the situations we examined, the signals are correlations, so the bias results in inflated correlations. Multiple comparisons correction do not change this fact (and may even exacerbate it — see Appendix 2).
This overestimation that arises when selecting and estimating based upon the same noisy measurements is confirmed in the simulations presented in our paper (Appendix 2), and it has been known about for a long time in statistics and machine learning.
2. Our claims based on calculations of an ‘upper bound’ on the correlations are inappropriate
We used the term “impossibly high” as a casual way of referring to highly improbable correlations. As we reviewed in our paper, reliabilities of both questionnaire measures and fMRI measures have generally been found to be modest, and are aware of no reason to suspect that these reliabilities would have been any higher in the studies we write about (see discussion in the supplementary Q and A). Given these modest expected reliabilities, the large number of extremely high correlations reported in this literature seemed suspicious to us, and we contend that these suspicions have been confirmed by what we have learned about the non-independent methods used in many studies (point 1 above). The fact that there is some variability and uncertainty associated with reliability estimates does not seem to us to be likely very important in understanding why this literature has featured so many enormous correlations.
3. Our simulations are misleading about false alarm rates.
Our first simulation was designed to show in its simplest form the main point of our paper: that one cannot apply a threshold to set of noisy measurements and take the central tendency (or worse, peak) of the above-threshold distribution as an estimate of the true values for these measurements. We did not intend to make any assertions about the rate of false alarms, nor to claim that all the correlations that we contend to be inflated are false alarms.
4. Non-independent analyses sometimes yield low or non-significant correlations.
In the rebuttal, the authors assert that the sorts of non-independent analyses we describe do not always produce substantial correlations. However, they do not provide specific examples, so we are not able to meaningfully comment.
5. Correlation magnitude is not so important
The authors suggest that the magnitude of the correlations really doesn’t matter, implying that providing accurate measurements of correlations is not all that vital. We addressed this issue in our paper, and explained why effect size measures are of great relevance to both the practical and theoretical importance of findings in any field. Whether or not the authors themselves care about the magnitude of the correlations, their procedures for producing these correlation estimates produce inflated numbers. The scientific literature should, where possible, be free of such erroneous measurements.
6. If non-independent analyses are so untrustworthy, why are they producing replicable results?
This is a very important point: if what we say is true, why do replications of the measured correlations occur? Assessing this claim requires an in-depth examination of specific literatures, which is beyond the scope of this rapid response, but we look forward to examining some specific cases in the future. For the moment, we would just note:
a. Merely finding a significant correlation in a nearby spot on two occasions does not validate a correlation magnitude estimate, which is the major point on which we have expressed skepticism.
b. If, as we contend, nonindependent analyses will regularly inflate correlations, doing a non-independent analysis can hardly be said to validate another non-independent analysis.
c. It is not completely clear what constitutes a replication of a finding that specifies a particular cluster showing a correlation with a particular behavioral measure. However, what would clearly be impressive would be analyzing the data of a second study using the anatomical markers identified in an earlier study (e.g., the Talairach coordinates) and then testing the correlation in that specific region. If the findings we criticize can be taken at face value, we think such replications ought to be successful. We have not encountered any examples of such true replication, and would challenge investigators to undertake such analyses.
7. Our survey was misleading and confusing.
This critical point would seem to be whether we mis-classified the methods of some studies, and counted them as having conducted non-independent analyses, when in fact they had not. If this happened, we would regret it, and any authors who feel that their papers have been misclassified should please contact us and provide details.
Thus far, however, no one has told of us any such mis-classifications. This makes us think that our survey may not have been so very confusing.
It has also been noted that we did not inquire about multiple comparisons correction methods. This is true — we didn’t ask them because they are irrelevant to our criticism (see point 1). However, as described in the paper, we did uncover what appear to be very serious misapplication of multiple comparisons corrections procedures in some cases.
8. Our suggested split-half analyses are not necessarily non-independent.
Here we think the authors of the rebuttal bring up an excellent point. There are different potential sources of noise in fMRI data. Lots of noise (experience would suggest, perhaps the vast majority of noise) arises from individual fMRI measurements. However, some noise is shared across an fMRI session. The split-half analysis we proposed — following on common practices in fields such as machine learning — would make the measurement noise across halves independent, but the session noise would be shared. Insofar as there is any shared session noise, the measures will not be fully independent. While we agree on this point, the correlation of noise in our proposed analysis is surely well below 1.0, and thus the overall analysis we propose certainly suffers from the non-independence problem to a much lower degree than the analyses we are criticizing. Moreover, our paper explicitly advocates that this split-half procedure be used as a supplement to a simpler analysis that is what we suspect most readers of these papers will have assumed was done in the first place, namely to specify regions on a priori anatomical or functional basis and average over these regions. There is evidently no single perfect analysis of brain-behavior correlations, but the procedures we suggest should offer a major improvement over the non-independent approaches being widely used.