Hey—here’s a tip from the biology literature: If your correlation is .02, try binning your data to get a correlation of .8 or .9!

Josh Cherry writes:

This isn’t in the social sciences, but it’s an egregious example of statistical malpractice:

Below the abstract you can find my [Cherry’s] comment on the problem, which was submitted as a letter to the journal, but rejected on the grounds that the issue does not affect the main conclusions of the article (sheesh!). These folks had variables with Spearman correlations ranging from 0.02 to 0.07, but they report “strong correlations” (0.74-0.96) that they obtain by binning and averaging, essentially averaging away unexplained variance. This sort of thing has been done in other fields as well.

The paper in question, by A. Diament, R. Y. Pinter, and T. Tuller, is called, “Three-dimensional eukaryotic genomic organization is strongly correlated with codon usage expression and function.” I don’t know from eukaryotic genomic organization, nor have I ever heard of “codon”—maybe I’m using the stuff all the time without realizing it!—but I have heard of “strongly correlated.” Actually, in the abstract of the paper it gets upgraded to “very strongly correlated.”

In the months since Cherry sent this to me, more comments have appeared at the above Pubmed commons link, including this by Joshua Plotkin which shares his original referee report with Premal Shah from 2014 recommending rejection of the paper. Key comment:

Every single correlation reported in the paper is based on binned data. Although it is sometimes appropriate to bin the data for visualization purposes, it is entirely without merit to report correlation coefficients (and associated p-values) on binned data . . . Based on their own figures 3D and S2A, it seems clear that their results either have very small effect or do not at hold at all when analyzing the actual raw data.

And this:

Moreover, the correlation coefficients reported in most of their plots make no sense whatsoever. For instance, in Fig1B, the best-fit regression line of CUBS vs PPI barely passes through the bulk of the data, and yet the authors report a perfect correlation of R=1.

A follow-up comment by Plotkin has some numbers:

In the paper by Diament et al 2014, the authors never reported the actual correlation (r = 0.022) between two genomic measurements; instead they reported correlations on binned data (r = 0.86).

I think we can all agree that .022 is a low correlation and .86 is a high correlation.

But then there’s this from Tuller:

In Shah P, 2013 Plotkin & Shah report in the abstract a correlation which is in fact very weak (according to their definitions here), r = 0.12, without controlling for relevant additional fundamental variables, and include a figure of binned values related to this correlation. This correlation (0.12) is reported in their study as “a strong positive correlation”.

So now I’m thinking that everyone in this field should just stop calling correlations high or low or strong or weak. Better just to report the damn number.

Tuller also writes:

If the number of points in a typical systems biology study is ~300, the number of points analyzed in our study is 1,230,000-fold higher (!); a priori, a researcher with some minimal experience in the field should not expect to see similar levels of correlations in the two cases. Everyone also knows that increasing the number of points, specifically when dealing with non trivial NGS data, also tends to very significantly decrease the correlation.

Huh? I have no idea what they’re talking about here.

But, in all seriousness, it sounds to me like all these researchers should stop talking about correlation. If you have a measure that gets weaker and weaker as your sample size increases, that doesn’t seem like good science to me! I’m glad that Cherry put in the effort to fight this one.

27 thoughts on “Hey—here’s a tip from the biology literature: If your correlation is .02, try binning your data to get a correlation of .8 or .9!

  1. Doesn’t this change magnitude, when observed on large sample data, tell us zomething interesting? The large sample allows us to infer that the issue is unlikely due to sampling though it cannot be ruled out entirely. It tells me that the small variations in x are not meaningfully related to y. It makes me ask the question Why is this so? What about the two phenomena might explain this finding? It makes me want to design a study to better understand why it is so and to test some causal hypotheses.

    • It also makes me wonder whether the statistical model being applied is adequately accounting for error or uncertainty. And whether there is a better way to model the data.

  2. How to deal with this problem in an intuitive way. You measure height and weight of cows. If you measure with high precision, you will find a high correlation, e.g., 0.8. If you have a large measurement error, you will get a correlation close to zero. When your sample size is sufficiently large, you can show that there is a relation by binning, averaging and showing that there is a systematic, relevant relation between weight and height.

    Thus the question, dear statisticians, how should we report this without you moaning around?

    • Is this what was happening in this case? Was something measured with poor precision and the authors recovered the original association? Does this method only recover original associations that were diminished by measurement error, or does it sometimes inflate small associations when there is no measurement error? How can you tell when it’s doing what?

    • If the problem is measurement error. Then one way to resolve that might be to do take repeated measurements on each experimental unit. In your cow example you could, for example, measure the height and weight of each cow three times. If the assumption that your error due to measurement is i.i.d. and unrelated to the underlying variable of interest, then it’d be a fairly simple matter to partition the measurement error from the error in relationship (in this case the cow-level mean height and weight) with a multi-level model.

      • My understanding is that this is not done due to limitations in funding and time. Some of these genomics experiments can take weeks to perform once and cost tens of thousands of dollars. This then leads to the question, is it better to measure one thousand cows three times, or three thousand cows once? The latter option provides the opportunity to include other experimental conditions while the former may not.

        It seems that the general solution is to measure three thousand cows once in a number of conditions, and then follow up if results seem to look interesting. This does mean that there are greater risks for false conclusions, but it allows greater breadth of study in a field filled with exciting potential avenues of research.

    • You could fit a straight line to your data and report the slope and intercept with uncertainties. (Centre the data first by subtracting the x mean from the x values, or your uncertainties will be artificially inflated by having to extrapolate to the intercept.) Remember, regression is basically finding the mean y at each value of x, but without binning anything.

      The uncertainty in the slope should decrease with more data, and will reflect how well the extra data is compensating for the measurement error. As you say, the correlation value in this case reflects your measurement error more than anything else, and is really not giving you what you actually want.

      And the slope actually tells you something about the relationship, which correlation completely ignores.

      (Plotting the binned averages along with the fitted line would probably be a useful _visual_ tool for checking that the relationship is actually linear — which is really important! Just don’t do calculations using binned averages.)

    • Indeed, if you assume that any weakness of the weight/height (w/h) correlation is due only to measurement error, then by binning you will learn (approximately) the true correlation between height and weight. But you are then begging the question. You can’t learn the strength of the w/h correlation by first assuming its strength. You also can’t say that, for example, 90% of the variation in weight is explainable by variation in height, because your binning by height means that you have cancelled out most of the other relevant variation, such as by waist circumference (wc), by averaging together cows with very different wc (but with almost the same heights).

      So if you are binning the cows by height, you will similarly show falsely increased w/h correlations even if the low w/h correlation was due to a real (nonmeasurement) factor, such as widely varying wc. Your x-intercept and your slope will be very close to the same with binning, your p-value ma not change by much, but your r-squared will increase by a lot. You can show this readily in Excel (I just did) by generating normally distributed random data for height and wc, and weight exactly proportional to the height times the waist circumference squared. So there’s no measurement error, and no factors other than height and wc determining weight, but you get the same results (inflated r-squared) from binning. If you bin by height, you will exaggerate the w/h correlation, and if you bin by wc, you will exaggerate the w/wc correlation. This problem is no less severe when you bin 1,000,000 cows down to 10,000 height bins than when you bin 1,000 cows down to 10 height bins; in either case, you are averaging out essentially all of the differences in wc by averaging together the wc in groups of 100 cows, while your binning/averaging does not eliminate any significant part of the differences in height.

    • Curious about this “binning” procedure and went quickly through the paper. I work in bioinformatics and try everytime to go to great efforts to avoid falling into similar traps. There is no “im not a statistician” when you do statistics or multidisciplinary science. It is now of my opinion that unlike codons, these bins should not be used.

      Where the CUFS similarity measure is concerned, there are 2 measures they relate it to that provide the biological selling point of the article and probably what landed it in Nature Comm. The first is that CUFS can be used as a functional measure (genes with a large codon usage frequency similarity will be annotated the same way, participate in the same biological pathways, have similar functions). This one is less outrageous, since sequence similarity is one of the best predictors we have of function and … CUFS is taken from sequence. But its still a major stretch, because codons are selected due to pressure on protein structure and regulation of expression. Long shot… The second measure is 3D “geographic” proximity in chromosome structure. From my initial perspective, I would think this is silly. Any residual minor association between those would carry on solely from the link between sequence similarity, and I would expect it to be weaker and dodgier than just doing the same analysis with raw sequence similarity. There are sequence motifs that might be used to predict proximity, due to the way that all the cellular machinery works. Codon usage …doubt it… The onus was on them to show us convincingly that it did. And what they did is arbitrarily bin their new continuous measure and correlate it ?!?. The adjustments for genome length between species make no sense at all (CUFS is calculated on gene basis if genes are larger you get better estimation of similarity so why adjust?), and they are just further fiddling for the sad times where even that failed to show a lot of “p-values”.

      Biology and stats – wise… Bin the bins (and the article).

  3. Perhaps reporting the noun “correlation” without an accompanying adjective should be banned. I guess most of the time the default is the Pearson product moment but there are a host of other “correlations”: Spearman, Kendall, intraclass, Yule’s Q, and others about which I know even less.

  4. Is this a special case of ecological correlation or is it somehow different? Correlations among group averages are usually far stronger than for individual observations (at times, the correlations may have opposite signs). The groups are usually grouped by some other variable (e.g., demographic) but aren’t bins another sort of grouping? If so, then this practice of measuring correlations according to the bin averages is quite common. As but one (important) example, much of the Equality of Opportunity Project (http://www.equality-of-opportunity.org/index.php/data) analyses also show correlations between binned data. I have been wondering about their research along the same lines as the issues raised in this post – are the “strong” correlations merely the result of binning the data?

  5. For those like Andrew who don’t know the word “codon”, this refers to a sequence of 3 nucleic acids (elements that make up DNA/RNA) that encode a single amino acid (element that makes up a protein). So, yes Andrew, you’re using the stuff all the time! :-)

    I’m not primarily trained as a biologist, but I have worked with biologists a fair amount, and my take on the abstract of this paper is that they are looking at how genes encode proteins (the choice of codons, for example some amino acids are encoded by more than one codon) and they’re trying to argue that something about the coding sequence corresponds to the way in which the DNA actually bends and twists around itself to form a 3D structure, so that for example sometimes sequences which are far away from each other when written out in a string are actually close together in 3D space.

    As far as I can tell, by binning and averaging, their goal is really to do a regression of 3D distance on some score they call CUFS and the “high” correlations they’re reporting really just mean that there is a definite trend (a correlation between the local average of Y and the local average of X). Another way to say this would have been to do a nonlinear fit using a basis expansion and note that the spline coefficients are statistically distinguishable from 0.

    My impression is that this kind of binning and averaging is common when people have little modeling background and a ton of data.

    • Daniel, I like your answer. Putting it more general, how should one convey intuitively that two things (e.g., height and weight of trees) are related and that the weight of trees might actually largely be determined by their height, if we can measure weight and height with low precision only.

      If we measure with high precision, we report a large correlation coefficient, e.g. 0.8, and a small p-value (and the type of test statistics used) and everyone beliefs it.

      If we measure with comparatively low precision, but have a large sample of trees, we will have a small correlation coefficient, e.g., 0.06, but provided the sample is sufficiently large a small p-value. (1) Reporting this (sample size, correlation coefficient, sample size, p-value), our fellow biologist will say: this is an unimportant finding, since the correlation coefficient is small, and p-values get allways small provided the sample is sufficiently large. If we (2) bin and average and plot, we will be told that we should provide hard numbers rather than nice pictures. If we (3) bin and average and calculate correlation coefficients, we will have again large correlation coefficients, but our fellow statisticians rightly find it fishy. If we (4) fit a linear model, we are asked to show the data rather than relying on a model.

      Thus how should we convey in this case the message that height and weight of trees are related in an important way? I’m in favour of (1), (2), and (4) but got hefty push-back from some anonymous reviewer.

        • Thanks. To discuss the magnitude of the regression coefficients from the linear model, we used and referenced http://onlinelibrary.wiley.com/doi/10.1111/1467-985X.00164/abstract. The figure of the binned data had 95% CI around each point.

          The criticism we received focused on the small correlation coefficients.

          May be the only thing that remains to be done is to sample a second set of reviewers. … and perhaps report larger correlation coefficients derived from averages of binned data :)

        • Why quote correlation coefficients at all?

          If you must, I’d just mention that they represent the raw measurement error and not the underlying relationship. (In fact I’d expect correlation values to be pretty much insensitive to sample size, as others have mentioned. If they change a lot with more data then it’s probably not linear!)

      • I think the right way to go if indeed you have low-precision measurements whose precision you understand, is with a measurement error model. Each measurement represents information about an unseen “true” value. Then, you run a nonlinear regression on unseen true values given your model for the relationship. In Stan pseudocode imagining these things as vectors, using a normal errors model (which you might choose to alter given real world info about measurement error):

        weightfun_params ~ some_distrib_of_params();
        trueheight ~ exponential(1/meanoverallheight);
        measuredheight ~ normal(trueheight,h_measure_precision);
        trueweight ~ normal(weightfun(trueheight, weightfun_params),model_precision);
        measuredweight ~ normal(trueweight,w_measure_precision);

        then, a spaghetti plot of weightfun given the posterior distribution of weightfun_params overplotted with a scatterplot of trueheight vs trueweight

        What this does is it deemphasizes the role of measurement error and allows you to visualize the uncertainty in the model, however, reviewers in many fields are likely to find this completely opaque as it is a hard-core Bayesian model (that is, there’s no sense in which you can talk about the “true” values having any frequentist probability distributions, and there’s a good chance that if you could somehow observe them, the actual observed frequency distribution of modeling errors would be very different from normal with precision “model_precision” None of those things are problematic when understood in a Bayesian way, but would be very problematic for many people trained in “classical” statistics.

      • Short answer; I’ll try to write more later:

        The ‘correlation coefficient’ is conceptually completely different from the ‘effect size.’ Using your example, suppose I measure heights and weights of trees. I could calculate the slope of the height v. weight relationship (assuming a linear model, wrong for trees of course). The magnitude and uncertainty of that slope is a well-defined thing, and (w/ caveats about binning and weighting) is the *same* whether I bin the data or consider the cloud of points. It is (for example, in some units) 8 +/- 3, and so I say confidently that it’s nonzero. This effect has some size, with some uncertainty.

        The correlation is sensitive to the scatter — i.e. how well does that line reflect the variance in the data. It could be tightly scattered around a line of slope 8, or widely distributed around a line of slope 8. In both cases, the slope is 8. The latter will have a poor correlation coefficient — knowing the weight of the tree doesn’t enable a precise prediction of the height of any particular tree.

        Binning and reporting the correlation coefficient of the binned points makes no sense. Of course there is less scatter of the binned points around the regression — that’s what binning does. The “high correlation” doesn’t actually correspond to any great predictive power of the relationship between height and weight for individual trees, and trees don’t naturally come in bins.

        About (1): Why would one want to report “sample size, correlation coefficient, sample size, p-value” instead of the actual slope (or whatever effect size measure). Isn’t that effect size what one actually cares about? If you add that to the list: good.

        About (4): This sounds great, but why would you show the model without the data?

        About (2): In some things, I’ve shown all points (e.g. in gray), binned points (which are visually informative), and a fit.

        Maybe this wasn’t so short! Hopefully it’s not too terse, and is helpful — I worry I’m just stating the obvious and not actually addressing your concern.

      • “how should one convey intuitively that two things (e.g., height and weight of trees) are related and that the weight of trees might actually largely be determined by their height, if we can measure weight and height with low precision only.”

        It seems the underlying model is that the data are determined by three mechanisms — the effect of interest, other effects of similar nature to the effect of interest, and measurement error. Is this correct?

        Using these data, to be able to say that the weight of trees is largely determined by their height you would need to be able to separate out the measurement error from the other effects, and I cant see how you do this. A measurement error model is great if you can characterize the error, but is there prior knowledge of the magnitude of the error? This would be a strong motivation to collect multiple observations per unit. If it is reasonable to assume the magnitude of measurement error is independent of everything else, you would only need two observations per unit (I think). Otherwise I am not sure how you make the statement that the effect is big… but for the measurement error.

        I find the binned correlation coefficients troubling as an estimator — presumably with more data you would have more per bin, therefore for a given underlying correlation in the unbinned data, the estimator is function of sample size, which would always asymptote to 1.0 or -1.0 as sample size gets massive (if we assume there are no true zero effects).

  6. So the correlation is from3D distance maps generated using this Hi-C procedure and their codon usage frequency similarity (CUFS) metric:

    “Briefly, the traditional Hi-C assay consists of six steps: (1) crosslinking cells with formaldehyde, (2) digesting the DNA with a restriction enzyme that leaves sticky ends, (3) filling in the sticky ends and marking them with biotin, (4) ligating the crosslinked fragments, (5) shearing the resulting DNA and pulling down the fragments with biotin, and (6) sequencing the pulled down fragments using paired-end reads. This procedure produces a genome-wide sequencing library that provides a proxy for measuring the three-dimensional distances among all possible locus pairs in the genome.”
    https://genomebiology.biomedcentral.com/articles/10.1186/s13059-015-0745-7

    “Briefly, given a pair of open reading frames (ORFs) the CUFS returns a distance estimation that is related to the codon content and distribution in the two genes: the more similar genes are in terms of the frequency of their codons (and amino acids), the shorter the distance between them.”
    http://www.nature.com/ncomms/2014/141216/ncomms6876/full/ncomms6876.html

    This sounds dangerous to me, if the DNA sequence is at all affecting the Hi-C procedure, then this paper amounts to discovering that DNA sequence is correlated with DNA sequence. A quick search found this comment in a thesis:

    “Hi-C experiments are further confounded by coverage biases, where some parts of the genome are under or overrepresented in the sequencing reads (Yaffe and Tanay, 2011). Some of these biases are shared with other high-throughput sequencing assays, such as ChIPSeq and RNA-Seq, while some are unique to genome conformationtype assays.

    Mappability (Koehler et al., 2010) represents the uniqueness of the genome sequence, where low-mappability (repetitive) regions will register fewer unique sequencing reads compared to high-mappability regions. GC-content primarily affects the Polymerase Chain Reaction (PCR) amplification step in the Illumina sequencing process, resulting in a different amplification efficiency for GC-rich and GC-poor sequences (Benjamini and Speed, 2012).

    Restriction fragment length, unique to experiments using restriction enzymes, affects the resulting sequences in multiple ways. Regions with a high density of restriction fragments tend to be overrepresented in the library, as there are more possible fragment ends to ligate to. Moreover, differently sized fragments have different propensity of forming ligation products with other fragments with longer fragments appearing more frequently in true ligation events compared to shorter ones.”
    https://www.ebi.ac.uk/sites/ebi.ac.uk/files/shared/documents/phdtheses/RobertThesis_2014-12-12_v03_CORRECTED.pdf

    I would also think it is a good idea to inspect at the code used to process this data. I bet there are various times that decisions are made depending on the input sequence(s), which will also introduce a correlation.

  7. The comment about correlation coefficients going down as sample size goes up — this is GOLD. Seems like unwitting code for ‘we chase noise’.

  8. I had a lawyer once tell me to “Report the correlation coefficient, not the R^2… After all, it’s bigger!” I also had a lawyer make me change a regression I’d formed using ln(A/B) as the dependent variable into a regression with ln(A) as the dependent variable and ln(B) on the right-hand-side (even though ln(B) had a coefficient of -.98 with se .06 and A?B was really the variable of interest) because in my formulation R^2 was 0.07 and in his it was 0.94. I hate both correlation coefficients and their squares, since they are really only there to substitute for the English word “related” which is much clearer in its ambiguity.

  9. A friend of mine just came to me with a stats question about a paper he was reviewing (in atmospheric science area). The authors basically did the same thing as here. My friend is not a statistician, but felt un-easy about the authors’ data binning strategy and wanted to confer with someone who knew something about statistics. He came to my office, drew a plot on my board, and then plotted a bunch of points around x=2. It was immediately apparent that the authors just used the extreme observations, those that were “significantly” different from zero (i.e. 2 sd’s from 0) and ran correlations only on those data.

    I believe he left my office debating between: “reject”, or “interesting paper idea, but redo the entire thing”, or “you need to re-write this and report the results from the other correlations you ran, the ones with all the data that didn’t have the p-values you liked. I imagine the conclusions will change considerably.”

  10. I haven’t read this paper, but in general if treated properly, I have seen in many cases binning data has an insignificant negative impact. I have for certain time series problems found that the difference between binning and using a model with full auto-correlations built in was insignificant; of course for certain parameters this may not be the case.

    The worst R^2 inflation technique I have seen is writing a model in the form: Y(T)=Y(T-1)+… where Y is something like a log of the total volume of bank deposits.

    Nevertheless, I am still get hoarse trying to remind colleagues, auditors, and regulators that R^2 does not measure the quality of the model.

Leave a Reply

Your email address will not be published. Required fields are marked *