Skip to content
 

Forking paths come from choices in data processing and also from choices in analysis

Michael Wiebe writes:

I’m a PhD student in economics at UBC. I’m trying to get a good understanding of the garden of forking paths, and I have some questions about your paper with Eric Loken.

You describe the garden of forking paths as “researcher degrees of freedom without fishing” (#3), where the researcher only performs one test. However, in your example of partisan differences in math skills, you discuss the multiple potential comparisons that could be made: an effect for men and not women, an effect for women and not men, a significant difference, etc. I would describe this as multiple testing: the researcher is running many regressions, and reporting the significant ones. Am I misunderstanding?

The case where the researcher only performs one test is when the degrees of freedom come only from data processing. For example, the researcher only tests for a significant difference between men and women, but because they have flexibility in measuring partisanship, classifying independents, etc, they can still run multiple versions of the same test and find significance that way.

So we can classify researcher degrees of freedom as coming from (1) multiple potential comparisons and (2) flexibility in data processing. In the extreme case, the degrees of freedom come only from (2), and the researcher only performs one test. But that doesn’t seem to be how you use the term “garden of forking paths” in practice.

My reply:

– You point to an example of multiple potential comparisons and write that you “would describe this as multiple testing: the researcher is running many regressions, and reporting the significant ones.” I’d say it’s multiple potential testing: the researcher might perform one analysis, but he or she gets to choose which analysis to do, based on the data. For example, the researcher notices a striking pattern among men but not women, and so performs that comparison, computes the significance level, etc. Later on, someone else points to the other comparisons that could’ve been done, and the original researcher replies, “No, I only did one comparison, so I couldn’t’ve been p-hacking.” Loken and I would reply that, as long as the analysis is affected by the data that were seen, there’s a multiple potential comparisons problem, even if only one particular comparison was done on the particular data at hand.

– You distinguish between choices in the data analysis and choices in the data processing. I don’t see these as being much different; either way, you have researcher degrees of freedom, and both sets of choices give you forking paths.

– Finally, let me emphasize that my preferred solution is not to perform just one, preregistered, comparison, nor is it to take the most extreme comparison and then perform a multiplicity correction. Rather, I recommend analyzing and presenting the grid all relevant comparisons, ideally combining them in a multilevel model.

19 Comments

  1. Nick Patterson says:

    I’ve worked in Bayesian stats a long time (maybe longer than anyone still active — started in 1972(!))
    but I think Andrew is too rigid here. Data can produce surprises! As a memorable example, I was working
    on a cancer data set (classify tumors). Inspection of the data showed a very strong effect for the
    day of week the biopsy was done. This would never have been preregistered, and if no effect of day of week
    was apparent would never have appeared in a published analysis. Data can be surprising and in real life
    analysis is often steered by inspection of the data. I think honest analysts have a very good feel if this
    is seriously skewing results — and unfortunately there are less than honest analysts, and a good deal of incompetence.

    Nick

    • Andrew says:

      Nick:

      I think we’re in agreement. Look at the last paragraph in my above post.

    • Anoneuoid says:

      Inspection of the data showed a very strong effect for the day of week the biopsy was done. This would never have been preregistered, and if no effect of day of week was apparent would never have appeared in a published analysis.

      This indicates a serious problem with the research culture more than anything. Everything in biology is dynamic and cyclical. The first thing to look at is how a phenomena changes over time, part of that being by time of day, day of week, day of month, day of year…. I remember an old Deming paper saying the same in general:

      It is important to remember that the mean, the variance, the standard error, likelihood, and many other functions of a set of numbers, are symmetric. Interchange of any two observations x, and xj leaves unchanged the mean, the variance, and even the distribution itself. Obviously, then, use of variance and elaborate methods of estimation buries the information contained in the order of appearance in the original data, and must therefore be presumed inefficient until cleared.
      […]
      If n statisticia11 in practice were to make a statement like the onc quoted, he n-ould lose his job summarily, or ought to.

      To extract the information from so costly an experiment, I should wish to have in hand the original data, to be able to plot the time of burning of every chute tested, in the order of test.

      • Nick says:

        Not really; it’s extremely unlikely (and I think false) that tumor genetics would change
        radically by day of week. Here the problem was that the lab work for a tumor collected on
        a Friday was done on the following Monday but for Mon-Thu tumor was processed next day.
        Extra time in the freezer mattered.
        I’m not smart enough to guess that. But the real point is that data may surprise you.
        So preregistration may not always work. After looking at the data there may be things to do
        that you didn’t expect. Note that in self defense before you fire me I did look for a day of week
        effect and spotted the problem. But I didn’t actually expect to see one. Many many checks are
        never reported in practice, or papers would be absurdly long, boring and unreadable.

  2. Willis says:

    ☺ …never liked that garden of forking paths term; it seems a meaningless catchphrase to most people, unless explained in detail.
    Kinda prefer the ancient India parable of Blind Men & Elephant that most people are familiar with… and is still popular in some academic philosophy courses.

    That parable describes a small group of journeying blind men who encounter an elephant, having never come across one before. Each blind man then touches a different part of the elephant body, but only one part. Each then conceptualizes & describe the “elephant” based on their partial experience — but their descriptions widely disagree and heated arguments follow.

    Moral of the parable is that humans tend to project their own partial experiences as the whole truth, ignoring other information and interpretations; one should thus be wary of drawing broad conclusions from partial and often subjective information methods.

    • Joe Hoover says:

      I think I have to disagree. The garden of forking paths metaphor highlights the fact that an analysis is constituted by a sequence of subjective decision points that are conditionally dependent on all sorts of information. The path you choose determines the reality you observe. Thus, it is irrational to afford some sort of epistemological superiority to the one path you happened to take and the consequent reality you observed.

      And, the metaphor ties very nicely to Andrew’s suggested alternative. In Borges story, the garden of forking paths, or rather the novel described by the phrase, attempts to describe a reality in which all possible (chains of) events occur simultaneously and thus lead to all possible outcomes.

      I agree that the blind men and their elephant are relevant to analysis and that their parable overlaps some with the GFP; but, their parable has rather different implications and lacks the richness of the GFP, IMO.

      • Kyle C says:

        To me, as a layperson, the garden of forking paths metaphor is completely intelligible and helpful, whereas the blind men and the elephant is a cliche, like the forest and the trees, that adds no value.

        • Martha (Smith) says:

          The blind men and elephant metaphor and the garden of forking paths metaphor each have their place, but each has a different “moral of the story”:

          BMAE makes the point that different people may perceive different aspects of a situation, and so each focuses just on what they perceive.

          GOFP, on the other hand, focuses on the possibility that one person may become aware of different aspects of a situation, and may use that “big picture” to pick out just one aspect to focus on (e.g., picking out the factor that appears to give the greatest effect).

          I agree with Andrew’s last paragraph as the best approach to addressing the GOFP problem.

          The best approach to the BMAE problem is for researchers to communicate with each other to help all perceive the “big picture,” rather than each focusing on (metaphorically) a one-dimensional projection of the multidimensional “big picture”.

    • There is always a struggle to keep the fallacies of division & composition straight as in any research query. I foresee a necessary evolution in measurement and if possible get beyond the Fisherian/Neyman Pearson controversies. I suppose we are trying to in some ways.

  3. Thanatos Savehn says:

    If ” the analysis is affected by the data that were seen, there’s a multiple potential comparisons problem”.

    That’s my epiphany for the day. Many thanks. I see a bit further.

    • Martha (Smith) says:

      An example I like to give:

      A group of researchers plans to compare three dosages of a drug in a clinical trial.
      There’s no pre-planned intent to compare effects broken
      down by sex, but the sex of the subjects is routinely recorded.
      Looking at the data broken down by combination of sex and dosage, the researchers notice that the results for women in the high dosage group look much better than the results for the
      men in the low dosage group, and decide to perform a
      hypothesis test to check that out.

      They have informally made fifteen comparisons:
      There are 3×2 = 6 dosage-by-sex combinations, and hence
      (6×5)/2 = 15 pairs of dosage-by-sex combinations

      • Ben Prytherch says:

        This is a great example; thanks for sharing.

      • That’s fine, though, as long as you note that you then need a p-value that is 15 times stronger to be as statistically significant as p < .05 is when there is only one way to look at the numbers.

        So p < .0034 is the new p < .05
        (0.9966 ^ 15 = 0.95)

        For studies that are not preregistered (with enough detail to eliminate all forks in the path), publications should require such significance (in this case, p < .0034 rather than p < .05). Such a rule would of course encourage preregistration, while keeping some flexibility for researchers to note unexpected findings.

        • Martha (Smith) says:

          David: See the last paragraph in Andrew’s post for what I have become convinced is a better approach:

          ‘Finally, let me emphasize that my preferred solution is not to perform just one, preregistered, comparison, nor is it to take the most extreme comparison and then perform a multiplicity correction. Rather, I recommend analyzing and presenting the grid [of] all relevant comparisons, ideally combining them in a multilevel model.”

          As he has pointed out elsewhere, using an adjusted p-value for multiple comparisons requires using an adjusted formula for confidence intervals, which lengthens them compared to what is obtained in a multilevel or other “regularized” inference method.

          • Andrew says:

            Martha:

            Yes. And, also, selecting the most extreme pattern in the data is a recipe for noise mining. There’s nothing special about a pattern that’s p=0.003, compared to some other pattern that’s p=0.3. The difference between “significant” and “not significant” etc.

          • Keith O'Rourke says:

            I recall being a bit surprised about how much more sensible multilevel intervals were compared to multiplicity adjusted intervals – with centers being stuck at their noisy estimate they have to be widened quite a bit to maintain nominal coverage.

  4. Clay Ford says:

    I recommend analyzing and presenting the grid all relevant comparisons, ideally combining them in a multilevel model.

    Can anyone recommend some articles where this is done?

Leave a Reply