Plausibility vs. probability, prior distributions, and the garden of forking paths

Posted on January 4, 2016 10:41 AM by Andrew

I’ll start off this blog on the first work day of the new year with an important post connecting some ideas we’ve been lately talking a lot about.

Someone rolls a die four times, and he tells you he got the numbers 1, 4, 3, 6. Is this a plausible outcome? Sure. Is is probable? No. The prior probability of this event is only 1 in 1296.

The point is, lots and lots of things are plausible, but they can’t all be probable, cos total probability sums to 1.

I was thinking about this when responding to a comment about a post on a recent bit of noise from Psychological Science. Somehow we ended up discussing those classic and notorious “embodied cognition” experiments, and the exchange went like this:

Me: These sorts of studies get published, and publicized, in part because of their surprise value. (Elderly-related words can prime slow walking! Wow, what a surprise!) But when a claim is surprising, that indeed can imply that a reasonable prior will give that claim a low probability.

Daniel: With findings about priming and embodied cognition, it doesn’t seem particularly outrageous to me that a response to physical instability could easily influence our perceptions of other things in the moment as well, including romantic relationships. Part of the reason we use the peer-review process, as flawed as it is, is because experts in the field have the judgment to decide not only if a study is worthy, but if its claims are reasonably supported in the context of the theories and findings in a particular field. The hypotheses in this paper are supported by a vast amount of work on priming and embodied cognition research . . .

Martha: I was under the impression that most of the findings about priming and embodied cognition had been discredited (e.g., failure to replicate, flaws in analysis).

Daniel: Priming itself is an extremely well-replicated phenomenon. With social primes (a subset of priming research), there have been a large number of studies demonstrating the effect with a few notable failures to replicate (Doyen et al. being one of the most notable due to the online discussion that ensued). Many of the replications haven’t taken into account advances in moderators that flip around the behavioural effects of social priming research. . . .

Me: Regarding plausibility, one might equally argue, for example, that being primed with elderly-related words will make college students walk faster, as this would prime their self-identities as young people. Whatever. Theories are a dime a dozen. These papers get published because they purport to have definitive evidence, and they don’t.

Daniel: Regarding priming, I’m assuming you’re referring to Bargh’s paper and then failed replication attempt by Doyen et al. One of the most frustrating things about that exchange is that the field had already moved on to a deeper understanding of social category priming, with the Cesario, Higgins and Plaks 2006 “motivated preparation to interact account”. They found that people who like the elderly walk slower when primed, while people who dislike the elderly walk faster (among other findings). I don’t really care whether Bargh got lucky with his sample or what was going on, but a replication attempt should take into account advances in the field, particularly with a moderator that can flip the effect around.

Me: I see no strong evidence. What I see is noise-mining. Each new study finds a new interaction, meanwhile the original claimed effects disappear upon closer examination.

We could go on forever on the specifics, but here let me make a combinatorial argument. It goes like this. There could be a main effect of elderly-related words on walking speed. As discussed above, I think a main effect in either direction would be plausible. Or there could be an interaction with whether you like the elderly. Again, I can see either a positive or negative interaction making sense. There could be an interaction with sex, or socioeconomic status, or whether you have an older brother or sister. Or an interaction with your political attitudes, or with your own age, or with the age of your parents, or with whether your parents are alive, or with whether your grandparents are alive. Or an interaction with your marital status, or with your relationship status, or with whether you have kids, or with the sex of your first child. It would not be difficult to come up with 1296 of these—maybe even 1296 possibilities, each of which has appeared somewhere as a moderator in the psychology literature.

And here’s the point: any of these interactions are plausible, but they can’t all be probable. It’s not as simple as the die-rolling scenario at the top of this post—the different possible interactions are not quite mutually exclusive—but the basic mathematical idea is still there, that it’s not possible for all these large effects to be happening together.

OK, embodied cognition is a joke. No need for me to pile on it here. It’s a fading fad that eventually even Psychological Science, PPNAS, and those Ted-talk people will tire of.

Here I’m interested in something larger, which is how to think about prior distributions in this sort of open-ended research scenario.

One reason I like the idea of analyzing all comparisons at once using multilevel modeling is that this will automatically stabilize estimates. A multilevel model is a mathematical expression of the idea that these interactions can’t all be large, indeed most of them have to be small.

This is not to say that no large effects can ever exist—remember, I did say that these hypotheses are individually plausible. They’re just individually improbable, which is why we need strong evidence to move forward on them. Somebody doing a study where they found an interaction with “p less than .05,” no, that’s not strong evidence. That’s where forking paths comes in. Forking paths comes into the p-value calculation, and forking paths comes into the prior. If you want to go full Bayes, that’s fine with me, then you don’t have to worry about other analyses the researcher might have done, you just have to worry about other models of the world that are just as plausible as your current favorite (for example, “people who like the elderly walk slower when primed, while people who dislike the elderly walk faster”).

P.S. Commenter Garnett asked why I say that all these large effect can’t be happening together. So here’s a slight elaboration of this point:

The basic idea is that you can’t have dozens of large effects and interactions all floating around at once, as this would imply an unrealistic distribution of outcomes. Consider that notorious example of menstrual cycle and clothing, which then was said to interact with temperature and could also interact with age, relationship status, number of siblings, political and religious attitudes, etc etc etc. If many of these interactions were large, you’d end up with silly claims such as that 50% of women in some category would be wearing red on some prespecified day in their cycle.

In a predictive model, there’s only so much “juice” to go around. We’re used to thinking of this in the context of R-squared or out-of-sample prediction error, but the same principle applies when considering comparisons or effects. And, thus, once you’re considering zillions of possible effects and interactions, their common distribution has to be peaked around zero.

I think there’s a theorem in there for someone who’d like to do some digging.

96 thoughts on “Plausibility vs. probability, prior distributions, and the garden of forking paths”

Daniel Lakeland on January 4, 2016 11:42 AM at 11:42 am said:

Andrew, the nomenclature you’re choosing here is potentially a little confusing. Cox/Jaynes axiomatization basically says “if you want a measure of plausibility, and you want it to have xyz properties, then it’s isomorphic to probability”, so in that context plausibility = probability.

We’ve had a variety of discussions here with Alexandre Patriota in which he’s advocated something else called “possibility” theory. His thingy actually sounds a lot like what you’re calling “plausibility”. In particular, I think his “possibility” isn’t an additive measure and can measure sets that are infinite without decaying possibility.

That is, you can give all your different scenarios. It’s totally possible (conceivable, not ruled out, has to be given some level of attention) that one or a few of the ideas is actually correct. But it’s not plausible (which is the same as probable in Cox/Jaynes Bayesianism) that they all could simultaneously be happening, nor is any individual one plausible=probable a-priori.

Normally I would think this level of pickiness is pure pedantry, but in this case it’s unfortunate that there’s a whole existing literature on plausibility measures being isomorphic to probability theory that it seems a little confusing for you to choose to create a distinction between these two words.

Reply ↓
- Martha on January 4, 2016 11:47 AM at 11:47 am said:
  
  I would agree with “plausibility = probability,” but I would not agree with “plausible = probable” (even though I would say that “probability expresses the degree of plausibility”). It’s subtle.
  
  Reply ↓
  - Daniel Lakeland on January 4, 2016 12:08 PM at 12:08 pm said:
    
    Well, I guess the point is that there are common usages of words like “plausible” and “probable” and then there are technical usages, and it’s a little confusing to somehow make a distinction here but not explain the distinction.
    
    I think what you’re saying is that common usage of the terms can be mapped into probability statements that are something like:
    
    “x is probable” means something like “p(x) is large”, whereas “x is plausible” means something like “p(x) > \epsilon” for some small but not too small epsilon, and “x is improbable” is the same as “x is implausible” and means “p(x) < \epsilon".
    
    and while different people might have different senses of where they put the \epsilon, at least the order of magnitude is somewhat similar from one person to another. (ie. log_10(\epsilon) typically between around -1 to -3)
    
    Reply ↓
    - Rahul on January 4, 2016 12:12 PM at 12:12 pm said:
      
      Statistics seems to love creating confusion by defining a technical usage in a non-intuitive sense & conflicting with common usage.
      
      e.g. “significant”
    - Martha on January 4, 2016 12:15 PM at 12:15 pm said:
      
      “intuitive” and “non-intuitive” can vary from person to person; so can common usage.
      
      It’s not that statistics loves creating confusion, it’s that finding good names for technical terms is inherently difficult.
    - Rahul on January 4, 2016 12:27 PM at 12:27 pm said:
      
      Agreed.
      
      One pitfall to avoid is re-using a relatively common English word like “significant”. Remapping readers’ minds from the lay usage to the technical connotation becomes very hard.
      
      Neologisms are one way out. Greek / Latin terms also work well because they rarely any more have a “common” connotation.
      
      e.g. Enthalpy / Entropy / Fugacity / Adiabatic are good choices. “Activity” & “reversible” are bad ones.
    - Daniel Lakeland on January 4, 2016 12:30 PM at 12:30 pm said:
      
      Connecting technical terms to real-world situations is inherently a difficult thing. It’s the essence of what it means to “model” something, and not only is modeling something that only a small fraction of the population actually does, it’s even a smallish fraction of the technically inclined numerate scientific population. Lots of scientists just do experiments and collect data and don’t really know how to “model” it. (esp. biology, medicine, and in my experience engineering is only marginally capable of modeling, most engineers have a “good enough for govt work” attitude and put up with extremely ugly and unintuitive models)
      
      That being said, I suspect “significant” wasn’t originally chosen to deceive, but the possibility for exploitation was rapidly capitalized upon. These days “significant” is used BECAUSE it is deceptive in many cases.
      
      For example, a dentist tried to sell me some dental treatment and gave me a brochure, the brochure constantly talked about “significant improvements” in periodontal pocket depth… but when you looked at the data in the fine print it was like less than 0.2 mm on average, but they had 2000 participants or whatever, so they could detect the difference. 0.2mm is like the thickness of 4 or 5 epithelial cells…
      
      http://www.ncbi.nlm.nih.gov/pubmed/21850882
    - Rahul on January 4, 2016 12:34 PM at 12:34 pm said:
      
      Interesting.
      
      I’m curious which are the fields / domains you correlate with better modelers? I would have picked Engineers but you don’t seem to.
    - Daniel Lakeland on January 4, 2016 1:14 PM at 1:14 pm said:
      
      Engineers are kind of “ok”. For example there are lots of engineering models where it’s just “fit a piecewise polynomial through some data”. Many models have the flavor of “calculate foo, if foo is between a,b then do X if between b and c do Y and if greater than c do Z…” and then X and Y and Z are themselves kind of ad-hoc rules for doing stuff.
      
      The history of Engineering is filled with large books full of tables.
      
      I’m constantly drawn back to the buckling of columns for my canonical example. There’s the AISC steel handbook and its canonical model for non-euler (short) columns. See this pdf for example:
      
      http://user.engineering.uiowa.edu/~design1/StructuralDesignII/CompresionDesign.pdf
      
      It’s divided ad-hoc into 3 regimes and some convenient formula is used piecewise to match between the two asymptotic regimes (too short to buckle, and long enough to buckle according to Euler’s formula). I once saw a great website on column buckling with a proper dimensionless analysis, a single formula comes out of it all, and the formula is totally motivated by physical conditions at every level. Unfortunately I can no longer find that website via googling…. sigh
      
      I suspect that Physics ranks top for modeling, and second down are kind of “specialized physics” such as physical oceanography, climate/weather models, geophysics, planetary physics, building sciences (heat transfer, building-fluid interaction, building envelope issues etc).
      
      The next step down is various forms of Engineering, not bad, but a tendency to do a lot of ad-hoc stuff. It tends to be that what matters is that some number can be calculated and relied on, not that it’s based on any real explanation.
      
      Some groups/people stand out, for example Zdeněk Bažant at Northwestern is constantly doing advanced mathematical analysis and using techniques like asymptotic matching and soforth. But, his work and that of his students stands out because it’s so clean. For example:
      
      http://www.scholars.northwestern.edu/pubDetail.asp?n=Zden%C4%9Bk+P+Ba%C5%BEant&u_id=133&oe_id=1&o_id=&id=84940553970
      
      The fact that this kind of clear clean analysis stands out compared to the typical stuff actually in use in engineering codes etc is why I claim Engineers have issues, let’s call it, “room for improvement”.
    - Daniel Lakeland on January 4, 2016 1:40 PM at 1:40 pm said:
      
      Also there are a lot of design problems that engineering undergrads do that have the flavor:
      
      “Guess the size of the pipe, from the guessed value, calculate some parameter foo, plug this foo into an equation from which you calculate the required diameter of the pipe, take the new diameter and go back to the beginning… do this until the output matches the input within the tolerance”
      
      which is … ok but potentially problematic (does the iteration ALWAYS converge? Does it always converge to a correct answer?), and teaching people to do this kind of thing by hand is problematic too, it’s more appropriate for a computer program. Finally, in most cases if you look in the literature there is some explicit formula that someone else has come up with… but the explicit formula is only an *approximation*, it’s never mentioned that the results of the canonical iterative method are also only an approximation, or that the explicit formula could actually be a BETTER approximation that the original.
      
      that’s the kind of thing that I think of when I say that even in engineering the concept of modeling is not so well established.
    - Rahul on January 4, 2016 1:55 PM at 1:55 pm said:
      
      @Daniel
      
      I didn’t get your argument against the iterative methods as bad strategy: Sure they may not always converge but can an equation based approach guarantee convergence always?
      
      I see some merits: Often the explicit formula turns out to be very ugly. The stage-wise calculation uses piece-wise simpler formulae can be intuitively easier to understand.
      
      Sometimes the intermediate results “foo” by themselves are meaningful combinations (e.g. At the end not only do I know the optimum pipe size but I also know that we are in the turbulent regime & only barely so)
    - Rahul on January 4, 2016 2:03 PM at 2:03 pm said:
      
      Some more merits I can think of:
      
      Engineering problems are almost always non-continuum. My pipes and girders come in discrete sizes. So far as I know this sort of thing is easier managed via an algorithmic approach (e.g. If foo < 2.5 select the next higher pipe size available) than via conventional equations. Maybe you could recast it as a dynamic programming problem?
      
      The rules you mention e.g. "if foo is between a,b then do X" often are a natural result of how the engineering reality operates. e.g. you might weld a small pipe joint but roll a large one and hence different tolerances etc.
      
      Another point: Often Engineering Problems are hierarchical and with an approach based on explicit equations we might need one equation for every combination. Potentially a big, ugly equation.
      
      Using an iterative, stepwise approach allows us to manage the combinatorial explosion.
    - Daniel Lakeland on January 4, 2016 2:22 PM at 2:22 pm said:
      
      I’m not opposed to iterative methods, I guess what I’m opposed to is the “canonicalization” of certain ad-hoc models.
      
      Typically the history is that in the dark days before pocket calculators, it was easier to calculate 3.24 * log_10(Q) using lookup tables and a slide-rule than it was to calculate 2.1041 * sqrt(1+Q^2) or whatever. So rather than use a basis expansion in multiquadric radial basis functions which has a straightforward “I’m approximating a function using a standard set of approximants” type explanation, an easy to calculate but opaque formula was used (why that formula? Because it fit the data and was easy to calculate with tables and slide-rule). (note, also people hadn’t yet developed multiquadrics, but they had fourier series and etc)
      
      Next, years later, the fancy opaque formula is not observed to be “easy to calculate and it fit the data pretty well” it’s taken to be “some amazing person back in the day figured out that THIS IS THE *REAL* FORMULA”. Now, every new formula has to be thought of as an approximation to THE REAL FORMULA… instead of “gee here’s kind of an easier and more straightforward way to model this data now that we have computers”
      
      It’s not so much that any given model is poor practice, but that the *modeling thought process* is not embedded in the education of engineers, or in the processes used to update and evaluate code recommendations and soforth.
      
      Most of the education of engineers builds in “some smart person figured out that this is the set of steps you need to learn to calculate X” and the average working engineer is not going to be able to evaluate the “goodness” of your new model relative to the old one, they were told “X was the right way” and since your new model is “NOT X” then it must be “NOT RIGHT”….
    - Rahul on January 4, 2016 2:40 PM at 2:40 pm said:
      
      I think the real situation is quite complicated.
      
      e.g. Often in my field there are archaic look up tables and nomograms etc. that were developed in the pre-computer era. Now it’d be a hell lot easier to now work with a spline or something but one problem is there’s a lot of grunt work involved in converting the legacy formats to equations. Not much glory in doing that so it does not get done.
      
      Another huge concern is validation: “THE REAL FORMULA” has been thoroughly tested & designers have become aware where the minefields lie.
      
      With change comes risk. And most professional engineers must be very risk averse by the nature of the things they do. You don’t want to be the guy testifying before the NTSB that the tanker blew up because you didn’t realize the ugly equation had another root you missed or something.
      
      So unless the payoff in terms of efficiency is huge no one wants to bite the bullet. Finally, its just a more elegant way to the same answer.
      
      What would be stupid is if some engineer today developed an ugly nomogram or look up table for a new application & novel data. I’m not saying it doesn’t happen but thankfully that’s rare.
    - Daniel Lakeland on January 4, 2016 2:59 PM at 2:59 pm said:
      
      Rahul:
      
      It’s important to consider the distinction between “a good model” as a model, and “a good model” as a means to accomplishing some goal like “it’s easy to calculate and we can make money quickly using it”
      
      I think the buckling of a column is a perfect example of my point. Here we need a continuum formula because although there are 8247 different discrete rolled shapes available in the market today, perhaps tomorrow there will be a few more shapes available, or the qualities of the steel will change a little (higher yield strength or whatever) so that all the shapes wind up at a different place on the continuum… etc.
      
      So, we need a continuum model in the dimensionless length parameter, and the cutoff points are purely there because it makes the piecewise fit work. Now, although you could explicitly have a single formula, you don’t, and because of that, if you want to solve some derived problem (such as “find the length of this column that minimizes the cost of construction subject to the constraint that its total load is less than the allowed value calculated by the allowed strength formula”) you are now solving an ugly problem that could actually be a simple problem (infinitely smoothly differentiable) but isn’t.
      
      A similar thing happens in the bending strength of reinforced concrete beams. If you give an explicit formula for compressive stress vs strain, then a computer can easily calculate the failure strength of any cross sectional shape via integration, and what’s more can calculate the plastic failure energy and other things that you’d like to know.
      
      Instead, the “equivalent stress block” is used as the canonical model. This is essentially a linear regression someone did in 1965 or whatever using a by-hand plot and a ruler. It should be “here’s the stress vs strain curve for various concrete mixture parameters, you may always calculate with the stress vs strain model, but for any of the following shapes you may use the equivalent stress block instead” but instead it’s “calculate using the equivalent stress block!”
      
      The main reason to have “good models” (ie. models that are smooth, explicit, physically motivated, etc) is their generalizability. The discrete example of pipe diameters is a perfect example of how shortcuts in engineering could cause real problems down the line. For example, you could have a table for all the different “allowable” pipe diameters vs volumetric flow rates for water and publish this in a book… Need to pipe water somewhere at a certain rate? Look up the diameter of the pipe needed…
      
      and then… someone will want to pipe benzene or natural gas or olive oil… or pipe olive oil through a pipe surrounded by heating elements to output it at a precise temperature, but the viscosity is highly temperature dependent!
      
      so anyway, getting a “good model” qua model is often given secondary or tertiary status in engineering and it causes issues down the road.
    - Daniel Lakeland on January 4, 2016 3:27 PM at 3:27 pm said:
      
      I don’t disagree with you that there are reasons to continue to use old ugly models… I’m mainly giving examples of how the skills of *making good models* are not necessarily well developed in Engineering, nor are the advantages of doing a good job necessarily very appreciated broadly within Engineering.
      
      If you have some new application, and you want to “do it right” you might need to look long and hard to find an Engineer who has the skills to do a good job.
      
      Also, if two researchers come to a standards body with models, one of which is well motivated by physics and applied mathematics, the other is a couple of ad-hoc piecewise curves fit to a lot of data. The two methods produce similar predictions for the entire range of possibilities…. The ad-hoc method requires many fewer key-presses on a pocket calculator and looks like the kind of stuff engineers are used to, the other model is infinitely smooth and easy to differentiate and use in optimization problems or to generalize to different types of scenarios (eg different gravitational accelerations or different material densities) but understanding the motivation requires knowing something Engineers may not usually deal with (ie. renormalization group or something) which one will get chosen?
      
      I fear number of keypresses on a pocket calculator and “similarity to the kinds of stuff we’ve done in other fields” is still a significant factor in model choice.
    - Bob on January 4, 2016 4:06 PM at 4:06 pm said:
      
      Well, I think you need to think about what field of engineering you are discussing.
      
      My understanding is that electrical engineers designing circuits use modeling tools all the time (e.g., Spice) as do engineers designing wireless handsets or wireless systems. Spice uses the Gummel-Poon model of a transistor. See
      https://en.wikipedia.org/wiki/Gummel%E2%80%93Poon_model.
      
      A bipolar junction transistor’s Gummel-Poon model has 41 parameters. The model comes from the physics—not empirical curve fitting.
      
      I understand that aeronautical engineers use large computational fluid dynamics tools all the time to tune designs. Then they shift to testing in wind tunnels.
      
      Bob
    - Daniel Lakeland on January 4, 2016 4:46 PM at 4:46 pm said:
      
      Bob. Yes, I agree. I may be a little glib here, but also I think some of the examples you mention are imported models. Fluid mechanics, and electronics are both areas where people you’d more likely call physicists had a lot of input early on. Most Engineers might be able to analyze a circuit with Spice, or a turbine with a CFD package, but would not necessarily be able to generate or evaluate the transistor model you mention, or determine whether the CFD package was capable of handling the case where raindrops were being sucked into the input of the turbine… etc.
      
      I am definitively NOT saying that Engineers are BAD at modeling. I just think that modeling in a physically motivated and “clean” way is less valued in many of the more messy areas of Engineering, and that many people are able to get through a full masters degree in Engineering or even some a PhD without even taking a single course that would be required to do a good job of modeling say the effect of particle size distribution on the resistance to flow of a fluid through a filter bed… or the progression to failure of a nail through two flat pieces of material of varying types when loaded in shear… or something like that…
      
      I think these people are capable of doing a good job in the abstract, they could learn more about modeling if taught, and they have the general numerical skills, and soforth. But, it’s not a major emphasis in most programs of study.
    - Daniel Lakeland on January 4, 2016 5:06 PM at 5:06 pm said:
      
      For example; Hermann Gummel is very distinctly in the category of Physicist.
      
      Fluid mechanics is an area where people educated as Engineers certainly did a lot of great modeling work. But at the same time, many of them were educated in a way very different from the modern engineering education.
      
      Osborne Reynolds for example studied mathematics with an interest in applying it to Mechanics. Prandtl was appointed head of “Institute for Technical Physics” and Von Karmann was Prandtl’s student…
      
      Although I think many of them were probably fine Engineers, they were exceptional at modeling compared to the average Engineer and certainly compared to the average say recently graduated Masters Student in Engineering out of a US school. It’s just not the point of a modern education to churn out Engineers who might be able to attack fluid mechanics principles from the basic laws of motion, or whatever. We’ve already got that down pretty well, so we work on churning out students who know how to fiddle around with meshes on finite element solvers and soforth instead.
      
      Perhaps it was an advantage that Prandtl and Reynolds and Von Karmann and soforth didn’t have to learn a large body of well developed codified stuff.
    - mark on January 4, 2016 12:47 PM at 12:47 pm said:
      
      My own personal favorite is recursive and non-recursive in the SEM literature.
    - Elin on January 4, 2016 1:46 PM at 1:46 pm said:
      
      +1
    - Garnett McMillan on January 4, 2016 12:27 PM at 12:27 pm said:
      
      This is a very helpful description!
Martha on January 4, 2016 11:44 AM at 11:44 am said:

Yes, yes, yes to emphasizing the difference between “plausible” and “probable”. I think this is where a lot of people go down the primrose path (including in the garden of forking paths — which, to perhaps overextend the metaphor, has a lot of primrose paths; so I guess I should have said “a primrose path” rather than “the primrose path”).

Reply ↓
Garnett McMillan on January 4, 2016 11:46 AM at 11:46 am said:

“…it’s not possible for all these large effects to be happening together.”
Why not? I agree that it seems unlikely, but why CAN’T then be happening together?

Reply ↓
- Martha on January 4, 2016 11:49 AM at 11:49 am said:
  
  This is a case where “highly implausible” and “highly improbable” both fit.
  
  Reply ↓
- Andrew on January 4, 2016 3:41 PM at 3:41 pm said:
  
  Garnett:
  
  See P.S. added above.
  
  Reply ↓
  - Garnett McMillan on January 4, 2016 11:02 PM at 11:02 pm said:
    
    Thanks! I missed the “large” qualifier for the effects in the original discussion.
    
    Reply ↓
Rahul on January 4, 2016 11:49 AM at 11:49 am said:

So to label an outcome “probable” what should be the probability, roughly? If instead of 1 in 1296 it were 1 in 100 can I call it “probable”? 1 in 10?

Is this an absolute designation or should “probable” be relative to something else?

Reply ↓
- Martha on January 4, 2016 11:53 AM at 11:53 am said:
  
  I would not call something “probable” unless its probability were more than 50%.
  
  Reply ↓
  - Rahul on January 4, 2016 11:58 AM at 11:58 am said:
    
    What if there were hundreds of thousands of very very small probability states but only one distinct outcome with 20% probablity?
    
    Would you call that “probable”?
    
    So can we have a most probable outcome that’s not probable?
    
    Reply ↓
    - Martha on January 4, 2016 12:07 PM at 12:07 pm said:
      
      “Would you call that “probable”?”
      No
      
      “So can we have a most probable outcome that’s not probable?”
      Yes
      
      If this seems paradoxical: I am interpreting “probable” in an absolute sens; “most probable” in a relative sense.
    - psyoskeptic on January 4, 2016 12:44 PM at 12:44 pm said:
      
      Then why select as low as 50%? Half of the time your outcome will not occur.
    - Daniel Lakeland on January 4, 2016 6:06 PM at 6:06 pm said:
      
      greater than or equal to 50%. It corresponds to “more (or as) probable than not”.
    - Kyle C on January 4, 2016 1:11 PM at 1:11 pm said:
      
      “So can we have a most probable outcome that’s not probable?”
      
      Quite often, when there are more than three possible outcomes of interest. It happens all the time in litigation, for example [win big, win a little, settle, lose].
    - Alex on January 4, 2016 12:36 PM at 12:36 pm said:
      
      You have this every year with sports playoffs, maybe most notably March Madness (college basketball). There are typically one or two obvious best teams in the playoffs, but there are still a lot of good teams and the eventual champion will have to win five or six straight games. The ‘favorite’ team will be something like 15-30% to win, which means the ‘favorite’ is actually any other team. The best team winning is not probable but is the most probable outcome. To give a concrete example, you could turn to the NFL, which just set its playoff schedule. The best team in the league is likely the Carolina Panthers, but they are underdogs versus the field to actually win the championship (http://projects.fivethirtyeight.com/2015-nfl-predictions/).
psyoskeptic on January 4, 2016 12:50 PM at 12:50 pm said:

Daniel, you’re seriously going to appeal to Cesario et al 2006 for support? That paper isn’t just a garden of forking paths, it’s a Texas ranch. After I teach Simmons et al’s false psychology paper even junior undergrads drive a truck right through that one.

Reply ↓
- Daniel Lakeland on January 4, 2016 1:20 PM at 1:20 pm said:
  
  Just to be perfectly clear, the “Daniel” in Andrew’s article is some other Daniel. I comment here a lot and maybe some people who see “Daniel” assume it might be me, so I wanted to clarify.
  
  Reply ↓
  - psyoskeptic on January 4, 2016 1:23 PM at 1:23 pm said:
    
    Daniel Lakeland, I didn’t assume either way. What I hoped was that the Daniel in question reads the blog. Knowing your blog I thought it both implausible and improbable that it was you.
    
    Reply ↓
- mark on January 4, 2016 4:27 PM at 4:27 pm said:
  
  Yeah – it is a wonderful example to use in class of what not to do and what research findings we can pretty safely ignore.
  
  Reply ↓
Nick Menzies on January 4, 2016 12:55 PM at 12:55 pm said:

Hi Andrew,

If I understand correctly your definition of plausibility extends beyond the features of the event of interest. For example, from the blog post I would surmise that rolling a 27 on a fair, million-sided die would be plausible, but that rolling the same number would not be plausible if all the other faces of the die were modified to read 25 (i.e. p(rolls a 27) = 1/1e6, p(rolls a 25) =1-1/1e6). Is this correct?

If so, it seems that the distinction you are making involves the unstated number of alternative hypotheses that are available to be tested, correct? By this logic, if someone first wrote 27 on a piece of paper, it would then be implausible to obtain the same number with either die (since now we are really interested in one hypothesis).

Reply ↓
- Z on January 4, 2016 3:14 PM at 3:14 pm said:
  
  +1 in that I’d also like to see this question answered
  
  Reply ↓
Eric Loken on January 4, 2016 1:48 PM at 1:48 pm said:

maybe “believable”? Reasonable? We like it when a story’s events were not predictable a priori, but still make sense once revealed. (Of course something ideological/didactic might be more satisfying to some precisely because it coheres perfectly with a priori beliefs.)

Then there’s the social science issue that the “plausibility” is often required only of the existence of the non-zero effect. The discussion turns to claiming that things are “true” or “demonstrated”, conveniently ignoring the “plausibility” of 23% voter swings or 3:1 odds in clothing choices or whatever the finding might be.

Reply ↓
Manoel Galdino on January 4, 2016 3:12 PM at 3:12 pm said:

Probably a mistake, but a few thoughts provoked by this post and some comments above:
1.Is this equivalent to stating that the many potential outcomes (in the Rubin causal model) are all plausible but not probable?
2. I never know, in practice, how to use hierarchical models to tackle this kind of problem, specially because we don’t know the number of interaction effects to study in advance.
3. How is this different on the way people try to find “cures” for some diseases? I mean, I suppose they came up with, say, a new hypothesis that a certain drug can fight a certain type of breast cancer, test it. It it doesn’t work, try a on a different type of cancer, or maybe try a new drug in the same type of cancer etc. All drugs are plausible, but not probable as defined by Andrew.

Reply ↓
Fernando on January 4, 2016 4:27 PM at 4:27 pm said:

This thread shows why good definitions are important. Andrew draws a distinction btw “plausible” and “probable”

A Google search for plausible returns this definition:

> (of an argument or statement) seeming reasonable or **probable**. (my emphasis)

Without good definitions is semantics and talking past each other all the way down.

PS I always thought the distinction here is btw “possible” and “probable” or, better still, “possible” and “relevant”.

Reply ↓
- Andrew on January 4, 2016 4:46 PM at 4:46 pm said:
  
  Fernando:
  
  Definitions are fine but create their own problems. I gave examples so that my message would be clear. If you want, just replace my words plausible and probable with X and Y and read it in that way.
  
  Reply ↓
  - Fernando on January 4, 2016 11:23 PM at 11:23 pm said:
    
    Andrew:
    
    You assume the definiens is clear while the definiendums are simply labels with no particular value. In practice “plausible” and “probably” already come loaded with meaning due to their association in common definitions to other definiens. They are not like X or Y at all. Hence, they are likely to get in the way of your examples.
    
    Put differently, the choice of words is not innocent good examples notwithstanding.
    
    Reply ↓
- Martha on January 4, 2016 11:18 PM at 11:18 pm said:
  
  @Fernando:
  
  1. Different people define the same terms differently. So Google searches or looking something up in a dictionary can be misleading, especially in technical subjects. (In fact, I’ve found that dictionary definitions are often circular.)
  
  2. I read the Google definition you gave as offering two alternative definitions of plausible: reasonable is one; probable is the other.
  
  3. Semantics are what it’s all about when we are talking about definitions! So it’s important to (try to) state one’s definitions.
  
  4. I consider “possible” to be essentially the same as “plausible”. (That’s essentially my definition of “plausible”.)
  
  Reply ↓
  - Andrew on January 4, 2016 11:20 PM at 11:20 pm said:
    
    Fernando, Martha:
    
    Blog comments are useful. I want to do more work in this area, and when I next write this up, I will try to use clearer terms.
    
    Reply ↓
    - Martha on January 5, 2016 4:14 PM at 4:14 pm said:
      
      I occurred to me that my use of “plausible” is influenced by its use in mathematics — and that Andrew’s might also have that influence.
      
      To elaborate: In mathematics, we distinguish between “plausible reasoning” (or a “plausible argument”) and proof. Two types of situations come to mind:
      
      1. Someone might give a plausible argument why something might be true. That is not considered a proof, but is considered good reason to consider the statement as a viable conjecture and/or to try to prove it.
      
      2. In teaching (especially courses such as calculus where applications are the goal), we may give a plausible argument for something rather than giving a rigorous proof or just stating it as a fact. As a somewhat simplistic example, consider the statement, “If f is continuous on the interval [a,b] and if f(a) 0, then there is some c between a and b with f(c) = 0.” A plausibility argument might go something like this: “Continuous means I can draw the graph of f without lifting my pen from the pencil. Well [drawing as I proceed], I start here at (a, f(a)), which is below the x-axis, and end here at (b,f(b)), which is above the x-axis, so since I can get there without lifting my pen off the paper, at some point I have to cross the x-axis — and that point is (c, 0), which means that f(c) = 0 for that c.”
  - Bruce on January 5, 2016 8:12 AM at 8:12 am said:
    
    To me, there is a distinction inherent to common usage of “possible” versus “plausible” in that, although both are statements of probability, “possible” connotes an absolute probability (i.e. p>0), and “plausible” connotes a relative probability (i.e. p > some reference value, such as the probability of some alternative outcome).
    
    As above, common usage of “probable” seems to typically connote an absolute probability (p > 0.5), though the word is also used in a relative sense.
    
    Reply ↓
    - Curious on January 5, 2016 10:03 AM at 10:03 am said:
      
      Bruce:
      
      I am not following your logic. Why would the probability of “possible” be greater than 0?
    - Curious on January 5, 2016 10:08 AM at 10:08 am said:
      
      Wouldn’t it more accurately be captured by pr(x) >= 0.
    - Daniel Lakeland on January 5, 2016 11:45 AM at 11:45 am said:
      
      p=0 implies the outcome can not occur (sort of, there’s an issue with continuous random variables… but let’s ignore this technicality for the moment).
      
      Therefore p > 0 is “possible” and p=0 is “impossible”
    - Curious on January 5, 2016 4:37 PM at 4:37 pm said:
      
      And you would characterize a binary event occurrence in that way? Even if finite?
    - Curious on January 5, 2016 4:49 PM at 4:49 pm said:
      
      If something(x) is constrained(c) such that it cannot possibly happen. Wouldn’t that be the equivalent of pr(x|c) = 0?
    - Daniel Lakeland on January 5, 2016 5:13 PM at 5:13 pm said:
      
      yes, so for something to be “possible” it needs to have probability GREATER than 0.
  - Daniel Lakeland on January 5, 2016 10:38 AM at 10:38 am said:
    
    Martha:
    
    You pull a 10 out of a random number generator. Is it “plausible” that it’s a unit normal random number generator? Is it “possible”?
    
    I’d say, No, it’s not plausible (p is too small), but yes, it is technically possible (p is not actually zero).
    
    Reply ↓
    - Martha on January 5, 2016 4:02 PM at 4:02 pm said:
      
      I guess I’d have to say that you and I have different criteria for something to be plausible.
    - Daniel Lakeland on January 5, 2016 4:58 PM at 4:58 pm said:
      
      I guess so. I think my definition, that probability shouldn’t be TOO low, is probably more close to common usage. The Mythbusters use “Plausible” in their show to denote something that they think could maybe have happened. Pulling a value greater than 10 from a unit normal has probability 7.62×10^-24 which is so close to zero that I just don’t think that it’s plausible that it really came from a unit normal. You could generate random numbers a million times a second for the life of the universe and you’d only get about pi of them :-)
    - Martha on January 5, 2016 5:21 PM at 5:21 pm said:
      
      Mea culpa — I didn’t read your question carefully enough.
    - Martha on January 6, 2016 11:29 AM at 11:29 am said:
      
      More accurately: I didn’t understand your question. So let’s try to see whether or not I understand it now:
      
      1. Am I correct that by “unit normal random number generator” you mean a rng that generates draws from a normal distribution with mean 0 and variance 1?
      
      2. When I read your question, I assumed that by p you meant the probability that the rng was indeed a unit normal. But your reply suggested that by p you meant the probability of “pulling a value greater than 10 from a unit normal”
      
      Here is what I would say now in response to your question:
      
      It is not highly plausible that the 10 was drawn from a unit normal — your argument is essentially a plausibility argument to this effect. But it is more plausible that the 10 came from a unit normal than from a uniform distribution on [0,1], since the latter possibility is impossible. But it is highly plausible that the 10 came from a normal with mean 10 and variance 2; even more plausible that it came from a normal with mean 10 and variance 1, etc.
      
      In other words, “plausibility” (to me) has a relative aspect to it (i.e., compared to other alternatives, of which there may be many); it’s not black and white.
    - Daniel Lakeland on January 6, 2016 12:32 PM at 12:32 pm said:
      
      Yes, unit normal meant mean=0 standard deviation=1.
      
      by p I meant “the probability that something close to 10 would come out of a unit normal” which is so small that I can reject the unit normal as the distribution for the RNG. This is one of the few cases where a p value is a truly appropriate thing to calculate ;-)
      
      Although I’m fine with your “more or less plausible” wording per se, I think that when people say “it’s plausible that the brakes failed on the car causing the crash” they don’t mean something like “out of a million crashes it will happen 1 or 2 times” they mean something more like “more than 1 in 1000 crashes happen like that”
      
      and if told that 1 in a million crashes actually happen like that, then they’re going to say it’s “implausible”.
      
      I think we’ve already established an agreement that “plausibility” is isomorphic to “probability” so if we’re trying to make a distinction between “plausible” and “probable” (as distinct from “plausibility” and “probability”) then we’re going to have to talk about some possibly fuzzy absolute level of probability aren’t we?
Rahul on January 4, 2016 8:39 PM at 8:39 pm said:

@Daniel:

Couple of points.

First, Reynolds, Prandtl etc. were exceptional. That doesn’t say much about Engineering Education today nor about modeling skills of engineers. That’s akin to judging the average Physics graduate of US colleges today by a comparison with Heisenberg or Bohr.

Secondly, in most domains (but perhaps more in Engineering?) the majority of graduates will become model *users* and only a few will be model builders. And that’s how things should be. Naturally, you are seeing the same sort of skew in training focus and observed skillsets.

Reply ↓
- Daniel Lakeland on January 4, 2016 10:29 PM at 10:29 pm said:
  
  @Rahul, yes that’s what I’m saying, I think we agree here. You asked who had the most high quality model builders, I basically said “physicists” especially in some of the more “applied” physics areas (geophys, oceanography, climate, weather, materials science, tribology etc) and then we went on to elaborate on ways in which most of the “good model builders” in Engineering were “more physicsy” (fluid mech guys, Timoshenko, etc) and now I think we agree that it’s not really a goal of modern engineering education to create “model builders”, and that because of this, we wind up with either a lot of old irritating lookup tables and nomograms, or the modern equivalent of irritating nomograms such as piecewise polynomial regressions or whatever. Also, there is resistance within Engineering to using the results from people like Bazant in code books and things because the models don’t have the “flavor” of what Engineers are used to.
  
  Reply ↓
- Daniel Lakeland on January 4, 2016 10:45 PM at 10:45 pm said:
  
  Also, to give credit where it’s due, a bunch of really great model building HAS been done by Engineers. That’s true in Hydrology, in extreme value theory, in certain stochastic methods (the Stratonovich integral eg) in structural mechanics, the invention of the Finite Element Method was pretty much out of the UC Berkeley Civil Engineering dept if I understand the history correctly, fluid mechanics, porous media flow (petroleum engineering especially), wear, friction, and tribology.
  
  But, a lot of good stuff has also come out of some Engineer who could describe a problem pretty well talking with some applied mathematician or physicist who could work through the model building process pretty well. There are some great examples in a wonderful little book: “Practical Applied Mathematics” by Sam Howison who was at Oxford Center for Industrial and Applied Mathematics. Stuff like “why do female birds constantly turn their eggs, and why do zoos lose their eggs when they don’t turn them?” or “why do we keep breaking off the cables that we’re trying to lay across the ocean?” or “why does paint act really weird when we spray it in certain conditions?”
  
  My impression is that MOST applied mathematicians are NOT model builders, but that a certain group of them are really into that and they do a lot of good work.
  
  All of this plays into my original point, which is that modeling is hard, and very few people really do it, even very few otherwise numerate and scientific people.
  
  Reply ↓
Christian Hennig on January 4, 2016 9:01 PM at 9:01 pm said:

All your arguments sounding reasonable enough, justifying any specific prior would still be very tough, wouldn’t it?

So a frequentist who’d say, OK, true, with forking paths etc. P<0.05 indeed is not strong evidence, but if we observe such a thing, let's have a follow-up study that tests this specific effect in a pre-defined way and exclusively, no forking of paths anymore, and if it is significant again we may actually be on to something, would it be that stupid? (I'm assuming this person is reasonable enough to have a proper appreciation of effect sizes etc.)

Reply ↓
- Andrew on January 4, 2016 9:08 PM at 9:08 pm said:
  
  Christian:
  
  Sure, if you want to go the significance test route, you need to preregister. But if noise is high compared to signal, you’ll still have difficulty interpreting the p-value because of Type S and Type M error. Most of my writing on the garden of forking paths is all about how the p-value can be wrong. The present post has an interesting (I think) new angle in that it’s connecting the multiplicity of the forking paths to a constraint on the prior distribution.
  
  And, sure, justifying any specific prior might be tough, but it might be possible to come up with reasonable bounds on a prior. The more forks in the path, the more possible effects and interactions, and the smaller they must be on average.
  
  Reply ↓
  - Christian Hennig on January 5, 2016 7:49 AM at 7:49 am said:
    
    Andrew: I have mixed feelings about this. On one hand I agree that if a research finding is “surprising” against a backdrop of solid previous knowledge, the bar for it to pass should somehow be higher, which is what Bayesian analysis can achieve. On the other hand, I don’t like statements of the kind “the posterior probability of hypothesis so-and-so is such-and-such” based on priors that have been chosen involving high doses of creativity and convenience. I don’t believe that such probabilities mean much. In particular they’re probabilities assigned to the “truth” of certain probability models that nobody should believe to be literally true anyway (be it in a frequentist sense, or be it in the sense of being a *precise* reflection of anybody’s true belief or state of knowledge).
    
    Overall, the confidence I have in a Bayesian analysis is proportional to how convincing I find the justification of the prior, and from being exposed in the literature and as reviewer to some random collection of Bayesian work, my impression is that things are not all that healthy in this department (although I’m not denying that good work does exist and that much of your work is very helpful in making people think better about their priors).
    
    One probably helpful distinction here is between priors that are based on solid knowledge that is at least fairly closely related to the hypothesis in question, like previous work on the hypothesis itself or on something that quite clearly has implications on it, and on the other hand priors that are based on general considerations such as “effects are rarely large in studies of this kind”, “this kind” typically including research hypotheses that have nothing at all to do with the issue in question. In such a case at the very least I’d like to see a proper definition of the population of studies from which the present one is interpreted to be a random draw, and some evidence coming from a sample of studies coming from this population backing up the choice of the prior, and I have hardly seen any such thing. It seems hard to me to have such a thing, given that I’d expect that the population of studies in question itself will be strongly affected by forking paths and all kinds of further statistical and measurement-related issues so that we are on rather weak ground making any kind of statement regarding a “proportion of true hypotheses”, “distribution of true effect sizes” etc. (which is probably my major issue with recently discussed work such as Ioannidis or Colquhoun on “true research findngs”).
    
    Even if it could be done, it will still be controversial whether the chosen population of studies is appropriate as a reference class for the specific study in question, and of course there are forking paths in a Bayesian analysis, too, such as which transformation of variables to analyse, should substantial probability be assigned on a point null hypothesis etc.
    
    Reply ↓
    - Andrew on January 5, 2016 10:18 AM at 10:18 am said:
      
      Christian:
      
      Like it or not, all aspects of these models are chosen based on “high doses of creativity and convenience.” Just take a look at Psychological Science some time. Given this, I’d much prefer an analysis of a large set of effects and interactions of interest, rather than the researcher picking just one thing and then trying to get a p-value for it, a summary of evidence that will depend on what the researcher would have done, had the data been different. That’s the sort of Monty Hall mess that seems to me to be distant from the scientific goals.
      
      Regarding your question about the population of studies, my point in the present post is that there’s a population of potential effects and interactions even in any single study.
    - Martha on January 5, 2016 4:20 PM at 4:20 pm said:
      
      I can agree with Andrew’s statement, ” Given this, I’d much prefer an analysis of a large set of effects and interactions of interest, rather than the researcher picking just one thing and then trying to get a p-value for it, a summary of evidence that will depend on what the researcher would have done, had the data been different.”
      
      But I also agree with Christian’s statement, “Overall, the confidence I have in a Bayesian analysis is proportional to how convincing I find the justification of the prior”
    - Daniel Lakeland on January 5, 2016 5:46 PM at 5:46 pm said:
      
      It’s interesting, because I almost never am too worried about the prior compared to how much I want to scrutinize the likelihood. I mean, sure, a blatantly skewed prior is a problem, but if your likelihood doesn’t reflect a reasonable model of the world, then you’re really blowing smoke…
      
      For example, if you’re trying to find out how dense the air is by throwing balls through it, and you’re using an Aristotelian theory of Impetus, it doesn’t really matter what you come up with for the prior over the density of air, since the connection between the density of air and the observed quantity (distance traveled) is just grossly wrong.
    - Christian Hennig on January 5, 2016 6:46 PM at 6:46 pm said:
      
      Daniel: I’m fine with scrutinizing the likelihood, but if you need a prior and a likelihood, you better worry about them both, don’t you?
      
      Andrew: No disagreement about whether your Bayesian approach will improve on what’s currently done there. I’m just saying that a frequentist can do better as well, without priors (which is a disadvantage in one respect, because the Bayesian has a convenient way to incorporate prior knowledge, but an advantage in another respect because it’s one problematic thing less to worry about).
    - Andrew on January 5, 2016 7:09 PM at 7:09 pm said:
      
      Christian:
      
      Yes, no doubt a non-Bayesian can do better too. Also, it’s possible to incorporate prior information in non-Bayesian ways, for example using penalty functions or other regularizations, or by evaluating frequency properties over a restricted subset of parameter space that corresponds to reasonable parameter values.
    - Daniel Lakeland on January 5, 2016 7:26 PM at 7:26 pm said:
      
      I just think the Likelihood is usually the place where people are more likely to go wrong, and it enters into the posterior probability typically N times once for each data point (under iid assumptions for example). As long as the prior contains the true value in its typical set, the prior is ok, and this is typically not really that hard to do. The main thing is to be honest about how uncertain you are about the parameter. But if the likelihood misrepresents how the parameters connect to the data… then it screws up the whole thing.
    - Christian Hennig on January 6, 2016 10:17 AM at 10:17 am said:
      
      Daniel: In many cases (we’re not talking about physical constants such as the speed of light here) the “true value” is only defined in the framework of a model, and the model is an idealisation and is itself not “true” even if the likelihood is chosen as well as possible. So I don’t think it’s that easy to have the “true value in the typical set” or even to define what is meant by this.
      
      Of course frequentists talk a lot about “true values”, too, and I don’t like most of this. However, one thing I like about p-values is that I can more modestly and precisely interpret them as probabilities computed under the H0-model, measuring to what extent the observed data are consistent with this model, without having to imply anything about the “truth” of the model.
    - Keith O'Rourke on January 6, 2016 11:52 AM at 11:52 am said:
      
      Christian:
      
      For “more modestly and precisely interpret them as probabilities” if you have yet to, you might wish to read an interesting paper by Keli Liu and Xiao-Li Meng – http://arxiv.org/abs/1510.08539
      
      Also, perhaps this interview with Don Rubin re:evaluating frequency properties over a restricted subset of parameter space http://arxiv.org/abs/1404.1789
      
      I understand, you might feel otherwise.
    - Daniel Lakeland on January 6, 2016 12:03 PM at 12:03 pm said:
      
      If you don’t like “true value” then use “the value that will minimize some measure of the total deviation of the predictions from the measured quantities in the full population to which the model might be applied”. It’s at least in-principle verifiable what this value is (survey the whole population and get their measurements, and then find the parameter value that best fits the whole population).
    - Christian Hennig on January 11, 2016 8:29 AM at 8:29 am said:
      
      Daniel: Good point. I thought about this a bit. I accept such measurements as some kind of “in principle observable truth”. But this means that we’d need a combination of prior and likelihood so that the vector of “in principle observable truths” is in the typical region. I don’t see how, if you accept that choosing the likelihood is troublesome (which essentially refers to the same truth), the choice of the prior is not. If the likelihood is badly chosen, getting the prior right in your sense is not going to help much. This seems to underline your point about the likelihood, with which I actually agree. But if you say that choosing the likelihood is difficult, how can you then think that choosing the prior is easy, which basically deals with the same problem, namely the same truth?
      
      Another issue I’m concerned about is that there are lots of different priors that get the true value in their typical region, so are you saying that all these are equally legitimate? If the dataset is not huge, they may lead to vastly different inferences, or not? Is the variation in this explored anywhere? (Obviously, also it is not possible to see without knowing the truth which priors are legitimate in this sense and which are not.)
    - Daniel Lakeland on January 11, 2016 11:36 AM at 11:36 am said:
      
      @Christian:
      
      When you choose a prior, you can choose it to be “big” that is, cover a lot of possible values. But when you choose the likelihood you are in essence describing how you think the world works. ie. what kind of data you’re likely to see out of the process. You need this to be as sharp as possible if you want to make your data informative. Furthermore, you need to not leave out lots of possibilities or have a model that predicts things that are very wrong.
      
      It’s not that priors aren’t problematic, it’s just that likelihoods are in many ways even more so, as they can extract very precise but very wrong information from the data if they are basically wrong themselves.
      
      For example, suppose you observe some function in time… and little do you know that it can oscillate fairly wildly at certain times where you only have a data point or two, and at other times you have lots of data, but it can’t oscillate at that time. So, you provide a likelihood that is very smooth, maybe a linear regression or a low order polynomial or a smooth radial basis function or a spline or something…
      
      well, by failing to take into account the fact that the function can oscillate, you wind up fitting through the peaks of the oscillations or something like that, and you get a heavily biased estimate for some kind of time-constant related to the rate of change of the function.
      
      Now, even if your prior has enough width to cover both fast and slow rate of change, the likelihood has excluded the fast oscillations from consideration.
      
      That’s the kind of thing that I am thinking of, where your model of how the world works is wrong, and generally “the way the world works” is mostly contained in the likelihood.
    - Christian Hennig on January 11, 2016 11:40 AM at 11:40 am said:
      
      Daniel: Thanks for the discussion but as I said before, there’s no need to convince me that it’s important to not get the likelihood (all too) wrong. I was more interested in your apparently initially rather sloppy attitude about the prior.
    - Daniel Lakeland on January 11, 2016 2:17 PM at 2:17 pm said:
      
      @Christian:
      
      “there are lots of different priors that get the true value in their typical region, so are you saying that all these are equally legitimate…?”
      
      So, maybe not “equally” legitimate, but any prior which puts the, let’s call it “proper” rather than “true”, value of the parameter in the typical set is at least “legitimate”, that is, it expresses a legitimate probabilistic fact “x is somewhere in this region”.
      
      degrees of legitimacy should really be measured relative to what the approximate “state of knowledge” is that leads you to choose the prior. For example if you only know that the value is “somewhere between 3 and 5” but you choose normal(4,0.2) and it happens to contain the true value in the typical set. It’s “legitimate” in that it gets the value correct, but it’s not legitimate as an expression of your state of knowledge, which would be better served by something like normal(4,1) or normal(4,2)
      
      Why is the prior less of an issue? If you’re honest about your degree of knowledge, it’s usually not that hard to make the “typical set” big enough to contain the “proper” value. In many cases this is a consequence of how tiny we are in the universe. There’s always some energy, length, monetary, time, or other bound which can be appealed to.
      
      You might argue that models for reported constructed scales on surveys are problematic, and I’d sort of agree with that, but even there you typically have a finite range of the construct, so “how much pain are you in” can’t legitimately be answered 1.214×10^36
      
      Why is the likelihood more of a problem? It’s no good if you specify a “legitimate” but “very blunt” likelihood. Like for example, if I plan to calculate the mass of some tumors as an outcome of a drug treatment, and I specify the likelihood as normal(mu,10^36) in kg to get inference on mu, it won’t work, because without a reasonably sharp likelihood you can’t really differentiate between different measurements across the whole scale of the possible.
      
      So although that likelihood expresses probabilistic truth (when mu is correct, the measurements will all be in the typical set of the massively wide normal) it doesn’t extract very much information.
    - Christian Hennig on January 13, 2016 8:20 AM at 8:20 am said:
      
      Could it be that different priors that roughly formalise more or less the same prior information may lead to quite different posterior assessments of uncertainty, even though all of these may lead to posteriors with the true value still in the high density region unless the data is hugely outlandish?
      
      That’s probably a question to do a bit of research on because the answer may depend on the details of the situation and the assumptions one is willing to make about priors formalising roughly the same information. Do you know whether such work already exists?
      This thing (currently discussed on Mayo’s blog) seems to be related and is not very optimistic:
      http://epubs.siam.org/doi/10.1137/130938633
    - Andrew on January 13, 2016 8:26 AM at 8:26 am said:
      
      Christian:
      
      I followed that link, and the paper you mention seems to have the usual problem of straining at the gnat of the prior while swallowing the camel that is the likelihood.
    - Keith O'Rourke on January 13, 2016 8:41 AM at 8:41 am said:
      
      Andrew: I thought this was likelihood problem that could be finessed by coarsening it – the authors and other _confirmed_ this.
      
      So not swallowing a precise (mistakenly taken as so precise?) camel that is the likelihood seems to avoid it.
      
      Am I missing something else here?
    - Andrew on January 13, 2016 8:59 AM at 8:59 am said:
      
      Keith:
      
      Here’s the abstract of that paper:
      
      With the advent of high-performance computing, Bayesian methods are becoming increasingly popular tools for the quantification of uncertainty throughout science and industry. Since these methods can impact the making of sometimes critical decisions in increasingly complicated contexts, the sensitivity of their posterior conclusions with respect to the underlying models and prior beliefs is a pressing question to which there currently exist positive and negative answers. We report new results suggesting that, although Bayesian methods are robust when the number of possible outcomes is finite or when only a finite number of marginals of the data-generating distribution are unknown, they could be generically brittle when applied to continuous systems (and their discretizations) with finite information on the data-generating distribution. If closeness is defined in terms of the total variation (TV) metric or the matching of a finite system of generalized moments, then (1) two practitioners who use arbitrarily close models and observe the same (possibly arbitrarily large amount of) data may reach opposite conclusions; and (2) any given prior and model can be slightly perturbed to achieve any desired posterior conclusion. The mechanism causing brittleness/robustness suggests that learning and robustness are antagonistic requirements, which raises the possibility of a missing stability condition when using Bayesian inference in a continuous world under finite information.
      
      Tell me a method that “can impact the making of sometimes critical decisions in increasingly complicated contexts” that does not depend on assumptions. If only Daryl Bem hadn’t used those silly Bayesian methods, he wouldn’t’ve . . . ummmm . . .
      
      I’ve just seen enough of these anti-Bayesian arguments over the past few decades. To me they read like what Tyler Cowen would call “mood affiliation.” These researchers want to be rigorous, there’s some idea that Bayesian methods are sloppy, so aha there’s some theorem that says so. It all seems to me to be meaningless. I mean, sure, it’s good to understand the mapping from assumptions and data to conclusions. But it seems naive to think that there’s something particularly “brittle” about Bayes, as compared to some other approach that keeps its assumptions hidden.
    - Christian Hennig on January 13, 2016 9:23 AM at 9:23 am said:
      
      Andrew: I didn’t post this to say that I agree with the conclusions or with what’s written in the abstract. I posted it because it is research of the kind “what happens to the conclusions if the prior is changed in such a way that one could argue that it still encodes pretty much the same prior information”. One can challenge any interpretation of this (one issue is the word “same” in “same prior information”; one may easily doubt this).
      I’d be very keen on being pointed to other literature that addresses this question, which I think is an important one for Bayesians; it doesn’t have to be interpreted in an anti-Bayesian way.
      Frequentism has lots of robustness issues, too, but the people who investigated them are not regarded as anti-frequentists.
    - Laplace on January 13, 2016 10:59 AM at 10:59 am said:
      
      “Frequentism has lots of robustness issues, too, but the people who investigated them are not regarded as anti-frequentists.”
      
      Mayo, who can’t possibly understand a single mathematical detail of that paper, posted it precisely because many were taking it as a devastating critique of Bayes.
      
      So there’s two hypothesis here (1) Bayes is seriously flawed, or (2) Some anti-Baysians are interpreting a mathematical result to mean something very different than it really does.
      
      Andrew seems to be in camp (2) for his own reasons, probably because he’s never seen the supposed “brittleness” in real life. I recall when that paper came out there was considerable push-back regarding the paper’s criteria for “closeness” as being pretty bogus.
    - Laplace on January 13, 2016 11:56 AM at 11:56 am said:
      
      This paper reminds me of the Marginalization Paradox from the 70’s, which was also widely proclaimed as a big flaw (and maybe a fatal one) of Bayes.
      
      At least one Bayesian took the time to explain it all. He was widely dismissed and is still being ridiculed over “denying” the marginalization paradox today. Yet 4 decades later no Bayesain analysis has ever blown up because of the supposedly inherent “flaw” brought out by the marginalization paradox. In the mean time, it’s difficult to find a frequentist paper that hasn’t fallen apart.
      
      All Bayesians got out of that previous episode was some wasted time. So how much time do you think Bayesians should waste on this latest paper?
    - Keith O'Rourke on January 13, 2016 9:15 AM at 9:15 am said:
      
      > these anti-Bayesian arguments
      It was my perception of that aspect that motivated me to post an immediate way to avoid the brittleness and remove any _bite_.
      
      But I also think there are good reasons to coarsen rather than just avoiding the brittleness/anti-Bayesian kerfuffle.
      
      E.g. see – https://www.researchgate.net/publication/278969335_Robust_Bayesian_inference_via_coarsening
  - Keith O'Rourke on January 5, 2016 9:11 AM at 9:11 am said:
    
    > you don’t have to worry about other analyses the researcher might have done, you just have to worry about other models of the world that are just as plausible
    
    Moving from what someone might have done to what things they should have thought of as plausible?
    – Interesting.
    
    Reply ↓
John Mashey on January 4, 2016 11:28 PM at 11:28 pm said:

The “not too many big effects” rule has sometimes been disobeyed by Silicon Valley venture capitalists, who (in aggregate) might invest in 20+ companies in some market segment
{PCs, disk drives, minsupercomputers, networking, DotComs, etc), each with biz plans to show they’get 30% of the market in their segment.

Reply ↓
- Fernando on January 4, 2016 11:56 PM at 11:56 pm said:
  
  There is nothing illogical — unless you take the business plan projections to be certain.
  
  Rather the opposite. If 100 firms are counting on capturing 100% of the same market segment then, either they are wildly optimistic, or the market is winner take all (e.g. social networks like Facebook). You probably want to invest in some of these firms.
  
  re “not too many big effects” I think this makes sense from Occam’s razor. In practice, the discussion here reminds me of Radford Neal’s work on priors over infinite neural networks.
  
  Reply ↓
Melanie on January 5, 2016 6:46 AM at 6:46 am said:

My newest favourite quite: noise mining. Well said. Type I error reigns!

Reply ↓
fpqc on January 5, 2016 11:51 AM at 11:51 am said:

> OK, embodied cognition is a joke.

As a math major with a lot of interest in embodied cognition, can someone please point me to some criticisms. Thanks

Reply ↓
jrc on January 5, 2016 6:42 PM at 6:42 pm said:

“I think there’s a theorem in there for someone who’d like to do some digging.”

So suppose that some experimental treatment varies in effect based on two characteristics of people, X1 and X2, by magnitude Beta1 and Beta2.

And suppose that we can observe individual outcomes Yt and Yc, giving us distributions of outcomes Dt and Dc (for (t) treatment and (c) control respectively), as well as the joint distribution of X1 and X2 (well – an estimate of the distribution from the sample).

My intuition is that we should be able to bound the sizes of the effect modifying coefficients (Betas) from just this information. For example, suppose both Betas are big and positive and X1 is orthogonal to X2: some person would have high X1 and high X2 and then, given the high Betas, that would put them outside Dt (even assuming that in the control group they would have been on the far opposite end of Dc – hence a really conservative bound).

I feel like my nomenclature sucks here (for instance it is confusing since X1/2 don’t affect Y at all here, they just change the effect of the experimental treatment), and also like this has been solved forever ago in a just slightly different context. But I also think that someone mathy could probably work this out pretty quick (or, like, post a link below to show me that this was done in 1972 in a much smarter way).

Reply ↓
- Daniel Lakeland on January 5, 2016 7:00 PM at 7:00 pm said:
  
  You may be thinking of “concentration of measure” type results? if you have a bunch of different factors that can contribute to something, they either kind of collapse down to a lower dimensional state (ie. factor A and factor B always come approximately together) or if they have sufficient independence, then they can’t really span the whole range of possibilities, it becomes far too implausible in high dimensions.
  
  https://en.wikipedia.org/wiki/Concentration_of_measure
  
  informal wikipedia statement:
  ‘Informally, it states that “A random variable that depends in a Lipschitz way on many independent variables (but not too much on any of them) is essentially constant”. [1]’
  
  For example, you draw 20 unit gaussian random numbers. If you calculate the radius of the vector: sqrt(sum(X_i^2)) then this number depends on 20 independent radom variables, and it is basically always a constant sqrt(20) = 4.472, within epsilon of about 0.7
  
  Or you may just be referring to the fact that if Y = F(X) and the distribution of Y is known and the distribution of X is known, but the connection between the two isn’t known, then you can get some kind of maximum and minimum bounds on the range of F over the region in which X has high probability, even if you don’t know F.
  
  Reply ↓

Statistical Modeling, Causal Inference, and Social Science

Plausibility vs. probability, prior distributions, and the garden of forking paths

96 thoughts on “Plausibility vs. probability, prior distributions, and the garden of forking paths”

Leave a Reply Cancel reply