But shouldn’t it be also informative to run the same model with different priors based on previous findings? That is, using only the effects (summary statistics, not posterior draws) from similar studies and just see how much models change/robustness/reliability.

]]>To be fair, I see that narrowing the prior can be justified from a purely probabilistic point of view. If you have the “correct” prior for the “clean” case, for example the effect of true beauty on sex ratio is effectively sampled from a N(0,0.002) distribution, knowing that there is a certain level of attenuation you can easily derive the effect of measured beauty on sex ratio. At least if the “measured beauty” is only partially correlated to the “true beauty” and is not correlated at all to any other factors that could affect the sex ratio. If it is partially measuring beauty and partially measuring something else, the net effect is not trivial to determine. If the “noise” is completely random, you will have in the extreme case (measured beauty uncorrelated to true beauty) a prior equal to zero.

In summary, it’s not impossible that you chose your prior by assuming first a precise prior for the effect of true beauty and then a precise amount of classification error. I guess I cannot accuse from over-precision, given that you said that’s a “fully informative prior”.

]]>Sure, if there is an effect it will be smaller. The attenuation will result in weaker data and the likelihood will move towards zero. Even if you don’t change the prior, the posterior will change as expected. I guess that if you had a prior centered at some value other than zero it would make sense to move the prior accordingly (to reflect the attenuation in the expected effect). I’m not so sure about changing the variance of the prior.

> In answer to your second point: No, I don’t know there’s no difference.

Ok, let me rephrase it. You know that the difference is small (much lower than 1%) and even the most extreme outcome wouldn’t provide enough evidence to suggest otherwise.

]]>You write:

Why would the prior depend on the noisiness of the measure of attractiveness? Say I have a prior for some experimental setting. If I had a similar setting with more noise I think I would still use the same prior for the parameter of interest (but maybe there would be a nuisance parameter related to the noise).

I also find that prior very strong. If the beautiful parents had *only girls*, you would estimate the population difference to be just 0.1%. Maybe that’s your point, that the whole study makes no sense because you know that there is no difference and even in the most extreme outcome you wouldn’t really change your mind?

In answer to your first point: noise in x will attenuate the correlation between x and y. Suppose, for example, that there’s some precisely measured “beauty” variable x for which the more beautiful parents are 0.1% more likely to have girls. Now suppose you don’t observe x, instead you observe z, a noisy measure of x, and then you compare the proportion of girls among parents who have high and low values of z. This difference will then be less than 0.1%. It’s called attenuation in econometrics and it’s easy to show analytically or by simulation.

In answer to your second point: No, I don’t know there’s no difference. There *is* a difference, it’s not zero. Older mothers and younger mothers have (small) differences in Pr(girl), white mothers and black mothers have differences in Pr(girl), etc. Take any two groups and you’ll get different probabilities. But, given all the empirical research on sex ratios (and there’s a lot, because N is huge and the data are just out there for free in birth records), we know that these differences are small. Not zero. Small.

https://helix.northwestern.edu/blog/2010/11/cell-biology-animation-and-reality

links to actual atomic force microscopy of the Myosin V molecule, which does have little feet that bind and unbind at a regular interval…

]]>https://youtu.be/yKW4F0Nu-UY?t=3m40s

You could suppose for illustration purposes that say the “feet” of this protein could absorb microwaves selectively because they “walk” at some 1000Mhz or whatever (or the microwave energy is a 1st 2nd or 3rd harmonic of whatever they do). If you add microwave energy, perhaps they vibrate back and forth rather than moving forward, hence a certain thing doesn’t get transported to its appropriate place as quickly, and so some chemical reaction does or does not occur fast enough to prevent some naturally occurring damage. This is more of a heuristic than anything else, obviously I have no particular candidate process in mind, just the idea that the intricate mechanical processes that large bio-molecules undergo could be selectively disrupted due to resonance at microwave frequencies. The more I learn about biology the more impressed I am at how complex it is, but also robust.

]]>This will only work for identifiable parameters though, ie those that just need enough data to estimate.

]]>Nevertheless, I agree with you about your skepticism that economics will begin to do this. I just don’t think this is because doing it is hard, or wrong, or anything like that, it’s because of politics etc.

]]>parameters{

real p0;

real dp;

}

transformed parameters {

real p;

p = p0 + dp; // a convolution of a uniform with a normal

}

model{

dp ~ normal(0,some_scale);

}

thereby giving you a nice flat plateau in -1000,1000 but convolved with some gaussian to give an infinitely smooth prior over the whole real line.

]]>taking their abstract at face value (possibly not a good idea, but a starting point since I don’t have access to full text) they suggest that resonant absorption of microwaves can certainly affect proteins selectively. It’s at least plausible. Yet, I fully agree with you in the basic point you’re making that policy is being made by people with strong but uninformed priors.

When it comes to power-lines at 60Hz I think the results are completely different, such small objects as proteins are likely to see 60Hz as essentially DC, resonance absorption should be up in the range of microwave ovens certainly above 500Mhz etc.

]]>Regarding the effect of EM on cells: the problem is that, not only is the radiation non-ionizing, it’s not even comparable to thermal energy. So any effect involving some activation barrier being surmounted by the radiation would already be blown past by ambient thermal noise. Robert Adair (Yale physicist) treated this issue at great length in the early ’90s (back when there were scares about power lines), albeit focusing on lower frequencies where the issue is even more clear cut. (My physics chops are a little rusty, and I don’t have a strong intuition about the resonance idea, except that it seems unlikely at those energies. Cell phone frequencies are below the blackbody peak at room temperature and I’m pretty sure there are a gazillion energy levels accessible to pretty much any large molecule in those ranges, particularly in a liquid environment.)

But at any rate, this is just to emphasize my mechanistic prior, which is evidently different from that of the Chronicle’s health writer and the Berkeley city council, who seem ready to use the uninformed (ha!) prior that every modern technology is carcinogenic unless proven otherwise, and also the studies showing otherwise should be ignored (because they disagree with said prior too strongly).

]]>But the point is that these differences are deeply rooted in philosophical differences in how people believe the economy works. There is no consensus. We could establish several priors corresponding to different schools of thought and then examine the same evidence in each case. That would be instructive and I would support that. But I don’t think you will see that any time soon – it makes these schools of thought less “scientific” and more “subjective.” If you want to claim these beliefs are wrong, I’m in agreement with you. But I think it is part of the fundamental reason why economists, at least, would resist the advice in Andrew’s post (of course, I could be wrong, since I can’t really speak for most economists).

]]>The different likelihood distributions multiplied together is approximately a weighted average and if the likelihoods are quadratic it is exactly equal to the inverse variance weighted average.

Something more thoughtful is advisable and if such can’t be discerned – flatten the multiplied together likelihood to reflect more uncertainty e.g. raise it to some number less than one (called something like fractional likelihood).

There also will be a related post later this afternoon.

]]>If you did get some kind of Bayesian posterior downstream, there’s the problem of how to compute with it if it’s not conjugate. That’s one of the reasons working directly with other data is easier.

]]>1. The probability the parameter is in (-1000, 1000) is only 0.1%

2. The probability the parameter is outside of (-1000, 1000) is 99.9%.

That’s probably not the information you want to provide to your Bayesian model if you don’t expect the parameter to have values outside of (-1000, 1000). I keep meaning to write a case study that shows how this works (along with the truncation you get that Daniel Lakeland describes above if you err on the other side and make the boundaries too tight). Andrew’s already written papers showing how the original diffuse inverse gamma priors suggested in the original BUGS examples led to overinflated variance estimates.

]]>For context, I mean if all the 600 kids from beautiful parents in the study were girls.

]]>if the point estimate obtained from a flat prior is ok but the posterior distribution is too wide maybe the problem is with the likelihood function and not with the flat prior. In any case, I don’t think the problem is that the prior specifies a 99.9% chance an effect size has absolute value greater than 10^305.

Andrew,

I agree and I think I said something similar myself (“What matters is what may be the effect on the inference when this prior is used in the context of the model once we include the data.”). Regarding the paper you link to:

“For a fully informative prior for δ, we might choose normal with mean 0 because we see no prior reason to expect the population difference to be positive or negative and standard deviation 0.001 because we expect any differences in the population to be small, given the general stability of sex ratios and the noisiness of the measure of attractiveness.”

Why would the prior depend on the noisiness of the measure of attractiveness? Say I have a prior for some experimental setting. If I had a similar setting with more noise I think I would still use the same prior for the parameter of interest (but maybe there would be a nuisance parameter related to the noise).

I also find that prior very strong. If the beautiful parents had *only girls*, you would estimate the population difference to be just 0.1%. Maybe that’s your point, that the whole study makes no sense because you know that there is no difference and even in the most extreme outcome you wouldn’t really change your mind?

]]>.

Just a simple uniform prior on an interval. That prior says it’s very unlikely that the value of is small, because

.

]]>Also, hi from an ex colleague, assuming there aren’t too many Oren Cheyettes in the SF Bay area.

]]>Both studies continue to get media attention – out here in the Bay Area, we were just treated to an alarmist story by the SF Chronicle’s health writer on the risk of smart watches, quoting heavily from two go-to figures in the “cell phones will give us all cancer” community and mentioning the NIH/NTP report. Particularly at the local level, a lot of questionable policy gets made based on these sorts of reports – e.g., Berkeley on cell phone warnings and Petaluma on herbicides used by public maintenance staff.

]]>The prior probability can be based on a series of data sets each being assumed to share the same ‘true’ mean but each with its own likelihood distribution. The likelihood densities of the different likelihood distributions can be multiplied together to form a joint likelihood distribution and then ‘normalising’ the latter so that all the posterior probabilities sum to 1 (normalisation always assumes that the ‘baseline prior is uniform or flat for random sampling, which is correct – see my blog: https://blog.oup.com/2017/06/suspected-fake-results-in-science/). The resulting posterior probability becomes the prior probability distribution for the new study. This is multiplied by the likelihood distribution of the new study data and normalised again to give the latest updated posterior probability distribution (to be discussed in the ‘discussion’ section of the paper).

]]>Talking about resistance, I spent the morning trying to figure out how to convince an action editor that a bunch of low-powered big effects is not as convincing as a small effect from a large-sample study. First I have to demonstrate how Type M error arises… the news has apparently not reached psychology.

]]>Following on some thoughts on priors for economic “multiplier effects” but we’d run out of reply room above.

Let’s let t be defined in years, and the “one year future total consumption per capita” function be

C1(t) = integrate(C(t+s)ds,s,0,1)

Where C(t) is the sum of all transactions that occur on a given day divided by the population N divided by 1/365 to put C(t) in units of dollars per person per year. C(t) is a piecewise constant function over each day.

Now, I take the 1 year multiplier effect to be

C1(t) if we have the government spent G dollars per capita (Call this C1_G(t)), where G dollars is any number between 0.001 times GDP/capita and 0.01 times GDP/capita (we assume an intermediate asymptotic stability of the effect for these moderately small spending levels)

minus

C1(t) if we don’t spend the G dollars per capita

divided by G

M = (C1_G(t) – C1(t))/G

Now clearly, this quantity depends on our choice of 1 year as the time period of interest, but we might expect that we’d get a similar effect for a range of window lengths from say 1/2 year to 2 years and so it’s *not extremely sensitive* to the window length. This is partly due to the fact that we average over 320 million people, and that we integrate our function over a full year or so, thereby smoothing out short term fluctuations quite a bit.

Next we note that logically we can in fact get quite large negative values, as I say if everyone in the country goes on strike because the Nazi party comes into power and whatnot… then C1_G(t) could go to zero, while C1(t) the counterfactual would have been something like 57000 $/person but… it’s extremely unlikely

In fact, for the most part, we’d expect this number to be something like 1 as the increase in GDP caused by spending G dollars per person would be something like G dollars per person, divided by G we’d get 1. So probably the peak of the prior density should be 1.

Furthermore it also seems like we could easily get 0, where each dollar spent by the govt causes someone to withhold a dollar of spending. This would be the case where we’re pretty much just doing a straight transfer from one group of people to another…. So the prior should be wide enough that 0 has density that is not so much lower than the density at 1. Finally, it’s reasonable that you might activate a lot of activity by your government spending, if it’s targeted properly (maybe you stimulate the economy of a depressed region, where lots of labor is available but little free cash for example). So you should be considering quantities out into the range of 2 or 3.

With all this in mind… an initial prior seems like normal(1.0,2.0) would be a good place to start, including values well into the negative range, and well above 1.0 but giving 1.0 the peak.

]]>However, once you create enough missing data you cannot estimate the models anymore, because some of the statistics are not observed enough anymore to estimate the parameters (e.g. I had posteriors from -100 to +150). Luckily I came across a youtube video of one of Andrew’s presentations about weakly informative priors, where he discussed a similar issue that parameter could not be estimated, because there was (nearly?) no data for it. Now, using these priors, Normal(0,4), the models converge nicely with 50% of the data missing (which in networks means that for many statistics you have 75% of the data missing). My point is that my main reason to choose this prior is pragmatical, you cannot run the model with a flat prior. I therefore wonder how much of this discussion applies directly to my choice of prior?

*I am not a native English speaker (as you might have guessed), but is there an difference between -icians and -ists? It seems to me the -icians (statist-, econometr-, psychometr-, mathemat-,…) have a better understanding of what they are doing compared to the -ists (psycholog-, sociolog-, biolog-,…).

]]>Actually, I phrased that last sentence poorly. I hear “let the data speak for itself” a lot, and like you I disagree with it, in two ways:

In a Bayesian/Frequentist context I prefer Bayesian which says that we need to make prior knowledge (common wisdom, our assumptions, etc) explicitly part of the model and then let the data push things around, speaking more loudly or more softly depending on how much and how strong it is.

In a general Data Science context, the methods and models we use will find a signal, if that’s possible. But the signal may not be what we hope it is. It could be a “leak from the future” in the data, which is very common. It could be a bot “clicking” on links rather than a potential customer. Heck, almost every engagement I go into doesn’t have a data dictionary and that data doesn’t speak for itself. (In fact, when I make the mistake of thinking I hear it talking based on the name of a field, I’m often deceived because that name doesn’t mean what I think it means.) So the data doesn’t actually speak for itself in this context either.

Only in the narrow sense of “don’t necessarily believe what ‘experts’ say about the data” does “let the data speak for itself” make sense to me.

]]>20 minute podcast here http://www.cbc.ca/radio/thecurrent/the-current-for-march-30-2017-1.4045972/march-30-2017-full-episode-transcript-1.4048646#segment2

]]>Suppose we take C(t) to be the total consumption by all members of the US at time t, a continuous function of time. Well, of course we know, like in the stock market, that consumption is not continuous. When I buy a sandwich a few dollars is transferred all at once. This is not the same thing as saying that all day long I spent a few pennies each hour…

You might think this is pedantic, but it seems to me the “multiplier” effect is some kind of derivative, how much total consumption changes when some particular amount of consumption by a certain party occurs. d something / d something

But the derivative is an unbounded operator, and it doesn’t even exist for a discrete series of transactions… and so we can really only discuss this in terms of taking the real series of discrete transactions, smoothing them in some way, and then defining our derivative of this smoothed thing… Fine, but then the result we get is dependent on the way in which we do the smoothing… Is there a way to define all of this in such a way that the result is largely independent of our choice of smoothing method for a wide range of smoothing methods? If so, we’re in the same situation as we get when trying to represent a steel bar using continuum mechanics, sure it’s atoms, but if we smooth the atoms by a smoothing kernel of width greater than 100 atomic distances and less than 1mm which is quite a few orders of magnitude… the results are nearly the same.

It’s less obvious to me how this would work for consumption. First off, consumption clearly has a very strong daily oscillation. I buy very little at midnight, and quite a bit more at noon. So any smoothing we do must be over a timescale large with respect to a day. But, there’s also clearly seasonal effects in consumption, christmas is big for retail, summer is big for travel… so smoothing seems to need to be large with respect to a year! But over decades technology and policy and things all change a lot. So I don’t think we’re ever in any regime where a smoothing based view of what’s going on really applies very well.

Now of course we’re interested in a causal effect, spending G government dollars causes some change in something, over some time period relative to what it would have been if the G event hadn’t occurred…. So it’s not a simple derivative in time, it’s a counterfactual about how much consumption would occur in some time period after the G event compared to what would have happened in the absence of G… But defining this in a way that is insensitive to the choice of time period still seems impossible. You could for example do a truncated Laplace transform (ie. discount all future consumption out to some window according to some discount rate) but then you’ll wind up with a result that’s very sensitive to the discount rate and the truncation window.

So, if you want to do a particular analysis, and you want to choose a particular way of doing the calculation, then I can give some particulars of the appropriate prior. All this is to back-up the assertion that Andrew made in a recent paper: The choice of prior is intimately connected to the choice of likelihood / data model.

]]>(Actually that was the reason the journal editor gave for rejecting the paper – not enough technical innovation to justify publication in my prestigious journal)

]]>It was the original motivation for the work I did in meta-analysis (to get prior for cost/benefit analysis of funding for clinical trials).

A little bit of thought about this soon suggests you don’t want some weighted average of the (mostly crappy) studies that happened to get published. Or maybe it takes more than a little thought…

]]>As we discuss in this paper, the prior can often only be understood in the context of the likelihood. In particular, a sample average or maximum likelihood estimate can be “quite reasonable” in some contexts but not in others. In a setting where measurements are accurate and plentiful and the goal is an estimate of a simple parameter whose value is not near the boundary of parameter space, then, sure, the flat prior can work. In a setting where measurements are noisy, sample size is not huge, and the goal is something more specific, then maximum likelihood or Bayesian inference with a flat prior can give bad answers: estimates with bad frequency properties, with high bias, high variance, high type M errors, high type S errors, the whole deal.

]]>This is of course by design for the person who distrusts priors, nevertheless as soon as you want to construct a measure of uncertainty or a risk and utility based decision you have a different story.

The risks associated with point estimation when outcomes and their consequences can vary widely are significant. If a posterior distribution is tightly peaked near your point estimate then things are ok, if there is nontrivial width then that flat prior can be deadly for your decisions as you wind up considering possibilities well outside what anyone actually thinks might happen, simply because no one wants to be in charge of justifying a prior choice. Walds theorem applies whether the user of statistics likes it or not.

]]>You can start here (http://marginalrevolution.com/?s=multiplier). Of course, that is not an authoritative source and it represents the more right wing side of economics – Krugman would have a somewhat different take. But I have no doubt you can generate a prior – or even two or three. And, I believe doing that would be superior to conducting a new study using some data and declaring a confidence interval for the *true* size of the multiplier from that single study. I am not disagreeing with the post or your comments here – I am providing my view for much of the underlying resistance to change and clinging to these frequentist methods. If our estimates for the size of the multiplier shift depending on which prior you choose – and I believe they would – then it exposes the entire enterprise to be a sort of mathematical trick, a way to couch a subjective belief as “scientific.” And, who wants to do that? (only real scientists perhaps). ]]>

A flat prior doesn’t assume that it *is* enormous, it assumes that it *could be* enormous. An informative prior may be better, but an uninformative prior is not obviously stupid. What matters is what may be the effect on the inference when this prior is used in the context of the model once we include the data.

If you say that the flat prior means that you expect the value of interest to be greater than 10^305 you make it look stupid.

If you say that the flat prior means that you will take the mean of the data to estimate the value of interest it looks much less stupid, actually it looks quite reasonable.

Let’s say you measure the height of a sample of people to estimate the average height in the population and you get mean=170cm. Maybe you have reasons to think you should correct it a bit in either direction, but taking the 170cm at face value is not obviously stupid. If you get mean=512km there are issues with your model or experimental setup much worse than the fact that the prior doesn’t rule out that value.

Of course nothing is normal, all models are wrong, etc. Everyone understands that if we say that the height in a population is normally distributed with such and such mean and standard deviation this is just an approximation. The median and the mode might be different from the mean, the shape of the distribution around the mean might be far from normal, and surely there are no negative heights or heights larger than 10^305.

]]>