I got this email from a journalist:

This seems . . . irresponsible to me.

Particularly:

For the first 100 years that meteorologists kept weather records at Central Park, from 1869 through 1996, they recorded just two snowstorms that dumped 20 inches or more. But since 1996, counting this week’s storm, there have been six. (You’ll find similar stats for other major East Coast cities.)

Basically, we’ve become accustomed to something that used to be very rare.

The link points to a post by Eric Holthaus on grist.org, and I agree with the person who sent this to me, that the argument is pretty bad, at least as presented.

First, let’s compute the simple statistical comparison. The previous rate was 2 out of 128, and the new rate is 6 out of 22. So:

y <- c(2, 6) n <- c(128, 22) p_hat <- y/n diff <- p_hat[2] - p_hat[1] se_diff <- sqrt(sum(p_hat*(1-p_hat)/n))

The difference between the two probabilities is 0.26 and the standard error is 0.10. So, sure, it's more than 2 standard errors from zero: good enough for grist.com, PPNAS, NPR, and your friendly neighborhood Ted talk.

But not good enough for the rest of us. The researcher degrees of freedom are obvious: the choice of 1996 as a cutpoint, the choice of 20 inches as a cutpoint, and the decision to pick out snowstorms as the outcome of interest. Also there can be changes at various time scales, so it's not quite right to treat each year as an independent data point. In summary, just by chance alone, we'd expect to be able to see lots of apparently statistically significant patterns by sifting through weather data in this way.

The story here is the usual: To the extent that this evidence is presented in support of a clear theory, it could be meaningful. By itself, it's noise.

By clear theory do you mean a model with point predictions about # of inches? In that case you would be using a different test statistic.

If you had a model that predicted when it’s greater than 20 inches, then you could test it with the pre-1996 data only, or pool.

In what case would .26 be the realization of an appropriate test statistic?

I guess your point is that there is no clear theory for which this would be an appropriate test?

By the way, the author posted his caveat emptor on Twitter yesterday.

Before:

By “clear theory” I mean a theory that relates global warming to snowfall. With such a theory we’d have something much better to look at than “snowstorms that dumped 20 inches or more,” which is a horrible data summary that throws away almost all the available information.

There weren’t any errors in the quoted text. The conclusion

“Basically, we’ve become accustomed to something that used to be very rare.”seems spot on. There was a change and that’s a fact. What caused the change may be up for debate, but the fact is there was a change. Consequently our perception of “normal” cab change regardless of what caused the increase. Looking at the article, most of it is purely observational like that. Only one paragraph talks about possible causes and mentions the physical affects of warming oceans.Data isn’t noise, it’s history.

Statisticians may think there’s some probability p of a 20 inch snow storm and imagine history is a random sample from it, but that’s just their lame attempted at bypassing the hard work of understanding physics+meteorology. All this stuff about “researcher degrees of freedom”, p-hacking, garden-of-forked-paths and so on is just a way to save face, because statisticians know by now not to make strong claims based on their fantasies/misunderstandings which can easily be contradicted by real science.

That’s what happens when you’re wrong about fundamentals. You need to have ready at hand a great many all-purpose excuses for the constant failures. That’s why no one from Archimedes to Feynman ever felt the need to mention “researcher degrees of freedom”. They were getting fundamentals right.

“Data isn’t noise, it’s history”

Boy, there is a lot packed into that statement! On one hand, it is quite insightful. It happened. We may not understand why, but it is undeniable that it happened. However, the statement that “Basically, we’ve become accustomed to something that used to be very rare” is a different matter. This is not fact – this calls for several interpretive statements – what does “accustomed” mean and what does “very rare” mean? That is where statistics enters and researcher degrees of freedom, etc. I don’t think Feynman could define “very rare” on the basis of fundamentals.

Laplace:

The issue is that a lot of things happened, and by itself you can’t learn much from a picked-out statistic such as “they recorded just two snowstorms that dumped 20 inches or more.” It’s not Bayesian if you don’t condition on all the data.

First, the only “noise” in this example is measurement error, so unless the error in snowfall measurement is something like +/- 15 inchs we do in fact learn that there has been an increase.

Second, they’re not conditioning on anything. They’re observing. The conditioning comes in potentially when looking for causes, which as I noted they only discuss one potential cause (ocean warming) very briefly and don’t draw any conclusions from it.

Third, Bayesians can condition on any thing they want, including subsets of what they actually know, or even made up assumptions. Whether it’s useful to do so is a different matter, but we’re free to do the calculation any time we feel like it. The Bayesian police wont show up and arrest us.

I don’t really like the usual use of the term “noise”. Rather than being a category, “noisiness” should be a continuous measure related to compressibility of the data. I haven’t seen anyone really investigate that, but it seems pretty obvious so is probably out there.

Laplace wrote:

“There was a change and that’s a fact.”

You flip a coin fifty times and you never get more than two heads or tails in a row. Then you flip six straight heads. There was a change and that’s a fact.

Get it?

Didn’t think so.

Matt: The change Laplace is referring to is the subjective experience of the observer. 6 heads in a row is a very real change in that respect, even if there is no change in regime.

Of course people do not base perceptions on >20″ snowfall, and likely not discriminate between 15″ snowfall and 20″ snowfall.

It’s a measure of just how screwed up statistics is that whenever one person looks at 1,2,3,4,5,6,7 and says “these numbers increase” a statistician has to jump in say “I don’t think so”.

Laplace:

The data in question are not “1,2,3,4,5,6,7.” The data are 150 years of continuous multivariate observations, out of which one particular scalar binary comparison has been extracted, without a clear motivation.

There is a clear motivation. That kind of increase in large snow falls, over that time period, is enough for many 60 year olds to perceive even without special record keeping. Note: the article concentrated on that change in perception.

Laplace

Read David Marcus’ comment below. I don’t think you are on solid ground here. Your nice quote about data being history aside, you are starting to sound like a troll. It is not clear to me that anything has changed regarding the size of snowstorms.

If you measure two numbers 10 +/-1 and 20 +/-1, it’s ok to say the second is bigger than the first. Given small enough measurement errors, no formal calculation is needed, it’s just a simple observation. Unless the measurement error in snowfall is ridiculously large the observation in the article stands.

If you wish to do a formal calculation for this, then you’ll get a Bayesian “significance test” of exactly the kind Laplace used 200 years ago to great effect. He seems to have single handedly done more real science with it than all living statisticians today seem capable of.

Dale,

I apologize to you. I misunderstood your comment. Just looked at David Marcus’ comment and loved it. He actually considered real measurement errors instead of insanely irrelevant statistical fantasies.

Laplace,

1. Numbers applied to observations is a good first step, but certainly not the final step (measurement).

2. 60 years of memory is perhaps not as error free as your comments assume (error).

3. A calculation is math, but is not science without a bit more (measurement + error + statistics).

4. Science requires an attempt to understand causality (statistics + methods).

5. Causal relationships are the focus of science (measurement + error + statistics + theory).

I’m not sure who you’re lecturing. Given measurements 100, 110, 120 we can ask: was there a genuine increase in the real world or not?

The answer depends on the error bars. With +/- 70 uncertainty on each measurement, we don’t know. With +/-.01 uncertainty on reach measurement we can safely say the increase actually happened.

Instead of considering the actual measurement errors, Gelman and others want to answer this question using a model which couldn’t be more fictitious or less relevant to the physical situation, and then mix in some blabitity blah about how the possibility of asking different questions throws off the analysis.

It’s that simple.

Also, where does this come from ” 60 years of memory is perhaps not as error free as your comments assume (error).”?

I didn’t assume anything. Gelman claimed the choice of time intervals had no relevance. I pointed out it actually was relevant to question the article wanted to address, namely the setting or adjusting of expectations based of recent history.

Your assertion is generally incorrect, which is the point of my comment above.

Your assertion is correct IF AND ONLY IF each of those numbers is produced by the exact same process within the exact same time frame with the exact same values on all relevant causal variables controlled for experimentally (or perhaps instrumentally).

Otherwise each of those values as an estimate of a process at different points in time could have wildly different intervals within which a set of true values may be appropriately modeled such that the interval around 120 may include 100.

Curious,

I continue to be astounded by how difficult it is to get statisticians to accept the trivially true.

If two measured numbers differ by far more than their measurement errors, then it’s safe to conclude the true values really were different.

What implications can be drawn from that difference is a another matter, and in this physics/meteorology example would be difficult to determine. BUT THEY REALLY ARE DIFFERENT.

Edit:

Laplace,

You are the one who is making the purely statistical argument based on cardinality of 3 distinct data points.

The rest of us are trying to point out that when it comes to science, difference implies a variation as the result of some process and not simply the rank ordering of outcomes based on the numbers assigned to them.

Below the example of roulette has been used. However, I believe the analogy misses the point slightly in that it would be more similar to trying to detect a rigged roulette wheel with only information about the payouts and without any information about the numbers that came up.

That “researchers degree of freedom” stuff is a fine critique of the completely unphysical/fantastical blather you based your R calculation on. It is not a critique of anything in the original artificial.

This is how this plays out: statisticians do their silly analysis (like you did) and get results which are nearly randomly connected to reality. To avoid embarrassment, they use their intuition to spot the really embarrassing conclusions. They then trot out some standard excuse (p-hacking, significance filter, whatever), and tell some cute story about how it invalidates the embarrassing conclusion. This is all pure philosophy so no one can really argue with it. They can always tell this story. The story is conveniently ignored if their intuition likes the conclusion.

I’ll let others comment on the irony that the frequentist philosophy of stats requires a large dose of prior intuition to fudge results back to even a slight facsimile of reality.

Laplace, I’m confused by your first statement. The R calculation is based on the statistic cited in the 4th paragraph of the article.

I’m also confused by your criticism of the the criticism. Do you consider it irrelevant that the 2/126 vs. 6/20 comparison (from which so much of the remainder of the article follows) is based on cut points chosen specifically for the purpose of making the comparison as dramatic as possible? Is it relevant that choosing different cut points would produce less dramatic numbers?

If you gave this info to a Physicist with no statistics education, they would look at the numbers, checked the measurement errors on the data, concluded the increase was real, and then jumped right into thinking about they physics (causes).

If you give this info to a Statistician, they consider a model where snow days are drawn at random from an urn. They then conclude the change might have been due to which sample from that mythical magical urn turned up in this universe and therefore doesn’t represent a “real” increase.

Any sensible person looking at both reactions concludes Statistics education is the most effective IQ reducer ever invented.

Actually it’s worse than that. Gelman did a calculation without consider either (1) the actual measurement errors in the data, or (2) the real physics behind it, which concluded it was “real”. Statistics is such an amazing thing! You don’t need to bother with the last 2000 years of carefully worked out physics. All you need a short R script evidently.

But then Gelman turned around and concluded it probably isn’t “real” because we might have asked a different question than the one we did ask. “Asking questions” is such a powerful force in the universe apparently that it can completely negate all data.

Like I said, any objective observer would conclude Statistics is the most powerful IQ reducer ever invented.

Laplace:

You write, “Gelman turned around and concluded it probably isn’t ‘real’ . . .” No, I never said that. Troll all you want (within limits) but don’t misquote me.

Laplace:

Perhaps you’d care to explain how a physicist would go about building a model of snowfall in the vicinity of Central Park — or any emergent/aggregate phenomena, really — without the use of statistical modeling.

And then I do hope you’ll show how such methods generalize to other “aggregate” sciences (the traditional natural/social science distinction is inappropriate here, since meteorology/climatology is surely not a social science, but neither is it reductionist in nature nor does it address systems whose initial and boundary conditions are well understood) such as psychology. In particular, I’m hoping you are up to the task of describing in mechanical detail how a hypothetical physicist would go about building and testing a theory of power pose, which must be real, because “the data are history.”

I look forward to your thoughtful reply.

This is exasperating. “The numbers” to be checked by our hypothetical physicist are a function of arbitrary cutoffs chosen by a human being (not a physicist, I’m guessing – but that doesn’t matter). They weren’t presented to us by God as The Correct Way To Dichotomize The Data. Choose different cutoffs, get different numbers. I know a few physicists and I’m certain they understand this. I can go ask one tomorrow if you’d like. I sincerely doubt he’ll say “as a physicist, I would interpret this data in light of the most impressive difference in # of days that I could possibly obtain, given flexibility in where to set arbitrary snowfall and time cutoffs”.

If what you’re really saying is that we should look at *total* snowfall (rather than days in which snowfall exceeded some threshold) across *all* years (rather than dividing time into pre/post 1996), then I doubt you’ll get any disagreement from anyone here.

Ok Andrew, you actually said “By itself, it’s noise.”

Uh no. The increase actually happened. Because it actually happened that implies all kinds of consequences. On the physics/meterology side it’s difficult to say what those are. On the human side, which the article writer concentrated on, it can cause people to adjust or set their expectations accordingly. All perfectly legitimate. In fact the article writer did a much better than job than most.

And by the way Andrew, that’s not me trolling. This is me trolling:

Did you hear about the Statistician on the battlefield? Bullets were flying everywhere. One hit the Statistician’s belly. A medic seeing the wound ran up to the him and said “you’re hit”. The Statistician responded “I’m not hit, it’s statistically indistinguishable from noise, and since you chose to look at my stomach and not someone else’s, you’re assessment is flawed by researcher degrees of freedom.”

That’s be trolling.

Laplace:

Only clouds exist, though clouds of very different degrees of cloudiness…

Was not aware of this paper by Popper http://www.the-rathouse.com/2011/Clouds-and-Clocks.html which discusses Peirce’s position on this.

I agree with Laplace in that in what the original posting said was technically about a measure over the complete population in the past (all storms in 1869-present). If I flip a coin 10 times and get six heads, there’s no question that 60% of my set of coin flips were heads.

Whether this continues in the future is a different question, in which we want to look into posterior probabilities, etc.

That’s bound to be the sort of response that drives Laplace crazy

“Data isn’t noise, it’s history.”

That’s a great line. The label “noise” is a manifestation of how difficult it is to simply observe and describe. We call the data noisy when we think it is unlikely the observed variation has been brought about by some theory we have in mind

On the logit scale, the difference is much larger compared to the standard error (3.5 standard errors, compared to 2.5).

This code assumes no years with more than 20-inch storm; the results are similar with `bayesglm`:

y = c(c(rep(0, 126), rep(1, 2)),

c(rep(0, 16), rep(1, 6))

)

x = c(rep(0, 128), rep(1, 22))

summary(glm(y ~ x, family = binomial)) # estimated effect: 3.2, +/- 0.9 (SE)

The point about researcher degrees of freedom is completely valid, though. As is the point about non-independence among years.

The 1996 cut off seems ok – before 1996, there was one in 1948 and one in 1888. It not like there were several just before the 1996 cut-off.

Sure, that’s one degree of freedom, but if your cutoff is 15+, you can add 11 outcomes to pre-1996, and 2 outcomes to post-1996.

+1

if i counted correctly, the rate of snowstorms that dumped 8 inches or more was slighlty higher before 1970 (77 in 102 years) than after 1970 (35 in 48 years). we live in a multiverse, unless 20 inch and the year 1996 are meaningful

Wolf:

But what’s the rate of himmicanes? That’s what I wanna know.

I assume that 20 was chosen as a cutoff, rather than, say, 19, because we have 10 fingers and toes. So, yes, the cutoff at 20 is arbitrary vis-a-vis 19. But it is not arbitrary vis-a-vis 8 inches. Eight inches is “how ’bout that weather today!” Twenty inches is “the subway is shut down today.” Also, it is certainly possible that a look at all of the snowfall data shows that the dist has become increasingly bimodal in the last 20 yrs, and the 20 inch cutoff captures that, at least in broad sense.

agreed that 8 and 20 are different in a meaningful way. but what is still arbitrary is that “the subway is shut down today” is more informative than “how ’bout that weather today!”

I don’t think I understand what you mean. Those two categories are meaningful ones, at least for some people. If I were thinking of buying a place in, say, Queens, and my job were in Manhattan, I might want to know how often my commute is likely to be screwed up. Those two categories would be quite informative. OTOH, if I am in CA and am wondering what water supplies will be like this year, those two categories don’t tell me nearly as much, since I am more concerned with total snowfall. But my NYC self is not being arbitrary in using those categories (as he probably would be if he used a 19/20 cutoff)

If the cutoff was selected to be, say 1980, then there would be 2 =>20inch snow storms within n = 81 and 6 => 20 inch snowstorms within n = 69. A bit different than the n and p values extracted from the article.

I presume the 1996 cutoff comes from the fact the data table on twitter indicates the first date in the last couple decades with snow >= 20 inches is in 1996. But if data first started to be recorded in this way in 1869 and the first snowstorm >= 20 inches happened in 1888 it would seem you should at least add 1888-1869 to the start of the cutoff point; making the cutoff point 1977.

Any time cutoff is justifiable, really. And that’s a big issue.

On top of arbitrary snowfall cutoff (is 19.8 close enough to 20?), the fact we’re only using snowfall (what about all precipitation?), etc.

Taking note of an interesting data curiosity is good. Co-mingling it with other data is better. Making bold statements about what an interesting data curiosity can tell us without supportive substantive theory is not so good.

AG: To the extent that this evidence is presented in support of a clear theory, it could be meaningful. By itself, it’s noise.

GS: Does the above statement pertain to all data or just those specifically being referred to (and data presumably “like them” in some respect)?

The way snowfall is measured changed: https://www2.ucar.edu/atmosnews/perspective/14009/snowfall-measurement-flaky-history

The result is that recent measurements are larger because the older measurements let the snow compress before it was measured.

Thanks.

+11

Nice! This is exhibit A re why the comments here are such a pleasure to read (esp compared to the horror show that is the norm on so many other sites)

The length of an inch has changed during that period. Did they correct for that?

https://en.wikipedia.org/wiki/Inch#History

I found this article helpful in illustrating the results of seemingly arbitrary cutoffs:

Wainer, H., Gessaroli, M., & Verdi, M. (2006). Visual Revelations: Finding What Is Not There through the Unfortunate Binning of Results: The Mendel Effect. Chance, 19(1), 49-52. http://www.tandfonline.com/doi/abs/10.1080/09332480.2006.10722771?journalCode=ucha20

This is also quite good:

Royston P, Altman DG, Sauerbrei W. “Dichotomizing continuous predictors in multiple regression: a bad idea.”

Stat Med. 2006 Jan 15;25(1):127-41.

https://www.ncbi.nlm.nih.gov/pubmed/16217841

I hate arbitrary classifications. I found the source data with the actual inches of snowfall here:

http://www.weather.gov/media/okx/Climate/CentralPark/BiggestSnowstorms.pdf

I decided to look at inches of snowfall with relation to year, and also (per the link David Marcus provided about how snow depth measurement has changed over time) with inches prior to 1990 increased by 17%.

A scatterplot of this relation looks pretty random.

To see whether any relationship might be present, I used a generalized additive model (Simon Wood’s gam, from his mgcv library) to model this relationship allowing for a non-linear change in inches over time (via a penalized spline).

With the original data, the model suggests 3 estimated degrees of freedom, F=1.138, p=.297. Some might describe this as a hint of a trend (I wouldn’t be one of them, but for the sake of discussion…). Plotted, the curve is horizontal to about 1990, then curves upwards a few inches approaching the present. Naturally, the 95% confidence interval is pretty broad.

With the data adjusted so that inches prior to 1990 are increased by 17% to compensate for the change in snow depth measurement methods, there is even less hint of a trend. The model suggests 1 estimated degree of freedom, F=.013, p=.909. That’s a horizontal line. One caveat… the original data is cut-off below 12 inches, so increasing the pre-1990 portion by 17% would have brought some of that cut-off data into the picture. This may affect the trend. I suppose a censored model might be attempted (Tobit), but I don’t know of a GAM approach to this.

I’d post pictures if I could. Anyway, the net message seems to be that there’s no persuasive evidence of anything going on here. Probably someone could come up with more complete data and get around the censoring problem.

You can post images here: https://postimage.io/ . No account or login required.

Here an illustration that of the problem that many scientists have regarding statisticians who are perhaps too eager to shout, “You’re just mining noise!!!”

Supposing a scientist and a statistician are in Las Vegas drinking cocktails and observing two roulette wheels, and they are thinking about rumors that one of the wheels is rigged. Now roulette wheels have 38 numbers on them, and let’s just say they are labeled 1,2,..,38. Now, one million spins of each wheel are observed. Let’s look at each wheel:

1) For the first wheel the results are seemingly random, 5,22,33,4,……. Now the argument is provided by a cocktail waitress that the probability of that sequence of numbers (under the assumption of a fair wheel) is (1/38) raised to the one millionth power – super small. Therefore, one might conclude that there’s a high probability that the wheel is rigged. The statistician then points out that every sequence of one million outcomes is equally rare, and since the sequence was not posited ahead of time, no conclusion can be drawn. Fair enough.

2) Now suppose that the second wheel produces numbers as 1,2,3…,36,37,38,1,2,3,…,36,37,38, etc etc. In other words, the outcomes are in order, over and over and over again, day after day after day. Here the statistician gives his same argument, that every sequence is equally likely under the null, and that this is just noise again. The scientist, correctly, disagrees.

The problem with the statistician’s conclusion in (2) is that the results are consistent with a theory of human intervention. The ordering observed is a human invented ordering. Further, suppose it’s discovered that the second wheel is being run by “Slippery Sam”, who’s been caught in the past rigging wheels to produce ordered outcomes for his friends to bet on. The correct argument here is that the rareness of the outcome, in conjunction with a reasonable substantive theory, do produce results that strongly suggest that the second wheel is not operating as it should. Despite the absence of a pre-defined hypothesis predicting these ordered outcomes, there is indeed something to be learned here.

Perhaps a statistician would try to model the result of the next spin N by using the result of the prior spin P.

In scenario 1, the statistician would not be able to build a model, since the data is random and there is no relationship between N and P.

In scenario 2, the statistician would find that the you could predict the value of N pretty effectively given P, after all N=P+1 in nearly every case.

There is nothing rare about the outcome in Scenario 2, it is entirely predictable, it is the opposite of rare.

“There is nothing rare about the outcome in Scenario 2, it is entirely predictable, it is the opposite of rare.”

Nooooo! Assuming a fair wheel, the probability in each scenario (1 and 2) is equally rare. This is actually a basic stat 101 type question. It generally goes something like, “Suppose I roll two dice 6 times. The first comes up: Dice1: 4,2,6,2,1,5 and the second Dice2: 6,6,6,6,6,6. Which sequence of outcomes of more likely, assuming each die is fair?” The answer is that the probability of each sequence is the same, (1/6)^6. Ask any competent statistician.

Now I agree with your first comment. However, the point of my original post is that, using a standard (frequentist) statistical argument, there is no reason to believe that what occurred in scenario 2 is anything but noise. It takes a combination of statistics and substantive theory to flag the problematic wheel.

But there is a standard frequentist argument to conclude that the wheel is rigged, in fact there are many. For instance, you could look at the correlation between successive spins. Under the “fair wheel” assumption (which I take to mean that each spin is i.i.d. from the discrete uniform distribution on 1-38), this correlation is 0, while the sequence 1,2,3,4,5,…. gives a pairwise correlation of 1.

Great comment! However, I would argue that the decision to chose such a pairwise correlation test would be based on some kind of theory or expectation of what is to be found or based on what was found.

Perhaps another example would be better. In scenario 2, suppose that you assign letters to the first 26 numbers in a fairly standard manner, as in 1-A, 2-B, 3-C, etc. Continue with the rest of the numbers. You’re gaming at Bellagio. You discover that the outcomes of the wheel spell out things like “Bellagio is the best! Cesar’s Palace sucks!” Is there a frequentist approach that would flag such outcomes as resulting from a rigged wheel? (There may be but I can’t immediately think of one.)

It would seem that substantive theory would help out in choosing flagging the rigged wheel.

Oh, and one other thing. If you did construct a correlation test based on observed data, you would implicitly be setting up hypotheses, and some would accuse you of HARKing (Hypothesis after Results Known) and you would be vilified on this blog.

By definition any sequence of outcomes is equally likely if the wheel is ‘fair’. If you want to proves the wheel is fair by looking at sequences then you need to observer a lot more than 2 sequences, perhaps you should observe 1 million sequences. However if the task is simply to figure out if the wheels are fair in both scenarios, you just need to look at the distributions conditional on the previous spin. So, when a 1 came up, what’s the distribution of the next spin. In scenario 1, all numbers 1-38 would come up the same # of times. In scenario 2, the result would always be 2. That’s all you need to show the wheel in scenario 2 is not fair, and you don’t need any theory in advance to reach that conclusion, you just need a frequency distribution.

This is not a situation where there would be disagreement between the statistician and the scientist. This is a straightforward simple statistical problem, and you wouldn’t even need millions of spins.

“This is not a situation where there would be disagreement between the statistician and the scientist. “

But of course not, it’s a rhetorical argument. The point being that if you were presented with data from a million spins for each scenario above, which wheel would you choose as the problematic one? Statistically, the outcomes are the same assuming a fair wheel. From a purely statistical point of view, they both represent noise. See my reply above.

You assume a fair wheel and then find that your assumption results in a conclusion of fair wheel…..

The appropriate Bayesian calculation is to compare posterior probabilities for fair wheel against an alternative model. If no alternative is considered obviously no alternative conclusion is possible

Daniel,

I’m referring to frequentist-style thinking where one computes the probability of outcomes assuming a null (fair wheel) assumption. With this way of thinking, the outcomes in both scenarios are equally rare.

The Bayesian way of thinking, building and testing alternative theories based on substantive theory, helps resolve the problem.

Sam, you’re just putting up a strawman. The frequentist is certainly allowed to calculate e.g. the proportion of times a cell n is followed by a cell n+1, calculate (or simulate) the sampling distribution under the null (which seems like it might be binomial(1/38)? (if you count mod 38)) and then note that the observed proportion is exceedingly unlikely to have arisen from this distribution.

But look at what you’re doing. You’re observing a pattern, and then constructing a test to discern it.

You can always do this. Similar to the example stated above, suppose you assign letters to the numbers and discover that the wheel spells out the words to Moby Dick. Or suppose wheel two produces the same number sequence observed previously by wheel one. You can of course construct hypotheses/tests to sniff out the anomalies. However, again you are observing a pattern and then constructing a hypothesis/test to assess it. Many would cry HARKing.

I don’t think many people cry HARKing when presented with p-values of 10^-50, which I think is a good rough guess for what we’d get if we compared a sequence of numbers spelling out Moby Dick to a sampling distribution under the null. Accusations of HARKing are much more likely to be leveled against people making post-hoc claims using p-values of 10^-2.

Why? If the model is wrong the p-value is only a matter of how much you are willing to spend on data collection… I see no reason why well funded researchers should be immune from accusations of HARKing.

I have in mind the simultaneous accusation of HARKing and noise mining, where we say “this looks like something you got by dumb luck or exploitation of researcher degrees of freedom”. Given a little bit of flexibility, it is easy to get p-values of 0.03 from noise. If the p-value is extremely small, I’d be less likely to say “it looks like that came from noise”. But it certainly could have come from a bad model and lead to bad post-hoc storytelling.

A frequentist can do a test on the compressibility of the sequence. The implicit null model here does a good job of capturing what a frequentist means by “random”.

I’m shocked this question has received this much attention without the obvious explanation.

I, personally, like the frequentist interpretation of probability just being a measure over sets. I like Bayesian views too, but sometimes its best to stay in the frequentist interpretation when possible.

Now, with that view, we look at this problem. We have sequence (1), which looks completely random to a human. We also have sequence (2), which does not look random to a human. The real difference between (1) and (2) is that (2) looks like a human recongizable pattern, while (1) does not. I will note that if you claim you don’t care about human-recognizable sequences, then you should not care at all about seeing {1, 2, 3, …}; it’s just another sequence of many!

Now, we are not surprised at seeing (1), as we roughly estimate that the set of non-human patterns is the vast majority of sequences. We are surprised when we see (2), because the set of human recognizable patterns is an extremely small set of all possible equally probable sequences under the idea of pure randomness. To get our exact p-value, we would want to precisely define the sets of sequences, but as long as we are not overly liberal with what we call a human recognizable pattern, then it’s pretty clear that the set of human recognizable patterns will be extremely small.

I realize this is a strawman statistician you’ve presented. The strawman scientist considers the set of all sequences where there are more odds than evens a human recognizable pattern, and the set of all sequences where there are more evens than odds to a be a human recongizable pattern, and then considers the fact that they observed a human recognizable pattern proof of Slippery Sam’s misdeeds.

sidenote : shouldn’t n be the number of storms in the data base, such that the rate is the proportion of big storms out of all storms, rather than the number of years? some years have seen more than one storm

A way to handle this issue is to examine snowfall data (assuming the data exist that I’ll mention) – you know…visually. Plot, like, total inches/year, number of snowstorms/year, and deepest snowfall/year for starters. Are there clear trends in data? Perhaps all of these measures has steadily increased across this span? You would know that there is something “new” about the amount of snowfall but not exactly unexpected. Maybe the data would not be so straightforward, of course…one may observe that, for example, there may appear to be relatively discrete periods of increased activity where the year to year variability within a “period” is not substantially different than the variability during the “regular time.” So…anyway, the point is that simply looking at the data might temper what one says about what the journalist implied when he said “we’ve become accustomed to something that used to be very rare” and doing so would not require much calculation or a theory – just looking at the data. Of course, whatever one said about the data in question could change if a much longer period of data (on the order of thousands or tens of thousands)could be examined.

[I’ve twice posted the following, once yesterday afternoon, again a few minutes ago, but it didn’t show-up. Guessing perhaps caught by spam filter due to links? Going to try trimming-out the image thumbnail links]

I requested all the snowfall data for Central Park from NCDC CDO (https://www.ncdc.noaa.gov/cdo-web/datasets) from 1890 to 2017 and found total snow and maximum daily snow for each year. I modeled each by generalized additive model, with a penalized spline to accomodate any non-linear relation over time. I did this both with snowfall as given, plus adding 17% to the measurements prior to 1990 to accomodate the change in measurement per the link David Marcus provided.

For total annual snowfall (unadjusted), the curve had 3.2 estimated degrees of freedom, F=1.7, p=.17.

For total annual snowfall (adjusted), the curve had 3.9 estimated degrees of freedom, F=.9, p=.4.

Here’s a link to the graphs (dotted line is the mean value):

https://postimg.org/image/4xz0r6xtt/

For annual maximum daily snowfall (unadjusted), the curve had 2.7 estimated degrees of freedom, F=2.4, p=.055.

For annual maximum daily snowfall (adjusted), the curve had 1 estimated degrees of freedom, F=1.7, p=.2.

Here’s a link to the graphs (dotted line is the mean value):

https://postimg.org/image/tqdr27qpl/

You can make your own interpretations from the above. I’m not persuaded that there’s evidence of a recent change in snowfall, especially once the adjustment for measurements prior to 1990 is made. Of course, this is data from a single site. A smarter way to do this would be to combine data from multiple sites, then model the combined data — this would be very interesting (and probably has already been done and reported), but the topic of this post was Central Park. I’m sure there are other ways to improve the modeling, but this is just for fun. As Glen Sizemore mentions, much more could be done with the data, but I’m not planning to publish anything.