Nice interface, poor content

Jim Windle writes:

This might interest you if you haven’t seen it, and I don’t think you’ve blogged about it. I’ve only checked out a bit of the content but it seems a pretty good explanation of basic statistical concepts using some nice graphics.

My reply: Nice interface, but their 3 topics of Statistical Inference are Confidence Intervals, p-Values, and Hypothesis Testing.

Or, as I would put it: No, No, and No.

Maybe someone can work with these people, replacing the content but keeping the interface. Interface and content are both important—neither alone can do the job—so I hope someone will be able to get something useful out of the work that’s been put into the project.

108 thoughts on “Nice interface, poor content

  1. From their About section: “The goal of the project is to make statistics more accessible to a wider range of students through interactive visualizations.”

    I don’t like this. We don’t need “a wider range of students” that have some idea of statistics. We need researchers and scientists that are experts in their fields and who are not afraid to ask for professional statistical support.

    • And those professionals are already overworked and also struggle to explain basic concepts to those who consult them. So it’s good to have more people enter the field as well as spread general knowledge to facilitate communication.

      • I agree with Paul here that getting help from someone who knows more than you is a good general tip no matter what you’re doing and what your level of expertise. I don’t know if I qualify as a “professional statistician” by Paul’s criteria, but whenever I tackle an applied problem, I find it super helpful to consult with people like Andrew. In fact, it was my coming out of industry to consult with Andrew and Jennifer Hill that got me hooked on real Bayesian stats in the first place. I’m also spoiled by being surrounded by great applied statisticians in addition to Andrew, like Ben Goodrich, Aki Vehtari, Dan Simpson, and all of our users doing really neat analyses with Stan.

        As far as the utility of consulting clients knowing some stats I agree with Statsgirl. There’s a huge amount of subtelty, but stats is very hard to communicate to someone who doesn’t know the basics. And when we’re consulting on applications, we don’t want to do all the work ourselves—there’s just too little of our time to go around. So I always feel we need to meet the applied people halfway. We need to understand a bit about their domain and they need to understand a bit about stats.

        Also, for people like me, who want to actually learn stats properly, most of the intro textbooks leave a lot to be desired. I struggled through Larsen and Marx and through DeGroot and Schervish, but I find them both overly complicated and way way too verbose for an intro. On the other hand, I really like Bulmer’s intro, but it’s not quite rigorous enough. I could really see something like this interface, which is fairly precise for what it is, being really helpful for a beginner trying to learn stats.

    • Wow just wow. So people who run stores shouldn’t learn to calculate means and standard deviations so they can plan how much stock to order? Parents shouldn’t know what median weight for a child of a given height means? No one but people who will get PhDs in statistics should know why the lottery is a losing proposition? Personally, I think every high school graduate should have basic knowledge of descriptive statistics and probability just so they can navigate daily life effectively not to mention read the advertising and package inserts.

      • Elin:

        Yes, I agree. Perhaps a good starting point for an intro book would be to work backward and first list a bunch of skills that we would like people to have. Computing p-values (that is, computing the probability of seeing something as extreme or more extreme than an observed test statistic, had the data come from a specified random number generator) would not be on my list. But what should be on the list, that’s the question. Some sense of sampling variability should definitely be on that list.

        • computing p values wouldn’t be on my list either. I’d like some of these, in the order that I thought of them:

          1) plotting scatter plots and choosing which plots to plot
          2) Visualizing variability through density plots and histograms
          3) Transforming variables, such as through the sigmoid function, the logarithm, x^2, sqrt(x) etc
          4) Dimensionless analysis of variables: how to choose meaningful scales and eliminate scaling redundancy.
          5) Understanding random number generators, and generating fake data through a generating process.
          6) Random sampling, biased/filtered random sampling (for example telephone surveys with 10% response rate), convenience sampling, identifying sources of possible bias.
          7) Converting a description of a generating process into a factorization of a probability distribution (ie. practice with p(a | b) p(b) type algebra, and with marginalization)
          8) Doing inference on a simple model by generating and rejecting (using ABC)
          9) A bunch of practice at coming up with algebraic models for simple example problems: essentially a crap-load of algebra word problems done in depth. Do you get more or less wet by running in the rain? does filling a ball with helium make it go farther? Does tire friction on the road depend on velocity? How much does it cost you to own and operate a car? Under what conditions does it make sense to buy a used vs new car financially? Given a mortgage and a variable interest rate and a tax rule, what is the best strategy for minimizing housing costs: pay off accelerated, refinance and pay at regular rate, or keep and pay at regular rate?

          10) Some concept of concentration of measure: plotting functions of many random variables including means, but also more complicated functions of many random arguments: earnings rate of a casino = (total winnings per person) * (total people per time) show how weekly vs monthly vs annual earnings compare…

        • Yes, I agree. I’ve really dropped having students do anything with them. We use R and I actually don’t even have them look at the summary with p values until late, and that is only because students see them in articles.

        • Also about sampling variability, one thing that I’ve done is to give every student in the class their own sample. For example, last semester we had the Titanic data and everyone took a random sample of 100 from that. (Some semesters I’ve had each student with two samples but that is really a bridge too far.) So then when they are doing analyses or visualizations they are all different, but yet mostly pretty similar. I’ve also in the past had them use ACS data from randomly selected counties.

      • I don’t believe anyone on this blog would contest the notion that it is a good idea to teach the broader population how to increase their proficiency at quantitative / qualitative reasoning. Paul, myself and others are not suggesting anything to the contrary and I think a knee jerk reaction to some of our (perhaps provocative statements) can be settled if we take a moment to discuss where different people are coming from.

        It would seem to me that most in this thread appear to think it’s a good (or at least not bad) idea to take rather nuanced concepts and distill them down to simple examples to help foster learning (in the case of the website adding a little interaction which is a nice touch). Presumably this is viewed as a stepping stone to either more in-depth study or is deemed sufficient if the student/reader has no need to dive deeper (e.g. their discipline doesn’t require it). There is nothing particularly wrong with this view. It’s what we mostly do in high school after all (not many learn about quantum mechanics in grade 11 physics)!!

        However I, along with some others here (I think), don’t necessarily share that view for all areas of teaching. For me some subjects are inherently complicated and pretending that they are less so than they really are (which I think this approach does, and I believe this is true of most introductory statistics texts as well) is doing a disservice to students. I think it’s doing a disservice precisely because MOST will not venture beyond the material to learn the caveats and nuances required to reason about the uses and limitations of the methods taught; so although it’s billed as a way to teach statistics to a broader audience, I think it harms their ability to think critically about the very subject it was supposed to help them with. I have no issue with the material as presented (and think it’s quite neat in fact) for students that do go on to further study; it is the ones that do not, which I am most concerned about.

        But at the end of the day the author is the one actually trying new methods, which is far more then I or others are doing, and it is a laudable effort.

        Let me finish by saying that this IS AN INCREDIBLY IMPORTANT SUBJECT and that is why I have engaged in this discussion.

      • I agree, we need to think carefully about what we should and can get across to various audiences.

        But I think Paul has a point – especially the inference demos are about how researchers should make sense of the experiments and report on them. When materials are perceived that way I am fairly sure it will do more harm that good.

        Some of my experience here was with clinical reviewers who read intro Bayes papers were taking what they thought they understood way too uncritically in important work.

        When I taught intro stat at Duke, I repeatedly stressed to students that if they ever where involved in an actual study – don’t try and make do with what you have learned here – try to get help from a statistician or experienced researcher.

        So there will be content we can and should try to get across – but for making sense of actual research – we are a long way from being able to do that with current introductory material.

        This might be relevant http://bulletin.imstat.org/2015/07/xl-files-the-abc-of-wine-and-of-statistics/

  2. Here is their definition of a p-value:

    “The p-value is the probability that a statistical summary, such as mean, would be the same as or more extreme than an observed result given a probability distribution for the statistical summary.”

    Wat?

    • That is the definition of a p-value, is it not? Replace “statistical summary” with “test statistic” and “a probability distribution” with “the null distribution” and you should have a more familiar phrasing. What do you take issue with here?

      • Well, even if we did those replacements, the p-value is praised as an unconditional probability. The important piece still missing is that this probability is conditional on some null hypothesis being true…

        • I don’t think it’s charity so much as that their definition’s really spot on for the level at which they’ve chosen.

          Of course, getting the definition right doesn’t solve any of the problems with p-values that Andrew’s been beating on for the last couple decades on this blog!

        • Bob:

          Just to clarify: I don’t think the problem is the “level,” as if null hypothesis significance testing is basic material and then you can move to more advanced levels. I think null hypothesis significance testing is essentially useless (yes, there are occasional scenarios where null hypothesis significance testing is appropriate, but I think this is outweighed by all the scenarios where it is used inappropriately, thus I think that teaching this material except in a “historical” or “cross-cultural” sense is counterproductive), and I think one can teach material at the same or even lower mathematical level that is much more relevant to applied statistics.

    • This looks like a re-phrasing of the definition the ASA gave in their recent statement on p-values:

      “Informally, a p-value is the probability under a specified statistical model that a statistical summary of the data (e.g., the sample
      mean difference between two compared groups) would be equal to or more extreme than its observed value.”

      It’s clumsy, but it has to be, otherwise it wouldn’t be a correct definition of a p-value. A simple and straightforward definition would have to be false as well (as we’ve seen time and time again).

  3. A lot of these are really good… I like the CLT and correlation demos. The confidence interval demo is also well done in that is clearly demonstrates how the procedure works.

    More generally… I don’t love teaching hypothesis testing, but given its ubiquity don’t we want to do the best job possible of explaining how it works to people who might otherwise misuse it or buy into someone else’s mistaken beliefs about how it works? We can and should take a critical approach and highlight the many negative consequences of hypothesis testing’s ubiquity – but understanding those criticisms requires an understanding of how the method works in the first place. I’d rather someone learn hypothesis testing from a competent (and critical) statistician than from a research peer or a popular article or a naive textbook.

    • I went through this whole thing in linguistics, where there was always a debate about whether we’d teach the Chomsky canon (so that the students could understand the 95% of the field working in that paradigm) or whether we’d concentrate on more modern, mathematical and computationally sound linguistic theories. I’m still not sure what the right answer is. As much as you can understand the locked-in majority, they rarely listen. After a dozen or so years of trying to fight it, I finally concluded we should just ignore the Chomskyans and go our own way—nothing fruitful ever came out of working with them or trying to talk to them. The last talk I ever tried to give on linguistics proper was at NYU, but they shouted me down after my intro slide for having the audacity to suggest we could use speech recognition for exploratory data analysis in theories of spoken language and dialogue; the Chomskyan line is that what people say isn’t data for linguistics; I think they would’ve thrown rotten fruit at me if they had it. I slunk away and decided linguistics was doomed as a scientific endeavour in the short term. I think the future of linguistics is in papers being written outside the field, like this one from Tim Vieira and Jason Eisner.

      • The Dutch categorial grammarians were also very rude to you back in Utrecht at esslli in 1999 or so. And i have seen hpsg guys from one subschool eviscerating a stanford student of Ivan Sag’s. Basically, linguists hate everything except the narrow approach they learnt from their advisor. The Chomsyans’ behavior is part of a broader culture in linguistics.

        • Shravan:

          Speaking of linguistics: I was scanning thru the comments and came across this sentence of yours: “And i have seen hpsg guys from one subschool eviscerating a stanford student of Ivan Sag’s” and at first I thought this was just gobbledygook, the sort of thing that spambots produce using some algorithm that strings together random words and phrases. I came this close to flagging it as spam.

        • The ‘good’ thing about an iPhone is that I can read and write comments on stats and other things while doing mundane tasks or waiting in line for things…the bad thing is that I leave behind a stream of ill-formed sentences, missing or inappropriately autocorrected words…sometimes I even try to write maths, which never turns out well

        • Yup, the splits in linguistics are fractal, as I think they are in many fields. But I always thought I was on the same side as the Dutch categorial grammarians for the most part. I spent a sabbatical in Utrecht working with Michael Moortgat and his students. Ivan Sag and I were co-teaching an HPSG and categorial grammar seminar to a national Dutch grad school in logic, language, and information where students came in from all over the country (not hard with a small country with awesome train service—except the week of student protests where they sat on the train lines all over the country). That class directly addresses this point, too. Ivan and Carl Pollard were arguing about empty categories, which pushed back the publication of the second HPSG book by over a year and wound up with the two developers of the theory on opposite sides of a micro-debate. My portion of the course was showing the isomorphism between metarules and empty categories, but being linguists, that didn’t stop the debate. It pretty much put an end to my faith that even the non-Chomskyans would see reason and logic (Carl was ABD in math before doing linguistics, so it wasn’t lack of formal training). Ivan was a student of Chomsky’s and saw HPSG as being within the Chomskyan paradigm even if Chomsky and his followers didn’t (this was what was labeled as “not linguistics” by our Chomskyan collagues at University of Pittsburgh—they called it “engineering” and the ensuing fight led to the dissolution of our joint Ph.D. program). I shared an office with Carl at Carnegie Mellon and I still remember our very first meeting at a middle Eastern restaurant in Shadyside—he told me he didn’t want to hear about categorial grammar until it could solve the “pied piping” problem. I solved it in my book way better than they did in HPSG, but not surprisingly, Carl wasn’t interested. Perhaps I should’ve repied, “I won’t talk to you about HPSG until it solves coordination and quantifier scope?” and gone looking for a different field back in 1988.

          I always thought I was on th same side as the Dutch categorial grammarians! It was Mark Steedman at Edinburgh who was on the other side with his CCG (it’s another marvel of closed-circle peer review that kept that theory alive in my opinion). They invited me for sabbaticals and talks. I spent a lot of time at the board with Michael Moortgat. They weren’t so keen that I was doing things like proving their multimodal categorial grammars were Turing equivalent, but that wasn’t because they didn’t like me. I would’ve liked them to have focused more on language, the same way I’d like all the MCMC researchers to focus more on applied problems with more than 10 dimensions rather than continuing to work on variants of Metropolis and Gibbs that don’t scale.

        • Interesting.

          The linguists I studied with in the 1980,s had moved into semiotics which had a few sub-fields but mostly they seemed to listen to each other.

          Guess I was very lucky.

  4. They cover 15 topics. The majority — 12 by my count — are on topics that are not directly related to NHST. Yet, that’s all you see? It’s pretty sad that your knee-jerk reaction against significance testing and related methods is so strong that you are unable to see anything valuable in the content.

    • I agree that Andrew might’ve pointed out how nice the first few presentations were.

      The problem is that inference is where the rubber hits the road in applied statistics. So if you provide the “wrong” advice there (what is wrong is a matter of some debate here, so different people will have different attitudes), you’ll be leading people down the garden path for applications.

      • I think it’s also debatable where the ‘rubber hits the road’ in applied statistics. I know several data scientists and most of their work doesn’t even touch inference. Between Andrew’s work and the ascendancy of data science, the days of NHST are numbered.

        • They largely fit models and make predictions. Your model either does a good job out of sample or it doesn’t. No one cares about p-values or model coefficients for that matter.

        • This sounds like Bob and Chicken may be using different definitions of “inference”. I use it to mean, “Trying to infer something about a population using data from a sample”. So statistical inference can be done by methods other than p-values.

        • If you object to my using “infer” in my definition as being circular, try this:

          “Trying to obtain information about a population by using data from a sample.”

        • I think the key is in the phrase ‘No one cares about…model coefficients for that matter’.

          Traditional statistical inference is usually about parameter inference e.g. determining and interpreting model coefficients etc.

          Yes you can call prediction ‘predictive inference’ or whatever, but despite people’s best efforts to reconcile them, there is often a tradeoff between this type of inference and traditional parameter inference. Even Andrew appears to _largely_ use ‘predictive checks’ to help ensure the validity of parameter inferences (and typically after he does parameter inference).

          ‘Data scientists’, ‘machine learning’ etc tend to care more about predictive inference and try as much as possible to jump straight to this with minimal detours. Parameter ‘inference’ is just a by-product.

          I think they pretty clearly have a different ‘inference’ culture, if you must call it that, hence (erm) Chicken’s comment.

    • Good catch. Unfortunately, I have seen this definition of probability often (including in dictionaries).

      Exploring a little more: Following the About link and then the one to the “creator” of the sight says he is “a first year master student at Stanford University studying Computational and Mathematical Engineering” In other words, the “creator” may have some know-how about creating visualizations, but it appears that he doesn’t have a good understanding of probability and statistics. So a well-intended but misguided effort.

      • I assume this stems from the colloquial use of the “likelihood” of something as being short for “how likely” that thing is, i.e. its relative frequency.

        Yes, this is not what is meant by statistical likelihood. I think the distinction would only matter for someone who needs to understand what a likelihood function is, and this page appears to be aimed at an “intro to stats” audience. The author should have picked a different title; I’m guessing he didn’t want to call the demo “probability”, since it is one of three demos under a broader “probability” header.

      • Martha:

        Regarding your last sentence: that’s why in my post above, I wanted to separate the delivery method from the content. Lots of people were sending me this link and telling me how great it is. And I do think it’s great that there’s a better visual interface out there. I hope they can now connect with people who have useful content. You have to start somewhere, and I have no problem with an approach that starts with the interface and then adds quality content later.

        • Did you also not like the studies that led up to inference? It’s very very hard to present this stuff simply, but when I said I thought people should learn basic measure theory a while ago on the blog, everyone thought I was nuts. What’s the right level to teach at? And for whom?

        • Not to answer for Bob but pretty sure he meant something no where near this technical.

          Did not know Jeff wrote such a book. The preface suggests its a less technical but still adequate coverage than Billingsley Prob and Measure which is what grad students at Univ of Toronto used to have to survive to remain (at least in the Phd) program. My sense is 99.999% of the material is focused on subtle continuity issues. I think somewhere in Billingsley its even said everything to do with finite sets is trivial.

          On the other hand, that was were (outside any course), I picked the connection between Characteristic functions and Fast Fourier Transfer (a hopeful but impractical way to learn about the distributions of some functions of random variables) https://stats.stackexchange.com/questions/44209/deconvolution-with-fourier-transform-or-characteristic-function/44221#44221

          So the challenge is some useful stuff is hidden in really technical books (and for weak convergence and characteristic functions in later chapters that courses don’t get to) and who do you trust to find and pull that stuff out and put in a less challenging format?

  5. I generally think this is a bad idea. I would be afraid that this type of learning fosters confidence to use statistical methods for some people who shouldn’t otherwise be doing so. As Gelman has articulated, statistics is hard and I think this type of click style learning will foster more end-users who shouldn’t be using statistics to do so then it would inspire end-users to learn statistics in-depth.

    There is certainly a fairly large chance that I am wrong about this, however. Although, I would be more optimistic about the same approach taken for the philosophy of science.

    • I don’t think the people who wrote this interface intended it to be a complete course in applied statistics! It thankfully wasn’t billed as “learn statistics in an hour”. How do you propose someone start learning statistics that would be less dangerous?

      I also don’t understand these reactions. It sounds like Allan Cousins and Paul are saying that a researcher should either remain completely ignorant of stats (so you don’t hurt yourslef) or do a Ph.D. in statistics, but anything in between is harmful.

      • I was thinking about that after I posted. I haven’t reconciled my initial reaction with your response about all in or nothing; there is obvious utility in having people with training inbetween. If I figure it out I’ll get back to you.

        I think I have strong feelings towards cool graphics because they’ve so often been used to dumb down material with no end game in sight. I’m not saying that happened in this case but that may be why in my less than sober state I was quite quick to suggest it.

        • Allan:

          From a related post http://statmodeling.stat.columbia.edu/2017/08/02/seemingly-intuitive-low-math-intros-bayes-never-seem-deliver-hoped/ the sense I got was that we have to realize that we can only get so much across with this type of material in say a webinar – no where near an adequate grasp of how statistics works that’s safe – so we need to think more about what we can get across. Maybe a sense of random variables, their distribution and conditioning (as something that is used in Bayesian statistics) – clearly marked as something to build on.

          Think I am running it something similar with regard to a shed I am build at my cottage. The online material on doing metal roofs is sometimes qualified as an entertainment video rather than how to. It can be very dangerous on large roofs that are at the second and higher level!

        • But at least with the metal roof example most people will have the fear of heights kick in to bring them down to reality (or gravity will do that for them)! Only a select few have enough bravado to embark on a construction project well over their head after reading a DIY article or video; most usually realize at some point that they need to stop or call a professional. I’m not sure the same is true for statistics.

          By the way I’m a contractor so if you need help with that shed or proper references on construction detailing just let me know!

        • Comment left. If that doesn’t work you can always google “Enable Contracting, Ontario” and that should get you to me. Look for the email at bottom of page and replace information with allan.

          I did attend Western. But I’m pretty sure the implied contract was that once the tuition stopped flowing so did the services. Ruthless!

    • I disagree with Paul and Allan here.

      I’ve seen this reaction – that we need to confine using statistics to those that really know what they are doing. It is a well-intentioned sentiment and a natural reaction to how badly conducted many analyses are. But it is totally unrealistic. In an era of false facts, statistical reasoning is going to be used – and misused – whether you like it or not. I’m all for opening the doors as widely as possible. Let mathematical illiterates have tools that they are likely to misuse. I’m not promoting misuse nor am I advocating throwing in the towel on statistical literacy. But we should welcome the fact that many people are willing to look at evidence. We should encourage them to do so – even if they will do it poorly. And we should follow-up with efforts to improve their reasoning and stress the importance of doing it better. Reacting to examples of people misusing statistical analysis by saying they should be kept away from it is a step in the wrong direction.

      • Dale wrote: “But we should welcome the fact that many people are willing to look at evidence.”

        Yes, for those who are willing. However, the majority of scientists I met were not willing, for them statistics was a bad thing that needed to be done by clicking buttons in SPSS. These people were better off without doing any analyses at all and focusing on the theory behind it instead, or whatever.

        Otherwise the consequences will be fatal (as we already experience by the replication crisis etc). In fact, I turned my back on the scientific community because it happened more than once that my analyses were rejected by reviewers who clearly had a poor understanding of statistics (“Why did you not just calculated multiple t-tests?”, “There are no p-values in the manuscript”, …).

        I think your standpoint as well as the one represented by Allan and me have their right to exist in this debate. The solution will most probably lie in between, something like teaching only those who are willing and providing support for the others.

        • I have my own horror stories about the difficulty (impossibility) of replication, poor incentives, poor editors, reviewers, academics, etc. The point I take issue with is your implicit belief that we can actually control the damage by only teaching to those who are willing. Much of the poor practice (particularly the intentional, not accidental, kind) is directly related to the fact that a large portion of the population is quantitatively illiterate and ready to be taken advantage of. Opening access will not solve these problems – I think we agree on that – but limiting access will not work. I would also take issue with laying the blame on easy to use software. Having to write code/program is not a protection against the problems. It will limit the perpetrators, but arguably make the offenses more serious (intentional deception, fraud, etc.). I am not bothered by silly (even stupid) accidental mistakes – that is our teaching opportunity.

        • Dale wrote: “Much of the poor practice (particularly the intentional, not accidental, kind) is directly related to the fact that a large portion of the population is quantitatively illiterate and ready to be taken advantage of.”

          I think the unintentional, accidental kind of poor practice is represented much more frequently and more harmful to the scientific process than the intentional kind, at least in the long run. Fraud will always exist, but it is nothing compared to the hordes of poorly trained researchers out there who plan and perform their studies and experiments on the basis of questionable results obtained by other poorly trained researchers. It is a multiplication of errors and wrong assumptions, like a house of cards that will (hopefully) collapse at some stage.

        • What’s the role of a statistical analysis of evidence when only other statisticians are allowed to be familiar, much less expert, with the methods?

        • Yes, at the “I got the gist from the headline” level, which is obviously inappropriate for professional work. I’m of the mind that the way to solve problems stemming from ignorance is to reduce ignorance, not actively increase it.

          Practically, I’m not even sure how your proposal gets off the ground. Researchers become technicians who run experiments designed and analyzed by statisticians who don’t understand the domain and don’t know how to actually conduct an experiment. If the statistician does know the domain, they do at a fairly naive level, missing critical nuances and complexities of the problem (ask me how I know).

          Or maybe, in the best case, you have someone with a high level of understanding of both the domain and statistics, at least enough to know that the problems on both sides are hard and complex. Sometimes, maybe often, that person decides to call on someone with more expertise in one thing or another, but even the knowing when to ask for help arises from *more* understanding of the problem, not less.

  6. Part of this discussion (the part that’s just harping on how they shouldn’t be trying to explain this stuff) is just off the rails. Part of the culture of the internet that you post unfinished things and expect feedback. They have links to their team, a contact form, and a twitter account you could use to tell them about their mistakes, that’s where the action is. Sure their definition of a p-values and their focus on it is a little off, but most stats departments teach undergrads the same thing and they don’t want your feedback so this group is miles ahead.

  7. I am not a statistician, but I do follow this blog FWIW. I’m disappointed in some of the comments, starting and especially with Andrew’s, which I feel are needlessly negative.

    I’m not contesting the content of Andrew’s (or anyone else’s) comments, just the tone. I would have preferred if Andrew and others had instead complimented the authors on some outstanding work in visualizing statistics and offered some suggestionns on how to make the site even better.

    • Vladimir:

      See the concluding paragraph of my above post. It expresses exactly what I think. Perhaps you or someone could write a filter that would take the content of this blog and run it through a post-processor that adds the phrases “In my opinion and I could be wrong,” “No offense,” and “I really appreciate everything you’ve done” at random through the text.

    • P.S. In all seriousness, I was a bit reactive in that post. Some people pointed me to that website saying how great it was, and my reaction was that it had good graphics but the statistical inference content was no good. On the other hand, if people had pointed me to that website saying how bad it was, my reaction might well have been that, sure, the inference content was no good, but the graphics were great. In any case, I stand by my overall reaction that I’d like to see the high-quality graphics attached to some more modern content. As to “suggestions on how to make the site even better”: my quick suggestion, no joke, would be to remove the inference material entirely Deciding what to replace it with, that would be more work. There’s a reason that I’ve only taught intro statistics once in the past fifteen years.

        • This subject seems to come up repeatedly – being nice and avoiding unnecessarily making someone unhappy versus being direct about enabling someone to do better research.

          Its a value/trade-off that will differ among us.

        • See Daniel Dennett’s advice on how to give criticism:

          How to compose a successful critical commentary:

          1. You should attempt to re-express your target’s position so clearly, vividly, and fairly that your target says, “Thanks, I wish I’d thought of putting it that way.

          2. You should list any points of agreement (especially if they are not matters of general or widespread agreement).

          3. You should mention anything you have learned from your target.

          4. Only then are you permitted to say so much as a word of rebuttal or criticism.

          See: https://www.brainpickings.org/2014/03/28/daniel-dennett-rapoport-rules-criticism/

          I’m not sure what Dennett meant by (1)—he’s a philosopher and they go slowly through deep water. I’m not sure (1) would’ve been helpful. Andrew did (2), though it was rather perfunctory. He might’ve mentioned it would’ve been an A project in statistical graphics and communication class (or maybe it would’ve flunked for mentiong p-values—no idea how Andrew grades). For (3), that’d be either nothing if we’re looking at the content or maybe something for just the graphics. As for (4), well, that’s what everyone’s objecting to.

          One of the reasons I like building tools is that generally people say “thank you” rather than picking apart minor details, though we do get our share of that for Stan.

        • This is a little off topic, but Dennett’s #1 is about ensuring one is not constructing a strawman (a particularly acute problem in philosophy). It’s also a call to stop and carefully think through the target’s position, especially from the target’s perspective.

        • (1) Is a little hard to do in this case because there isn’t really an argument being made except implicitly that “this help students learn” but that’s not the issue that Andrew is interested in, he is interested in “what should students learn” which is more of an implicit argument of the website.
          It’s actually a/the normal structure to have scholarly discussion in fields where there are debates about theory.

  8. I like their demonstrations of coin flipping and dice rolling and estimation of pi from “throwing darts”.

    They illustrate frequentism so well since we get randomness (or pseudo-randomness in this case, or pseudo-randomness that doesn’t fail standard tests for non-randomness to be more exact) and they show fairly rapid convergence of the long-term frequencies pretty easily that can be relied upon.

    I’d recommend the authors of the site cite “Mathematical Theory of Probability and Statistics” and “Probability, Statistics and Truth” by Richard von Mises for extra background on the foundations of frequentism.

    Thanks,
    Justin

    • I do, too. In fact, a lot of people like these examples so much that they’re the canonical way to start all these topics (e.g., see the Wikipedia page on Monte Carlo methods).

      I really like Bulmer’s intro to stats if you want a siple intro to hypothesis testing (it’s an inexpensive Dover book). And he starts exactly with the convergence of coin tosses and dice rolls (with some biased dice, no less). And it’s a nice intro to the two basic notions of probability without too much math.

      There’s nothing frequentist per se about the central limit theorem—it’s also a theorem for Bayesians! We use it all the time in Monte Carlo calculations, for example. What makes you frequentist is what you don’t allow—probability statements about things with single values, like parameters. It’s like the patent system that way—a patent doesn’t give you the right to manufacture a project (doing so may infringe earlier patents, for example), rather it lets you prevent other people from using your intellectual property.

      • CLT might be for Bayesians, or used by Bayesians, but I don’t think it is Bayesian in nature. There is a Bayesian CLT, I guess something like under some reasonable assumptions the posterior distribution would be normally distributed, that would fit the bill. But I would only trust it if it had good frequentist properties. Of course if n is large enough maybe the Bayesian and frequentist analysis would somewhat agree anyway, if not the parameters at least the decisions made from them.

        -Justin

        • Perhaps breaking rules Bob posted above – the link does seem to be about numerical integration using smart monte-carlo – doing math approximation by computer rather than inference about empirical phenomena.

          But if we roughly recall Geyer’s CLT – its the true likelihood not being that different from one based on Normal data generating model that is indexed by the sample mean and variance. So he avoids any consideration of a sampling distribution.

        • The link was to show that even though standard Monte Carlo methods for numerical integration are purely frequentist (using them to calculate Bayesian expectations doesn’t make them Bayesian) it seems people have also explored a “Bayesian” approach. I have not read the paper, to be honest.

          (I don’t understand your second remark. I don’t know if it’s related to the frequentist character of the Central Limit Theorem and Monte Carlo methods or if it goes in a different direction.)

        • Thanks.

          > Monte Carlo methods for numerical integration are purely frequentist
          To me that is just approximation – to be frequentist _inference_ – it would need to be an estimate or test of something empirical.

          > I don’t know if it’s related to the frequentist character of the Central Limit Theorem
          It was, but I was wrong thinking it gave a Bayesian CLT – it actually ignores CLT all together https://projecteuclid.org/download/pdfview_1/euclid.imsc/1379942045

          Here are notes apparently from NormalDeviate claiming there is a Bayesian central limit theorem – http://www.stat.cmu.edu/~larry/=stat705/Lecture14.pdf

        • Monte Carlo gives an estimate for the true integral based on a finite random sample. The estimate will vary with a sample and converge as the sample becomes arbitrarily large. This corresponds to at least one brand of frequentist inference imo.

          Just replace an unknown mathematical object – the integral – with an unknown physical theory object or quantity and the idea is the same, and pretty convincing to me.

          (Less convincing to me is Neyman-style ‘error probabilities’ based on an assumed model and _decisions_ under this model. So yes, there are a few flavours of frequentist inference imo, and some are more appealing than others. )

        • Keith’s point is that the integral isn’t empirical it’s logically defined as a definite single number. MCMC or IID sampling using an RNG are computing methods to get close to this number.the RNG is stress tested on sequences of trillions of numbers, and typically has proofs of desirable properties such as long periodicity.

          When replacing a pure mathematical construction with an empirical scientific process, nothing like the strong tests of randomness that the RNG underwent are available to you to make the assumption that your sample of crops or asthma patients or fish eggs or whatever have properties that make the math of random sequences plausible for calculation, at least not until you have samples of sufficient size that might allow some mild test of randomness to operate. Around a few thousand would be the minimum effective size for that type of test. You’ll need to learn the empirical distribution and then test for autocorrelation and stationarity and so forth.

        • This discussion reminds me of chemistry lab work at university where we plotted some curve on paper, cut it with scissors and weighted on a precision scale to get the AUC. I guess kids these days use a snapchat filter or something.

        • I use Tai’s method for numerical integration.

          And yes the empirical study of objects from the messy real world is usually much messier than the empirical study of mathematical objects (and is presumably further covered in…the subject of statistics and experimental design etc), but is the distinction really that clear as a matter of principle?

          Do you not think mathematics can be investigated empirically?

        • > the distinction really that clear as a matter of principle?
          I think it is – mathematical objects are defined by abstract representations and these representations are directly worked upon (analytically or numerically approximated) so its deduction (even if approximation does use random devices).

          Induction is about learning about reality which can only be fallibly represented and so the representation we work on are not the ones were interest in and we have to worry about that a lot.

          To me statistical inference of any stripe is learning about empirical reality not mathematical abstract representations.

        • ojm: I don’t think the meaning of “investigating a mathematical relationship empirically” really uses the same word “empirically”. It’s like “statistically significant” vs “practically significant”.

          In science Empirically means collecting data about the world using some protocol. In math Empirically means generating concrete examples of certain abstractly defined sets and looking at them. In Math, the set is itself a well defined object and its relationship to the more general thing is well defined. In science, not so much. I mean, we can study 22 southern German asthma patients found by recruitment via posters in the lobby of a local clinic, but what is the biochemical relationship between them and say “all the people who today, or in the future, will have pulmonary problems due to allergies and need treatment?”

          When you generate an iid random sample for an integral, at least you know you’re going to get numbers from inside the high probability region of the space. When you recruit 22 asthma patients, maybe one of them just has Munchausen syndrome, and 3 of them were ex chemical plant employees whose lungs were damaged by exposure to an industrial chemical rarely used outside Southern Germany.

        • OK I suppose we clearly differ on this. I think it makes perfect sense to investigate mathematics empirically/experimentally.

          On the wiki page for experimental mathematics:

          https://en.m.wikipedia.org/wiki/Experimental_mathematics

          You’ll find the following quote from Paul Halmos:

          > Mathematics is not a deductive science—that’s a cliché. When you try to prove a theorem, you don’t just list the hypotheses, and then start to reason. What you do is trial and error, experimentation, guesswork. You want to find out what the facts are, and what you do is in that respect similar to what a laboratory technician does.

        • ojm:

          I agree with that – as Peirce put it – experiments performed on representations rather than lab materials – but the distinction with math experiments is the the representation and what that represents is one and the same thing.

          Daniel is saying pretty much the same thing, I think.

        • ojm: I think this is a consequence of the fact that humans are the mathematicians. Of course, there are similarities in the psychological techniques we have to come to grips with mathematics and to come to grips with say the behavior of cells in an organ, but this is just saying that both mathematicians and biologists are human, it isn’t saying that the underlying phenomenon of say the primality of certain very large integers and the rate of metabolism of ethanol in liver cells of person X are fundamentally the same kind of thing.

          I’m going to define “scientific empiricism” as the process of guessing an approximate answer to a question about a process that occurs in the universe and then causing molecules to carry out a time-evolution in such a way that humans agree that such a process has occurred, and then checking the final result against the prediction.

          I’ll define “mathematical empiricism” as the process of guessing something about a logical process and then carrying out formal calculations to determine whether the guess meets certain logical requirements for an answer or approximate answer to a more detailed formula.

          Both involve guess and check… but one involves checking against the specific evolution operator of the state of the universe, and the other involves checking against an abstract symbolic formula.

          The *process* is similar “guess and check” but only one of them provides information about the way atoms and clumps of atoms actually do behave.

          So, I fully agree with you and Halmos that an empirical type process occurs in Mathematics. I just disagree that we should lump it with the empirical process of science, because I think the distinction where physical experiment is the final arbiter is very important. This gets me for example when people run many thousand dimensional dynamic global circulation models on computers and claim to know what the future climate will be, or the like. What you know, is what the behavior of a certain complicated mathematical formula is. What you *don’t* know is whether that behavior mimics the behavior of the universe over those time-scales.

        • > I use Tai’s method for numerical integration.

          That’s a good illustration of the risks of teaching less statistics in order to protect people from themselves. They will anyway rediscovered things poorly when they need them.

        • As soon as someone comes along and tells us the exact time evolution operator and state of the universe at a fixed point in time, and provides us with a computer in a different dimension that can simulate it very quickly, then the two would become identical! We could empirically study the time evolution operator of the universe through guessing and checking against the precise mathematical description of that operator as a computer program in another dimension where things go faster and storage is vastly larger. Of course, this symmetry leads people to say things like “we’re already living in a simulation” but whatever. Until that day comes, scientific empiricism is a very special case, and can only be investigated through experiment on atoms and things not on formulas.

        • I wonder to what extent the Bayesian/Frequentist divide tracks the Rationalist/Empiricist divide?

          Anyway, I personally take Monte Carlo to belong to the general frequentist umbrella, in the sense of estimating an unknown quantity based on a function of a finite number of measurements of that quantity, and where the function is assumed, proven, thought to, or hoped to converge to the target as the number of measurements becomes arbitrarily large.

          The paper Carlos links to shows at least _some_ prominent Bayesians agree, though I suppose we shouldn’t generalise this finite sample to the full population so readily ;-)

        • Re
          > This gets me for example when people run many thousand dimensional dynamic global circulation models on computers and claim to know what the future climate will be, or the like. What you know, is what the behavior of a certain complicated mathematical formula is. What you *don’t* know is whether that behavior mimics the behavior of the universe over those time-scales.

          That’s very much in line with my position. I’m generally more and more inclined towards the view that mathematical models are not/should not necessarily be considered literal representations of the world.

          I’m just saying that you can investigate the behaviour of s complex math model in much the way you can investigate the behaviour of the ‘real’ world.

        • Daniel Lakeland said, “I’ll define “mathematical empiricism” as the process of guessing something about a logical process and then carrying out formal calculations to determine whether the guess meets certain logical requirements for an answer or approximate answer to a more detailed formula.

          Both involve guess and check… but one involves checking against the specific evolution operator of the state of the universe, and the other involves checking against an abstract symbolic formula.’

          Speaking from the perspective of a pure mathematician (as Halmos was), I do agree with ojm’s quote from Halmos, but don’t think that “carrying out formal calculations to determine whether the guess meets certain logical requirements for an answer or approximate answer to a more detailed formula,” and “the other involves checking against an abstract symbolic formula” capture what Halmos was describing.

          Instead, I think that what Halmos was saying was more like: You want to start with assumptions A and get, by logical reasoning, to conclusion B. But you can’t expect to proceed in a straightforward manner. You have to try something, with the understanding that it might not get you to B. But it might get you to C, and then you might try to get from C to B. But that might not work, so you might then try to look for D that will indeed get you to B, and try to get from A to D. Or you might try to find a counterexample to your conjecture: a case where A is true but B is not. That might give you insight — maybe you need to add more conditions to A, or maybe you might need to modify B. But maybe trying to construct a counterexample will give you insight in how to get from A to B; or etc., etc.

        • Martha, yes that is more or less the psychological process of “figuring out” what is going on, and it applies to how we figure out what is going on in biology, or ecology, or chemistry or whatever as well… but when it comes to determining whether you’ve succeeded, in pure math it’s down to whether or not you could in theory write down a long sequence of symbols which follow an allowable formal system for symbolic logic and lead to a valid proof. I mean, this is the logician’s ideal. Obviously in practice we usually use less than truly formal proofs, we suggest in a more understandable way what the symbolic expression would look like. But the arbiter of truth is whether such a formal symbolic formula that follows valid rules of construction does exist.

          Whereas, the arbiter of “truth” in say biology or physics or ecology is whether or not we consistently get some results out of running physical experiments on lumps of atoms. And, here the meaning of “get some results” and “experiments” and “lumps of atoms” are all pretty wobbly compared to that theoretical long lambda calculus or first order predicate calculus formula.

          So, how you get to the formula is sure very human, very wishy washy, try and see, what the heck, maybe it’s this way, or maybe it’s that way… but in the end, you have something that in theory you could run through a proof validation algorithm. Not so in empirical sciences.

        • ‘Monte Carlo is a purely frequentist procedure…[and thus] fundamentally unsound’…so instead we’ll propose a Bayesian approach, starting from ‘A very convenient way of putting priors over functions is through Gaussian Processes’…sigh.

      • We have to distinguish I think between

        Frequentist statistics

        and

        Mathematical probability theory

        Mathematical probability theory is the mathematical description of certain pure number theoretical sequences. For example, we can create the set of all 10^1000 length sequences of values 0 and 1. We can then define a filter over these which selects those that are “random” according to some comprehensive test. The properties of such sequences are mathematical facts that both Bayesians and Frequentist statisticians need to agree with if they accept basic set theory etc. It’s like the statement sqrt(4) = 2 it’s not controversial. (This definition of randomness is Per Martin-Lof’s and has been proven equivalent to some by Kolmogorov, see my blog post and linked article http://models.street-artists.org/2013/03/13/per-martin-lof-and-a-definition-of-random-sequences/ )

        On the other hand Frequentist statistics assumes that the frequency property of abstract number sequences such as the above is a good model for the behavior of scientific measurements. This assumption is very questionable by Bayesians, and they in general do NOT always make this assumption (though of course sometimes could without contradiction), instead the general Bayesian assumption is that the mathematical notion of a measure over spaces is a useful model of degree of some kind of knowledge. A Bayesian distribution doesn’t in general represent the distribution of actual measurements, it represents the knowledge that we have under our assumptions about where measurements would or would not be likely to happen. p(Data | Parameters) once you plug in fixed data is a degree of agreement between prediction from a model with limited knowledge built-in and observed fact from measurement.

        With all that being said. the CLT is a mathematical fact about abstract number sequences. If we use said sequences to perform calculations (such as MCMC etc) then the sequences will have CLT properties because all such sequences do. That need not mean that actual scientific data *is* such a sequence to the Bayesian, but the belief that actual scientific data is well approximated by a Per Martin-Lof type random sequence *is required* to believe Frequentist results.

        • To see how problematic this can be, consider the difference between the following 2 procedures:

          1) Take the heights of 100,000 people in the US, generate high quality random numbers based on the Mersenne Twister algorithm to shuffle them into a random sequence using the Knuth / Fisher-Yates algorithm
          https://en.wikipedia.org/wiki/Fisher%E2%80%93Yates_shuffle hand these out in groups of 20

          2) Take the heights of 100,000 people and sort them into ascending order, hand these out in groups of 20

          Now, in method 1 due to the fact that we are using 100,000 measurements and a high quality RNG, the first 20 numbers will virtually guaranteed have an average that is a small multiple of sd(Heights)/sqrt(n) away from the true overall average across all 100,000. so will the second 20, so will the 3rd 20, so will the 4th 20. The number of 20-length sequences that are a certain distance away from the true average is given well by the normal distribution approximation to the sampling distribution of the mean. It decays like exp(-z^2) where z=(mu(x) – muTrue)/(sdTrue/sqrt(20)) so that out of 5000 possible groups of 20 only around 1 or 2 of them is more than z = 3.6 away from the true mean.

          Now, in method 2 the first 20 numbers will be guaranteed to be very very low compared to the true overall average. The second set of 20 will also be very very low but also guaranteed to be higher than the first 20… etc… the first 2500 sets of 20 will ALL be low compared to the true average. The average of all of the first 2500 sets of 20 will be low compared to the true value. In fact, you won’t get within a small multiple of sd(Height)/sqrt(n) until you’ve seen ALMOST ALL of the sets of 20.

          Absolutely nothing about the CLT applies to sets of 20 people taken from example 2.

          The sequence in (2) is *not* a random sequence according to Martin-Lof’s criteria. It is a perfectly good sequence, it contains *exactly* the same numbers as sequence (1) but none of the properties of random sampling apply to it.

        • Also in method 2, each set of 20 will have extremely small standard deviation, and so by reliance on standard sampling theory, looking at any given set of 20 will make you almost completely sure that every single person in the country is the same height to within 1/4 of an inch, p < 0.00001 or whatever.

        • Daniel Lakeland said,

          “On the other hand Frequentist statistics assumes that the frequency property of abstract number sequences such as the above is a good model for the behavior of scientific measurements. This assumption is very questionable by Bayesians, and they in general do NOT always make this assumption (though of course sometimes could without contradiction), instead the general Bayesian assumption is that the mathematical notion of a measure over spaces is a useful model of degree of some kind of knowledge. A Bayesian distribution doesn’t in general represent the distribution of actual measurements, it represents the knowledge that we have under our assumptions about where measurements would or would not be likely to happen. p(Data | Parameters) once you plug in fixed data is a degree of agreement between prediction from a model with limited knowledge built-in and observed fact from measurement.”

          This seems to somewhat get at the issue, but not quite. Here’s an *attempt* at what I see as a better description:

          The theory of probability can be stated abstractly in terms of certain axioms, and theorems derived from the axioms. Frequentist statistics applies this theory to model how actual measurements are distributed, whereas Bayesian statistics applies this theory to model both actual measurements and also to model knowledge (or assumptions or uncertainty) about parameters that are involved in modeling measurements.

          I’m not entirely happy with the above attempt, so let me attempt to clarify a little:

          1. In both frequentist and Bayesian statistics, we model distributions of measurements. In both cases, these models involve parameters. In both cases, the parameters are unknown.

          2. Frequentist statistics uses probability theory (and the model of the distribution of the measurements) to derive a model for the distribution of estimates (calculated from actual measurements) of the parameters, then uses this model and the actual measurements to come up with “best” estimates (given the data and the model) for the parameters.

          3. In addition to modeling the probaility distribution of measurements, Bayesian statistics involves modeling the probability distribution of the parameters involved in the model of measurements. This second model is based on the particular model of the measurements as well as any prior knowledge or beliefs that might be relevant. Bayesian statistics then uses the model of measurements, the model of the parameters, and probability theory to derive an “updated” model for the measurements.

        • in your part 2. “derive a model for the distribution of estimates” leans entirely on the properties of those high complexity random sequences, and need not work at all for other sequences.

          Given the hard work that goes into designing computer random number generators, and the number of failed attempts, and the massive banks of testing they undergo (see “dieharder” https://webhome.phy.duke.edu/~rgb/General/dieharder.php ) it seems highly suspect to say that just any old single sample of say asthma patients or counts of tree frogs in a certain region of a rainforest or whatever would automatically have all the properties of a random number generator that are required to ensure that these measurements can be treated as the prefix sequence of a proper random number generator with a distribution that we can automatically infer by just choosing one from the back pages of our stats textbook, and therefore that the samples are probabilistically guaranteed to have say averages or standard deviations that are “close” to some stable parameter that describes the RNG, and that future such measurements will also equally give similar results.

          Of course, there are cases where this idea does hold up. When you get large volumes of data, you can sub-partition it and analyze it against the basic idea that it’s a stable high complexity random sequence. For example industrial process control where we get hundreds of measurements a day or whatever.

          Still, the appropriateness of this basic idea in a one-off study of say acupuncture for 22 asthma patients in southern German is highly questionable.

        • “Still, the appropriateness of this basic idea in a one-off study of say acupuncture for 22 asthma patients in southern German is highly questionable.”

          Yes … mainly because among other things there is likely not even any real attempt at probability sampling happening at all in those studies because there is no well defined population to try to randomly sample from. If you are interested in an industrial process inside one company you really could meaningfully use an RNG to select a sample because it’s actually possible (though in that case you might ask why sample?).

          I think there are two issues that for a lot of students, researchers and people talking about this get mixed up, one is about statistics and probability theory and the other is about research design. I use the CLT and simulations of random sampling (including random and non random sampling of the students in the room) in my research methods course to talk about designing a sampling strategy for data collection. If you are going to study 22 asthma patients in Germany what’s the best way to pick them? And is 22 a reasonable number for the stated purpose of the research?

Leave a Reply to ojm Cancel reply

Your email address will not be published. Required fields are marked *