Skip to content

Teaching Bayesian applied statistics to graduate students in political science, sociology, public health, education, economics, . . .

One of the most satisfying experiences for an academic is when someone asks a question that you’ve already answered. This happened in the comments today.

Daniel Gotthardt wrote:

So for applied stat courses like for sociologists, political scientists, psychologists and maybe also for economics, what do we actually want to accomplish with our intro courses? And how would it help to include Bayesian Statistics in them?

And I was like, hey! This reminds me of a paper I published a few years ago, “Teaching Bayesian applied statistics to graduate students in political science, sociology, public health, education, economics, . . .”

Here it is, and it begins as follows:

I was trying to draw Bert and Ernie the other day, and it was really difficult. I had pictures of them right next to me, but my drawings were just incredibly crude, more “linguistic” than “visual” in the sense that I was portraying key aspects of Bert and Ernie but in pictures that didn’t look anything like them. I know that drawing is difficult—every once in awhile, I sit for an hour to draw a scene, and it’s always a lot of work to get it to look anything like what I’m seeing—but I didn’t realize it would be so hard to draw cartoon characters!

This got me to thinking about the students in my statistics classes. . . .


  1. Daniel says:

    Thank you very much for the link. I like the paper a lot and will think about it a little bit more. I stumbled upon one of your remarks, though:

    “Unfortunately, I’m afraid that
    the top quantitative students in sociology, economics, and public
    health do not take this sort of applied class, instead following
    the fallacy that the most mathematical courses are the best for
    them. (I know about this fallacy—it was the attitude that I had
    as an undergraduate, until I stumbled into an applied statistics
    class as a senior and realized that this stuff was interesting and
    difficult.)” (p. 202)

    Do you not think knowing more about the mathematical properties does help to understand the models? It was often my mathematical background, which allowed me to be able to work with more complex models and procedures more easily than other social science students. At least I think so. But even with a minor in mathematics I sometimes have the impression not to understand enough about the basic properties. I might be a special case being actually interested in the mathematical modeling itself, though.

    In my own experience, the applied classes were quite useless if you’re not really interested in the particular topics. Of course, doing a multivariate linear regression analysis yourself is important, but doing it the third or fourth time is annoying. I had to take on more complex models on my own not to be bored to death. You need at least a focus on methods and I think that’s why you’re course is working well. You’re Data Analysis book helped a lot to keep me interested in the social sciences and data analysis by the way.

    I think you really can’t stress enough, though, how important the deterministic part of the models is, especially interactions need much more love in university courses…

    • Daniel says:

      To be more precise: The linear regressions in the applied courses we were meant to do on our own did also not include interactions or something like adding regression parameters to watch for spurious effects or anything like it. Yes, I could do it on my own and did so, but the courses were not meant to learn this. Andrew’s regression examples in Data Analysis are far more interesting because they teach you how to improve your models instead of just “take a few theoratically plausible models, do the regression with SPSS, interpret the resuls and write a research report”

  2. Obviously it’s a bit silly of me to be saying this to someone like Andrew, but I feel that it’s a mistake to not emphasize the sampling distribution of the sample mean. Not understanding this concept causes a whole lot of confusion and trouble. You can teach it without any derivations, by simply using simulation. Repeatedly sample from a normal or some other distribution, and compute and store the means in a vector and examine its properties. E.g., you can show by simulation the relationship between SE, SD, and sample size.

    • Andrew says:


      Instead of first “deriving” the sampling distribution of the sample mean, and then describing the sampling distribution of the estimated regression coefficient as being similar, I’d prefer to first discuss the sampling distribution of the estimated regression coefficient, and then show how the sample mean is a special case of a regression on a constant term.

      That is, I think it makes sense to start with regression and then consider the mean as s special case. This seems counterintuitive from a mathematical perspective, but from the statistical perspective I think it makes sense to start with regression, which is more interesting and more grabbable.

  3. Nony says:

    I’ve taken classes in DOE in school and in industry. Got a lot of good insights that helped me solve problems. I don’t think the theoretical abstractions of stats theory were any relevance for me. And sure don’t think some dick size contest of “mindsets” of Bayesian/frequentist would have helped me.

    I think you people really miss the picture. Work more to help people remove non-representativeness from surveying. And not in some theoretical way, but really crafty practical ways. Heck, help people who need to make decisions and lack time/money to do gold standard work to do the best they can.

    Yeah…I’m sure there’s a need for people off on the side to argue about linear regressions versus something else. But for what people with no knowledge need? Nurses and engineers and businessmen? They don’t need to be exposed to your Protestant heresies.

    • Entsophy says:


      In Newton’s time only a handful of people were able to read Newton’s works and even when they did there were almost no real problems they could actually solve with it.

      Jump ahead a couple of hundred years, and Classical Mechanics is regularly taught to thousands of Engineers. They’re able to master it in about a year and able to use it effectively on vastly more real problems then Newton ever could have.

      That difference occurred because someone took the time to get all that theoretical stuff right and really understand it. They then used that understanding to both dramatically improve and dramatically simply the subject. It’s extremely likely in my opinion that Statistics will follow the same path.

      There is nothing less practical in the long run than pragmatists.

      • K? O'Rourke says:

        Because pragmatists (or at least pragmaticists) purposefully “understand what the model is saying” and “know what their models are doing” and do their best to continuously get them less wrong.

        Andrew, here seems to be suggesting skills are first needed (before grasping concepts) but not sure if/why these need to be developed first on classical approaches (“we introduce Bayes by stealth”). Perhaps similar reasons to advantage of doing regression before means as better to engage interest of students?

    • Entsophy says:

      Also, are you aware that basically everyone here spends a lot more time on applied work than theoretical?

      • Nony says:


        Those are all good points.

        1. If you ever get it figured out enough so that it is a no brainer that it helps people and is easy (like um…correcting for Coriolis with offshore gunfire support?), then that’s good. I guess, have fun and try to work out useful approaches. Just don’t kill the students until you have it down.

        2. I’m glad y’all are applied. I would not be able to hang with the math/discussion if it weren’t.


        Still…my sympathy is much more with people who do “X” and just need some basic stats learning in their bag of tricks. And could care less about differing religions fighting each other. Like people who want to make money or put ordnance on target or cure cancer or find submarines or stuff like that. You know. The operational art.

        • Entsophy says:


          Let me try to convince you that the difference between Bayes and Frequentists isn’t just a difference of interpretation, but has very practical consequences.

          Take probably the oldest application of statistics to science, that of measuring a parameter mu subject to measurement errors.

          In the standard view, the likelihood of errors represents the histogram of a long string of such errors. If the those errors don’t occur with the proper frequency (approximately) then the statistics is wrong.

          An alternate view is that the likelihood is merely a way of describing how well we can pinning down or locate the actual errors in the data. It’s a way of codify the uncertainty in our knowledge of those particular errors (all future errors that might be seen being irrelevant). In fact, even the histogram of the errors actually in the data is almost entirely irrelevant.

          These are two very different criteria. The later is often satisfied, while the former is hardly ever satisfied in real physical systems. The consequence of the later understanding is that we can get accurate interval estimates for mu sometimes even when the likelihood looks nothing like the frequency of errors! That is a huge practical benefit, which in the right hands is an immensely powerful tool.

          The details can be found here: with a cleaner explanation here

          Note the mathematical details are trivial. It’s not that Frequentists couldn’t have seen the point mathematically, it’s that they never thought to even check those details because of their philosophy.

          Statistics is an a unusual situation that way. If the problems with statistics could have been fixed by either pragmatists or mathematicians, they would have been fixed eons ago. There has been an entire army of superb practitioners and mathematicians focused on statistics over the years.

          The obstacles to progress in statistics are almost entirely conceptual.

          • Nony says:

            I honestly didn’t get any of that. What I have learned in classes was to assume a normal distribution. Maybe if you need to be super fancy, do a test for normality if someone is going to be a jerk. Or watch out for bimodal issues or floors or ceilings. But in general. Figure it’s normal. It’s like throwing darts at a bullseye. Do enough replications so your average is close enough to the true value. Lather, rinse, repeat.

            Move on to other interesting parts of the problem. Go have a beer.

            • Anonymous says:

              Nony, your point of view is so much more divorced from the real world than anything you’re criticizing.

            • Entsophy says:

              “Go have a beer”

              I don’t drink alcohol anymore. I only drink whiskey and limit it to special occasions like “breakfast”. Lunch sometimes too. Dinner quite a bit as well now that I think about it.

            • Entsophy says:

              How about this then. This is as simple as I can make it.

              You’re trying to weigh someone. They weigh 200lbs. You take three measurements and get 199, 201, 202. So the errors are -1,1,2. Now you don’t know the real weight (200lbs) or the errors. All you see is the three measurements 199, 201, 202.

              But you know one other thing. You know that the machine you used to weigh the person doesn’t give errors bigger/smaller than about 3lbs or so. Given that information you make a list of all possible sets errors that could have been in the data. Maybe something like this:


              and so on. Note that the actual errors are in the list somewhere.

              Now you do the simplest most intuitive thing imaginable. You look at all the possible errors and you see what weight they each imply. From that you construct an interval like 198lbs < weight < 203lbs which his consistent with almost every one of the possible errors in your list.

              You don't know what the actual errors were (-1,1,2 in this case), but you do know that they're in the list somewhere. Since almost every set of errors in the list implies the weight is between 198< weight <203 then that's strong reason to believe the person does weight between 198lbs and 203lbs.

              The probability calculation we do for this problem is just an incredibly slick way of carrying this process out. To do it though, you don't need a likelihood P(error) which represents the frequency of errors. In fact, the P(error) needed can, and almost always is, very different from the frequency of errors.

              This is so simple I really don't understand why have trouble with it. Moreover, we really do know thinks like “errors bigger/smaller than about 3lbs or so” in real problems, which stands in stark contrast to the Frequentist requirement that we know the histogram of an infinite string of future errors that will never be made.

              • Entsophy says:

                I meant “bigger in absolute value than about 3lbs or so”

              • K? O'Rourke says:

                > I really don’t understand why have trouble with it

                The history suggests some smart people did.

                From my probably never cited paper META-ANALYTICAL THEMES IN THE HISTORY OF STATISTICS: 1700 TO 1938, PAKISTAN JOURNAL OF STATISTICS. 2002.

                Probability models were being used to represent the uncertainty of observations caused by measurement error by the late 1700’s. Laplace, as did Simpson earlier in 1755, decided to re-write these probability models not as probability that an observation equaled “the true value plus some error” but as simply the truth plus the “probability of some error”. That is, they focused not on the observations themselves but on errors made in the observations – for instance, the differences between recorded observations of the position of the body being observed and its actual position (the truth). The probability distributions they considered for the errors did not involve the unknown truth.

                When Gauss approached the problem of combining observations he used Laplace’s form of Bayes’s theorem but directly in terms of O – V [obs – true value] and supposing all values of V [true value] were equally likely a prior.

              • Nony says:

                Just take the average as your estimate. You’re making this too complicated.

              • Anonymous says:

                @entsophy – you’re wasting your time. if this guy thinks that questioning normality assumptions amounts to “being a jerk”, he’s either trolling or hopeless.

              • Rahul says:

                @K? O’Rourke:

                How’d choose that journal, I’m curious.

              • Corey says:

                Nony: Actually, Ent’s trying to explain why “just take the average” works. Without understanding that, it’s difficult to generalize to other cases.

                But maybe you’d rather drink beer…

              • K? O'Rourke says:


                I was invited to present a paper in Pakistan and the funding fell through.

                And the all statistical journals I had sent earlier versions to – straight out rejected them flat ;-)

                Not evan picked up on my google scholar page

  4. Rahul says:

    I’ve a tangential remark about the subject choice: My feeling is that political science, sociology, public health, education, economics are not the best subjects to show off Bayesian tools to an introductory audience.

    I rather love the Bayesian applications from areas like AI / Machine Learning / Pattern Recognition / Robotics / Process Control etc. In my (possibly biased) opinion the gems of Bayesian models are in Electrical Engineering / Software Engineering & allied areas.

    Do others share my prejudice?

    • Andrew says:


      My current favorite example for introducing Bayesian inference is spell checking. What I like about this example is not that it “sells” Bayes—at this point, I have enough examples of this sort, where Bayesian methods solve problems that were not easily solved in other ways—but rather because, as I wrote above, it demonstrates how Bayesian data analysis works in the context of probabilities (both the “likelihood” and the “prior”) that were constructed from data. I like having an intro example that goes beyond simple data models such as the binomial and simple prior information such as “p has to be somewhere between 0 and 1.”

      • Rahul says:

        Yep. Love that too. That’s the sort of examples I love Bayesian methods for.

        OTOH, I’m not terribly fond of the examples I’ve seen from political science, sociology etc. They often seem sort of contrived & unconvincing. Or it seems like the author is using Bayes because it allows him the flexibility to draw the particular conclusion he already has in mind.

      • Fernando says:


        Regarding the spell checking example. I see how you get the prior from distribution of words in a large database but this assumes there are no spelling errors in said database, no? I assume you want the distribution of correctly spelled words P(theta), not the spelled + misspelled words P(y).

        But how do we know whether the words are correctly spelled? Does that not depend on the — unobserved — word the writer wanted to type? This echoes with statistical inference being about making probability statements about unknown quantities from known data.

        • Rahul says:

          Someone commenting on this blog (was it Bob Carpenter? Not sure) has reminded me that Bayesian Spam filtering is not really Bayesian. I never understood the arguments why; but in case that’s true and Bayesian spam filtering is indeed not Bayesian; is Bayesian spell checking truly Bayesian?

          Why? Aren’t these two applications fairly similar?

          • The only thing I can think about for why spam filters “aren’t really Bayesian” is that they typically assume a likelihood function built on the idea that a message is an IID sample of individual words, which feels like a fairly frequentist thing to do. In other words (pun… intended?), they don’t incorporate a lot of information that might be possible to include.

            • K? O'Rourke says:


              You may like to read the comments to Bayes Estimates for the Linear Model. D. V. Lindley and A. F. M. Smith

              And if in a hurry, just David Cox’s comment.

              Still the spell checking example is a nice introductory example – although it avoids? concerns about what the prior represents. At some point that needs to be addressed?

  5. Mark Patterson says:

    This is really a comment on the post from yesterday — those comments were closed, so I figured this is as good a place as any. I went to a useR group last night that included a presentation on the RGoogleDocs package (for R) which establishes a nice way of connecting to the Google Documents API. I remember quite a while back you had a really cool post on using Google Forms to poll your students before class. I realized it would be really easy (using RGoogleDocs) to set up a Google Form for the candy-weighing demonstration, with an R script already written for analysis.. this might end up being too much work to move from *crude* histograms to *really awesome ggplot* histograms, but the general routine seems worth investigating.

Leave a Reply