Stan without frontiers, Bayes without tears

[cat picture]

This recent comment thread reminds me of a question that comes up from time to time, which is how to teach Bayesian statistics to students who aren’t comfortable with calculus. For continuous models, probabilities are integrals. And in just about every example except the one at 47:16 of this video, there are multiple parameters, so probabilities are multiple integrals.

So how to teach this to the vast majority of statistics users who can’t easily do multivariate calculus?

I dunno, but I don’t know that this has anything in particular to do with Bayes. Think about classical statistics, at least the sort that gets used in political science. Linear regression requires multivariate calculus too (or some pretty slick algebra or geometry) to get that least-squares solution. Not to mention the interpretation of the standard error. And then there’s logistic regression. Going further we move to popular machine learning methods which are really gonna seem like nothing more than black boxes. Kidz today all wanna do deep learning or random forests or whatever. And that’s fine. But no way are most of them learning the math behind it.

Teach people to drive. Then later, if they want or need, they can learn how the internal combustion engine works.

So, in keeping with this attitude, teach Stan. Students set up the model, they push the button, they get the answers. No integrals required. Yes, you have to work with posterior simulations so there is integration implicitly—the conceptual load is not zero—but I think (hope?) that this approach of using simulations to manage uncertainty is easier and more direct than expressing everything in terms of integrals.

But it’s not just model fitting, it’s also model building and model checking. Cross validation, graphics, etc. You need less mathematical sophistication to evaluate a method than to construct it.

About ten years ago I wrote an article, “Teaching Bayesian applied statistics to graduate students in political science, sociology, public health, education, economics, . . .” After briefly talking about a course that uses the BDA book and assumes that students know calculus, I continued:

My applied regression and multilevel modeling class has no derivatives and no integrals—it actually has less math than a standard regression class, since I also avoid matrix algebra as much as possible! What it does have is programming, and this is an area where many of the students need lots of practice. The course is Bayesian in that all inference is implicitly about the posterior distribution. There are no null hypotheses and alternative hypotheses, no Type 1 and Type 2 errors, no rejection regions and confidence coverage.

It’s my impression that most applied statistics classes don’t get into confidence coverage etc., but they can still mislead students by giving the impression that those classical principles are somehow fundamental. My class is different because I don’t pretend in that way. Instead I consider a Bayesian approach as foundational, and I teach students how to work with simulations.

My article continues:

Instead, the course is all about models, understanding the models, estimating parameters in the models, and making predictions. . . . Beyond programming and simulation, probably the Number 1 message I send in my applied statistics class is to focus on the deterministic part of the model rather than the error term. . . .

Even a simple model such as y = a + b*x + error is not so simple if x is not centered near zero. And then there are interaction models—these are incredibly important and so hard to understand until you’ve drawn some lines on paper. We draw lots of these lines, by hand and on the computer. I think of this as Bayesian as well: Bayesian inference is conditional on the model, so you have to understand what the model is saying.

58 thoughts on “Stan without frontiers, Bayes without tears

  1. I so so SO agree with this approach. Bayesian statistics isn’t about “statistics” (random sampling, sampling distribution of the mean, etc), it’s about a different meaning for the equal sign.

    y = a + b * x + error

    means

    (y – a – b*x) ~ error_distribution()

    • “Bayesian statistics isn’t about “statistics” (random sampling, sampling distribution of the mean, etc), it’s about a different meaning for the equal sign.”

      But in the background, there’s also a question/confusion of the meaning of “error”, that occurs in frequentest statistics as well.

  2. As a grad student with only a Cal 1 background, I agree. This is what made McElreath’s Statistical Rethinking text so accessible. He emphasized model building, simulation, and plotting. He emphasized programming, which I’ve found immensely helpful. And of course, the unifying framework of regression–so much easier to navigate than the t-test/ANOVA paradigm. And McElreath left plenty of breadcrumbs for when I’ll inevitably need to go deeper. Gelman & Hill was also pretty good at this and I’m excited to see y’all have in store for the second edition.

    • Oh, and of course, the youtube lectures. McElreath’s online lectures are a godsend. Andrew’s online presentations on methodology have given me a lot of food for thought, too. The proportion of cat videos to instructional material is imbalanced. I suspect we applied guys would benefit immensely if more of y’all experienced teachers who frequent this blog would record and upload your lectures and presentations.

    • Glad it was so helpful. My approach grew out of teaching population biology and anthropology PhD students, who had mostly forgotten calculus. I did a lot of teaching experiments until I found something that worked.

      Also, the 1st ed of Gelman and Nolan helped me think my way in. It was a huge help.

    • +1

      I have been pretty sold on the philosophical advantages of Bayesian methods over classical methods for a while (thanks in large part to this blog), but my math background is not great. Statistical Rethinking was exactly the book I needed, and the youtube lectures are a perfect compliment to the text.

  3. Interesting. It seems like everyone is saying it’s not as important to follow the math or estimation as I originally thought. I’ve been playing around more with simple models in Stan. I still have the feeling that until I have a more rigorous understanding of how to choose appropriate priors, how conjugate priors/improper priors work, and how sampling from the posterior works, it won’t click. Probably It will take more practice problems (which BDA3 makes extremely nice with solutions and code examples online!). Understanding parameters as random variables is still weird.

    Granted, to your other example, I didn’t feel comfortable using a neural net until I was able to code one up from scratch in Python. Which I guess most people don’t bother learning, but seems important to me.

    • Hi Natasha, one thing that might help is that rather than think of parameters as ‘random variables’ think about modeling your uncertainty about parameters using probability distributions..

      • Specifically, to really GET the Bayesian modeling viewpoint, change your view of probability from “randomness” to “plausibility”.

        In a regression problem y = a + b*x + error, if someone told you the “right” values for a,b how plausible is it that y – a – b*x = error would be near zero? How about near 1? near 300? near -3?

        Draw a curve that describes relative plausibility of different outcomes with the most probable outcome at 1 and anything less than that below 1. Call this curve f(x).

        Now, instead of determining the scale of f(x) by the fact that the maximum value = 1, re-normalize the whole thing so that the total plausibility = 1… Do this by dividing the whole curve by Z = integrate(f(x),x,-inf,inf). Call this curve f(x)/Z = p(x) the probability density.

        There’s nothing in here about “how often” anything.

    • NatashaRostova, I’m in social science, too. If I could do undergrad over again, I’d definitely go through Cal II or III. But then again, I’d also take some computer science courses. I can only imagine how much more efficient my research would have been over the past few years had I had the benefit of those foundations.

      • Hah, yeah. I finished undergrad in 2011, at which point (at least in my experience), everyone suddenly woke out of their post-crisis stupor and said “oh shit, yeah go learn how to code and do math.” So since then I’ve spent a substantive portion of my ‘free time’ trying to self-teach these foundations. I haven’t done that badly I guess, but it often feels like a never ending grind. “Just one more chapter of this probability theory textbook and I’ll finally get it!” I try not to compare myself too much to people who did take those courses in their undergrad. There will always be someone who started a year earlier than me, or is smarter than I am etc.

        • Sigh, that is about me too. It just happened that I worked on different problems and didn’t focus enough on coding and math. That was a strategic mistake, and I am still catching up. As Natasha.

  4. I think has some great advantages, but it has disadvantages as well. I remember back in graduate school (at the dawn of the modern computer age) arguing forcefully that all of these closed-form solutions and existence proofs needed to ensure mathematical purity of results (do we *really* know that utility correspondences are upper hemicontinuous? Some of us even doubt that there re such things as utility functions…) Just write the problem down and go out and maximize… if the maximization is unbounded, you’ve either found a utility pump or you made a mistake somewhere.

    On the other hand, while *arid* mathematics is useless (and deciding what is arid and what isn’t can sometimes be useful) solving every problem individually will (or at least might) cause you to miss out on fundamental underlying principles which unify reality… and that’s where the learning is.

    A (fairly) simple example: using a normal DGP makes no sense where the underlying objects are discrete, but most of the time it makes little difference – treating a large population integral quantity as if it were continuous is accurate enough. But every once in awhile, continuity matters, and the implications of a model might actually change radically (picture something like a knapsack problem.) Without the math, you’re just staring at a collection of peculiar results and you can’t tell what actually generated them.

    All that said, for teaching at the undergraduate level I’m firmly in Andrew’s corner: if you require too much math to introduce a subject, you simply ensure that people won’t learn anything at all… that *can’t* be better.

  5. This is exactly the right approach for undergraduates. At Westminster College (Salt Lake City, UT), we ran a May Term course (only one month!) in which we used Stan together with Kruschke’s text to impart the basics of Bayesian statistics. The students didn’t come out of the experience as experts, of course, but we hope that we whet their appetite so that they’d want to learn more. That’s the true benefit of presenting this kind of material in a relatively math-free way: it gets people excited about the subject. That’s the most successful learning outcome I can imagine!

  6. I also use the “learn to drive” analogy. Most drivers have no idea how the car works but they are able to drive it without issues. That level of statistical literacy is what we should strive for in the general population.

    In terms of math, I don’t think the mere presence of equations is bad. What is harmful is when equations and mathematical logic substitute understanding and intuition. I love applied statistics because you can’t succeed by pulling out a formula sheet and pressing the button. The math gives you a nominal standard, and it is the aberration from the norm that generates the insights.

    • I will accept your point, to a point. What I observe in reality is a bit more complicated:

      1. People with moderate understanding of basic statistical modeling and some understanding some complex econometric functions are given automated ensemble methods with a bunch of defaults.

      2. The problem I see with this is that there is very little ability to understand the outcomes of this process and literally no understanding of how to determine whether it is a sensible model or where to make changes if it is not. It is assumed sensible by the mere fact that it produced parameter estimates.

      3. I am not saying this has to be the case, but I find it to be the case far more often than the person who is highly skilled at a wide number of modelling methods including distributional functions and data simulation who simply does not like to code (though clearly those types of people do exist).

      • Curious:

        I understand your concerns. But, for better or worse, things are going in the opposite directions, with the increasing popularity of nonparametric methods, typically called machine learning. These methods are essentially impossible for most students to understand—they’re much more complicated than linear regressions or generalized linear models.

        • It’s even worse than what Andrew described. First, modeling has been reduced to a programming exercise. Second, people believe they have “all the data.” Third, “big data” makes it ok to ignore biases and missing not at random.

          At a recent talk, I showed a DATA -> OUTPUT -> ACTION framework, and observed that hard skills are more important in the first arrow and soft skills in the second (although both are needed throughout). An audience member asked: if he’s only interested in “data science,” can he focus on the first arrow and only hard skills?

        • Kaiser,

          I don’t see why there shouldn’t be a division of labor in situations with a lot of unwieldy data that requires a lot of preparation.

          Though, I suppose the danger is that a hard skill data specialist as such, over time, may not possess the proper insight to adequately determine what data should and should not be included and what data should and should not be reduced or summarized and how it should best be recoded, etc. People who typically analyze high level data often do not understand the value in including data at as low a level as possible for modeling and insight.

        • I do agree with you that teaching the basics like lm and glm before going into deep learning and random forests would be a lot wiser because too many people now do complicated models without really understand it and this is a major problem.

          On the other hand though, trees are way simpler than linear models. They are never mentioned in statistics courses because of historical reasons (they come mostly from CS literature) but they are such a neat and simple structure and then random forest (or BART) is just a natural extension of simple trees. I really feel like we should teach stats students trees before they learn regression.

    • A part of me really wants to recommend calculus based on nonstandard analysis. Downsides are that it’s not the usual way things are taught, upsides are that I think it really matches better with the kind of reasoning that is important for applied people.

      There are two books I can think of and I haven’t used them to teach, but they’re out there and they’re free or cheap, and you could look through them and see if any of it helps:

      Keisler’s text is available online:

      https://www.math.wisc.edu/~keisler/calc.html

      And this book is very cheap and comes in a Kindle version:

      https://www.amazon.com/Infinitesimal-Calculus-Dover-Books-Mathematics/dp/0486428869

      I’ve read the second one and I found it pretty reasonable as a way to introduce calculus ideas.

      Note that these will appeal more to you if you are more of an “algebraist” than an “analyst” but since you don’t know any calculus you’ll probably need some explanation of those ideas.

      from this: https://www.quora.com/Why-do-so-many-algebraists-hate-analysis

      “There’s a crispness to the algebraic side of things that I miss when I venture into the analytic realms. In algebra, a property either holds, or it doesn’t; whereas in analysis, the ballgame is frequently about getting ε-close. And speaking only for myself, the bits and pieces of analysis which I’ve found most fun have this crisp algebraic character to them”

      The nonstandard analysis approach essentially invents a whole bunch of numbers that lets someone who likes equals signs deal with the idea of “closeness”

      Now, Bayesian stats is a lot like that… dealing with a model in which you’d like for something to be equal, but you know that the best you can do is an approximate closeness that can’t be reduced to zero.

      In that sense, I think nonstandard approaches to calculus kind of match what you need for Bayesian stats.

      In the end though, what you need is very particular to *your* needs. so check it out, but ymmv

      • Another potentially helpful idea:

        In calculus, the two main concepts are derivative and integral.

        In the “standard” approach, each one is a notation for a *process* of getting closer and closer to some quantity.

        In the “nonstandard” approach, each one *is* a particular *object* namely the ratio of two differences or a particular sum of numbers.

        nonstandard approaches invent new infinitesimally small numbers that let you take a “plug into a formula” approach where standard calculus uses standard numbers, but has to take a “there exists a sequence of things that gets ever closer to…” approach.

        If you like to “plug in” a value to a formula, you are an algebraist ;-) if you like to think about how you could get ever closer to something by repeating something over and over again, you are an analyst.

        • The proofs in standard calculus texts do indeed depend on what you can think of as a limiting process rather than the existence of infinitesimals. But folks who use calculus (physicists, mathematicisn … statisticians{?)) routinely argue correctly with infinitesimals. Making that approach rigorous requires a substantial excursion into mathematical logic. I think that’s a burden for most people.

          There’s lots of discussion on this at math.stackexchange:

          https://math.stackexchange.com/questions/51453/is-non-standard-analysis-worth-learning

          https://math.stackexchange.com/questions/1991575/why-cant-the-second-fundamental-theorem-of-calculus-be-proved-in-just-two-lines/1991585#1991585

          There are several postings on Terry Tao’s blog. Search for terry tao nonstandard analysis

        • It is possible to make it rigorous axiomatically, for example based on the Alternative Set Theory (cf. Vopenka) although I am not aware of a text in English that does that.

        • Understanding Abraham Robinson’s construction requires some deep logic, but just using the techniques correctly doesn’t. As you say, applied people routinely argue correctly with infinitesimals.

          The book I linked by Henle is sufficiently simple enough to teach you derivative and integral calculus with hyperreals so that the average student could do calculus at least as reliably as if they’d learned the standard method.

          After you skim chapters 1-4 of that book for background, you get into actually using the numbers for calculus stuff like calculating integrals and whatnot, and it’s relatively straightforward.

          The first 4 chapters are not terribly long or terribly terse, so I think someone who wants to learn calculus can basically buy a $5 kindle book and arrive at a useful place in about a week of reading and thinking. Having some particular motivating questions in Probability/Stats can be a big boost for this person.

          Remember, the goal is to make someone like an undergrad econ major able to understand something like a how to do seasonal adjustments using continuous functions instead of weekly-indicators, or able to understand how to set up a model for a distribution of a quantity that is only observed when it exceeds some threshold or stuff like that.

        • Thanks for the book links. I’m interested to take a look at this non-standard analysis thing.

          The week startup time sounds pretty ambitious though after glancing at the Henle book. What sort of mythical animal is it that is totally comfortable with definition-theorem-proof style so can easily digest that book, but doesn’t know any calculus?

        • The key to the one week approach is to iteratively skim. You kneed to understand the bigger reasons for the proofs and that means power through the details and come back several times. Read the details in detail only after you decide that you understand why you’d want to know the fact. That might be after you get to an example problem where you don’t know what to do.

        • Having thumbed through Henle’s first few chapters again. Here are some ideas and assumptions:

          1) Read all of chapter 1, it’s an introduction to the history and purpose of calculus and infinitesimals etc. It doesn’t have lots of technical content. It’s short.

          2) Skim chapter 2 very lightly. I assume you know some logic and set theory of the kind you’d have learned in say an algebra class in high school. The point of chapter 2 is to point out that there is a language to the logic of math, you can write down a series of symbols that means “for all real numbers x there exists a number y such that y = sqrt(abs(x))”. This fact about a language will be used later, but the details of the symbols are not important, and different texts use different symbols anyway.

          3) Read chapters 3 and 4 in some detail but skip proving anything yourself, the goal is to understand the basic picture of how to think about the hyperreals.

          4) Read chapters 5,6,7 try to understand the concepts about functions and the arguments given. When he argues using the technique of “transfer” (that is, that every logical sentence about the reals is also true in the hyperreals and vice versa) go back to chapter 2 and 3 and look for understanding there about the basic technique. This is why I say the whole thing is appealing to “algebraists”. If you like programming, you’ll like nonstandard analysis.

          At this point, you’ve arrived at your useful place. It should take less than a week, if you can devote an hour a day and you have at least 3 years of college prep high-school math.

          Next you’ll be able to go through the remaining chapters with a new kind of logical structure in your head, learning facts about doing integrals or dealing with infinite series etc. That will take longer. After you’ve done some of this stuff, go back to chapter 2 and 3 and read them in depth, you’ll see why they discuss abstract ideas in logic and you’ll be able to do some of the proofs more easily having figured out what to learn and why to learn it from the later chapters.

          Good luck!

        • Thanks Daniel. Do you think the Henle book, as opposed to the Robert book, is still the right thing given I have Spivak’s Calculus book (at least a lot of it) under my belt? I’m interested to see what the nonstandard analysis view of things brings to the table.

          In general, I see what you mean about nonstandard analysis providing a different sort of on-ramp to the necessary math. But I still think you are probably overestimating how quickly a rank math newbie could get up to speed… it certainly took me much longer than a week! Good math is presented in a certain style. Like any style, it takes a long time to mold one’s thoughts into that style and start extracting useful stuff.

        • If you have Spivak’s Calculus as background, I’d recommend both Henle and Alain Robert’s book. There are actually 2 main “methods” of deriving the hyperreals, the Henle book goes along a simplified route of the one created by Abraham Robinson, the Alain Robert book goes along the Internal Set Theory (IST) route created by Edward Nelson at Princeton. I think it’s useful to see both, and I personally use IST if needed because I find it intuitive.

          As for the “one week” estimate, it may be more a difference between what I mean by “a useful place” and what you think I mean by “a useful place”.

          I don’t think you can start doing serious problems in calculus in a week, but I do think you can start to get a feeling for what an infinitesimal is and why you might care to have a number system that includes them.

          On the other hand, you have to remember that I did a BS in Mathematics with a minor in Computer Science and an informal minor in Philosophy, and THEN I went into Engineering… so I may have a skewed view of what to expect.

        • The one week thing is a red herring, even if it takes a year it doesn’t matter. The point is, you can go one of two different paths.

          The path chosen around 1900 or so was epsilon-delta and is all about whether you can make things arbitrarily close to something by looking at a sequence and then going out sufficiently far in the sequence…

          The original path used by people who invented calculus for use in physics (Newton, Leibniz etc) was to work with special numbers. The problem was no one had created a consistent set of rules to define those numbers. Well, that happened in the 1960’s and again in the 1970’s and now you can do calculus by working with an algebra of special numbers…

          If you can get to the idea of why you’d want special numbers for calculus and the basics of how they work, whether it’s in a week or a year, it’s a useful tool to have for mathematical modeling.

        • Henle in chapter 3 defines a hyperreal number system and then says you can skip to chapter 4, after that the rest of chapter 3 goes into some stuff about “quasi big” sets, skip it first time through.

          Literally just read the first page or two of chapter 3 and move on, come back after you understand how to use infinitesimals from chapter 4-6 or so. It’s like the learning to drive a car analogy. Learn about wishbone suspension only when you know enough to care about high performance automobiles. The quasi-big stuff is interesting but not necessary for doing rigorous high quality mathematical modeling, it’s more like understanding how a compiler translates things from a high level language to CPU instructions.

          So, my newly revised Henle reading suggestion is
          1) Read chapter 1, it’s quick and nontechnical and motivating.

          2) Skim chapter 2 very lightly, extract the idea that mathematics is actually about formal rules for manipulating strings of symbols, a language. This fact will be used to make correspondences between proofs using real numbers and proofs using hyperreals.

          3) Chapter 3, Read 2 pages, skip when he says skip to chapter 4. Seriously, come back later.

          4) Read Ch 4 and try to understand it deeply, the basic ideas that hyperreals can describe “orders of magnitude” (a/b is infinite, or a/b is infinitesimal, a^2/b is infinitesimal… etc), this gives you the big picture of hyperreals, this is where the bulk of the first week will be spent. Try it an hour each day.

          Now you’ve arrived at your happy place where you have an idea that it could be useful to have new numbers and how to use them. Then go on the read chapter 5 and see how he uses these numbers to describe properties of functions related to continuity and continue on through the actual calculus.

  7. Also, I will be taking a course in the Fall on Bayesian data analysis, where we will be using R2WinBUGS. Still worth learning? I know you originally wrote the package, but seem to be a, to be cute, Stan for Stan.

    • I’d say it’s useful to learn the concept of a graphical model and its connection to generative modeling. Not so useful to learn BUGS itself. There are still some problems where it’s easier to use than Stan, like many missing data problems, and problems where you just can’t use Stan, like literally modeling missing count data. But often these programs where it’s easier to write the model in BUGS than in Stan won’t fit in BUGS.

      If you must use something, at least move to JAGS, which is much more robust than WinBUGS and also portable to platforms other than Windows.

  8. That explains why I never understood what lme4 did from Gelman and Hill. You can’t define the max marginal likelihood estimator without calculus because you need to marginalize out the lower level parameters (using calculus!), optimize the higher-level parameters over the marginal, then plug back in and optimize the lower-level parameters again. It’s hard to even understand optimization without a bit of calculus.

    In fact, it’s calculus all the way down. In Bayesian stats, everything’s a posterior expectation. Parameter estimate? Expectation of a parameter. Event probability? Expectation of an indicator function. Prediction for unobserved quantity? Expectation of a posterior predictive quantity. What’s an expectation? An integral over a density.

    More fundamentally, everything in continuous stats is an integral or derivative. A probability density function is just the derivative of a continuous cumulative distribution function.

    You don’t need to learn to solve these integrals analytically—Stan will do that for you. But I think it helps to understand at least at a Calculus I and Calculus II level what it is you’re computing. And if you want to do MCMC, then probably Calc III so that you know about sequences and series, because MCMC reduces calculating integrals to series.

    P.S. I loved Gelman and Hill. Probably wouldn’t have ever wrapped my head around stats without the combination of BUGS and Gelman and Hill. I’m not saying you can’t start pre-calc, the same way people often start econ or physics pre-calc. But if you actually want to understand what you’re doing, calculus is going to rear its beautiful head.

  9. I know that many people, especially those in applied fields, do not have great mathematical backgrounds and do not have the luxury or even the opportunity to rectify that. I also know that many of those fields are desperately in need of better statistics, and hence that the immediately solution seems to be statistical presentations that avoid calculus altogether. But here’s the thing: continuous statistics (as opposed to statistics over discrete spaces) literally cannot exist without calculus and so that dream will forever be outside of our reach.

    Every continuous statistical calculation is, by definition, an integral. Probability densities, for example, are not well-defined outside of an integral sign. In particular, _there are no Gaussians without calculus_!

    First and foremost, this means that if you don’t know calculus then you shouldn’t be developing or implementing your own statistical algorithms. Without knowing calculus you don’t have any idea what operations are well-posed and which are ill-posed and hence will inevitably end up building something fragile and prone to error. If pushed I’d even go so far as to say that you’d need basic measure theory, in particular its behavior in high-dimensional spaces, but let’s just stick to calculus for now.

    Even assuming software like Stan to handle all of the computations, you still have to have an idea of what you can calculate and what you can’t. In other words, what questions can you even ask in statistics? Then there’s the problem of understanding distribution which serve as the atoms of generative modeling. Some of this can be pattern-matched by example or learned from proof-by-authority, but is that really understanding?

    The only way to build up a solid foundation is to try to convey the conceptual basics of densities as objects to be integrated and statistical queries answered by expectations of prior and posterior distributions or likelihoods. Not that it’s easy to identify the right decomposition of the concepts from the technicalities, of course, but there has been some progress towards this end. Ultimately, however, this is just a really sneaky way of _teaching calculus_!

    So the question here shouldn’t be “can we teach statistics without calculus” the question should be “can we stop sucking at teaching calculus”?

    • Having come from only a social science training background, I’ve taught myself all the math I know. What I have noticed though, which I think is true, is that understanding the logic and algorithmic operations of integration is sufficient to use applied methods.

      That is to say, I probably couldn’t solve any sort of complicated integration question if you poised it to me now and gave me a pen and paper. But I’d understand exactly what the question is asking, and what the answer is doing.

      So far, in my auto-didactic quest to learn Bayesian methods, this seems to be sufficient (as with everything quantitative though, being marginally better at math would only make life easier).

      • Neither could I! I think the conceptual part’s important though. I’m talking about understanding the difference between probability mass and probability density and why the former is an integral over the latter. And about understanding how expectations are weighted averages over densities. And about how (Markov chain) Monte Carlo methods can solve integrals.

        I agree that it really comes down to getting better at teaching at least the basics of calculus and linear algebra. I found that doing stats was immensely helpful in that it provided some concrete motivation for learning calc and linear algebra. I was a pure math major as an undergrad who mainly concentrated on logic and set theory. I did Lebesgue integration and topology, so I was all set for measure theory, but I couldn’t remember the chain rule (building an autodiff system really helps drive home differential calc!). Similarly I did abstract algebra and Galois theory, but never learned about determinants (Ben laughed at the first code I wrote for Stan for multivariate densities as it just literally followed the textbook—I had no idea that you shouldn’t apply inverses in numerical linear algebra). I found thinking about covariance matrices really drove home rotations and scalings; then Jacobians for changes of variables really helped understand the role of determinants. So learning stats along with calc can help with understanding both.

      • +1 on everything upthread (Michael, Natasha, and Bob)

        My experience with math classes is that they hit the worst possible balance between pure math and applications. I learned to like math only later on my own by doing one or the other. Digging into pure math for its own sake, learning how to feel comfortable doing proofs, epsilon-delta limits, etc, was its own sort of intellectual motivation. Then, on the flip side, just as motivating was doing on-the-job learning about differential equations because I had a system I cared about solving/simulating, and probability densities I cared about integrating, and matrix calculations that made my simulations easier to do.

        Contrast that with the math classes (that I took) that give thin justification, if any, for, say, integration by parts. Then throw some tricky-looking arbitrary function next to an integral sign as a puzzle.

        Of course, it’s easy to point out the flaw and harder to think up how to do it well. Dan Meyer seems to be on the right track: https://vimeo.com/163821742 (tl;dr make the math work itself more interesting by “developing the question”. see 18:20 for an interesting example). In the world of applications, I like the style of the first part of deeplearningbook.org. Very readable primers on the linear algebra, probability, numerical methods they use in the rest of the book. Not in an appendix, but at the front.

    • “Can we stop sucking at teaching calculus?” pretty much sums up the question I think.

      My own approach when I started to really need high level analysis ideas (such as to understand Lagrangian mechanics or whatever) was to throw out everything I learned in calculus classes, and use my math-major knowledge to read up on nonstandard analysis and adopt that. It worked for me, but we’re talking about people who haven’t even taken Calc 1 so what worked for me is irrelevant.

      The question is, would scientific consumers of statistics be better off if they took a semester of calculus based on nonstandard analysis such as the approaches in Keisler or Henle as linked in my above comment http://statmodeling.stat.columbia.edu/2017/04/24/stan-without-frontiers-bayes-without-tears/#comment-471847

      It’s an empirical question and requires a big grant and a randomized controlled trial. ;-)

    • I don’t know if general calculus teaching can get any better but I had many science classes where profs basically said “well, this would all make more sense with calculus but I’m not going to teach that.” We could at least try teaching calculus in science classes (other than physics, they seem to do ok).

  10. Stan is making me return to calculus. Folks in my field (ecology generally, fisheries specifically) have really glommed onto JAGS for custom probability models in the past few years. You get to be lazy with JAGS/BUGS and work with the conditional data likelihood. My full-time job these days is one multi-state mark recapture model after another – combining continuous and discrete data. Stan has allowed me to tackle some problems that were just too computationally intensive to do with JAGS, but it’s definitely harder to come up with the likelihood for some of these models.

    JAGS is still very very useful for the majority of folks in my field. The reason is that it is so much simpler to visualize and think through these problems as one conditional step after another. Having to work with the marginal is Stan is certainly more challenging. I’m mathematically inclined, but still feel like doing the doggie paddle in an ocean when it comes to integration for some of these models. I think you have start with the conditional model in teaching and learning though, mostly because it more closely matches intuition. From there Stan doesn’t free you from calculus. If anything JAGS lulls you into thinking you can live without it, until you encounter a problem that breaks JAGS and you find yourself having break out old coffee stained theory notebooks to try to remember how to do a convolution and solve for the determinant of the Jacobian.

    One downside to all this calculus: it gets hard describing to the lay people that I love what I do for a living. The old Dunkin Doughnut’s commercial from the 80’s is a helpful metaphor: “Time to make the n-dimensional tori!”
    https://www.youtube.com/watch?v=petqFm94osQ

    • We’ve been told that Stan would be challenging for people in ecology, but we’re actually seeing a lot of uptake. Sure, you have to marginalize out the latent states in your movement HMM, but when you do it mixes much much better, so you can actually get a result. You’ll find that the math you need to marginalize is in the 1980s literature back when they were doing optimization.

      I think it’s helpful to think of the models in terms of the of the discrete parameterization. I always write the discrete parameterization down then do the marginalization. Given that the marginalization is over discrete parameters, it never involves calculus—just some algebra to keep everything computationally stable on the log scale. The algebra can get hairy if you generalize to something like HMMs—then you need the forward algorithm, a kind of dynamic programming algorithm, to compute the log density.

      We want people to think about Bayesian models generatively—generate the parameters from the priors, generate the data from the parameters. (Acyclic) directed graphical modeling, as in BUGS, usually forces you to do that (unless you start using the zeros trick in BUGS/JAGS). I’d find it even easier in BUGS if you declared the types. I like that we force people to declare data vs. parameters, which is implicit in BUGS and only determined at runtime. It does make some missing data problems harder and also makes it impossible to reuse the same program for different inferences over the same joint probability model.

      • This is one of the things I find really charming about my PhD field—sooner or later ecologists will take up pretty much any method as long as its useful. It would be nice for people to make the road less bumpy and just teach the stuff up front.

  11. From Bob: “I think it’s helpful to think of the models in terms of the of the discrete parameterization” and “We want people to think about Bayesian models generatively—generate the parameters from the priors, generate the data from the parameters”

    That’s the two-stage simulation, ABC, McElreath’s demos, etc. which I always thought of as just stage 1 of learning where simulations take you towards continuity and with the addition of importance sampling (re-weighting the prior by likelihood to get the posterior) an opportunity to distinguish density and distribution functions and for instance how the density can be a good approximation for the distribution function for observations.

    The idea is not to avoid calculus forever, but rather introduce it at the right stage with right motivations for doing statistics.

    I had only intro calculus for social science when I went into statistics (fortunately we did do proofs) and a lot of effort was spent learning math over many years. I was able to grasp all lot about statistics while this was going on because I learned simulation really early on. Perhaps that is what kept me going. But the math folks likely need will take, if taken _up front_ will likely take most about two years of full time study – not feasible for most (wasn’t for me). Also most of the math won’t be helpful but which is likely very uncertain.

    Perhaps more importantly its not calculus that is needed but ways to represent and work with continuity in high dimensions given current computational resources to address variability and uncertainly less wrong (I think Daniel’s point).

    But overall, most undergraduates have some vague sense of regression and starting with that, getting them to do things with it and then bringing in simulation based views of whats going on in regression – is likely the way to go.

  12. This is an interesting discussion. After many years, I will be teaching out 300 level prob and stats course again next year, and I am looking for a textbook to use. It is normally a fairly typical prob and stat course mostly for applied math and engineering majors, with a sprinkling of pure math, math ed and assorted natural sciences majors. Most often it is taught using Jay L. Devore, Probability and Statistics for Engineering and the Sciences. I would like to inject some Bayesian point of view into the course. Anybody has a recommendation for a textbook to use?

  13. This is the thing with computer science, what you learned 5 years ago in CS is now obsolete. People are always after the next big thing.

    Now that statistics is crossed with computer science = machine learning, you can’t avoid studying machine learning.

    Bayesian statistics was hot, they even used it to search missing airplanes.

    Today, it’s deep learning, even machine learning is out.

    About the math you mentioned, sometimes you can’t explain the math, for instance, what’s the math proof behind convolution neural networks, nobody knows. Of course, I get your point about the math foundation and I agree with it.

Leave a Reply to Bob Carpenter Cancel reply

Your email address will not be published. Required fields are marked *