Some might quibble that is the definition of probability favored by frequentists, and that a Bayesian-who-shall-not-be-cited in BDA tried to provide a foundation for probability where those characteristics were theorems, and that the myriad of proofs of Cox’s theorem have been slightly controversial among commentators on this very blog, and that the move from discrete probability to continuous probability has been disputed even among Bayesians.

But as long as “the ultimate proof [that probabilities may be a reasonable approach to summarizing uncertainty] is in the success of the applications (13)” by some unspecified criterion, that is foundational enough. Probably.

]]>Functional Data Analysis by Ramsay et. al. has been in my statistics wish list on Amazon for a couple months now. Maybe I’ll add their Dynamics book to the priority list (itself a list of 70+ books).

I’ll have to wait for the prices to come down though. In case readers on this blog aren’t aware, prices on Amazon swing wildly. Sometimes books in my wish lists start at $200 + and on some random days I can grab them for $25. There’s a website to actually track some of the more popular items! (https://ca.camelcamelcamel.com/) ….but I usually just check the wishlists daily (not that it wouldn’t be too difficult to write a script in r to check it for you)

For the curious, Gelman’s BDA could have been purchased on jan 18th for $10 less then its current price (https://ca.camelcamelcamel.com/Bayesian-Analysis-Third-Andrew-Gelman/product/1439840954). Which is pretty stable actually. The rarer the book / the less demand the wilder the swings I have found.

]]>if you replace the word probability with frequency in your statement, then sure, the BDA examples are useful demonstrations of what it means to *measure frequency*.

which gets at Bob’s point that the foundations in BDA are more foundations of applied bayesian statistics, not the foundations of probability.

]]>Most of the discussions I’ve seen regarding the foundations of probability have had little to nothing on the idea that probability is a measurement. I think the examples of chapter 1 of BDA are useful demonstrating what it means to *measure* probability (as opposed to simply *defining* it) in practice.

Mill does a great job of laying out what I think of as the philosophical foundations of probability as used in Bayesian statistics. He covers roughly the same ground as section 1.5 of *BDA*.

I think of the mathematical foundation of probability as the kind sof topics covered in probability theory textbooks, like events, measure, expectations, basic laws of probability, the central limit theorem, etc.

You presuppose knowledge of mathematical probability in *BDA*. The examples you cite provide examples of inference, going from prior and likelihood to posterior. I don’t see how they relate to the foundations of probability.

Foundations of the theory of applied (Bayesian) statistics, perhaps?

]]>1. If I didn’t know you so well, I’d think you were kidding recommending people read John Stuart Mill!

2. I didn’t recommend the first chapter of BDA as an *introduction* to probability, I recommended it (in particular the three examples mentioned above, which are in sections 1.4, 1.6, and 1.7) for *foundations* of probability. Just about any intro probability book will cover the math of random variables. But, sure, I guess I should’ve recommended that too.

I do think the first chapter of *BDA* is a great intro to the principles of Bayesian stats for someone who’s already fluent in math stats. The bullets in the second paragraph already assumes you know what a conditional distribution is! Then the very first topic you discuss assumes randomness by scare-quoting the word “random”. Then you go onto exchangeability of densities without ever defining what a density is. Rather than defining what a density (or mass function is), your first mention is in a block called “Probability notation” where you explain that p(.|.) denotes a conditional probability density and p(.) a marginal density without ever defining these.

For an introduction, it’s very very confusing to overload p(.|.) and p(.) the way you do (for every density) and also overloading random variables (traditionally capital letters) and bound variables (traditionally lower-case letters). I understand why you do this now, though I found it almost impossibly confusing when first trying to understand it. At the time I was trying to read it, I knew what a pdf, pmf, and cdf were, but didn’t know anything about random variable notation, so the whole thing was just frustratingly opaque.

I’m actually trying to write an intro to probability theory right now based on proper definitions, but not a lot of theory. It’s what I wanted when I was first trying to learn the material. I didn’t need 300 pages of fluff and digressions into frequentist stats like you find in most introductions to probability theory in math stats textbooks.

]]>Tarantola’s Inverse Problems book is interesting though a little idiosyncratic.

]]>There are some, depending on what you are looking fot, but I’m still a bit dissatisfied with what I’ve seen.

You might find Data Assimilation: A Mathematical Approach interesting.

There are a few others around. Another interesting one from a different direction is ‘Nonlinear time series analysis’ by Kantz which uses tools from dynamical systems theory to do data analysis (rather than fitting prespecified models to data).

Somewhat in-between, and recent, is Dynamic Data Analysis by Ramsay and Hooker but I haven’t read it properly yet.

]]>Thanks, Aki. That was my understanding, from reading the papers, so glad I didn’t misunderstand too much. I still need to sit down and simulate in context of data I understand, so I can get some intuition.

Your biggest problem is mine as well. My book warns the reader about those things, but I fear it also encourages misuse. I thought about having an example with explicit LOO for whole clusters (hospitals) in the 2nd edition, to emphasize the issue.

Re time series, I have ecology colleagues who resist *IC because they often want to predict outside the range of past data, making the whole prediction task rather less well defined. Also, they are less interested in total predictive accuracy than avoiding extirpation/extinction.

]]>You might find Data Assimilation: A Mathematical Approach interesting –

http://www.springer.com/gp/book/9783319203249

There are a few others around. Another interesting one from a different direction is ‘Nonlinear time series analysis’ by Kantz which uses tools from dynamical systems theory to do data analysis (rather than fitting prespecified models to data):

https://www.cambridge.org/core/books/nonlinear-time-series-analysis/519783E4E8A2C3DCD4641E42765309C7

Somewhat in-between, and recent, is Dynamic Data Analysis by Ramsay and Hooker but I haven’t read it properly yet:

]]>Are you familiar with any texts dealing with Differential Equations in probabilistic terms? I don’t deal with them often anymore but I am curious. I don’t recall every coming across a text dealing with it.

I stumbled across this last week: http://andrewgelman.com/2014/04/29/bayesian-uncertainty-quantification-differential-equations/

But haven’t gotten into it yet.

]]>https://www.buzzfeed.com/golianopoulos/fuck-that-gator?utm_term=.vpVxVAZk2#.li12EmGrq

]]>https://www.youtube.com/playlist?list=PLDcUM9US4XdMdZOhJWJJD4mDBMnbTWw_z

]]>The biggest problem for me with *IC is that usually story about balance between fit and complexity is emphasized, and the connection to predictive task is forgotten and it’s misused, for example, for hierarchical models (predicting for a new hospital and not for a new patient) and timeseries (not taking into account that it’s easy to predict one missing observation in the middle of the time series).

]]>I think of Waic as an approximation to Loo. Once Aki explained that to me, I found it more difficult to recommend Waic to people. I think it’s a lot better than Dic, though. I don’t really have a problem with people using Waic, I just find it awkward to recommend.

I agree that Aic is an excellent starting point, especially if people in the audience have already heard about it! To me the key step in explaining any of these things is to step away from the idea that there is some sort of Platonic “information criterion” out there to be discovered, and instead to consider all of these as methods for estimating out-of-sample pointwise prediction error. From there, it’s clear that Aic gives such an estimate in certain simple settings, that Loo is a reasonable general approach, and that approximations such as Waic or Psis-Loo can be useful for computing fast and stable estimates.

]]>“the true data-generating process” is as best we know, quantum physics, and quantum physics isn’t even close to any of the models used for anything much. For example Navier Stokes is as close as damn it to “the true data generating process” compared to say predator-prey models for ecology which is in turn as close as damn it to real compared to ideal rational actor models in Economics which is again as close as damn it compared to thick arms and voting patterns models…

I take the “M-open” setting to be one where we have little reason to commit to the idea that we’ve circumscribed all the models we’d be interested in comparing as of today, whereas when we really do think that one of the models in question is a very good model, we can approximate things as “M-closed”. The truly “M-closed” setting never exists, but there are settings where one of our models is an approximation that is as close as we care to spend the effort to get.

]]>From a teaching perspective, I need to start with AIC, because most biologists are familiar with it. It feels like a burden of presentation.

]]>– Spell checking (data on word frequencies and typing errors, plus theoretical expectations that inference from a given database would be relevant in new cases)

– Football point spreads (data on point spreads and game outcomes, plus economic theory that point spreads should give unbiased predictions of score differentials)

– Record linkage (data on matches and non-matches, plus whatever theory it took to construct the algorithm that was used to create the uncalibrated scores)

In each of these cases the model is imperfect, and that’s part of probability in the real world too.

]]>I think Waic is kind of ok too, but Aki has a point that the best justification for Waic is as an approximation for Loo, and in that case I prefer Loo as the argument for it is more straightforward. Aic is fine as a starting point for linear models with flat priors, but that’s about it. And Dic is of historical interest as an intermediate step that happened along the way to our current understanding.

Bob:

Prediction is fine but it can be important to predict in new scenarios that are different from past scenarios. For example, using Mister P to generalize to a population that is different from the survey data at hand. Or using a differential equation model in pharmacology to make predictions for dosing scenarios that are different from those in the experimental data.

It’s fine with me when people say that prediction is the only problem in statistics that matters—as long as they recognize that some sort of modeling (or generalization or regularization or whatever you want to call it) is necessary to make predictions for new cases, in the very important scenario where the observed data are not a random sample from the population of interest.

Hmmm, maybe that needs its own blog post?

]]>- Mill, John Stuart. 1882.
*A System of Logic: Raciocinative and Inductive*. Eighth Edition. Franklin Square, New York: Harper & Brothers, Publishers.

Specifically Part III, Chapter 18. I think he does a brilliant job laying out the issues of “subjective” probability. I particularly like this quote:

We must remember that the probability of an event is not a quality of the event itself, but a mere name for the degree of ground which we, or some one else, have for expecting it. … Every event is in itself certain, not probable; if we knew all, we should either know positively that it will happen, or positively that it will not. But its probability to us means the degree of expectation of its occurrence, which we are warranted in entertaining by our present evidence.

He discusses several notions of probability, including classical notions of equiprobabile events.

As I’ve said before, I also believe some familiarity with measure theory is necessary to understand modern statistics (in the same way that some familiarity with manifolds is required to understand modern physics). It’s not absolutely necessary, but it sure simplifies things. I don’t know good references for this. Ash’s short book on probability is OK, but you don’t really need that level of formalism. Texts like DeGroot and Schervish cover most of the material, but it’s very very long (hundreds of pages to get through).

]]>When the inferential goal is prediction, leave-one-out cross-validation makes a lot of sense, as it’s a proxy for having a true held-out test set (like actually using your model to predict tomorrow’s weather).

An alternative (though asymptotically similar) method for measuring prediction that I really like is outlined in this paper:

- Gneiting, Balabdaoui, and Raftery. 2007. Probabilistic forecasts, calibration and sharpness.
*JRSS B*.

I use an informal version of calibration and sharpness to compare models in my repeated binary trials case study.

]]>Re LOO: As you know, not everyone is yet convinced that LOO is an advance over other metrics. I haven’t made up my mind. Will give you the chance to convince me in Helsinki sometime.

]]>Be warned, I guess, that they take an unabashedly ‘subjectivist’* and operational/predictive stance, where parameter inference is a secondary, special limiting case of predictive inference (and hence limited to parameters definable as large-sample functions of observables, i.e. identifiable parameters?).

Which means, for example, I’m not sure how well the approach fits with e.g. ODE-style models derived from other considerations of the sort familiar to scientists, engineers and applied mathematicians (if that’s what you’re after). Though Andrew seems to be shifting towards a predictivist stance while also working with ODE models and things too. So stay tuned, I guess?

They also discuss a bit the whole M-open and M-closed issue raised in Aki’s comment above (e.g. from one of his links: “The widely recommended procedure of Bayesian model averaging is flawed in the M-open setting in which the true data-generating process is not one of the candidate models being fit”). Though as far as I remember, they don’t really give much advice for dealing with M-openness…

(*but see Andrew and Christian’s recent ‘Beyond…’ paper)

]]>For BDA3 there are some R demos (including RStan and RStanARM demos) at https://github.com/avehtari/BDA_R_demos

Stan case studies are also excellent. Few more and it would be possible to use only them as a course material :)

]]>Thanks again for pointing me in this direction!

]]>I’m sure others will provide amazing recommendations for Bayes/Stan so I’ll try to help out with some of the related components:

Principles of Applied Statistics (2011), Cox and Donelly

An Accidental Statistician: The Life and Memories of George E.P. Box (2013), G.E.B

Introduction to Scientific Programming and Simulation Using R (2014), Owen James et. al.

An Introduction to Statistical Learning: with Applications in R, Gareth et. al.

A Concrete Approach to Mathematical Modelling (1995), Mike Gibbons

Modelling with Differential and Difference Equations (2004), Fulford et. al.

Statistical Models in Engineering (1994), Hahn & Shapiro

A Paul Meehl Reader: Essays on the Practice of Scientific Psychology (2006), Edited by Waller et. al.

]]>https://www.youtube.com/user/mikelwrnc/playlists

Content in all is R, tidyverse, & Stan, with slightly different model types across the courses. The 6001 one is probably the better of the three. ]]>

The first link is dead.

Thanks,

Phil

]]>