“I would argue that an undergrad education probably doesn’t give enough perspective to do all of this, even though the basic mathematical tools are there. You need to be comfortable building things from scratch and dealing with people in intense situations. I’m not sure how to train someone for the latter…”

]]>I think seriously don’t – but perhaps I should do a post to clarify why I think that.

Briefly, I think its just the mistaken sense that getting the sampling distribution solves statistical inference (Efron/Tibshirani preferring the use of bootstrap ideas and Cox/Fraser preferring the use of higher order asymptotics.)

When dealing with realistic problems (e.g. nuisance parameters, estimates that depend on higher moments, parameter spaces that a non-linear, etc.) it can be very hard to get actual coverage anywhere near supposed coverage. Yes, thoughtful experienced experts can get it right.

Efron once wrote (around 2000) that most statisticians don’t. That included me, in that when the percentile intervals were _simmilar_ to the BCA intervals in applications, I (mis)thought they had no advantage (i.e. I forgot to think about the distribution of intervals over repeated applications).

Most times I remember checking out bootstrap work done by colleagues, it was not valid according to technical papers written by experts.

Also, Efron did write about the bootstrap as being a Bayesian approach that automatically/implicitly defined a _reasonable_ non-informative prior.

]]>Thanks for laying out what you mean by “measure theory” — my understanding of that phrase is in line with Chris and Juho’s comments.

But if we’re talking about teaching how to apply statistics, we also need to give examples of how this applies to real world problems. So my question is: If you were teaching statistics, how would you illustrate your definition in a context such as studying the effect of a drug on a certain population of patients?

]]>Learning how to do this process, which is just what I would call “science”, is what PhD candidates in research are learning how to do, in principle. And I think it’s no mistake that 1) a PhD takes a long time to complete, and 2) it evidently doesn’t appropriately cover all the steps anyway.

That is to say, while I agree with the general idea of an education focused on these issues, I can’t see that an undergrad major would be enough space to do justice to it. Current trends in what the data science community cares about teaching aren’t encouraging, either. Either you get a research scientist who understands “data gathering” pretty deeply and mostly ad-hocs the analysis, or you get a data scientist who has the analysis down but thinks “data gathering” means finding the right URL and being critical about the codebook.

]]>min d_1(theta,0)

s.t. d_2(y,f(theta)) leq delta

for any distances d_1, d_2. The problem is then parameterised by delta. Or use Lagrange multipliers.

In terms of coded directly or not. It depends. Here, using Python/Matlab, code your own model, choose your metrics and use in-built optimisation routines (e.g. scipy optimize etc).

(We also teach computational/numerical methods where they code the methods from scratch and project-based design papers where they code heuristic optimisation/search algorithms themselves tailored to specific problems).

]]>https://en.wikipedia.org/wiki/Banach%E2%80%93Tarski_paradox

https://en.wikipedia.org/wiki/Hairy_ball_theorem

I used to work in computational linguistics, which was split between linguistics departments and computer science departments. I always thought it was just linguistics done by people who knew more math and computer science. Because of the computational burden and the need to develop algorithms, it had a home in computer science. And because there were applications, like search and speech recognition. I think that’s also why machine learning is in computer science—there are some hard computaional problems and somebody needed to do them. And the work needed to do them looked like systems and algorithms, so it looked like computer science. I just think it should be part of statistics. Just because we use computers to get the answer doesn’t make it computer science (and I’m a computer scientist by training more than anything else!). Physics uses a ton of computing, but there aren’t computational physicists in computer science departments. Why not? I think it’s because physicists are better at doing their own computing than statisticians.

]]>The stuff I think of as “measure theory” is all that stuff about “measurable functions” and “measurable sets” and how the individual points have probability 0 and only certain sets even have probability, and the probability of an event is no longer the sum of the probabilities of the elements in it. All that is pretty unintuitive and requires you to adopt a point of view that is, in my opinion, contrary to what is useful in model building. Hence, I like the nonstandard number of balls in an urn view of mathematical probability. It corresponds to a direct idealization of what is actually done (namely, using RNGs to give you IEEE floats and the like).

The next thing I think is of concern is that teaching mathematical probability theory tends to move you away from other interpretations of probability and cement a “one true interpretation” view. For example the Jaynesian plausibility of the truth view has nothing to do with balls and urns so to speak. So, I don’t exactly disagree with you about the importance of having some mathematical knowledge, but I think some subtle thinking needs to go into how to convey useful mathematical concepts without having them shut-out useful model-building ways of thinking. The math can have many different meanings attached to it, and how you teach the math seems to affect how well people are able to transport the math across to the meaning.

]]>A *sample space* is a set .

An *event* is a subset of the sample space, . Let be the collection of events (there are a few reasonable closure conditions such as that is an event and we have complements and countable unions—we almost never need to worry about these details).

A *probability measure* is a function satisfying some basic axioms (like positivitiy and that disjoint events have additive probabilities—these details are all obvious).

Oh, and a random variable is a function with cumulative distribution function .

If you want to get fancy, you can define a pdf as the derivative of a continuous cdf.

You can’t tell me you don’t know all this.

]]>I spend almost all of my time on ‘mechanistic’ modelling via ODEs, PDEs, reaction networks etc. I recently taught one lecture on estimation for some ODE models we’ve been deriving.

Given the time constraints I taught them about a) distances for measuring model fit to data, b) measures of model complexity and c) how to explore the trade off between a and b. Basically, penalised nonlinear regression (or multiobjective optimisation) with discussion the role and ways of choosing/exploring penalties. No randomness required.

I would call this ‘parameter estimation’.

In terms of _data analysis_ I would just teach how to explore variability of useful statistics via bootstrap/resampling.

]]>“But once upon a time we didn’t have statistics departments, they were math. We didn’t have computer science departments, they were — what? I.E.? Math?”

I’m not sure this is entirely true. Yes, once upon a time we didn’t have statistics departments, but my impression is that statistics was surfacing in a variety of academic departments (as well as non-academic environments) — yes, math was one of them, but statistics was developed in various ways by economists, demographers, astronomers, biologists, industrial engineers, and others.

“The world needs academic faculty aimed directly at that phenomenon, and I’d be willing to bet it will get them, in large part because rich parents and donors will see what I see.”

The university from which I am now retired did not have a statistics department until after I retired, and I was involved in the move to create one, so can offer some interesting comments on how a statistics departments might arise.

1. In preparation for co-writing a proposal for a statistics department (not the first attempt anyone had made at this — a previous one a decade or so earlier had failed), I looked at the “sister institutions” that my university usually compared itself to. I found that of the nine or so, Texas and Indiana were the only ones without a statistics department. These were also the only universities in the group that did not have any of the following: A school of medicine, school of public health, or school of agriculture. I think this was not coincidental, but that the existence of such entities was usually crucial in making the case for a statistics department.

2. At the time of the proposal, the Deans involved were of the opinion that getting the OK for a department at that time was not likely, but that a proposal for a Division of Statistics had a better chance of being accepted, so that’s what we proposed.

3. Coincidentally, there was also an interest within Natural Sciences for a unit (separate from the Department of Computer Sciences) dedicated to Scientific Computation.

4. The Dean decided to combine both initiatives, and proposed a Division of Statistics and Scientific Computation, which was approved.

5. After several years, that Division was “upgraded” to a Department of Statistics and Data Sciences.

Bottom line: The establishment of a new department can involve a variety of factors and may vary from place to place.

]]>I personally would like to see people teach deterministic type model building first… ODEs and algebraic models and soforth. teach people the importance of dimensionless quantities, and segue from choice of dimensionless scales to choice of priors. After all, most of dimensional analysis is about choosing scales so that things are O(1) and that goes reasonably well towards then putting a weighting function over the quantity.

Then, once you have a weighting function over the quantities… talk about how to find out more from collecting data and comparing predictions to the data under different possible values of the unknowns. How much weight should we give to parameter values that cause predictions to vary highly from measured? How about if predictions are close to measured? How do we measure “close”

I wouldn’t even once mention the idea of “random” just different weights to be given to different values of theoretical quantites, and different weights to be given to different degrees of difference between prediction and measured quantities.

Once those ideas are cemented, we can move on to how to compute with these weights… at that point you could introduce several computing techniques: resampling, ABC, HMC/MCMC, etc.

]]>I don’t think I know any measure theory. Or maybe I do, and I just don’t know that it’s called that. The most theoretical things I’ve ever done are the appendix in BDA (which was my reconstruction of arguments that I’d heard before but not ever seen formally explained; actually it turns out it was all in De Groot’s book, but I didn’t know that at the time) and my 0.234 paper (but Gareth took over the writing of that one so the proof ended up being written in some sort of mathematical code that I couldn’t really follow; my own proof was much more long-winded and I’m sure was less rigorous).

]]>It was actually your comment of August 2, and Shravan’s and Andrew’s comments, that highlight for me the utility of situating stats courses in a broader, less rushed curriculum about data gathering, analysis, and use. Making the math concrete.

Your comment underscores that the adjacent academic apartments will always “do their own thing” better and at higher research levels than the type of engineering department I’m describing (and may be better at avoiding what you view as fads, if I understand your scare quotes around “machine learning”). Sure, that’s a given. But once upon a time we didn’t have statistics departments, they were math. We didn’t have computer science departments, they were — what? I.E.? Math? We once didn’t have area studies or urban studies or epidemiology. Fields and departments seem to arise when faculty start saying, gee, a lot of our students could benefit from looking at the picture in this adjacent way, but we just don’t have the time in our degree sequence, and it would a shame to sacrifice all the cool stuff our senior undergrads get into.

I think I see you all talking that way. And I know your students are growing up in a world uniquely — uniquely, mind! — awash in “data.” The world needs academic faculty aimed directly at that phenomenon, and I’d be willing to bet it will get them, in large part because rich parents and donors will see what I see.

]]>How much measure theory do you need for that? Unless understanding multivariate calculus means that you know enough measure theory, even if you’ve never heard of a sigma-algebra.

> Trying to write about stats these days without measure theory would be like trying to write about gravitation without manifolds. Sure, it’s possible, but you’ll be left out of most of the conversations.

That’s a different question. It’s useful for multiple reasons but it should not be an end in itself. And it doesn’t necessarily help students who are not interested in pursuing an academic career in this field.

I’ll quote a passage from Jaynes’ book, the whole appendix B “mathematical formalities and style” is worth reading (there are a couple of pages on measure theory in particular): http://www.hep.fsu.edu/~wahl/phy5846/statistics/jaynes/pdf/cappb8.pdf

“If anyone wants to concentrate his attention on infinite sets, measure theory, and mathematical pathology in general, he has every right to do so. And he need not justify this by pointing to useful applications or apologize for the lack of them; as was noted long ago, abstract mathematics is worth knowing for its own sake.

But others in turn have equal rights. If we choose to concentrate on those aspects of mathematics which are useful in real problems and which enable us to carry out the important substantive calculations correctly – but which the mathematical pathologists never get around to – we feel free to do so without apology.”

OK students, you might have heard about this thing called ‘gravity’ but we won’t cover that until grad school after you’ve done analysis on manifolds.

(PS funnily enough I actually do think manifolds should be introduced much earlier in standard curricula…)

]]>I don’t think we need a new major (but then I don’t think we need “machine learning” or “uncertainty quantification” either). I think we need to be teaching all these things in statistics as well as visualization and communication and serious computing.

Creating new faculty a university is always an uphill battle. Departments jealously guard tenure lines. Data science has been a boon to both computer science and statistics department in terms of faculty growth, but I don’t think either want to see data science departments. As far as I know, Columbia’s Data Science Institute doesn’t have any faculty tenure lines.

]]>I only meant enough calculus to understand what limits, derivatives, and integrals are, and enough measure theory to understand event probabilities, random variables, and the related notions of expectation and covariance, as well as the key notion of concentration of measure (this last bit is the thing most people miss that is critical for understanding Bayesian posteriors). That amount of math stats and measure theory is *presupposed* by BDA. The very first hierarchical modeling example in chapter 5 assumes you can compute a multivariate Jacobian and match moments of a beta distribution. On the other hand, Gelman and Hill manage to mostly sidestep measure theory, though they provide a hand-waving definition of random variables (balls in an urn [aka sample space] with values written on them for the random variables—they don’t mention you need an uncountable number of balls!).

Trying to write about stats these days without measure theory would be like trying to write about gravitation without manifolds. Sure, it’s possible, but you’ll be left out of most of the conversations.

]]>Yes, I agree.

> I work on problems where least squares and maximum likelihood estimates don’t work well.

Me too. The main problems with these are a) lack of robustness and b) problems with non-identifiable models.

In many ways I do think it is a good way to separate estimation theory from uncertainty of estimation (this is something I’ve changed my mind on). In this sense, statistics in the bootstrap-style is just about uncertainty of an estimator determined via other means.

Regularisation theory is one way of thinking about estimation theory. Bayes is another. They have some overlap but are not identical. Increasingly I prefer the mathematical foundations of the former to the latter. In particular, I think Bayes still has serious issues with a) and b) above.

]]>Things you will find in BDA: how to do Bayesian statistics at an advanced level.

Things you will not find in BDA: measure theory.

You can do essentially all of probability using nonstandard analysis and finite probability spaces. Edward Nelson’s PDF is a must-have:

]]>1. Models are just approximations of reality and involve choices by the modeler.

2. Bayesian inference is inductive inference—it shows how to update knowledge with uncertainty in the light of evidence with uncertainty. Bayesian probabilities model that uncertainty.

3. Statistics is counterfactual—we model past events probabilistically as if they might have turned out otherwise, even if they only have one value.

]]>Bootstrap is great but the hard part is coming up with the estimate to bootstrap. That is, you have data y, estimate theta_hat(y), bootstrap distribution p(y_boot|y), and bootstrap distribution of the estimate, theta_hat(y_boot), induced by p(y_boot|y). The literature on the bootstrap is all about coming up with a reasonable p(y_boot|y) and also figuring out what to do with the induced distribution on theta_hat(y_boot). That’s all fine, but I think the real action is in deciding the estimator theta_hat(). I work on problems where least squares and maximum likelihood estimates don’t work well. You can take a bad estimate and bootstrap all you want, without the problem getting fixed. Again, this is not a slam on the bootstrap, just a reminder that bootstrap requires an externally supplied theta_hat().

]]>‘Statistics of Parameter Estimates: A Concrete Example’ by Aguilar et al. They include both Bayesian and frequentist methods (the frequentist approach is based on Philip Stark’s work in inverse problems). See http://epubs.siam.org/doi/abs/10.1137/130929230

As might be expected by the cynical, none of the methods actually work _especially_ well considering the example is so simple. But they all work OK ish. The culprit is likely model misspecification (eg th assumed drag law) and/or measurement issues. The confidence intervals seem to do the best though, being more conservative.

]]>http://models.street-artists.org/2013/04/18/dropping-the-ball-paper-experiments-in-dynamics-and-inference-part-2/ , http://models.street-artists.org/2013/04/26/1719/ and associated additional posts.

Now, in this experiment, the parameter g for gravitational acceleration, is a meaningful thing, it’s known to be somewhere close to 9.80 m/s^2 wherever you are on the earth, and there are tables for various cities, so it would be totally unreasonable to infer from dropping balls that g = 7.1 or g=12.16 is a reasonable value

Further, the aerodynamic 2*radius should be somewhere close-ish to the measured diameter of the ball, it has to be that order of magnitude. If we measure a ball at 7.5cm diameter and we get from our model that it falls as if it were a perfect sphere of diameter 0.1 cm we know we’ve gone wrong somewhere…

So, if your starting point is a physicsy mathematical model of some real stuff happening, the prior is often a very concrete idea of about what the different parameters should be because they have symantics associated to them… they mean something concrete. And, the idea of a sampling distribution of repeated trials under identical conditions… not so much. what does it even mean to have identical conditions? What you consider to be “identical” will strongly affect what you consider to be the “sampling distribution”, so the “sampling distribution” is a really abstract thing depending on some subtleties such that different groups of students would see potentially very different distributions when they try identical repetitions of certain experiments.

On the other hand, I do agree with you about constructive procedures and their role in understanding models. It’s no good to worry about abstract properties of mathematical objects… like the integrability of the indicator function on the irrationals… a totally meaningless thing for applied math. You can’t compute the decimal expansion of even *one* irrational.

]]>I am very pro math. But ‘mathematicising’ is not, to me, the same thing as ‘modelling’ unless the latter is so broad as to mean little.

Why is ABC a seemingly attractive way to teach Bayes? I would agree it’s because it presents a constructive _procedure_ for carrying out Bayes updates instead of just an axiomatic or algebraic characterisation. The procedure itself seems more concrete even than the idea of Bayes updates.

Of course you can and should analyse this procedure to see where it takes you. For example, you can characterise the real numbers algebraically, and you can characterise them via a construction procedure using the rationals. It’s nice to see that they agree.

On the other hand, constructive mathematics (for example) offers a different view on mathematical concepts that some students find helpful. For example, I recently-ish had a student returning to university mathematics after dropping out when younger in part due to difficulties with concepts like limits. I explained these ideas to them in terms of procedures and that seemed to help significantly.

In terms of introducing difficult concepts I think:

– Introduce a concrete problem to be solved or a clear goal

– Introduce a constructive procedure for tackling this specific problem, taking you through a series of reasonable and understandable steps

– Demonstrate the good properties of the chosen procedure for tackling the problem

– Discuss what can go wrong and the limitations of the procedure. E.g. using counterexamples.

The last two points are very important, of course, and can make all the difference.

To paraphrase one of Andrew’s least favourite people – we don’t _understand_ mathematics, but we learn how to _do_ it.

Now, back to Boostrap.

Bootstrap is similar to ABC – it’s most direct interpretation is as a procedure. You can also analyse it’s properties as a procedure: where does this process ‘take you’?

One benefit, as I tried to argue, is that the ‘series of reasonable and understandable steps’ are simpler than ABC because it starts with a concrete given – the dataset – and not an abstract unknown – the parameter/prior over a parameter.

Boostrap is direct, ABC/rejection is like the contrapositive. They may be equivalent (I would argue against in this case tbh) but one direction is more intuitive.

]]>