In an article catchily entitled, “I got more data, my model is more refined, but my estimator is getting worse! Am I just dumb?”, Meng and Xie write:

Possibly, but more likely you are merely a victim of conventional wisdom. More data or better models by no means guarantee better estimators (e.g., with a smaller mean squared error), when you are not following probabilistically principled methods such as MLE (for large samples) or Bayesian approaches. Estimating equations are par- ticularly vulnerable in this regard, almost a necessary price for their robustness. These points will be demonstrated via common tasks of estimating regression parameters and correlations, under simple mod- els such as bivariate normal and ARCH(1). Some general strategies for detecting and avoiding such pitfalls are suggested, including checking for self-efficiency (Meng, 1994, Statistical Science) and adopting a guiding working model.

Using the example of estimating the autocorrelation ρ under a stationary AR(1) model, we also demonstrate the interaction between model assumptions and observation structures in seeking additional information, as the sampling interval s increases. Furthermore, for a given sample size, the optimal s for minimizing the asymptotic variance of ρ.hat.MLE is s = 1 if and only if ρ^2 ≤ 1/3; beyond that region the optimal s increases at the rate of log^(−1)(ρ^(−2)) as ρ approaches a unit root, as does the gain in efficiency relative to using s = 1. A practical implication of this result is that the so-called “non-informative” Jeffreys prior can be far from non-informative even for stationary time series models, because here it converges rapidly to a point mass at a unit root as s increases. Our overall emphasis is that intuition and conventional wisdom need to be examined via critical thinking and theoretical verification before they can be trusted fully.

I’m very sympathetic to the argument that we have to be careful when imputing general statistical properties of a method based on past successes. I’m reminded of my (friendly) disputes with Adrian Raftery on Bayesian model selection. As Don Rubin and I wrote, “Raftery implies that the model with higher BIC will be expected to yield better out-of-sample predictions than any other model being compared. This implication is not generally true; there is no general result, either applied or theoretical, that implies this.” My guess is that Raftery was reasoning by analogy: he had a method derived from certain statistical principles and he just assumed that it would have other desirable properties. But, as Meng and Xie say, “intuition and conventional wisdom need to be examined via critical thinking and theoretical verification.”

The abstract of your paper reminds me of my Deep Thought paper. In addition, the world of time series is full of models that don’t make sense but are considered to be standard and acceptable. I think a key issue here is that econometricians are always afraid of cheating (also called “specification searches”). They distrust the idea of statistical data-based model-building (instead of what they prefer, which is a priori model building based on economic theory, or else fully nonparametric non-theoretically-based models). The statistical tradition of building a model using data with some theoretical support is not so popular in econometrics, as they worry that data-based modeling will violate statistical principles. I think this is why we often see economists running regression with data “straight out of the box” with minimal transformations of variables. Transforming is an opportunity to cheat. Similarly, they like AR or ARMA models with automatically-chosen lags because such models are objective and require no human input.

Here’s an example of a simple theoretically-based Bayesian model outperforming a default AR model. Cavan’s work was no surprise to the ecologists, but the time-series statisticians just couldn’t accept it. My impression was that they felt that the AR model was the game to be played and that it was cheating for a model to be built based on the structure of the problem.

Where’s the article? I like how they start with the “Possibly…” caveat.

It should be noted that “specification search” are not what economists call it. Very confusingly for statisticians, we call it “data mining.”

> reasoning by analogy: he had a method derived from certain statistical principles and he just assumed that it would have other desirable properties.

I think this sort of thing often can by deduction as well – where deductions are made given a model (or representation) and it’s taken (implicitly assumed) to matter importantly for the representation.

My pet ones are assuming continuity or uncountable models – see here http://radfordneal.wordpress.com/2008/08/09/inconsistent-maximum-likelihood-estimation-an-ordinary-example/ (see comment 19 and what seems agreement afterwards)

or here http://normaldeviate.wordpress.com/2012/06/26/self-repairing-bayesian-inference/#comments

Such models are terribly convenient and not to be avoided but caution exercised about what implications are claimed to matter importantly in applications.

Opps, should have been

“to matter importantly for what is being represented (by the representation)”

This is fairly off-topic, but something that has bugged me for while is this claim that pops up subtly from time to time that there’s not a good theoretical reason to favor simpler models over more complex models, assuming both perform equally well according to some accuracy measure of interest on the data at hand.

I am curious what you think about some of the results from PAC Learning, developed by Leslie Valiant. I’ve often heard philosophers of science argue that this is a pretty strong support for practical validity of something like Occam’s razor, and that specifically it gives you a real reason to expect simpler models to have less generalization error.

In a course I took last year [1], we discussed this and it did not seem simple to me at all. The neat stuff on Kolmogorov complexity and the universal prior makes me really want to believe that minimizing complexity-related criteria (like BIC, say) is well-supported. But at the same time good arguments can be made in the other direction too.

When I feel confused by something, I try to write a paper about it. Below [2] is a link to the paper that my colleague Miguel Aljacen and I wrote about information geometry of Boosting in machine learning and how this might bolster Occam’s Razor type things. It’s super amateur, but maybe it is interesting.

There also was a failry comprehensive and deeply technical master’s thesis by De Wolf [3], in which he address a lot of the reasons we might think Occam’s Razor and minimal complexity explanation are favored (and tons of neat stuff on innateness of language).

As I see it, these things are deeply related to practical questions and I for one do think that various proxies for complexity often give useful info about the importance of “simple” answers over “complex” ones. At the same time, I have always thought your comments about inflating the number of parameters or levels in a model to get better accuracy are extremely useful and on track. In most human statistical endeavors, we’re not even close to having models that adequately explain the data, and the simplicity constraints only apply *after* you have several competing models that genuinely equally explain the current data. In real life we can often tolerate much more complexity because the simplest adequately-accounting-for-the-data model is probably far more complex than all the models we’ve yet imagined. But even while I believe this, I still think minimizing complexity proxies is informative, especially if you do have reason to suspect your current slew of competing models more or less equally account for the data.

[1]

[2] (This won’t be up too much longer as I just graduated. I’ll move it onto my blog sometime soon).

[3]

Not sure why the links did not come through, but they should be:

[1] (http://philtcs.wordpress.com/2011/11/03/class-8-occams-razor-the-universal-prior-and-pac-learning/)

[2] (http://homepages.cwi.nl/~rdewolf/publ/philosophy/phthesis.pdf)

[3] (http://homepages.cwi.nl/~rdewolf/publ/philosophy/phthesis.pdf)

If they don’t come through again and you’re interested enough, I guess let me know so you can put them in a comment to make sure they show up.

Ah, really sorry about this. The second link is supposed to be: (http://people.seas.harvard.edu/~ely/ThingsThatStartWithB.pdf)

On the topic of rethinking asymptotics: last week the following were published on arXiv:

http://arxiv.org/abs/1206.4762

“Asymptotics of Maximum Likelihood without the LLN or CLT or Sample Size Going to Infinity”

by

Charles J. Geyer

The summary reads:

If the log likelihood is approximately quadratic with constant Hessian, then the maximum likelihood estimator (MLE) is approximately normally distributed. No other assumptions are required. We do not need independent and identically distributed data. We do not need the law of large numbers (LLN) or the central limit theorem (CLT). We do not need sample size going to infinity or anything going to infinity. Presented here is a combination of Le Cam style theory involving local asymptotic normality (LAN) and local asymptotic mixed normality (LAMN) and Cram\’er style theory involving derivatives and Fisher information. The main tool is convergence in law of the log likelihood function and its derivatives considered as random elements of a Polish space of continuous functions with the metric of uniform convergence on compact sets. We obtain results for both one-step-Newton estimators and Newton-iterated-to-convergence estimators.

Kjetil: Thanks for posting this.

K O’Rourke: Thanks!

I have for a long time been telling people that to see if a normal approximation is Ok, they should

just plot the log-likelihood and see if it is quadratic. I guess I was thinking on this in a

Bayesian way: If the log-likelihood is quadratic, and the prior is (almost …) uniform, then the posyerior

is normal… Ought to be true in a frequentist sense as well …. Geyerś’s paper confirms this!

One way of looking at it, look at a Poisson iid sample. The log-likelihood is

l(\theta) = -n \theta + n\bar{x}\log(\theta) + constant ,\qquad (\theta > 0)

which is not quadratic for any n , but it is smooth, so we can Taylor-approximate with two terms,

and for n large the maximum will be close to the true \theta, and the quadratic approximation can be good

enough \emph{in that vicinity}. But if true \theta is to close to zero, that breaks down …

Kjetil: That is what I was trying to put into pictures with these plots http://andrewgelman.com/2011/05/missed_friday_t/

But many things I find interesting about Geyer’s paper, the first of which is where he published it – as he indicates it would be hard to get this sort paper into most statistics journals.

I think it is fair to say there is sort of a taboo against being informal (non-rigorous) in statistics journals and even in the discipline itself – we are supposed to be better (more analytical) than that (than those that just do computations). Most have spent years or decades putting together and keeping their (as Don Rubin use to refer to these skills) mathematical tool kits sharpened and it is likely that they have a very (investigator) biased evaluation of the indispensable value of these kits and even some resentment and mistrust of those who might make these kits dispensable.

Unfortunately being fully analytical in statistics (so far) has been insurmountable and an asymptotic dodge seems unavoidable though hardly satisfactory in theory even if often OK in applications. (Geyer suggests approximation could replace this asymptotic dodge and would not be so bad in theory and would be more flexibly and assuredly applied in practice. This is certainly consistent with something David Cox often said in seminars (Geyer was not claiming originality) – that in a given application a Monte-Carlo investigation for a relevant range of parameter values could easily displace/confirm the asymptotics for just those parameter values. Now I had thought of using importance sampling to interpolate and extrapolate those parameter values that had been simulated – seems Geyer did this already in the 1990s, so I can just read about it.)

More generally (and of greatest interest to me) is the balance being analytic versus computational when one wants to _apply_ statistics to problems. For instance in the late 1980s my Biostats supervisor blasted me for doing a simulation study to confirm power with –“a professional statistician ought not to have to stoop to simulations to determine study power”. Today, almost everyone outside doing classroom exercises, will use some simulation to get power under a variety of yet to be (easily) analytically represented assumptions about trial design and conduct. The computational advantages overall analytical representations are undeniable here.

Geyer seems to be placing his bets much more on the computational – positing computer Monte-Carlo (what he calls the parametric bootstrap) will trump algebra and even computerized algebra in at least some areas in the statistics discipline.

Doubly unfortunately, without being fully analytical you cannot rule out being wrong. For instance, if one multiplies two truly quadratic likelihoods the multiple is quadratic but multiplying two almost indistinguishable looking from quadratic likelihoods can give a multiple that’s is multimodal (this actually happen in my thesis). I’ll need to read Geyer’s paper more carefully to see if this case is excluded from his claims.

So you need both analytical and computational – it’s just a question of getting the right balance. Perhaps one might say – analytical understanding should not be the enemy of visualizing or explaining in concrete terms