Cosma Shalizi writes:

Kevin Kelly has an interesting take on Ockham’s razor, which is basically that it helps you converge to the truth faster than methods which add unnecessary complexities let you do. I think his clearest paper about it is this, though sadly it looks like he removed the cartoons he had in the draft versions.

I took a look. Here’s the abstract:

Explaining the connection, if any, between simplicity and truth is among the deepest problems facing the philosophy of science, statistics, and machine learning. Say that an efficient truth-finding method minimizes worst-case costs en route to converging to the true answer to a theory choice problem. Let the costs considered include the number of times a false answer is selected, the number of times opinion is reversed, and the times at which the reversals occur. It is demonstrated that (1) always choosing the simplest theory compatible with experience and (2) hanging onto it while it remains simplest is both necessary and sufficient for efficiency.

This is fine, but I don’t see it applying in the sorts of problems I work on, in which “converging on the true answer” requires increasingly complicated models as more data arrive. To put it another way, I don’t work on “theory choice problems,” and I’m invariably selecting “false answers.”

P.S. I’m not saying this to mock Kelly’s paper; I can imagine this can be useful in some settings, just maybe not in problems such as mine where I would like my models to be more, not less, inclusive.

You say:

"converging on the true answer requires increasingly complicated models as more data arrive"

Kelly says:

"always choosing the simplest theory

compatible with experienceand hanging onto itwhile it remains simplestis necessary for efficiency" [my emphasis, and I've omitted text without inserting ellipses]It doesn't seem to me that you're contradicting Kelly.

On the other hand, I'm not sure from the excerpts how Kelly's take on Ockham is new.

Derek: Perhaps Kelly and I don't differ much in practice. The key difference is that Kelly says, "always choosing the simplest theory compatible with experience," whereas I want the most complicated theory I can fit stably. As my ability to build and fit models increases, I'd like to fit more and more complicated models. For me, simplicity in modeling is not a goal, only a compromise. (I seek simplicity in

displayinginferences, but that's another story.)i see 'compatible with experience' as 'compatible with the sample size'.

And let's not forget Tufte's recommendation (from his second book,

Envisioning Information,i>:"simplicity of reading derives from the context of detailed and complex information, properly arranged. A most unconventional design strategy is revealed:

to clarify, add detail."Ockham is good for theory choice, but one needs an entirely different principle for theory creation, something imaginative.

I'm afraid that the problem you might have with Kelly's abstract is a category error resulting from a lack of context, unless you explicitly take a competing justification as a defeater. That is, his theory is couched within a lager program of methodological reliablism based on algorithmic learning theory and interesting correspondences between notions of reliability (logical guarantees to converge with certainty; limiting, delta epsilon-style convergence; and gradual convergence), and what is known as the Borel hierarchy in point-set topology (Interested readers can see his project page for the hairy details).

The theory is prior to standard statistical assumptions (IID, finite additivity, etc), and is aimed at legitimizing Ockham's razor in the face of competing justifications, and as a part of a wider methodological and philosophical program. There are any number of practical contexts where the assumptions are stronger and associated loss functions (if you will) that will obviate a strictly Ockham approach. In short it is not meant as a categorical methodological imperative, but as a justification of a component of theory selection and, iterated indefinitely, scientific discovery.

Full disclosure: I was a student of Kelly's, and wrote the woefully inadequate Wikipedia entry on his approach to formal epistemology (computational epistemology).

This might be the paper Cosma was thinking of. Also, this has some nice criticisms of the alternative justifications.

John: From the abstract of the paper you cite:

This is interesting but it does not correspond to any science or engineering research that I've ever done. I could see it being useful in the context of some set of procedures that are used in machine learning or whatever; it just doesn't fit in to my understanding of research or scientific problem solving.

Andrew: I appreciate your point, but this is not what the theory is for. A Turing machine is not much like actual physical computers, but it describes the ideal limitations of them, nonetheless. Similarly, Kelly's model of scientific inquiry does not resemble science in the wild, but it strives to establish the ideal limits of inference (and its dual, success) in a mathematically disciplined way. Arguably, both are normative in this way. It helps to know that a problem is incomputable, or you cannot have a logically reliable means of inference for an empirical problem (particularly in ML, as you suspect). Conversely, there may be a reliable method, which is also good to know.

In my view, to get to point where it even roughly describes the richness and challenges of actual scientific practice, additional (and familiar) assumptions need to be made about the nature of the evidence, relevant possibilities, practical costs, etc. But, again, this is not it's primary aim. Another way to see this is to note that, generally, Kelly's arguments trade on worst case analyses to provide methodological justifications via logical assurances under minimal background assumptions. This is what you do when you are trying to create things like complexity or computability classes, not describe scientific practice in relevant detail.

This sort of thing is not everyone's cup of tea, but Carnot engine should not be faulted because it does not correspond in important ways to what you have under your hood.

Anyway, I wanted to take this opportunity to thank for keeping such an informative and interesting blog. I'm a long-time lurker, and first time poster.

John: Yes, I see what you mean, in general terms. I still don't see the point of this particular theory, which sets a goal of simplicity. I agree that simple theories are good if you're a physicist, maybe not so much if you're a political scientist.

This might help to explain my position.

Andrew: I don't want to belabor my point, but, as far as I understand him, Kelly does not think simplicity is a goal or intrinsic epistemic good, rather it is a means. As Malcolm Forster puts it, it's as simple as:

Method A is better than method B at achieving X.

Scientist Y wants X. Therefore, scientist Y ought to use method A.

Simplicity, according to Kelly's arguments is a means to a minimal guarantee on a kind of convergence. We may take it or leave it. There are of course many other features of a method we are interested in, like speedy convergence, monotone convergence, greater inclusiveness for potential models, etc.

Thanks for the link; it greatly clarifies your point about unthinking acceptance of simplicity as a scientific goal. Now it has me thinking about explanations for this difference between the methods of social and physical scientists. Undoubtedly, the analytic tradition Kelly comes from has strongly favored physical science methods and goals. The modeling needs there are quite different than, say, when figuring out which factors influence voting patterns. I can see where casting a wide net for potential variables of influence is needed. If they are nulled, then they are culled. Something like that, anyway. I can see how this may motivate the use of something like Bayesian Model Averaging (BMA) instead of BIC or AIC. Then again, in this domain, I am way out of my depth and would gladly concur to the understanding of a practicing social scientists.

Finally, I must admit that there are many potentially relevant details in statistical model selection and the background information in the domains of interest that complicate any cartoon of the situation that I could possibly draw. Worse, I am being terribly imprecise by not defining important terms, referencing the appropriate proofs, and recognizing specific ways in which stronger assumptions may modify methodological recommendations. I know just enough to get myself into some trouble.

Best Regards.

John:

Thanks for the discussion.

As discussed in chapter 6 of Bayesian Data Analysis, I much prefer continuous model expansion to discrete model averaging (the so-called BMA).

Regarding your other point, my impression from glancing at Kelly's articles was that he is interested in "theory choice problems" and "finding the true answer to a scientific question when the answers are theories or models." Neither of these really fit the social-science research that I do most of the time.

That said, I can well believe that his ideas and methods could be useful. After all, I use the normal distribution all the time, and I don't think that's correct either.

Andrew: Thank you. This discussion has really motivated me move some of your books (esp. Bayesian Data Analysis) off of my 'to read' list and into my hands!

John, Thanks so much for the links and well written explanations. Do you lurk at overcomingbias.com or lesswrong.com? Your comments would be greatly appreciated there.

Michael: You are too kind. I do lurk on both of those sites, but my extended engagement on either site would likely be as an interested critic—I am skeptical of many of the strong claims made by philosophical Bayesians , Libertarians, Transhumanists, Futurists and Futarchists, though I am fascinated by all of these positions.

Would a professional statistician care to comment on the possible relationship of the Razor to things like the Akaike Information Criterion (AIC), the corresponding Bayes Information Criterion (BIC, which I believe is essentially equivalent), and notions like divergence measures to the problem of model selection? Sure, in the narrow there's Likelihood, but that may have problems.

Thanks.

Jan: I would not call myself a professional statistician, though I do work in Business Intelligence and other applied areas, however I would refer you to Forster's "The New Science of Simplicity" from A. Zellner, H. A. Keuzenkamp, and M. McAleer's

Simplicity, Inference and Modelling. Forster does an admirable job of summarizing AIC and BIC, while defending AIC, in particular. They are not, by the way, equivalent– though they both discount as a function of free parameters, their scoring rules and their asymptotic behavior are quite distinct, for instance.Thanks much, John. I'll have a careful look.

An additional thought/question, prompted by John Taylor's use of the words "asymptotic behavior": Suppose the number of observations is constrained to be two million or more. How does that change approaches or emphasis within curve fitting and model construction, given typical human standards for knowing and certainty? Are things just done the Same Old Way?