Median often feels like an ad hoc calculation, not like an aspect of a statistical model. But in fact, median actually corresponds to a model. Last week, Risi and Pannagadatta at the Columbia machine learning journal club reminded me that L1 norm of the data is minimized at the point of the median. But the L1 norm and Laplace distribution are closely related (to be elaborated later), and the median effectively corresponds to the mu parameter of the Laplace distribution.

For example, let’s assume a very flat prior on the Laplace parameters and data [1,2,10]. The posterior distribution of mu, as obtained through WinBUGS is shown in the histogram below. The MAP peak is at about 2, and we can see that it’s hard to estimate the median with so little data:

I am sure that someone has thought of this before, but it’s easier to reinvent it than to track it down. Let me now elaborate on the connection between norms and distributions.

In the past few years one of the hottest topics is the realization that the L1 norms often work better on cross-validated benchmarks than L2 norms. One of the underlying reasons is greater resistance against outliers. There is an intriguing connection between norms and distributions that should be better known than it is. The connection between popular norms and popular distributions can be established through Jaynes’ maximum entropy principle.

Inder Jeet Taneja’s book draft has a nice survey of the results: if you fix the upper and lower boundary, and maximize entropy, you’ll get the uniform distribution. If you fix the mean and the expected L2 norm (d^2) between the mean and the distribution, maximizing the entropy you’ll get the Gaussian. If you fix the expected L1 norm (|d|) between the mean and the distribution, maximizing the entropy you’ll get the Laplace (also referred to as Double Exponential). Moreover, log(1+d^2) norm will yield the Cauchy distribution – a special case of the standard heavy-tailed Student distribution.

In that sense, when people play with loss functions, they are essentially also playing with probability distributions that are entailed by the loss functions. When they use L1 or L2 regularization for regression, they are picking either Gaussian or Laplace priors for the parameters. The reason for the popularity has been primarily the realization that Laplace prior is better than Gaussian prior on many benchmarks. I wonder if the log(1+d^2) norms will generate as many papers as L1, or whether statisticians will migrate from Student to Laplace.

You might also find an older perspective of some interest ( the way Gauss came up with the Normal Distribution in the first place )

@article{Keynes,

author={Keynes,J.M.},

title={The principal averages and the laws of error which lead to them},

journal={Journal of the Royal Statistical Society},

volume= 74,

year= 1911,

pages={322–331}

}

Keith O'Rourke

The L1-norm has some interesting new applications in statistical approaches to sparse function approximation, essentially due to concentration of measure. See compressive sampling.

Keith – many thanks for this reference! It feels good to have reinvented something Keynes himself thought useful enough to publish.

Ambitwistor – indeed, we have used Laplace distribution in lossless image compression too – a discrete approximation to it was used in the LOCO algorithm which was taken to Mars by the rovers Spirit and Opportunity.

Aleks,

Regarding the connections between maximum entropy, Bayes, and Laplacian and other distributions, see my little article from 1992.

Regarding the idea of statisticians moving away from the normal distribution, I'd recommend the t family rather than the Laplace. The t just makes more sense to me. Although it (as well as the Laplace) can give multimodal posterior densities–I still have to think more about this. Crudely speaking, multimodal densities correspond to discrete aspects of the model, which in this case corresponds to the mixture-of-normal-data-and -outliers interpretation of the t distribution.

Aleks,

As pointed out by Ambitwistor, the L1 norm has been shown to be equivalent to L0 in the context of sparse approximation, in other words, L1 is successful because it does something else (L0) why do you think the log(1+d^2) norms would have any success in light of this ?

Igor.

Igor, we have done a number of experiments and t-norms yield higher predictive accuracy than either L1 or L2. In that sense, t-distribution is a better prior than Laplace or Normal. BTW, nice blog!

Aleks,

Any preprints/publications on your experiments ?

Igor.

Andrew,

So this 41 pages long article + review was really a misunderstanding ?

Your last sentence states :

" Knowledge that an image is nearly black is strong prior information that is not included in the basic maximum entropy estimate."

In short, if the authors had used a Laplace priors for the ME method, this method would have looked like the L1 regularization, or am I missing something ?

Igor.

Igor,

I wouldn't say that the 41-page article was completely a misunderstanding, but I do think they put in a lot of work to show something using classical methods that could be seen much more directly by considering the regularizations as Bayesian priors. I was a little disappointed that in their rejoinder they didn't seem to seriously grapple with my comment.

That said, I might not have had whatever insight was presented in my comment had I not read their article first. Sometimes someone has to do things the hard way first, then the simple synthesis comes later.

Andrew,

I was being facetious. But considering the date (1992) and the fact that they begin to talk about wavelets at the end is by itself history of science in the making. I am glad you pointed it out, it was a very nice read.

Thanks,

Igor.