Median often feels like an ad hoc calculation, not like an aspect of a statistical model. But in fact, median actually corresponds to a model. Last week, Risi and Pannagadatta at the Columbia machine learning journal club reminded me that L1 norm of the data is minimized at the point of the median. But the L1 norm and Laplace distribution are closely related (to be elaborated later), and the median effectively corresponds to the mu parameter of the Laplace distribution.
For example, let’s assume a very flat prior on the Laplace parameters and data [1,2,10]. The posterior distribution of mu, as obtained through WinBUGS is shown in the histogram below. The MAP peak is at about 2, and we can see that it’s hard to estimate the median with so little data:
I am sure that someone has thought of this before, but it’s easier to reinvent it than to track it down. Let me now elaborate on the connection between norms and distributions.
In the past few years one of the hottest topics is the realization that the L1 norms often work better on cross-validated benchmarks than L2 norms. One of the underlying reasons is greater resistance against outliers. There is an intriguing connection between norms and distributions that should be better known than it is. The connection between popular norms and popular distributions can be established through Jaynes’ maximum entropy principle.
Inder Jeet Taneja’s book draft has a nice survey of the results: if you fix the upper and lower boundary, and maximize entropy, you’ll get the uniform distribution. If you fix the mean and the expected L2 norm (d^2) between the mean and the distribution, maximizing the entropy you’ll get the Gaussian. If you fix the expected L1 norm (|d|) between the mean and the distribution, maximizing the entropy you’ll get the Laplace (also referred to as Double Exponential). Moreover, log(1+d^2) norm will yield the Cauchy distribution – a special case of the standard heavy-tailed Student distribution.
In that sense, when people play with loss functions, they are essentially also playing with probability distributions that are entailed by the loss functions. When they use L1 or L2 regularization for regression, they are picking either Gaussian or Laplace priors for the parameters. The reason for the popularity has been primarily the realization that Laplace prior is better than Gaussian prior on many benchmarks. I wonder if the log(1+d^2) norms will generate as many papers as L1, or whether statisticians will migrate from Student to Laplace.