Random matrices in the news

From 2010:

Mark Buchanan wrote a cover article for the New Scientist on random matrices, a heretofore obscure area of probability theory that his headline writer characterizes as “the deep law that shapes our reality.”

It’s interesting stuff, and he gets into some statistical applications at the end, so I’ll give you my take on it.

But first, some background.

About two hundred years ago, the mathematician/physicist Laplace discovered what is now called the central limit theorem, which is that, under certain conditions, the average of a large number of small random variables has an approximate normal (bell-shaped) distribution. A bit over 100 years ago, social scientists such as Galton applied this theorem to all sorts of biological and social phenomena. The central limit theorem, in its generality, is also important in the information that it indirectly conveys when it fails.

For example, the distribution of the heights of adult men or women is nicely bell-shaped, but the distribution of the heights of all adults has a different, more spread-out distribution. This is because your height is the sum of many small factors and one large factor—your sex. The conditions of the theorem are that no single factor (or small number of factors) should be important on its own. For another example, it has long been observed that incomes do not follow a bell-shaped curve, even on the logarithmic scale. Nor do sizes of cities and many other social phenomena. These “power-law curves,” which don’t fit the central limit theorem, have motivated social scientists such as Herbert Simon to come up with processes more complicated than simple averaging (for example, models in which the rich get richer).

The central limit theorem is an example of an attractor—a mathematical model that appears as a limit as sample size gets large. The key feature of an attractor is that it destroys information. Think of it as being like a funnel: all sorts of things can come in, but a single thing—the bell-shaped curve—comes out. (Or, for other models, such as that used to describe the distribution of incomes, the attractor might be a power-law distribution.) The beauty of an attractor is that, if you believe the model, it can be used to explain an observed pattern without needing to know the details of its components. Thus, for example, we can see that the heights of men or of women have bell-shaped distributions, without knowing the details of the many small genetic and environmental influences on height.

Now to random matrices.

A random matrix is an array of numbers, where each number is drawn from some specified probability distribution. You can compute the eigenvalues of a square matrix—that’s a set of numbers summarizing the structure of the matrix—and they will have a probability distribution that is induced by the probability distribution of the individual elements of the matrix. Over the past few decades, mathematicians such as Alan Edelman have performed computer simulations and proved theorems deriving the distribution of the eigenvalues of a random matrix, as the dimension of the matrix becomes large.

It appears that the eigenvalue distribution is an attractor. That is, for a broad range of different input models (distributions of the random matrices), you get the same output—the same eigenvalue distribution—as the sample size becomes large. This is interesting, and it’s hard to prove. (At least, it seemed hard to prove the last time I looked at it, about 20 years ago, and I’m sure that it’s even harder to make advances in the field today!)

Now, to return to the news article. If the eigenvalue distribution is an attractor, this means that a lot of physical and social phenomena which can be modeled by eigenvalues (including, apparently, quantum energy levels and some properties of statistical tests) might have a common structure. Just as, at a similar level, we see the normal distribution and related functions in all sorts of unusual places.

Consider this quote from Buchanan’s article:

Recently, for example, physicist Ferdinand Kuemmeth and colleagues at Harvard University used it to predict the energy levels of electrons in the gold nanoparticles they had constructed. Traditional theories suggest that such energy levels should be influenced by a bewildering range of factors, including the precise shape and size of the nanoparticle and the relative position of the atoms, which is considered to be more or less random. Nevertheless, Kuemmeth’s team found that random matrix theory described the measured levels very accurately.

That’s what an attractor is all about: different inputs, same output.

Thus, I don’t quite understand this quote:

Random matrix theory has got mathematicians like Percy Deift of New York University imagining that there might be more general patterns there too. “This kind of thinking isn’t common in mathematics,” he notes. ‘Mathematicians tend to think that each of their problems has its own special, distinguishing features. But in recent years we have begun to see that problems from diverse areas, often with no discernible connections, all behave in a very similar way.

This doesn’t seem like such a surprise to me—it seems very much in the tradition of mathematical modeling. But maybe there’s something I’m missing here.

Finally, Buchanan turns to social science:

An economist may sift through hundreds of data sets looking for something to explain changes in inflation – perhaps oil futures, interest rates or industrial inventories. Businesses such as Amazon.com rely on similar techniques to spot patterns in buyer behaviour and help direct their advertising.

While random matrix theory suggests that this is a promising approach, it also points to hidden dangers. As more and more complex data is collected, the number of variables being studied grows, and the number of apparent correlations between them grows even faster. With enough variables to test, it becomes almost certain that you will detect correlations that look significant, even if they aren’t. . . . even if these variables are all fluctuating randomly, the largest observed correlation will be large enough to seem significant.

This is well known. The new idea is that mathematical theory might enable the distribution of these correlations to be understood for a general range of cases. That’s interesting but doesn’t alter the basic statistical ideas.

Beyond this, I think there’s a flaw in the idea that statistics (or econometrics) proceeds by blindly looking at the correlations among all variables. In my experience, it makes more sense to fit a hierarchical model, using structure in the economic indexes rather than just throwing them all in as predictors. We are in fact studying the properties of hierarchical models when the number of cases and variables becomes large, and it’s a hard problem. Maybe the ideas from random matrix theory will be relevant here too.

Buchanan writes:

In recent years, some economists have begun to express doubts over predictions made from huge volumes of data, but they are in the minority. Most embrace the idea that more measurements mean better predictive abilities. That might be an illusion, and random matrix theory could be the tool to separate what is real and what is not.

I’m with most economists here: I think that, on average, more measurements do mean better predictive abilities! Maybe not if you are only allowed to look at correlations and least-squares regressions, but if you can model with more structure than, yes, more information should be better.

Update (2014):

I wonder how things are going in this field. Is the stable distribution of eigenvalues taking over the world?

9 thoughts on “Random matrices in the news

  1. Random matrix theory is popular in financial applications to the extent that PCA is popular and it provides guidance on how to choose the number of factors during PCA. I’m not sure if it’s any more popular than it was back in 2010. I always tended to view it as a quirk resulting from the nature of calculating eigenvalues from random matrices more than some deep natural phenomenon.

    RMT can be tricky with covariance matrices, especially where some of the variables have vastly different variances. I recall it making the most sense to use with correlation matrices since the most useful results (like Marcenko-Pastur) depend on all the values on the diagonal of the matrix being the same. For instance, if you apply it to the covariance matrix of equity returns and the change in the 3 month government yield, then a factor largely exposed to solely the yield will likely have the lowest variance. Thus, when selecting the top N factors, you’ll miss it. You could argue that RMT also can select factors with the smallest variance, but I’m not sure how often most people do that. If you did it with the correlation matrix instead, the yield factor would be near the top, most likely. Also, if you set the variances all the same for the smallest eigenvalues, then the yield in the above example might have a variance that’s way too high. It’s almost like you need to simulate from the distribution of the covariance matrix and estimate your own distribution of the eigenvalues for each covariance matrix.

    I also never understood why there isn’t more emphasis on a AIC/BIC approach to selecting the number of factors.

    As a result, I’ve sort of become dissatisfied with the PCA/RMT approach. I’ve been learning more about hierarchical methods (and Stan) partly in response to it.

  2. The math/physics community uses the name “KPZ Universality class” to discuss these ideas. The distribution of the largest eigenvalue is typically a Tracy-Widom distribution. I’m not very informed about real world applications of these ideas, but a great deal of progress is being made right now by probabilists on showing broader and broader families of stochastic processes belong to this class. Some notable names include Alexei Borodin, Ivan Corwin, Martin Hairer and Jeremy Quastel. Expect a Fields medal to be awarded for work on these problems in the near future (possibly this year).

    I’d like to expand a little on the Deift quote. The analogy I like to draw with the Central Limit Theorem is the following: the CLT says any “reasonable” averaging process of n variables has Gaussian fluctuations of order n^(1/2). This was first proved for independent fair coin flips by DeMoivre in the early 1700’s, was extended for some other distributions during the 19th century. A precise definition of “reasonable” wasn’t really hammered out in full detail until the 1920’s. KPZ universality says any “reasonable” interlacing process with parameter n has Tracy-Widom fluctuations of order n^(1/3). As it stands, we have some special cases. The first was the largest eigenvalue of certain random matrices (1980’s), and later extended all “reasonable” random matrix models. In the CLT analogy, think of these as averages of Bernoulli random variables. We are proving now for the first time other random processes (with no clear underlying random matrix interpretation) that exhibit the same behavior. The first, TASEP, was proved in 2009. We still don’t know what should be considered a reasonable interlacing process, but are making tons of headway in that regard.

    • Sorry, I forgot to get to the point of why Deift found this so surprising. Right now, one of the largest classes of models that are provably KPZ are coming out of random processes produced using standard objects in representation theory. The proofs rely heavily on extracting asymptotics from functions that are typically thought of as purely algebraic. These are the sort of connections mathematicians live for!

  3. “The central limit theorem is an example of an attractor—a mathematical model that appears as a limit as sample size gets large. The key feature of an attractor is that it destroys information.”

    That gets it exactly right. The fact that many inputs yield the same final result makes the result robust to some extent. It also means that you’ll see the same result of many physically unrelated systems. Inevitably someone notices that and starts claiming they’ve found the “secret to unraveling the inner mysteries of the universe”. These kinds of fads seem to crop up at regular intervals. Mandelbrot’s power-laws-are-everywhere stuff is a similar example.

    Of course the exact opposite is true. Since lots of different physical realities or inputs lead to the same “attractor” observing that “attractor” tells you very little about the physical reality. Far from unlocking the secrets of the universe, it obscures them.

  4. this is probably way off, but is there a relationship between ” the number of variables being studied grows, and the number of apparent correlations between them grows even faster” and stein’s paradox?

    • Stein’s paradox might be most clearly understood as having partial information to do regression, forgetting Galton’s lesson of regression to the mean (i.e. the two regression lines, y~x != x~y) and then implicitly noticing that the partial information only becomes adequate to improve estimates when you have three or more estimates but not realizing why.
      (But the math was really hard to see this.)

      • I’m intrigued by the connection between Stein’s paradox and regression to the mean. Can you point me to a citation or writeup somewhere?

  5. Wrt the “hard problem” of predictive modeling using massive numbers of candidate predictors as is commonly found in applied situations mining RDBMS warehouse information combined with unstructured text and image data….here are a few thoughts:

    There’s a consensus kluge solution emerging on the practitioner side and today’s kluge is tomorrow’s standard practice. This one involves a straight-forward extension of Breiman’s random forests to, e.g., logistic regression with millions of snapshot mini-regressions run in parallel in a matter of a few hours yielding an approximate ensemble answer. This kluge is similar to Cleveland or Wickham’s D&R and Scott’s approach with MCMC using URLs as the unit of analysis but with different assumptions insofar as none of them invoke Breiman, but that may just be nitpicking.

    An important consideration here is the existence of a unique solution, if one is even possible beyond the theoretical when modeling in the presence of hundreds of thousands if not millions of candidate predictors. Some consider unique solutions to be chimerae in the presence of more than a few dozen predictors and they’re probably not wrong. In this instance theory as expressed by explicit hypotheses would be inherent in the data gathered. A careful statement of those hypotheses as would be done in a dissertation would seem sort of silly and feckless.

    Other important concerns result from the consequences of collating, linking and sampling massive, yet still incomplete data from multiple sources, a common practice these days. Simple random sampling would almost certainly be fatal to any inherent dependence structure — variance destroying — just as would be true if it were used by the above kluge. Bill Winkler, Jeremy Wu and Xiao-li Meng have all written insightful blogs and papers that address some of these questions. We can safely conclude from them that massive amounts of incomplete information does not, in and of itself, lead to a corresponding improvement in the accuracy of any resulting answer. If anything, it’s the opposite, especially with the relatively high proportion of inaccurate, missing or bad data that’s out there. In point of fact anyone asserting that massive quantities of Web information amounts to a census (and many have) are not only naive, they’re profoundly misguided. Just consider the FTC’s recent audit of credit bureau data which found that 20% of their roughly 220 million records for individual consumers had mistakes of some magnitude while more than 5% of consumers had mistakes significant enough to impact their credit ratings. It’s even worse on the digital side where, e.g., fraudulent click-through rates are in the 30% range, many consumers block their cookies, data aggregators ignore important considerations in rolling data up, to name just a few considerations.

    Yet another concern lies in the instructive example of price elasticity models where there is a very real need to insure the independence of the price parameter(s). In this instance, any dependence of pricing on other model parameters will degrade the model’s ability to detect the effects and impact of manipulating price alone. This, too, is not a problem with a simple solution and the kluge outlined above will almost certainly violate it.

    A final concern in this laundry list involves a fundamental premise of statistical modeling in its drive towards Occam’s Razor-like simplicity and reduction down to a relative handful of parameters in a “final” predictive model. Practitioners are finding that there are times when clients’ will demand — repeat, demand — leveraging ALL of the available information, even if that means millions of predictors. This can throw Occam out the window.

    PCA doesn’t work well, if at all, with massive numbers of inputs, as a variable reduction and selection method. While it’s possible to imagine MIC-based component analyses, that seems far off on the horizon. Nor does the LASSO seem suitable to challenges of this magnitude. One possible answer is to simply retain the parameters from the snapshot regressions, passing later data iterations through them for scoring. This is not as easy as it sounds.

    Another answer is associative memory machine learning techniques rooted in astatistical theories such as Kolmogorov Complexity. This approach is currently employed by Paul Hoffman at Saffron Technology in the Bay Area. Saffron just announced a new round of funding that will enable additional research and expansion —

    ww2.saffrontech.com/press_seriesb

    Yet a third approach would be to employ SVMs or support vector machines as the predictive engine. SVMs do model on all of the predictors thrown into its hopper, so to speak. While hierarchical SVMs seem possible, is anyone working on this? In addition and currently, for profit SVM packages such as KXEN don’t support more than 10,000 or so predictors. But it is possible to imagine the existence of SVMs in the wild that would.

    Then there is Haldane’s proposal developed in his address entitled “The Dog and the Frisbee” which is a challenge to BIS-type (Bureau of International Settlement) modeling of the regulatory requirements stipulating the capital reserves of large investment banks — a topic that’s seen much scrutiny in the wake of the recent Downturn. His views are that the regulatory environment and business landscapes today for any financial entity is of such unimaginably great complexity that statistical models simply are not able to capture all of the relevant effects. Given that premise, he prescribes summing and leveraging simple indicators as a valid proxy to more complex algorithmic approaches as the BIS III rules prescribe. Of course, his theorizing is qualitative in nature as he is quite short on statistical or mathematical motivations and lemmas for doing so. This does not invalidate testing the approach.

    What would be great in any review and consideration of the many questions and issues in predictive modeling with massive numbers of candidate predictors is the possibility of head-to-head comparisons of these broad modeling avenues and approaches — as well as many other possible approaches and solutions that might surface but have not been here discussed. Admittedly, it is hard, if not impossible, to imagine a single research group taking on a challenge of this magnitude. No one would want to do it alone.

    Where it might work is in the context of a collaborative Kaggle or CrowdAnalytics-like competition on the same data with a blind holdout. By developing carefully spelled out a priori boundaries containing multiple goals both theoretical and applied — e.g., retaining any hierarchical structures and dependencies, total elapsed time to solution, stability and efficiency of the algorithm, complete transparency wrt the methodology employed, and so on, a clear set of assumptions, constraints and goals could be specified. But specifying this set of criteria alone could occupy a panel of cross-discipline experts for quite some time.

    The simplest metric for success is creating a blind, statistical “holdout” of distinct calibration and validation data sets. Another success metric could be prospective — say, accuracy 6 months ahead — if the dependent variable(s) can be related to known and publicly available sets of indexes or behavior, an approach used by Didier Sornette in his predictions of financial bubbles. The reason for this is that by creating multiple outcomes and goals, the chances of the collaboration becoming just a “shoot-out” and matter of gaming an algorithm to max out a single KPI such as holdout accuracy are minimized. This is the single most important omission and weakness of the comparable Kaggle-like competitions to date. The lack of transparency in the resulting algorithms is another.

    I’m willing to bet Hoffman would be up for it…and entities like KXEN or their proxies in the wild could probably be coaxed in as well. A Kaggle-like approach would eliminate the need for criteria for admission…but almost by definition there couldn’t be too many contenders since the constraints in terms of software and hardware alone seem formidable — that is if petabytes of data with millions of predictors forms the base analytic dataset. It could bring together the machine learning, mathematical, statistical and quantitative social science communities among others, which now seem to be in direct competition if recent KDD and LinkedIn posts, an ASA statement as well as some of Xiao-li Meng’s comments are any indication. And if someone puts up a prize…

    I will confess to ignorance as to whether or not such a collaboration doesn’t already exist in some form out there. Kaggles are not yet up to these standards and don’t count. If so, then everything I’m writing is redundant.

    Regardless, this seems like a doable, interesting and useful initiative that might address many of the burning questions on the frontiers of today’s applied practice. And let’s face it, applied practice currently seems populated mostly with W. James’ “blooming, buzzing confusion” — multiple contending, conflicting assertions or worse, the many uber-confident smoke and mirror claims deliberately aimed at driving growth by leveraging uncertainty and innumeracy. If some of this arbitrage in ignorance can be pulled back, then the day might be considered won…at least for the time being.

Comments are closed.