Where do I learn about log_sum_exp, log1p, lccdf, and other numerical analysis tricks?

Richard McElreath inquires:

I was helping a colleague recently fix his MATLAB code by using log_sum_exp and log1m tricks. The natural question he had was, “where do you learn this stuff?”

I checked Numerical Recipes, but the statistical parts are actually pretty thin (at least in my 1994 edition).

Do you know of any books/papers that describe these techniques?

I’d love to hear this blog’s answers to these questions.

I replied that I learned numerical analysis “on the street” through HMM implementations. HMMs are also a good introduction to the kind of dynamic programming technique I used for that Poisson-binomial implementation we discussed (which we’ll build into Stan one of these days—it’ll be a fun project for someone). Then I picked up the rest through a hodge-podge of case-based learning.

“Numerical analysis” is name of the field and the textbooks where you’ll learn log_sum_exp and log1p and complementary cdfs and learn how 0 is so very different than 1 (smallest double-precision floating point value greater than zero is around 10^-300, whereas the largest double-precision value less than 1 is about 1 – 10^-16), which is rather relevant for statistical computation. You’ll also learn about catastrophic cancellation (which makes naive variance calculations so unstable) and things like the stable Welford algorithm for calculating variance, which also has the nice property of behaving as a streaming accumulator (i.e., it’s memoryless). I don’t know which books are good, but there are lots of web sites and course materials you can try.

The more advanced versions of this will be about matrices and how to maintain stability of iterative algorithms. Things like pivoting LL^t decompositions and how to do stable matrix division. A lot of that’s also about how to deal with caching in memory with blocking algorithms to do this efficiently. A decent matrix multiplier will be more than an order of magnitude faster than a naive approach on large matrices.

“Algorithms and data structures” is the CS class where you learn about things like dynamic programming (e.g., how to calculate HMM likelihoods, fast Fourier transforms, and matrix multiplication ordering).

Algorithms class won’t typically get into the low-level caching and branch-point prediction stuff you need to know to build something like Stan efficiently. There, you need to start diving into the compiler literature and the generated assembly and machine code. I can highly recommend Agner Fogg’s overviews on C++ optimization—they’re free and cover most of what you need to know to start thinking about writing efficient C++ (or Fortran—the game’s a bit different with statically typed functional languages like ML).

The 1D integrator in Stan (probably land in 2.19—there’s a few kinks to work out in Ben Bales’s math lib code) uses an input that provides both the integrated value and its complement (x and closest boundary of the integration minus x). Ben Goodrich helped a lot, as usual, with these complicated numerical things. The result is an integrator with enough precision to integrate the beta distribution between 0 and 1 (the trick is the asymptote at 1).

Integration in general is another advanced numerical analysis field with tons of great references on error accumulation. Leimkuhler and Reich is the usual intro reference that’s specific to Hamiltonian systems; we use the leapfrog (Störmer-Verlet) integrator for NUTS and this book has a nice analysis. We’re looking now into some implicit algorithms to deal with “stiff” systems that cause relatively simple explicit algorithms like Runge-Kutta to require step sizes so small as to be impractical; we already offer them within Stan for dynamics modeling (the _bdf integrators). Hairer et al. the more mathematically advanced reference for integrators. There are tons of great course notes and applied mathematics books out there for implementing Euler, implicit Euler, Runge-Kutta, Adams-Moulton, implicit midpoint, etc., all of which have different error and symplecticness properties which heavily tie into implementing efficient Hamiltonian dynamics. Yi Zhang at Metrum is now working on improving our underlying algorithms and adding partial differential equation solvers. Now I have a whole new class of algorithms to learn.

So much for my getting away from Columbia after I “learned statistics”. I should at least record the half-lecture I do on this topic for Andrew’s stats communication class (the other half of the class I do focuses on wring API specs). I figure it’s communicating with the computer and communicating with users, but at least one student per year walks out in disgust at my stretching the topic so broadly to include this computer sciency stuff.

20 thoughts on “Where do I learn about log_sum_exp, log1p, lccdf, and other numerical analysis tricks?

  1. I can’t really point to a specific book or class where I learn these types of things. Mainly I have picked them up in news groups (showing my age), mailing lists, and other forums. Often by posting my code and having someone else point out that using this other function gives better properties. So my big suggestion is don’t be afraid to share what you have done, and when someone points out something that you could do better, don’t get offended and learn from it.

    I also learn (or relearn) from working through other peoples examples. The Stan case studies are one great resource, I like working through them and sometimes I realize that my approach would have been different and try to work out why our approaches differ (thought and trial and error). If my way is better then I have something that I can share, if my way is worse then I have an opportunity to learn, if they are about the same then at least I have another potential tool in my toolbox to consider another time (me learning is the most common occurrence, but starting off willing to accept any of the outcomes is my strategy).

  2. I have found some of the tricks mentioned in the post by chance, subforums etc. Of course this only hapen when you are lucky enough to use the “right” keywords. An example is the log-sum-exp trick, that I found by chance. Once I got to know the name of the trick I found out it was also described in the Kevin Murphy’s book that is in general quite recommended for varous reasons (and DeGroot prize winner!) –> https://mitpress.mit.edu/books/machine-learning-1

  3. Apparently like some others, I learned about log-sum-exp and similar tricks by hacking on HMMs. These days I recommend reading John Cook’s blog posts at https://www.johndcook.com/blog/. And if you can find a copy, “The Engineering of Numerical Software,” by Webb Miller, is a good introduction to numeric integration and function approximation.

  4. I learned the basics so long ago I don’t remember where, probably my undergrad numerical algorithms class. Since then my learning has been driven by paranoia and testing: either I’ll stress-test the code and find places where it blows up, or I’ll be consciously looking for ways things could go wrong. Then, once I’ve identified possible problems, I end up doing a bunch of algebra (or having Mathematica do it) to see if I can find a better way to express the computation, look for fast-converging series expansions in the problematic region, or I’ll do a web search for solutions. Being familiar with Taylor series expansions is essential.

    I’m currently implementing a robust mean-parameterized truncated exponential distribution in Stan, and these numeric issues are popping up all over the place. One thing I’ve found that Stan needs (had to implement it myself) is log1mexpn(x) = log(1 – exp(-x)), which has issues for *both* large and small positive values of x.

      • Ok, for dumb people like me, if R thinks sqrt2*sqrt2 – 2 is not equal to zero, what does that imply? What are the broader, practical, ramifications such that when I’m driving down Route 66 -R which fauna should I be looking out for lest they run in front of my headlights at 2 AM and cause me to crash?

        Speakers should front load, as Dale Carnegie advised, “why you should listen to what I’m about to tell you” rather than leaving it up to listeners to stay interested long enough to wait for the punch line. Just sayin’. Or maybe I missed it, in which case I blame myself for being dumb.

        • Numerical analysis is mostly for doing long computations and having the results come out without being nonsense. Consider for example solving a differential equation and breaking up time into milliseconds… After a few seconds of simulation if each timestep had a numerical error biased high… The predicted result would be meaningless, dominated by accumulating error. Any single time step calculation might have small enough error, but put thousands of them together and you lose.

        • > sqrt(2) * sqrt(2) - 2
          [1] 4.440892e-16
          

          It means if you write code like 2 == sqrt(2) * sqrt(2) and expect the result to be true, you’re going to be very disappointed,

          > 2 == sqrt(2) * sqrt(2)
          [1] FALSE
          

          You’ll also be disappointed if you write code like X * t(X) and expect the result to be symmetric. Or write someting like 1 - 1e-20 and expect a result other than 1.

          What it really means is that whenever you do floating point, you have to worry about loss of precision and be very careful with comparison.

          As to front loading, the point of the post was to get an answer to Richard McElreath’s question. I wasn’t trying to explain floating point to the masses. I put the question in the title and then led with it. I don’t quite see how I could’ve front loaded it any more, but I’m always open to suggestions.

  5. I discovered Welford algorithm in the book The Art of Computer Programming by Donald E. Knuth. It is in volume 2: Seminumerical Algorithms, 3rd edn., p. 232. Boston: Addison-Wesley.

Leave a Reply

Your email address will not be published. Required fields are marked *