Flatten your abs with this new statistical approach to quadrature

Philipp Hennig, Michael Osborne, and Mark Girolami write:

We deliver a call to arms for probabilistic numerical methods: algorithms for numerical tasks, including linear algebra, integration, optimization and solving differential equations, that return uncertainties in their calculations. . . . We describe how several seminal classic numerical methods can be interpreted naturally as probabilistic inference. We then show that the probabilistic view suggests new algorithms that can flexibly be adapted to suit application specifics, while delivering improved empirical performance. We provide concrete illustrations of the benefits of probabilistic numeric algorithms on real scientific problems from astrometry and astronomical imaging, while highlighting open problems with these new algorithms. Finally, we describe how probabilistic numerical methods provide a coherent framework for identifying the uncertainty in calculations performed with a combination of numerical algorithms (e.g. both numerical optimisers and differential equation solvers), potentially allowing the diagnosis (and control) of error sources in computations.

“A call to arms,” huh? That sounds very British indeed. In any case, this all seems very important, the idea of treating computing problems as statistical problems. Worth a look.

40 thoughts on “Flatten your abs with this new statistical approach to quadrature

  1. From the introduction:

    Probability theory does not rest on the notion of randomness (aleatory uncertainty), but extends to quantifying epistemic uncertainty, arising solely from missing information [they cite a book titled ‘The Geometry of Random Fields’].

    I find this statement compelling, though I don’t think that I understand it.

      • The right viewpoint makes a difference. If you think probabilities are modeling a mysterious force call “randomness” and this needs to be present to get results, then you immediatly face the that “random number” generators are completely deterministic. You wind up tying yourself tied in knots so bad all progress stops. You’re trying to rely on intuition to make progress, but the intuition is severely crippled.

        With probabilities as modeling uncertainties resulting from incomplete information, things look very different. “High probability” outcomes are one the ones least sensitive to the unknowns. Those are the outcomes which happen most of the time almost no matter what the unknowns are.

        So not only is it immediately obvious how deterministic algorithms can be used with probabilities (the algorithm is generating one possibility and almost all possibilities lead to the ‘high probability’ outcome), but it’s clear how to generalize this far beyond anything “random variable” intuition can take you.

        • Anonymous wrote:

          If you think probabilities are modeling a mysterious force call “randomness” and this needs to be present to get results, then you immediatly face the that “random number” generators are completely deterministic. You wind up tying yourself tied in knots so bad all progress stops. You’re trying to rely on intuition to make progress, but the intuition is severely crippled.

          ——————————-
          I have seen lots of simulations of wireless systems (and have written a few) that assume that signals are corrupted by additive white gaussian noise.

          The underlying physics model the noise as random. See https://en.wikipedia.org/wiki/Additive_white_Gaussian_noise.

          But, using a pseudo-random number generator often gives results that match measurements.

          Not always though. Showing how old I am, I recall when some of my coworkers were getting weird results from such a simulation. They were using the library routine for gaussian random numbers on a IBM 1130. That library routine added 12 variates that were uniform on [0,1] and then subtracted 6. By a CLT argument, the result was something that looked like a gaussian—except on the tails. But, the tails are what matter when you a simulating a communications system with a bit error rate of 10^-5 or so.

          Bob

        • These days, if you really want “random” aren’t there pluggable gizmos that generate it for you via thermal / decay or other physical modes? Didn’t some intel chips have an onboard way of doing “true” randomness via thermal noise of the on-chip temperature sensor?

          Though, I’m not sure if even these systems become “pseudo random” when they amplify their native randomness sources to satisfy random bit hungry consumer applications.

          i.e. Is the “true” randomness source providing only the seeds for the pseudo random generator or the entire random sequence?

        • > But, using a pseudo-random number generator often gives results that match measurements.
          Probabilities as models are _just_ representations and if they emulate the measurements they are currently doing well enough is all that is required (see quote below).

          I think the crucial issue here (as in some of the authors comments read so far) – “care must be taken to ensure well-calibrated priors and models” i.e. do the models emulate the true errors adequately?

          From Peirce around 1880 in The Collected Papers of Charles Sanders Peirce. Electronic edition.Volume 1: Principles of Philosophy

          Transcribed from: Peirce, Charles Sanders, (1839-1914) The collected papers of Charles Sanders Peirce Cambridge, MA : Harvard University Press, 1931-1935 Volumes 1-6 edited by Charles Hartshorne and Paul Weiss.
          Section XVI.

          §16. Reasoning from Samples

          94. That this [reasoning from a sample taken at random to the whole lot sampled] does justify induction is a mathematical proposition beyond dispute. It has been objected that the sampling cannot be random in this sense. But this is an idea which flies far away from the plain facts. Thirty throws of a die constitute an approximately random sample of all the throws of that die; and that the randomness should be approximate is all that is required.

        • Here is a basic point of principle: once a specific sequence of numbers is given, it can’t possibly make a difference anymore how they were generated.

          Or state another way: if a specific sequence of numbers is useful for some purpose, then it’s usefulness has to be the result of the properties of those specific numbers, and not a result of how the sequence was created.

        • That’s simply not true. What the numbers represent matters to substantive interpretation as does the assumptions about the data generating process.

        • Anonymous,

          To be a bit more specific. If you are dealing with numbers that represent nothing but themselves, your statement is acceptable. However, when the numbers represent features of the real world, then understanding the data generating process, the assumptions underlying it, and their consistency or inconsistency with the mathematical assumptions of the modeling is of essential importance to make substantive inferences from the results of the math.

        • There’s a scene in the Grapes of Wrath involving a slot machine with a deterministic schedule which the owner keeps track of, so he can be virtually assured of never having to pay out.

        • ‘once a specific sequence of numbers is given, it can’t possibly make a difference anymore how they were generated’

          This statement strikes me as a bit troubling (and non bayesian). If you have an extremely large sequence of numbers encapsulated in a function called cryptographically_secure_prg() and then a helpful person over at the NSA says he made some improvements to it… what kinds of inferences do you make? You could try to evaluate every number in the new sequence and satisfy yourself re: its security, but (to put a lower bound on it) that will take you longer than your lifespan using the fastest available computer to do. You could try a lot of other approaches, but you may not detect any weakness. You could, alternatively, make the inference that given the NSA’s involvement, the prg is no longer bullet proof — that’s a bayesian approach oriented toward the generator, not the sequence, to a very difficult but quite realistic problem.

          Or maybe the point here is more subtle, and you’re suggesting that P = NP? Either way, I find the statement above to be ok as a first cut, but ultimately I think it over-reaches.

          I also worry that someone would say “once a specific sequence of [numbered doors] is given, it can’t possibly make a difference anymore how they were generated”. Enter the Monty Hall problem.

        • >”I recall when some of my coworkers were getting weird results from such a simulation. They were using the library routine for gaussian random numbers on a IBM 1130.”

          Probably RANDU, supposedly one of the worst prngs ever devised:

          >”its very name RANDU is enough to bring dismay into the eyes and stomachs of many computer scientists!”

          >”One of us recalls producing a ‘random’ plot with only 11 planes, and being told by his computer center’s programming consultant that he had misused the random number generator: ‘We guarantee that each number is random individually, but we don’t guarantee that more than one of them is random.’ Figure that out.”
          https://en.wikiquote.org/wiki/RANDU

      • While PT:LOS is perfect for “epistemic uncertainty arising solely from missing information”, its framework is a good one in which to explore uncertainty in computation. Probability theory as an extension of logic presupposes infinite computational resources in the sense that the logical consequences of any prior information are assumed to be available. Let me state that another way: in the Cox-Jaynes setting if prior information X implies A then Pr(A|X) must be equal to 1 — even if the computational resources necessary to prove that X => A exceed those available but are nonetheless finite.

        To get around this you can do what everyone does, which is to say, you can treat the computation being numerically approximated as a black box and model it as a random process. Strictly speaking, this violates the desideratum of the Cox-Jaynes system that states that relevant information must not be discarded. Treating the computation as uncertain is a kluge — it works, but the nature of the hack means that you’re not actually using probability theory to model *your* uncertainty given (all of) *your* prior information.

        This is a pretty important subject because uncertainty about the output of a computation is the reason we need to do any specific computation. A proper mathematical model of this kind of uncertainty doesn’t yet exist, but some work toward it is here.

        • But in which sense is “data collection” different from computation? If we want to know how many people would vote today for candidate X, we could make a claim that the information is out there – we just haven’t computed it yet (evaluated the voting function at every point and summed it). To perfectly compute the total (a bog-standard summation problem given the function values) is typically far too costly and error prone. Hence, we apply a randomized algorithm (sampling and averaging) and get an estimate of the number. From the point of view of probability calculus and statistical inference, it is irrelevant whether the numbers are a result of asking a number out of a person or out of a call to a function. Some blackboxes are artificial. Others are breathing, living.

          On a related note, there is a long story on using statistical methods in lieu of perfect computation. There is a textbook-name for that: randomized algorithms. Sorting, searching, optimization, summation, decision problems (primality test being a famous example), you name it, basically any computational task has examples of successful randomized algorithms applied to it (mostly frequentist, including MCMC, but some Bayesian). Sometimes randomization guarantees correct answers with high probability given a number of samples, sometimes with bounded error, sometimes with bounded in probability or “good” expected computational cost (randomized quicksort being an example). All can be seen as standard statistical inference problems where data happens to be generated by a computer instead of a natural system.

          Probabilistic numerics is another great step forward in this tradition.

        • “From the point of view of probability calculus and statistical inference, it is irrelevant whether the numbers are a result of asking a number out of a person or out of a call to a function.”

          I assume you are distinguishing between statistical and substantive inferences. Because in terms of drawing conclusions about the world, there is a very big difference whether the data generating process comes from the brain of a human, the brain of a rat, or a highly calibrated machine.

        • Sure, in the same sense that econometrics, biostats, etc. are different, but share common statistical inference and modeling principles.

          I just don’t see the point of mystifying something like the outcome of algorithms as if they require a new epistemology (they don’t). This lead to long discussions, like https://xianblog.wordpress.com/2012/10/12/estimating-a-constant-not-really/, which may be interesting and informative but complicate the problem unnecessarily. Of course, statistical modeling of algorithms still requires a lot of attention to detail and subject knowledge, just like econometrics, biostats and so on.

        • I don’t see it as a kluge at all. You basically never model *your* uncertainty, it’s always the uncertainty implied by some K, where K is not your brain’s thoughts, but rather a statement in a specific mathematical model/structure.

          But setting that aside, there’s two pieces here. First, you have a range of uncertainty due to lack of information. Second, you perform a kind of sensitivity analysis over that “range” and anything insensitive to the exact value within that “range” is called “highly probable”.

          The latter step makes perfect sense, even if the “range” isn’t a real range of uncertainty, but is merely a enlarged range of possibilities due to laziness (or lack of computational resources) where in principle you know enough to narrow things down further. You could give it a different name than “uncertainty” if it makes you feel better psychologically, but the mathematics and final numbers will be exactly the same.

          How about calling it the range of “willful ignorance” rather than “uncertainty”.

        • They will only be the exact same if there is no difference in the mathematics chosen driven by the types of uncertainty or ignorance, etc. If we somehow believe that the same mathematics applied to tolerances in manufacturing are adequate to the ignorance generated by self-report assessments of often oversimplified constructs, then we are very likely not thinking deeply enough about the underlying data generating process.

        • “if there is no difference in the mathematics chosen driven by the types of uncertainty or ignorance”

          The “mathematics chosen” can be of infinite varieties. Or at least as different as the variety of experiences we wish to model. Uncertainties are always manipulated with the good old rules of probability theory though.

        • “exactly the same” referred to a single specific model where “uncertainty” might be more accurately called “willful ignorance”. You might get some psychological benefit from calling it by a more accurate name, but doing so doesn’t change the model.

        • One of the premises of the Cox-Jaynes construction is that if prior information X implies proposition A, then the plausibility of A given X is as high as possible. If we toss that aside and are still able come up with useful results, we can’t ascribe our successes to our adherence to the Cox-Jaynes framework. Most likely there is some further elaboration of the framework that is able to handle bounded reasoning and reduces to Cox-Jaynes in the limit of unbounded computational resources.

        • Anonymous, I get that this is what you think is going on. I disagree: I see it more like being willing to assign the proposition “in chess, Black can force a draw” a probability other that zero or one even though every method that we might want to use to come up with such a probability fully encodes all the rules of chess, and the truth/falsity of the proposition in question is a logical consequence of those rules.

        • Corey, I get that too. I disagree: if you want to assign a nonzero probability to something based on K, then you’re saying K can allow it happen. If K doesn’t allow it, then that’s like starting a theorem off with “assume A and not A”.

          In your specific example, chess is not just the rules. It’s rules + circumstances. “Circumstances” could be something like “start from this position” or “only consider N moves deep”. You can keep the rules, but change the circumstances so that the combination no longer implies the proposition you’re interested in.

        • “Start from this position” is a circumstance of the chess game; “only consider N moves deep” is a bounded-reasoning constraint. My contention is precisely that bounded-reasoning constraints cannot be treated as prior information in the Cox-Jaynes framework.

          I guess we’ll have to agree to disagree.

        • Corey, you can call it whatever you want, but (1) it clearly can be done (and almost certainly is done) and (2) it doesn’t take us away from cox-jaynes. In fact, Jaynes discusses this (in a sense) in that mysterious Chapter 18 “The Ap distribution” which I’ve never seen anyone talk about, but is the motherload of unexplored possibilities.

          Basically, to compute Pr{Black Wins|starting position, rules of chess} we want to make a list of every possible game and return p=fraction in which Black wins. Our inability to compute the list however introduces an additional uncertainty. It’s not however an uncertainty in the game that changes p directly. It’s an uncertainty that changes the Ap distribution. In a sense we can only estimate p rather than calculate it.

          Chapter 18 of Jaynes’s book doesn’t require an extension of the formalism. It explicitly uses the sum/product rule just as before. It’s dealing with uncertainty just as before, in the same ways as before (albeit in a deeper space).

          As has happened thousands of times before, the problem here isn’t with the basic equations of probability theory or their application. The problem is the statement “Pr{Black Wins|starting position, rules of the game}” contains some hidden conditions.

          P.s. Frequentist are famously stupid for confusing the “frequency of occurrence” with probability. You’re making a similar but far more subtle mistake. The classical (and far more general) idea is to think of probability as being the fraction of cases favorable to the total number of cases. But although this is often the case (just as sometimes probabilities are numerically equal to frequencies) it’s not really true in general.

          The fraction of games won by black may be determined from the rules, but the probability of black winning is a function of our state of knowledge. We don’t actually know every consequence of a set of axioms once the axioms are stated typically. If we did the Mathematics profession wouldn’t exist.

        • While I agree with you that bounded reasoning can not be treated as prior information, I think it may be a different question as to whether bounded reasoning can be incorporated into a likelihood. It seems to me that the statement:

          “Given my knowledge K and a limited ability to compute, I place high probability on seeing data from portion X of the space of outcomes”

          is sort of like saying “given my knowledge K’ which is not quite as much knowledge as K”…

          so p(Data | K’) can stand in for p(Data | K)

          From a theoretical perspective, what we SHOULD do is compute with K, but from a practical perspective provided that K’ produces a p(Data | K’) which covers the high probability region of p(Data | K) it seems like it will work to give valid if slightly less informative inference.

        • Daniel, chess is a formal system, and it’s formal systems in general (and Turing-complete ones in particular) that are in the back of my mind. A formal system that is designed to do probabilistic reasoning about itself (that is, an AI that needs to reason about which propositions it is likely to be able to prove) is going to need to take bounded reasoning constraints into account as part of the formal system (and it’s going to somehow need to deal with the undefinability of truth in formal systems and also avoid ascribing probability one to its own consistency, or else it will Gödel itself to death). Relative to that, our current (fairly satisfactory) methods of dealing with uncertainty in computation via probabilistic numerics are a kluge.

        • Anonymous:

          We agree that “Uncertainties are always manipulated with the good old rules of probability theory”. That said, how we communicate about that information and about the types of uncertainties is important for current researchers but likely more important for future researchers. When students are taught that a data generating process is equal to ‘signal’ plus ‘random noise’ we end up with discipline specific researchers blissfully unaware that the ‘random noise’ is not actually random and that a fruitful line of research might be to explore the aspects of the uncertainty that are causally related to the outcome and simultaneously directly distorting the signal.

        • Anonymous said: “When students are taught that a data generating process is equal to ‘signal’ plus ‘random noise’ we end up with discipline specific researchers blissfully unaware that the ‘random noise’ is not actually random”

          When students (or others) are told that something is ‘signal’ plus ‘random noise’, they may make all kinds of interpretations of what constitutes “signal”, what constitutes “noise”, and what constitutes “random”.

        • There are two issues as I see them:

          1) Inadequacy of the model
          2) Inadequacy of the computation

          By far, (1) is the more interesting one. If you make an ODE to model a physical process, you will have to provide a function that models the forcing, and you will have to provide some coefficients, and you will have to believe that the ODE is a complete model of the thing… and none of those are ever available in their entirety.

          So, if the forcing function is f(x) + g(x) where f(x) is “large” and g(x) is “small but unknown” and some coefficient is c(x) which is “nearly the constant C, but does change a bit” then it would be useful to have ODE solvers that could give you in one shot some kind of distribution over the possible outcomes given some kind of model of g(x) and c(x)… that might not be what is going on here, but I can imagine cases where what’s going on here approximates some such thing.

  2. The discussion in the comments is great epistemologycally, pedagogically & philosophically. But in practice, when you get down to the dirty, applied, implementation details, what exactly are these authors asking for?

    It is indeed a clear & loud call to arms, yes, but who’s the enemy? Is this a solution in search of a problem?

Leave a Reply

Your email address will not be published. Required fields are marked *