Using black-box machine learning predictions as inputs to a Bayesian analysis

Following up on this discussion [Designing an animal-like brain: black-box “deep learning algorithms” to solve problems, with an (approximately) Bayesian “consciousness” or “executive functioning organ” that attempts to make sense of all these inferences], Mike Betancourt writes:

I’m not sure AI (or machine learning) + Bayesian wrapper would address the points raised in the paper. In particular, one of the big concepts that they’re pushing is that the brain builds generative/causal models of the world (they do a lot based on simple physics models) and then use those models to make predictions outside of the scope of the data that they have previous seen. True out of sample performance is still a big problem in AI (they’re trying to make the training data big enough to make “out of sample” an irrelevant concept, but that’ll never really happen) and these kinds of latent generative/causal models would go a long way to improving that. Adding a Bayesian wrapper could identify limitations of an AI, but I don’t see how it could move towards this kind of latent generative/causal construction.

If you wanted to incorporate these AI algorithms into a Bayesian framework then I think it’s much more effective to treat the algorithms as further steps in data reduction. For example, train some neural net, treat the outputs of the net as the actual measurement, and then add the trained neural net to your likelihood. This is my advice when people want to/have to use machine learning algorithms but also want to quantify systematic uncertainties.

My response: yes, I give that advice too, and I’ve used this method in consulting problems. Recently we had a pleasant example in which we started by using the output from the so-called machine learning as a predictor, then we fit a parametric model to the machine-learning fit, and now we’re transitioning toward modeling the raw data. Some interesting general lessons here, I think. In particular, machine-learning-type methods tend to be crap at extrapolation and can have weird flat behavior near the edge of the data. So in this case when we went to the parametric model, we excluded some of the machine-learning predictions in the bad zone as they were messing us up.

Betancourt adds:

It could also be spun into a neurological narrative. As in our sensory organs and lower brain functions operate as AI, reducing raw inputs into more abstract/informative features from which the brain can then go all Bayesian and build the generative/causal models advocated in
the paper.

32 thoughts on “Using black-box machine learning predictions as inputs to a Bayesian analysis

  1. Andrew said:

    “In particular, machine-learning-type methods tend to be crap at extrapolation and can have weird flat behavior near the edge of the data.”

    Which is not really a surprise, since each of the members of the hidden layer(s) amounts to a parameter whose value has to be tuned. IOW, it’s a setup for overfitting if there’s enough data. And in a high-dimensional situation, there won’t be much data near the edges, so the fits will probably be poor there.

    I wonder if, in real animals, the brain has a method for reducing the number of elements in the hidden layers to reduce these issues for particular cases.

  2. Not exactly the same thing, but reminds me of Platt’s (1999) approach to provide probabilistic output for support vector machines: Don’t use the classification but the underlying decision value and enter it into logistic regression.

    • That was common back in the 90s. It was the advice I got at Bell Labs, so it’s what I put into one of my first papers on stats:

      http://www.speech.cs.cmu.edu/Courses/11716/2003/chu-carroll-carpenter-1998.dialogue_management.pdf

      We really couldn’t see the forest for the trees—rather than just saying we’re conditioning high-dimensional predicors with a rank-reduced SVD and then using logistic regression, there’s a pile of tortured math and even more roundabout justification.

      The real contribution of that paper was disambiguation. Jennifer went on to become one of the managers of the Watson project at IBM (the cool Jeopardy playing natural language system, not the later branding effort), which chained similar approaches together.

      • So weird how small the world is. I first became aware of Jennifer Chu-Carroll because I followed the math blog of her husband Mark, but I didn’t realize what sort of work she’d done. Later on I discovered that both she and my childhood friend David Buchanan had left the Watson project at the same time to work at Elemental Cognition (a start-up spun off of Bridgewater Associates) — and now I discover she’s worked with you on NLP.

  3. I have been wondering how to quantify uncertainty in predictions from machine learning algorithms.

    Are you saying that you got some prediction values, y_hat, from a machine learning model, y ~ x_1 + … + x_n.

    Then, you made a parametric model, y ~ Y_hat, to quantify uncertainty?

  4. I find this combination of techniques exceedingly useful when I have a lot of data on an indicator that informs me about the outcome of interest but where I have relatively sparse data about the outcome itself. A prime example of this is estimating constructions costs where you have 10x-20x as many estimates as you do actual project costs.

      • You need to separate cost estimation from bid preparation. They are fundamentally different tasks.

        When estimating construction costs your goal should always be to estimate as closely as possible the future costs (as they occur so that financing can be properly incorporated into the bid) for the project as specified.

        Once the estimating component is complete you can look at your bid and decide how you would like to prepare it for submission. This is the point where you look at the documents and asks yourself some questions. For example, did the architect/engineer forget to specify a component of the work that will need to be completed in order to facilitate construction as detailed? If so, do we want to reduce our base bid by the projected excess profit on that extra so we are more likely to get the job? The answer to this last questions will depend on your relationship with the design team and the owner; if the designer is usually agreeable to admitting to the owner when they miss specify something, then you have a higher likelihood of that extra coming through. Or you could have a situation where the designer is not agreeable to admitting to faults on the face of it, but is okay with certifying additional work (not related to misspecification) where you can recoup those costs. Alternatively, if you think there will be little interest from the bidders for whatever reason you may forget this idea entirely and add even more to the base bid!

        If there are penalty clauses for work not billed as is sometimes the case for government projects relying on grant money, you may want to ask yourself how you can front-load the contract so that the billings are met without violating the covenants of the CCDC contract.

        These are only some of the questions that you’ll review prior to bid submission. Very few of them have anything to do with actual estimation of construction costs as specified. They are more like applied game theory.

        • The statement “more like applied game theory” resonates seriously with me. But, my impression is that the part where “estimate as closely as possible the future costs” comes in is such a small part of the whole game-theoretic component that it becomes highly subservient. Even the people who ought to know better receive such heavy political wrangling pressures that their estimates reflect the politics more than reality, also that anything written is discoverable in lawsuits so you can never produce a document that is accurate if it reveals something that is counter to your game-theoretic goal of having every document tell the story you’ll need to tell when you get sued, etc.

          Also, though a Bayesian statistical model with real-world uncertainties would be a great thing for good decision making, it seriously threatens an industry of special purpose people who carry out essentially by-hand algorithmic computations which are considered “best practices” in the “cost estimating field” etc and so it doesn’t get accepted as a method because doing so would put the people who need to accept it out of business.

          The same thing goes for construction scheduling. The PERT method was developed in the 50’s or so for the Polaris missile project. A well thought out PERT method chart for the construction of an office building would probably have 8 to 20 nodes. A typical chart I’ve seen has 2230 nodes and includes thing like “install water faucets on 3rd floor, 8 hours”. The reason for the PERT chart is to prove to some jury that you did your job the way you were told to do, and so it must be the fault of the general contractor / plumbing subcontractor / framing subcontractor / site manager / architect or whoever it is that you’re pointing the finger at. It has nothing to do with projecting an actual completion date with uncertainty.

          i may be biased by the fact that my experience is in forensics where the lawsuit is already a foregone conclusion, but I think the number of construction lawsuits divided by the number of construction projects is pretty close to 1.

        • Daniel:

          The cost estimate is the origin of all other bidding tactics and decision making. That doesn’t necessarily mean that the estimate is developed through a rigorous process prior to moving onto decision making. Most contractors / estimators operate on poorly calibrated mental models; even if they hide behind a layer of false objectiveness because their post-hoc estimate calculations indicate when you add some numbers together you get a total that’s remarkably similar to where they thought they’d be before they even started inputting numbers into the excel spreadsheet. See end of post for more on mental models and when they can be useful.

          PERT does have some uses but it is most useful for identifying the critical path or pairs of critical paths (DuPont came out with an actual “critical path method” but both identify these paths). This ends of being very useful at the site logistics level (once bid is awarded) and not as much prior to (during estimating). The reason being that unless there is an unusual aspect to the site logistics that would directly affect the cost, the use of averages of past prices/labour productivity/etc. to develop a cost estimate will take into account those site logistic issues; and thus is not really a concern of the estimator. It’s what allows estimators to have no clue about how to run a project and yet furnish fairly accurate estimates!…it’s not nearly optimal in my view, but it works for some.

          To touch on your point about statistical models replacing estimators…..absolutely. But there are some caveats here. Only in textbooks are there many types of estimating and “best practices” for construction cost estimating. In practice, there is only one method that enjoys widespread use; the prediction of costs in the future based on the average labour productivity and average costs observed for similar work in the past. Estimators essentially estimate the mean of each line item and unwittingly employ the law of large numbers to have pluses and minus average out over a given job or a given portfolio of jobs (this actually works out surprisingly well except for when things become highly correlated, which happens when things start to go way south). When put in those terms, it’s hard not to think about replacing an estimating office with a machine learning algorithm.

          But that’s not really the caveat. The caveat is that calibrated human estimators are and will forever remain a vital part of the process. If an estimator is serious about their craft then they need to calibrate themselves. This provides a check on the model if using model based estimates. If your calibrated estimate agrees with the modelling estimate, you are probably on good footing. If it doesn’t, then you can investigate why and revise accordingly. The estimator should give the calibrated estimate before the computer model returns a value as to not anchor themselves on the modeling estimate (this is usually in the form of a 90% credible interval).

          If that’s not really clear it may be because the idea of a calibrated estimator is something that isn’t widely adopted in my industry. I first came across the idea in How To Measure Anything by Doug Hubbard for which I am forever in his debt.

          The process is as follows….estimator takes a series of tests that require them to furnish a 90% credible interval for individual questions of the form “what is the circumference of the Earth?” If they are properly calibrated then 9/10 times the answer will be in the interval. Intervals will be wider or narrower depending on the question being asked and depending on the estimator. For example, if a question asked to estimate the number of minor hockey players in Toronto I may have a narrow range for my 90% CI then say someone from Egypt who doesn’t know the prevalence of hockey in Toronto. But if we’re both calibrated then 90% of the time the answer will fall within our respective intervals. We do this with many questions and tests. If at the end we find that our intervals aren’t wide enough (e.g. only 4/10 answers fall within my “90%” credible intervals) then we are overconfident and need to re-test to calibrate until we get that 9/10 mark +/-.

          Once this is achieved we’re now calibrated and can apply the same principle to estimating construction costs. Before inputting a number into an excel spreadsheet or running the ridge regression in R, take a look at the bid documents, and jot down your 90% credible interval for the costs.

          Rinse and repeat with 100s or 1000s of estimates/projects. As you get more confident in estimating the particular projects your firm goes after, your credible interval will narrow (well, it will if you actually review the CIs you’ve recorded in the past). In my experience, if you actually commit to this idea you will eventually be able to go to site with the bid documents, pick up the dirt, and give a 90% CI that’s within 10-20% of the final cost estimate. This is deeply reassuring (yes, there are issues of anchoring but I would rather be calibrated and anchored then anchored at the end to an estimate I have no checks on….you can also mitigate anchoring by having one calibrated estimator perform the CI and another estimate the costs without knowledge of the CI).

          You can use this procedure as a check on the computer modeling estimates that we’re transitioning towards; it helps identify when they get a little out of wack. This is why a calibrated estimator is important. But you really only need one of them if your models / data are sufficient.

          We have to be mindful ourselves as we are quickly hijacking this thread! I own a construction business and I could write an encyclopedic text on how my industry and related industries are living in the dark ages (designers, code agencies, etc. are no better and are often far worse). I’d be happy to continue it elsewhere though; email me if you want to talk more about the legal / political side. I think hashing out the estimating stuff here is cool as it’s a statistical topic after all.

        • We can talk politics and construction industry specifics and all that jazz on email (sounds interesting), but from a statistical perspective I’d like to mention the following here because I think it’s right on topic. The issue is whether you want to rethink the idea of “calibration” in cost estimation. To see why, consider a project that will ultimately cost 1.25 billion dollars to build, we know this because I just made up the problem, but the people doing the estimates don’t.

          Estimator C is Calibrated, he estimates the cost of the project at 500M to 2.5B and knows that to get calibrated estimates he’ll need this extremely wide interval because he’s tested himself on many estimation problems and has an intuitive model of the frequency of errors that requires this wide estimate because the particular project has lots of tricky issues.

          Estimator B is Bayesian, he works through a calculation involving a Bayesian model for costs using historical data and a description of the construction project’s components, and comes up with 1.1 to 1.2 billion dollars. Furthermore in 100 other projects he never even once gets the actual value in his interval. But, on the other hand, the actual value is NEVER more than 5% of the mean value outside the interval and tends to be 50-50 either above or below.

          Which estimator’s results are “better”?

          A much better way to think about the goodness of estimates is over time, how much loss do we accumulate (or gain do we accumulate) making decisions using the estimates? If you’re using Bayesian decision theory with real-world cost calculations, this accumulated gain is far more important to you than how often your estimated interval did or did not cover the correct value (also, probability of financial ruin would be of interest).

          For example, suppose that we will ultimately make money on our airport or whatever so long as construction costs can be kept below 1.7 Billion. The coverage estimate includes values up to 2.5 Billion. How do we decide whether to do the project or not? If we aren’t willing to take the risk of a possible 0.8 billion dollar loss based on the 2.5 billion upper end… we don’t do the project. But the project would have netted us 1.7-1.25 = 0.45 billion. If we use the Bayesian estimate, we may ultimately decide that we’re going to make say 0.55 billion, and wind up “only” making 0.45 but we’re $450 million dollars ahead of the company with the calibrated estimate that decided to pass.

          If I reinterpret your “calibration” goal as a desire of having good priors then I think it makes much more sense. Yes, if your Bayesian model’s posterior puts your estimate way outside the prior range… you should look carefully at your model. But I still think ultimately calibration is the wrong way to think about what makes priors “good”. Something like “small accumulated decision-loss” is a better way to think about it.

        • Daniel, you are distracting me from actual estimating. But since you’ve suggested I rethink my way of doing things I am compelled to respond! In all seriousness, I do have bids to get to but I’ll leave a short comment for now.

          I can tell right now we agree but need a little discussion to get there.

          You are contrasting two estimators that are calibrated in distinctly different ways. One based on a credible interval and one on the mean (which is harder to do in practice, so good luck with that). Note how you have framed this question: estimator C provides a wide range, which would suggest the project is not typical of something bid previously (as you note). If it so happens that the point estimate that is subsequently churned out from the usual ritualistic crank of estimating average costs, falls within this credible interval that is a good sign; but given the range, the agreement between the interval and the point estimate doesn’t provide great confidence in the point estimate. It corroborates it to some extent but further investigation is needed. In the case of an A-typical project the calibrated CI should be viewed more as a lower/upper bound methodology than a defining interval with which agreement provides supreme confidence. However, that said it is infinitely more useful then saying “I have no freaking clue” when asked the question “what will this thing cost?” for reasons that don’t necessarily have to do with the bid preparation itself (bonding requirements, etc. don’t have time at moment to get into)

          On the other hand, if the calibrated estimator has a narrow range pre-estimate and the point estimate falls into that range then you’re probably pretty happy with that result and won’t lose too much sleep over it.

          What range you feel comfortable with defining as narrow enough is up to you as the head estimator / owner. Presumably this will be based on a loss function taking into account current projects being bid, capital of the firm, and projects in hand.

        • “Even the people who ought to know better receive such heavy political wrangling pressures that their estimates reflect the politics more than reality”. I worked on one project where my boss estimated the cost (very accurately), but management insisted that it was too high and that it needed to be cut more or less in half. This project was a big one for our small company. So the project was duly bid at the lower cost.

          When we were done, it turned out that the original estimate had been correct within 10%. But our team got dinged during reviews for years because we kept coming in over the contract cost.

        • Yeah, that’s the kind of stuff I was referring to, and it happens in industries outside construction as well. Feynman described it at NASA in terms of something like: “the boss wants to prove that the Shuttle missions have less than a 1e-6 chance of blowing up, so figure out what numbers you have to put into the estimates to come up with 1e-6”

          and the result was that some bolt that they used thousands of had to have a 1e-13 chance of failure to achieve the desired result so somewhere there was a document in which it said probability of failure of bolt = 0.0000000000001

          (Note: i’m making these numbers up, but that was the gist of it)

          Of course when he asked the engineers they’d say things like “well about 1 in 10 of our unmanned rockets fail, but we put a bunch more effort into the manned missions so it’s probably around 1 in 100” which is more or less correct.

        • Tom:

          Been there. Sucks.

          It’s an even better feeling when you bid a project at what you think the true cost is + reasonable markup and still come 40-50% under the next lowest bidder! Those are sleepless nights.

          Daniel:

          What Tom describes is relatively rare among individual firms. But there are a lot of firms so on any given bid, especially if it’s public, the low bidder is likely to be the one that permitted a similar decision analysis into the final bid.

          This is problematic because your cost estimates aren’t ever (except sometimes internally) compared to true costs, they are compared against what other bids are coming in it. So it is entirely possible for you to be extremely accurate in estimating construction costs but be told your numbers are too high (or dinged during reviews for future work as Tom mentions). That places downwards forces on base bid; where once you get the contract, you need to figure out how in god’s green earth you’re going to make money. This feeds into the modus operandi that I mentioned in my email to you.

          But you still need to have somewhat reasonable / reliable estimates of true costs. You start here and work downwards.

        • Right, as I say, having experience on the forensic back-end, what I saw is that lots of construction projects are done by the low bidder whose estimates are unrelated to anything realistic… So you’re right, the other firms probably had a better handle on things. However, it does seem to me that there’s a strong incentive to get the bid using a ridiculously low cost estimate, and then change-order the project until you make money. A little game theory that disincentivized this heavily would I think go a long way towards improving public works bidding for example. I do know that there are various attempts to do it by offering bonuses for coming in under time and under budget etc. But then, there are also the SF Bay Bridge project (over budget by something like 5 to 25x the base estimate depending on what you consider as the base estimate) and soforth to keep gaming the system alive and healthy.

        • I mentioned in my second post that it’s an entirely rational strategy to reduce the base bid by the amount of excess profit you believe will come through the way of change order. This isn’t “gaming the system.” It’s what is more often than not required to get a job if only to make your cost of capital.

          Note that not all jobs have the same probability of extras coming through. A proper contractor will have a model, again usually an informal mental model, of how much is likely to come by the way of these extras and with what probability (it depends on the specification, design team, owner, etc.). From these probabilities a reduction to the base bid can be made.

          This is all a very rational way to approach bidding given the constraints of the system….which usually entail an owner who won’t go ahead with the work if the full freight is shown up front, and a design team that isn’t paid enough to properly specify the project. This is especially true of Public Works projects which are notoriously underspecified.

          A further step is to apply a loss function for the portfolio of projects to make sure one accurately anticipates the risk of any given bet or series of bets (which contract bids are) of not paying off and how that may affect the solvency of the enterprise.

          …almost no one in my industry would use any of these terms though. So don’t try to talk to contractors about loss functions and probabilities!

        • The fact that it’s required to be in the business doesn’t make it “not gaming the system” the fact is, construction particularly public works, is an industry where gaming the system is the *primary* activity, and building stuff is just the means to get into the game.

          I fully agree with you with “This is all a very rational way to approach bidding given the constraints of the system….which usually entail an owner who won’t go ahead with the work if the full freight is shown up front, and a design team that isn’t paid enough to properly specify the project. This is especially true of Public Works projects which are notoriously underspecified. ”

          however, just because it’s rational and predicted by game theory, doesn’t mean it’s…. shall we say… the best of all logically possible worlds. ;-) There are MANY Projects that get done which SHOULD NOT (in other words, the owner was absolutely right to not be willing to do the project for the full freight), such as the CA high speed rail project, which will clearly cost more than just giving everyone who ever winds up riding that thing a free airplane ticket. Or the Kansai airport.

        • “That places downwards forces on base bid; where once you get the contract, you need to figure out how in god’s green earth you’re going to make money.”

          The contract I referred to was a US Navy contract for equipment critical to nuclear submarines. The way to make money on these contracts, it was said, was on ECRs (Engineering Change Requests). We could never make that work out, though.

        • “Oops! I meant “So the project was duly bid at the *higher* cost.” Hah! I confused myself there: our bid really was for about 1/2 of our estimated cost, the way I originally wrote it.

  5. I had a brief at Yarin Gal’s thesis whichseems to be reaching for Bayesian Deep Learning by making explicit probability assumptions to replace random dropout – or that was impression anyway. Has anyone had a close look at this work or perhaps some recent stuff here https://alexgkendall.com/computer_vision/bayesian_deep_learning_for_safe_ai/ ?

    Also, machine learning predictions as inputs to a non-Bayesian analysis seems reminiscent of Efron and Tibshirani’s pre-validation approach where they, in turn, fit ML for one subset, using date other than in that subset, and the fit additional covariates in each subset and then put all the subsets together completely pool the covariates as if they were common. But they can’t be common as the ML fit was different in each subset.

    It was a nice example of representists versus propertyists http://statmodeling.stat.columbia.edu/2017/04/19/representists-versus-propertyists-rabbitducks-good/ They being propertyists had worked out that pre-validation had bad properties while I worked through the representation and showed it to be contradictory (complete pooling parameters as if they were common when they were in fact different).

Leave a Reply

Your email address will not be published. Required fields are marked *