Chasing the noise in industrial A/B testing: what to do when all the low-hanging fruit have been picked?

Commenting on this post on the “80% power” lie, Roger Bohn writes:

The low power problem bugged me so much in the semiconductor industry that I wrote 2 papers about around 1995. Variability estimates come naturally from routine manufacturing statistics, which in semicon were tracked carefully because they are economically important. The sample size is determined by how many production lots (e.g. 24 wafers each) you are willing to run in the experiment – each lot adds to the cost.

What I found was that small process improvements were almost impossible to detect, using the then-standard experimental methods. For example, if an experiment has a genuine yield impact of 0.2 percent, that can be worth a few million dollars. (A semiconductor fabrication facility produced at that time roughly $1 to $5 billion of output per year.) But a change of that size was lost in the noise. Only when the true effect rose into the 1% or higher range was there much hope of detecting it. (And a 1% yield change, from a single experiment, would be spectacular.)

Yet semicon engineers were running these experiments all the time, and often acting on the results. What was going on? One conclusion was that most good experiments were “short loop” trials, meaning that the wafers did not go all the way through the process. For example, you could run an experiment on a single mask layer, and then measure the effect on manufacturing tolerances. (Not the right terminology in semicon, but that is what they are called elsewhere.) In this way, the only noise was from the single mask layer. Such an experiment would not tell you the impact on yields, but an engineering model could estimate the relationship between tolerances ===> yields. Now, small changes were detectable with reasonable sample sizes.

This relates to noise-chasing in A/B testing, it relates to the failure of null hypothesis significance testing when studying incremental changes, and what to do about it, and it relates to our recent discussions about how to do medical trials using precise measurements of relevant intermediate outcomes.

9 thoughts on “Chasing the noise in industrial A/B testing: what to do when all the low-hanging fruit have been picked?

  1. I didn’t think of this. Is NHST what happened to intel? Did they AB test themselves into a corner?

    Overall, Intel had a stellar quarter, but it originally promised that it would deliver the 10nm process back in 2015. After several delays, the company assured that it would deliver 10nm processors to market in 2017. That was further refined to the second half of this year.

    On the earnings call today, Intel announced that it had delayed high-volume 10nm production to an unspecified time in 2019.

    https://www.tomshardware.com/news/intel-cpu-10nm-earnings-amd,36967.html

  2. @Andrew: Even noisy measurements add information. If I were running a semiconductor manufacturing line, I would be happy to make a chance that has a confidence interval of (-0.9% to 1.1%). The burden of action is overcome by mere evidence, not solid proof. IMO. Thoughts?

    @Anoneuoid: No, 10 nm is a very difficult thing to get right. Intel is spending like $12B a year on R&D. Moore’s law is over. EUV is years behind schedule. We’ve hit ceilings on lithography wavelength, NA, and k factor. As a result, improvements in transistor density are coming from double, triple, quadruple patterning. Electromigration is more and more of a problem as wires get thinner. These are real problems. It’s not from noisy A/B testing.(Source: I almost worked in Intel R&D on electromigration at the 10 nm node)

  3. ” it relates to our recent discussions about how to do medical trials using precise measurements of relevant intermediate outcomes.”

    Measuring intermediate outcomes, instead of final outcomes, may be the single most powerful experimental tool in many fields. Juran talked about it (“cutting a window on the process”) 50 years ago. But it is very much a two-headed axe, as the problems in medicine clarify.

    The benefits include:
    1.Lower noise, therefore better S/N
    2.MUCH faster information flow time (time from proposing an experiment to analyzing the result). In semicon, 10X improvements are easy. In AIDS research, measuring how fast patients would die took years, while measuring HIV levels took a few months.
    3. Lower cost, at least variable costs and often also fixed costs.

    BUT the downside is lower fidelity. Unless/until you have an excellent physical/biological model of a phenomenon, you cannot be sure that your intermediate variable is truly predicting the final outcome that matters. In engineering, once a good intermediate measure is identified and validated, the rate of technical progress can improve by an order of magnitude due to the 3 effects above.

    My MD friends suggest that all RCTs of long-term medication should measure *overall* mortality as one of the outcomes. Without measuring mortality, the RCT is a gamble that long-term effects in the body are minor. An unfortunate example was the diabetes drug Rosiglitazone/Avandia. This requirement is a paradox, though, because measuring mortality is very slow and very noisy.
    My layman’s conclusion is that medical progress in the future will be based on Bayesian gambles.
    “But we’ve grabbed all the low-hanging fruit. In medicine, public health, social science, and policy analysis we are studying smaller and smaller effects. These effects can still be important in aggregate, but each individual effect is small.

    • I say this “low hanging fruit” idea is totally off. When I did medical research I wanted to create a model of what I thought was going on. I needed to know various stuff like:

      number of various receptors per cell, size of the cells, how many of these cells are there, what is the turnover rate of these cells, concentration of various ligands, permeability of the cell membrane to these ligands, number/concentration of secondary messengers and signal transducers within the cell, how long it takes for the signal to be transduced, how long it takes for the various feedbacks to kick in, final reaction rates. Then all of that for multiple “types” of cell, how adhesion to various substrates affects any of it, etc.

      Guess what? Pretty much none of that was available. The stuff that was available was questionable since it was only reported by one group in one specific circumstance, etc.

      The reality is we know almost nothing, and what is “known” is highly suspect. There are tons of low hanging fruit to be grabbed. Medical research needs a Tycho Brahe to actually collect usable data so we can figure out whats going on rather than look for “effects”.

  4. As with many issues surrounding designed experiments, George Box figured this out. He developed the Evolutionary Operation (EVOP) approach to generating and detecting small improvements in production processes.

    In short, by systematically changing your factors inside the allowed production envelope, and running each combination for a few weeks (or more), then you can detect the small improvements, without creating scrap.

    https://en.wikipedia.org/wiki/EVOP

  5. this is a common problem in the Tech industry as well. it’s often impossible to detect the effect of an experiment over the variable we ultimately care about (e.g. revenue or retention). that’s because said variable is often too far removed from the treatment and intermediate steps can add a lot of noise. better measure change closer to where the treatment is applied and, as Roger Bohn suggests, use a model to estimate the final impact on the variable we care about.

  6. Would this apply to A/B testing in website design? Many small changes that are far removed from point of purchase/decision (e.g., about box that explains a field in a filter) are very difficult to connect to traditional add to cart rate but can marginally improve the customer experience

Leave a Reply to Roger Bohn Cancel reply

Your email address will not be published. Required fields are marked *