I disagree with Geoff Hinton regarding “glorified autocomplete”

Computer scientist and “godfather of AI” Geoff Hinton says this about chatbots:

“People say, It’s just glorified autocomplete . . . Now, let’s analyze that. Suppose you want to be really good at predicting the next word. If you want to be really good, you have to understand what’s being said. That’s the only way. So by training something to be really good at predicting the next word, you’re actually forcing it to understand. Yes, it’s ‘autocomplete’—but you didn’t think through what it means to have a really good autocomplete.”

This got me thinking about what I do at work, for example in a research meeting. I spend a lot of time doing “glorified autocomplete” in the style of a well-trained chatbot: Someone describes some problem, I listen and it reminds me of a related issue I’ve thought about before, and I’m acting as a sort of FAQ, but more like a chatbot than a FAQ in that the people who are talking with me do not need to navigate through the FAQ to find the answer that is most relevant to them; I’m doing that myself and giving a response.

I do that sort of thing a lot in meetings, and it can work well, indeed often I think this sort of shallow, associative response can be more effective than whatever I’d get from a direct attack on the problem in question. After all, the people I’m talking with have already thought for awhile about whatever it is they’re working on, and my initial thoughts may well be in the wrong direction, or else my thoughts are in the right direction but are just retracing my collaborators’ past ideas. From the other direction, my shallow thoughts can be useful in representing insights from problems that these collaborators had not ever thought about much before. Nonspecific suggestions on multilevel modeling or statistical graphics or simulation or whatever can really help!

At some point, though, I’ll typically have to bite the bullet and think hard, not necessarily reaching full understanding in the sense of mentally embedding the problem at hand into a coherent schema or logical framework, but still going through whatever steps of logical reasoning that I can. This feels different than autocomplete; it requires an additional level of focus. Often I need to consciously “flip the switch,” as it were, to turn on that focus and think rigorously. Other times, I’m doing autocomplete and either come to a sticking point or encounter an interesting idea, and this causes me to stop and think.

It’s almost like the difference between jogging and running. I can jog and jog and jog, thinking about all sorts of things and not feeling like I’m expending much effort, my legs pretty much move up and down of their own accord . . . but then if I need to run, that takes concentration.

Here’s another example. Yesterday I participated in the methods colloquium in our political science department. It was Don Green and me and a bunch of students, and the structure was that Don asked me questions, I responded with various statistics-related and social-science-related musings and stories, students followed up with questions, I responded with more stories, etc. Kinda like the way things go here on the blog, but spoken rather than typed. Anyway, the point is that most of my responses were a sort of autocomplete—not in a word-by-word chatbot style, more at a larger level of chunkiness, for example something would remind me of a story, and then I’d just insert the story into my conversation—but still at this shallow, pleasant level. Mellow conversation with no intellectual or social strain. But then, every once in awhile, I’d pull up short and have some new thought, some juxtaposition that had never occurred to me before, and I’d need to think things through.

This also happens when I give prepared talks. My prepared talks are not super-well prepared—this is on purpose, as I find that too much preparation can inhibit flow. In any case, I’ll often finding myself stopping and pausing to reconsider something or another. Even when describing something I’ve done before, there are times when I feel the need to think it all through logically, as if for the first time. I noticed something similar when I saw my sister give a talk once: she had the same habit of pausing to work things out from first principles. I don’t see this behavior in every academic talk, though; different people have different styles of presentation.

This seems related to models of associative and logical reasoning in psychology. As a complete non-expert in that area, I’ll turn to wikipedia:

The foundations of dual process theory likely come from William James. He believed that there were two different kinds of thinking: associative and true reasoning. . . . images and thoughts would come to mind of past experiences, providing ideas of comparison or abstractions. He claimed that associative knowledge was only from past experiences describing it as “only reproductive”. James believed that true reasoning could enable overcoming “unprecedented situations” . . .

That sounds about right!

After describing various other theories from the past hundred years or so, Wikipedia continues:

Daniel Kahneman provided further interpretation by differentiating the two styles of processing more, calling them intuition and reasoning in 2003. Intuition (or system 1), similar to associative reasoning, was determined to be fast and automatic, usually with strong emotional bonds included in the reasoning process. Kahneman said that this kind of reasoning was based on formed habits and very difficult to change or manipulate. Reasoning (or system 2) was slower and much more volatile, being subject to conscious judgments and attitudes.

This sounds a bit different from what I was talking about above. When I’m doing “glorified autocomplete” thinking, I’m still thinking—this isn’t automatic and barely conscious behavior along the lines of driving to work along a route I’ve taken a hundred times before—; I’m just thinking in a shallow way, trying to “autocomplete” the answer. It’s pattern-matching more than it is logical reasoning.

P.S. Just to be clear, I have a lot of respect for Hinton’s work; indeed, Aki and I included Hinton’s work in our brief review of 10 pathbreaking research articles during the past 50 years of statistics and machine learning. Also, I’m not trying to make a hardcore, AI-can’t-think argument. Although not myself a user of large language models, I respect Bob Carpenter’s respect for them.

I think that where Hinton got things wrong in the quote that led off this post was not in his characterization of chatbots, but rather in his assumptions about human thinking, in not distinguishing autocomplete-like associative reasoning with logical thinking. Maybe Hinton’s problem in understanding this is that he’s just too logical! At work, I do a lot of what seems like autocomplete—and, as I wrote above, I think it’s useful—but if I had more discipline, maybe I’d think more logically and carefully all the time. It could well be that Hinton has that habit or inclination to always be in focus. If Hinton does not have consistent personal experience of shallow, autocomplete-like thinking, he might not recognize it as something different, in which case he could be giving the chatbot credit for something it’s not doing.

Come to think of it, one thing that impresses me about Bob is that, when he’s working, he seems to always be on focus. I’ll be in a meeting, just coasting along, and Bob will interrupt someone to ask for clarification, and I suddenly realize that Bob absolutely demands understanding. He seems to have no interest in participating in a research meeting in a shallow way. I guess we just have different styles. It’s my impression that the vast majority of researchers are like me, just coasting on the surface most of the time (for some people, all of the time!), while Bob, and maybe Geoff Hinton, is one of the exceptions.

P.P.S. Sometimes we really want to be doing shallow, auto-complete-style thinking. For example, if we’re writing a play and want to simulate how some characters might interact. Or just as a way of casting the intellectual net more widely. When I’m in a research meeting and I free-associate, it might not help immediately solve the problem at hand, but it can bring in connections that will be helpful later. So I’m not knocking auto-complete; I’m just disagreeing with Hinton’s statement that “by training something to be really good at predicting the next word, you’re actually forcing it to understand.” As a person who does a lot of useful associative reasoning and also a bit of logical understanding, I think they’re different, both in how they feel and also in what they do.

P.P.P.S. Lots more discussion in comments; you might want to start here.

P.P.P.P.S. One more thing . . . actually, it might deserve its own post, but for now I’ll put it here: So far, it might seem like I’m denigrating associative thinking, or “acting like a chatbot,” or whatever it might be called. Indeed, I admire Bob Carpenter for doing very little of this at work! The general idea is that acting like a chatbot can be useful—I really can help lots of people solve their problems in that way, also every day I can write these blog posts that entertain and inform tens of thousands of people—but it’s not quite the same as focused thinking.

That’s all true (or, I should say, that’s my strong impression), but there’s more to it than that. As discussed in my comment linked to just above, “acting like a chatbot” is not “autocomplete” at all, indeed in some ways it’s kind of the opposite. Locally it’s kind of like autocomplete in that the sentences flow smoothly; I’m not suddenly jumping to completely unrelated topics—but when I do this associative or chatbot-like writing or talking, it can lead to all sorts of interesting places. I shuffle the deck and new hands come up. That’s one of the joys of “acting like a chatbot” and one reason I’ve been doing it for decades, long before chatbots ever existed! Walk along forking paths, and who knows where you’ll turn up! And all of you blog commenters (ok, most of you) play helpful roles in moving these discussions along.

For how many iterations should we run Markov chain Monte Carlo?

So, Charles and I wrote this article. It’s for the forthcoming Handbook of Markov Chain Monte Carlo, 2nd edition. We still have an opportunity to revise it, so if you see anything wrong or misleading, or if there are other issues you think we should mention, please let us know right here in comments!

Remember, it’s just one chapter of an entire book—it’s intended to be an updated version of my chapter with Kenny, Inference from Simulations and Monitoring Convergence, from the first edition. For our new chapter, Charles liked the snappier, “How many iterations,” title.

In any case, our chapter does not have to cover MCMC, just inference and monitoring convergence. But I think Charles is right that, once you think about inference and monitoring convergence, you want to step back and consider design, in particular how many chains to run, where do you start them, and how many iterations per chain.

Again, all comments will be appreciated.

SIMD, memory locality, vectorization, and branch point prediction

The title of this post lists the three most important considerations for performant code these days (late 2023).

SIMD

GPUs can do a lot of compute in parallel. High end ($15K to $30K) GPUs like they use in big tech perform thousands of operations in parallel (50 teraflops for the H100). The catch is that they want all of those operations to be done in lockstep on different data. This is called single-instruction, multiple data (SIMD). Matrix operations, as used by today’s neural networks, are easy to code with SIMD.

GPUs cannot do 1000s of different things at once. This makes it challenging to write recursive algorithms like the no-U-turn sampler (NUTS) and is one of the reason people like Matt Hoffman (developer of NUTS) have turned to generalized HMC.

GPUs can do different things in sequence if you keep memory on the GPU (in kernel). This is how deep nets can sequence, feedforward, convolution, attention, activation, and GLM layers. Steve Bronder is working on keeping generated Stan GPU code in kernel.

Memory locality

Memory is slow. Really slow. The time it takes to fetch a fresh value from random access memory (RAM) is on the order of 100–200 times slower than the time it takes to execute a primitive arithmetic operation. On a CPU. The problem just gets worse with GPUs.

Modern CPUs are equipped with levels of cache. For instance, a consumer grade 8-core CPU like an Intel i9 might have a shared 32 MB L3 cache among all the cores, a 1MB L2 cache, and an 80KB L1 cache. When memory is read in from RAM, it gets read in blocks, with not only the value you requested, but other values near it. This gets pulled first into L3 cache, then into L2 cache, then into L1 cache, then it gets into registers on the CPU. This means if you have laid an array x out contiguously in memory and you have read x[n], then it is really cheap to read x[n + 1] because it’s either in your L1 cache already or being pipelined there. If your code accesses non-local pieces of memory, then you wind up getting cache misses. The higher up the cache miss (L1, L2, L3, or RAM), the more time the CPU has to wait to get the values it needs.

One way to see this in practice is to consider matrix layout in memory. If we use column major order, then each column is contiguous in memory and the columns are laid out one after the other. This makes it much more efficient to traverse the matrix by first looping over the columns, then looping over the rows. Getting the order wrong can be an order of magnitude penalty or more. Matrix libraries will do more efficient block-based transposes so this doesn’t bite users writing naive code.

The bottom line here is that even if you have 8 cores, your can’t run 8 parallel chains of MCMC as fast as you can run 1. On my Xeon desktop with 8 cores, I can run 4 chains in parallel, followed by another 4 in parallel in the same amount of time as I can run 8 in parallel. As a bonus, my fan doesn’t whine as loudly. The reason for the slowdown with 8 parallel chains is due not due to the CPUs being busy, it’s because the asynchronous execution causes a bottleneck in the cache. This can be overcome with hard work by restructuring parallel code to be more cache sensitive, but it’s a deep dive.

Performant code often recomputes the value of a function if its operands are in cache in order to reduce the memory pressure that would arise from storing the value. Or it reorders operations to be lazy explicitly to support recompilation. Stan does this to prioritize scalability over efficiency (i.e., it recomputes values which means fewer memory fetches but more operations).

Vectorization

Modern CPUs pipeline their operations, for example using AVX and SSE instructions on Intel chips. C++ compilers at high levels of optimization will exploit this if the right flags are enabled. This way, CPUs can do on the order of 8 simultaneous arithmetic operations. Writing loops so that they are in blocks of 8 so that they can exploit CPU vectorization is critical for performant code. The good news is that calling underlying matrix libraries like Eigen or BLAS will do that for you. The bad news is that if you write your own loops, they are going to be slow compared to loops optimized using vectorization. You have to do it yourself in C++ if you want performant code.

Another unexpected property of modern CPUs for numerical computing is that integer operations are pretty much free. The CPUs have separate integer and floating point units and with most numerical computing, there is far less pressure on integer arithmetic. So you can see things like adding integer arithmetic to a loop without slowing it down.

Branch-point prediction

When the CPU executes a conditional such as the compiled form of

if (A) then B else C;

it will predict whether A will evaluate to true or false. If it predicts true, then the operations in B will begin to execute “optimistically” at the same time as A. If A does evaluate to true, we have a head start. If A evaluates to false, then we have a branch-point misprediction. We have to backtrack, flush the results from optimistic evaluation of B, fetch the instructions for C, then continue. This is very very slow because of memory contention and because it breaks the data pipeline. And it’s double trouble for GPUs. Stan includes suggestions (pragmas) to the compiler as to which branch is more likely in our tight memory management code for automatic differentiation.

Conclusion

The takeaway message is that for efficient code, our main concerns are memory locality, branch-point prediction, and vectorization. With GPUs, we further have to worry about SIMD. Good luck!

“Modeling Social Behavior”: Paul Smaldino’s cool new textbook on agent-based modeling

Paul Smaldino is a psychology professor who is perhaps best known for his paper with Richard McElreath from a few years ago, “The Natural Selection of Bad Science,” which presents a sort of agent-based model that reproduces the growth in the publication of junk science that we’ve seen in recent decades.

Since then, it seems that Smaldino has been doing a lot of research and teaching on agent-based models in social science more generally, and he just came out with a book, “Modeling Social Behavior: Mathematical and Agent-Based Models of Social Dynamics and Cultural Evolution.” The book has social science, it has code, it has graphs—it’s got everything.

It’s an old-school textbook with modern materials, and I hope it’s taught in thousands of classes and sells a zillion copies.

There’s just one thing that bothers me. The book is entertainingly written and bursting with ideas, also does a great job of giving concerns about the models that it’s simulating, not just acting like everything’s already known. My concern is that nobody reads books anymore. If I think about students taking a class in agent-based modeling and using this book, it’s hard for me to picture most of them actually reading the book. They’ll start with the homework assignments and then flip through the book to try to figure out what they need. That’s how people read nonfiction books nowadays, which I guess is one reason that books, even those I like, are typically repetitive and low on content. Readers don’t want the book to offer a delightful reading experience, so authors don’t deliver it, and then readers don’t expect it, etc.

To be clear: this is a textbook, not a trade book. It’s a readable and entertaining book in the way that Regression and Other Stories is a readable and entertaining book, not in the way that Guns, Germs, and Steel is. Still, within the framework of being a social science methods book, it’s entertaining and thought-provoking. Also, I like it as a methods book because it’s focused on models rather than on statistical inference. We tried to get a similar feel with A Quantitative Tour of Social Sciences but with less success.

So it kinda makes me sad to see this effort of care put into a book that probably very few students will read from paragraph to paragraph. I think things were different 50 years ago: back then, there wasn’t anything online to read, you’d buy a textbook and it was in front of you so you’d read it. On the plus side, readers can now go in and make the graphs themselves—I assume that Smaldino has a website somewhere with all the necessary code—so there’s that.

P.S. In the preface, Smaldino is “grateful to all the modelers whose work has inspired this book’s chapters . . . particularly want to acknowledge the debt owed to the work of,” and then he lists 16 names, one of which is . . . Albert-Lázló Barabási!

Huh?? Is this the same Albert-Lázló Barabási who said that scientific citations are worth $100,000 each. I guess he did some good stuff too? Maybe this is worthy of an agent-based model of its own.

Bayes factors evaluate priors, cross validations evaluate posteriors

I’ve written this explanation on the board often enough that I thought I’d put it in a blog post.

Bayes factors

Bayes factors compare the data density (sometimes called the “evidence”) of one model against another. Suppose we have two Bayesian models for data y, one model p_1(\theta_1, y) with parameters \theta_1 and a second model p_2(\theta_2, y) with parameters \theta_2.

The Bayes factor is defined to be the ratio of the marginal probability density of the data in the two models,

\textrm{BF}_{1,2} = p_1(y) \, / \, p_2(y),

where we have

p_1(y) = \mathbb{E}[p_1(y \mid \Theta_1)] \ = \ \int p_1(y \mid \theta_1) \cdot p_1(\theta_1) \, \textrm{d}\theta_1

and

p_2(y) = \mathbb{E}[p_2(y \mid \Theta_2)] \ = \ \int p_2(y \mid \theta_2) \cdot p_2(\theta_2) \, \textrm{d}\theta_2.

The distributions p_1(y) and p_2(y) are known as prior predictive distributions because they integrate the likelihood over the prior.

There are ad-hoc guidelines from Harold Jeffreys of “uninformative” prior fame, classifying Bayes factor values as “decisive,” “very strong,” “strong,” “substantial,” “barely worth mentioning,” or “negative”; see the Wikipedia on Bayes factors. These seem about as useful as a 5% threshold on p-values before declaring significance.

Held-out validation

Held-out validation tries to evaluate prediction after model estimation (aka training). It works by dividing the data y into two pieces, y = y', y'' and then training on y' and testing on y''. The held out validation values are

p_1(y'' \mid y') = \mathbb{E}[p_1(y'' \mid \Theta_1) \mid y'] = \int p_1(y'' \mid \theta_1) \cdot p_1(\theta_1 \mid y') \, \textrm{d}\theta_1

and

p_2(y'' \mid y') = \mathbb{E}[p_2(y'' \mid \Theta_2) \mid y'] = \int p_2(y'' \mid \theta_2) \cdot p_2(\theta_2 \mid y') \, \textrm{d}\theta_2.

The distributions p_1(y'' \mid y') and p_2(y'' \mid y') are known as posterior predictive distributions because they integrate the likelihood over the posterior from earlier training data.

This can all be done on the log scale to compute either the log expected probability or the expected log probability (which are different because logarithms are not linear). We will use expected log probability in the next section.

(Leave one out) cross validation

Suppose our data is y_1, \ldots, y_N. Leave-one-out cross validation works by successively taking y'' = y_n and y' = y_1, \ldots, y_{n-1}, y_{n+1}, \ldots, y_N and then averaging on the log scale.

\frac{1}{N} \sum_{n=1}^N \log\left( \strut p_1(y_n \mid y_1, \ldots, y_{n-1}, y_{n+1}, \ldots, y_N) \right)

and

\frac{1}{N} \sum_{n=1}^N \log \left( \strut p_2(y_n \mid y_1, \ldots, y_{n-1}, y_{n+1}, \ldots, y_N) \right).

Leave-one-out cross validation is interpretable as the expected log posterior density (ELPD) for a new data item. Estimating ELPD is (part of) the motivation for various information criteria such as AIC, DIC, and WAIC.

Conclusion and a question

The main distinction between Bayes factors and cross validation is that the former uses prior predictive distributions whereas the latter uses posterior predictive distributions. This makes Bayes factors very sensitive to features of the prior that have almost no effect on the posterior. With hundreds of data points, the difference between a normal(0, 1) and normal(0, 100) prior is negligible if the true value is in the range (-3, 3), but it can have a huge effect on Bayes factors.

This matters because pragmatic Bayesians like Andrew Gelman tend to use weakly informative priors that determine the rough magnitude, but not the value of parameters. You can’t get good Bayes factors this way. The best way to get a good Bayes factor is to push the prior toward the posterior, which you get for free with cross validation.

My question is whether the users of Bayes factors really believe so strongly in their priors. I’ve been told that’s true of the hardcore “subjective” Bayesians, who aim for strong priors, and also the hardcore “objective” Bayesians, who try to use “uninformative” priors, but I don’t think I’ve ever met anyone who claimed to follow either approach. It’s definitely not the perspective we’ve been pushing in our “pragmatic” Bayesian approach, for instance as described in the Bayesian workflow paper. We flat out encourage people to start with weakly informative priors and then add more information if the priors turn out to be too weak for either inference or computation.

Further reading

For more detail on these methods and further examples, see Gelman et al.’s Bayesian Data Analysis, 3rd Edition, which is available free online through the link, particularly Section 7.2 (“Information criteria and cross-validation,” p. 175) and section 7.4 (“Model comparison using Bayes factors,” page 183). I’d also recommend Vehtari, Gelman, and Gabry’s paper, Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC.

Frictionless reproducibility; methods as proto-algorithms; division of labor as a characteristic of statistical methods; statistics as the science of defaults; statisticians well prepared to think about issues raised by AI; and robustness to adversarial attacks

Tian points us to this article by David Donoho, which argues that some of the rapid progress in data science and AI research in recent years has come from “frictionless reproducibility,” which he identifies with “data sharing, code sharing, and competitive challenges.” This makes sense: the flip side of the unreplicable research that has destroyed much of social psychology, policy analysis, and related fields is that when we can replicate an analysis with a press of a button using open-source software, it’s much easier to move forward.

Frictionless reproducibility

Frictionless reproducibility is a useful goal in research. It can take a while between the development of a statistical idea and its implementation in a reproducible way, and that’s ok. But it’s good to aim for that stage. The effort it takes to make a research idea reproducible is often worth it, in that getting to reproducibility typically requires a level of care and rigor beyond what is necessary just to get a paper published. One thing I’ve learned with Stan is that much is learned in the process of developing a general tool that will be used by strangers.

I think that statisticians have a special perspective for thinking about these issues, for the following reason:

Methods as proto-algorithms

As statisticians, we’re always working with “methods.” Sometimes we develop new methods or extend existing methods; sometimes we place existing methods into a larger theoretical framework; sometimes we study the properties of methods; sometimes we apply methods. Donoho and I are typical of statistics professors in having done all these things in our work.

A “method” is a sort of proto-algorithm, not quite fully algorithmic (for example, it could require choices of inputs, tuning parameters, expert inputs at certain points) but it follows some series of steps. The essence of a method is that it can be applied by others. In that sense, any method is a bridge between different humans; it’s a sort of communication among groups of people who may never meet or even directly correspond. Fisher invented logistic regression and decades later some psychometrician uses it; the method is a sort of message in a bottle.

Division of labor as a characteristic of statistical methods

There are different ways to take this perspective. One direction is to recognize that almost all statistical methods involve a division of labor. In Bayes, one agent creates the likelihood model and another agent creates the prior model. In bootstrap, one agent comes up with the estimator and another agent comes up with the bootstrapping procedure. In classical statistics, one agent creates the measurement protocol, another agent designs the experiment, and a third agent performs the analysis. in machine learning, there’s the training and test sets. With public surveys, one group conducts the survey and computes weights; other groups analyze the data using the weights. Etc. We discussed this general idea a few years ago here.

But that’s not the direction I want to go right here. Instead I want to consider something else, which is the way that a “method” is an establishment of a default; see here and also here.

Statistics as the science of defaults

The relevance to the current discussion is that, to the extent that defaults are a move toward automatic behavior, statisticians are in the business of automating science. That is, our methods are “successes” to the extent that they enable automatic behavior on the part of users. As we have discussed, automatic behavior is not a bad thing! When we make things automatic, users can think at the next level of abstraction. For example, push-button linear regression allows researchers to focus on the model rather than on how to solve a matrix equation, and it can even take them to the next level of abstraction and think about prediction without even thinking about the model. As teachers and users of research, we then are (rightly) concerned that lack of understanding can be a problem, but it’s hard to go back. We might as well complain that the vast majority of people drive their cars with no understanding of how those little explosions inside the engine make the car go round.

Statisticians well prepared to think about issues raised by AI

To get back to the AI issue: I think that we as statisticians are particularly well prepared to think about the issues that AI brings, because the essence of statistics is the development of tools designed to automate human thinking about models and data. Statistical methods are a sort of slow-moving AI, and it’s kind of always been our dream to automate as much of the statistics process as possible, while recognizing that for Cantorian reasons (see section 7 here) we will never be there. Given that we’re trying, to a large extent, to turn humans into machines or to routinize what has traditionally been a human behavior that has required care, knowledge, and creativity, we should have some insight into computer programs that do such things.

In some ways, we statisticians are even more qualified to think about this than computer scientists are, in that the paradigmatic action of a computer scientist is to solve a problem, whereas the paradigmatic action of a statistician is to come up with a method that will allow other people to solve their problems.

I sent the above to Jessica, who wrote:

I like the emphasis on frictionless reproducibility as a critical driver of the success in ML. Empirical ML has clearly emphasized methods for ensuring the validity of predictive performance estimates (hold out sets, common task framework etc) compared to fields that use statistical modeling to generate explanations, like social sciences, and it does seem like that has paid off.

From my perspective, there’s something else that’s been very successful though as well – post-2015ish there’s been a heavy emphasis on making models robust to adversarial attack. Being able to take an arbitrary evaluation metric and incorporate it into your loss function so you’re explicitly training for it is also likely to improve things fast. We comment on this a bit in a paper we wrote last year reflecting on what, if anything, recent concerns about ML reproducibility and replicability have in common with the so-called replication crisis in social science.

I do think we are about at max hype currently in terms of perceived success of ML though, and it can be hard to tell sometimes how much the emerging evidence of success from ML research is overfit to the standard benchmarks. Obviously there have been huge improvements on certain test suites, but just this morning for instance I saw an ML researcher present a pretty compelling graph showing that the “certified robustness” of the top LLMs (GPT-3.5, GPT 4, llambda 2, etc), when trained on the common datasets (imagenet, mnist, etc), has not really improved much at all in the past 7-8 years. This was a line graph where each line denoted changes in robustness for different benchmarks (imagenet, mnist, etc) with new methodological advances. Each point in a line represented the robustness of a deep net on that particular benchmark given whatever was considered the state of the art in robust ML at that time. The x-axis was related to time, but each tick represented a particular paper that advanced SOTA. It’s still very easy to trick LLMs into generating toxic text, leaking private data they trained on, or changing their mind based on what should be an inconsequential change to the wording of a prompt, for example.

Simple pseudocode for transformer decoding a la GPT

A number of people have asked me for this, so I’m posting it here:

[Edit: revised original draft (twice) to fix log density context variable and width of multi-head attention values.]

This is a short note that provides complete and relatively simple pseudocode for the neural network architecture behind the current crop of large language models (LLMs), the generative pretrained transformers (GPT). These are based on the notion of (multi-head) attention, followed by feedforward neural networks, stacked in a deep architecture.

I simplified the pseudocode compared to things like Karpathy’s nanoGPT repository in Python (great, but it’s tensorized and batched PyTorch code for GPU efficiency) or Hunter and Phuong’s pseudocode, which is more general and covers encoding and multiple different architectures. I also start from scratch with the basic notions of tokenization and language modeling.

I include the pseudocode for evaluating the objective function for training and the pseudocode for generating responses. The initial presentation uses single-head attention to make the attention stage clearer, with a note afterward with pseudocode to generalize to multi-head attention.

I also include references to other basic presentations, including Daniel Lee’s version coded in Stan.

If this is confusing or you think I got a detail wrong, please let me know—I want to make this as clear and correct (w.r.t. GPT-2) as possible.

My SciML Webinar next week (28 Sep): Multiscale generalized Hamiltonian Monte Carlo with delayed rejection

I’m on the hook to do a SciML webinar next week:

These are organized by Keith Phuthi (who is at CMU) through University of Michigan’s Institute for Computational Discovery and Engineering.

Sam Livingstone is moderating.This is presenting joint work with Alex Barnett, Chirag Modi, Edward Roualdes, and Gilad Turok.

I’m very excited about this project as it combines a number of threads I’ve been working on with collaborators. When I did my job talk here, Leslie Greengard, our center director, asked me why we didn’t use variable stepwise integrators when doing Hamiltonian Monte Carlo. I told him we’d love to do it, but didn’t know how to do it in such a way as to preserve the stationary target distribution.

Delayed rejection HMC

Then we found Antonietta Mira’s work on delayed rejection. It lets you retry a second Metropolis proposal if the first one is rejected. The key here is that we can use a smaller step size for the second proposal, thus recovering from proposals that are rejected because the Hamiltonian diverged (i.e., the first-order gradient based algorithm can’t handle regions of high curvature in the target density). There’s a bit of bookkeeping (which is frustratingly hard to write down) for the Hastings condition to ensure detailed balance. Chirag Modi, Alex Barnett and I worked out the details, and Chirag figured out a novel twist on delayed rejection that only retries if the original acceptance probability was low. You can read about it in our paper:

This works really well and is enough that we can get proper draws from Neal’s funnel (vanilla HMC fails on this example in either the tails in either the mouth or neck of the funnel, depending on the step size). But it’s inefficient in that it retries an entire Hamiltonian trajectory. Which means if we cut the step size in half, we double the number of steps to keep the integration time constant.

Radford Neal to the rescue

As we were doing this, the irrepressible Radford Neal published a breakthrough algorithm:

What he managed to do was use generalized Hamiltonian Monte Carlo (G-HMC) to build an algorithm that takes one step of HMC (like Metropolis-adjusted Langevin, but over the coupled position/momentum variables) and manages to maintain directed progress. Instead of fully resampling momentum each iteration, G-HMC resamples a new momentum value then performs a weighted average with the existing momentum with most of the weight on the existing momentum. Neal shows that with a series of accepted one-step HMC iterations, we can make directed progress just like HMC with longer trajectories. The trick is getting sequences of acceptances together. Usually this doesn’t work because we have to flip momentum each iteration. We can re-flip it when regenerating, to keep going in the same direction on acceptances, but with rejections we reverse momentum (this isn’t an issue with HMC because it fully regenerates each time). So to get directed movement, we need steps that are too small. What Radford figured out is that we can solve this problem by replacing the way we generate uniform(0, 1)-distributed probabilities for the Metropolis accept/reject step (we compare the variate generated to the ratio of the density at the proposal to the density at the previous point and accept if it’s lower). Radford realized that if we instead generate them in a sawtooth pattern (with micro-jitter for ergodicity), then when we’re at the bottom of the sawtooth generating a sequence of values near zero, the acceptances will cluster together.

Replacing Neal’s trick with delayed rejection

Enter Chirag’s and my intern, Gilad Turok (who came to us as an undergrad in applied math at Columbia). Over the summer, working with me and Chirag and Edward Roualdes (who was here as a visitor), he built and evaluated a system that replaces Neal’s trick (sawtooth pattern of acceptance probability) with the Mira’s trick (delayed rejection). It indeed solves the multi scale problem. It exceeded our expectations in terms of efficiency—it’s about twice as fast as our delayed rejection HMC. Going one HMC step at a time, it is able to adjust its stepsize within what would be a single Hamiltonian trajectory. That is, we finally have something that works roughly like a typical ODE integrator in applied math.

Matt Hoffman to the rescue

But wait, that’s not all. There’s room for another one of the great MCMC researchers to weigh in. Matt Hoffman, along with Pavel Sountsov, figured out how to take Radford’s algorithm and provide automatic adaptation for it.

What Hoffman and Sountsov manage to do is run a whole lot of parallel chains, then use information in the other chains to set tuning parameters for a given chain. In that way it’s like the Goodman and Weare affine-invariant sampler that’s used in the Python package emcee. This involves estimating the metric (posterior covariance or just variance in the diagonal case) and also estimating steps size, which they do through a heuristic largest-eigenvalue estimate. Among the pleasant properties of their approach is that the entire setup produces a Markov chain from the very first iteration. That means we only have to do what people call “burn in” (sorry Andrew, but notice how I say other people call it that, not that they should), not set aside some number of iterations for adaptation.

Edward Roualdes has coded up Hoffman and Sountsov’s adaptation and it appears to work with delayed rejection replacing Neal’s trick.

Next for Stan?

I’m pretty optimistic that this will wind up being more efficient than NUTS and also make things like parallel adaptation and automatic stopping a whole lot easier. It should be more efficient because it doesn’t waste work—NUTS goes forward and backward in time and then subsamples along the final doubling (usually—it’s stochastic with a strong bias toward doing that). This means we “waste” the work going the wrong way in time and beyond where we finally sample. But we still have a lot of eval to do before we can replace Stan’s longstanding sampler or even provide an alternative.

My talk

The plan’s basically to expand this blog post with details and show you some results. Hope to see you there!

PhD student, PostDoc, and Research software engineering positions

Several job opportunities in beautiful Finland!

  1. Fully funded postdoc and doctoral student positions in various topics including Bayesian modeling, probabilistic programming and workflows with me and other professors in Aalto University and University of Helsinki, funded by Finnish Center for Artificial Intelligence

    See more topics, how to apply, and job details like salary at fcai.fi/we-are-hiring

    You can also ask me for further details

  2. Permanent full time research software engineer position at Aalto University. Aalto Scientific Computing is a specialized type of research support, providing high-performance computing hardware, management, research support, teaching, and training. The team works with top researchers throughout the university. All the work is open-source by default and the team take an active part in worldwide projects.

    See more about tasks, qualifications, salary, etc in www.aalto.fi/en/open-positions/research-software-engineer

    This could be a great fit also for someone interested in probabilistic programming. I know some of the RSE group members, and they are great, and we’ve been very happy to get their help, e.g. in developing priorsense package.

Update on Retrodesign: R package for Type M and Type S errors

In 2000, Francis Tuerlinckx and I published our paper on Type M and Type S errors.

In 2014, John Carlin and I published a paper on the topic with a more applied focus, including an R function “retrodesign” for computing these error rates given assumptions.

In 2019, Andy Timm improved the function and put it on CRAN.

Since then, Timm and Martijn Weterings found some small problems. They fixed and updated the retrodesign package, and it’s right here on CRAN again.

I hope you find it useful.

Report on the large language model meeting at Berkeley

I’m not the only one who thinks GPT-4 is awesome. I just got back from an intense week of meetings at the Large language models and transformers workshop at Berkeley’s Simons Institute for the Theory of Computing. Thanks to Umesh Vazirani for organizing and running it so calmly.

Here are the videos of the talks.

Human feedback models

I gave a short talk Friday afternoon on models of data annotation.

  • Bob Carpenter. Softening human feedback improves classifier calibration

  • The step from the language model GPT-4 to the chatbot ChatGPT involves soliciting human feedback to rank potential outputs. This is typically done by converting the human feedback to a “gold” standard and retraining the baseline GPT-4 neural network.

    Chris Manning (who introduced me to statistical natural language processing when we were both professors at CMU), provides a nice high-level overview of how OpenAI uses reinforcement learning with human feedback to try to align ChatGPT to the goals of being helpful, harmless, and truthful.

    Chris Manning. Towards reliable use of large language models: better detection, consistency, and instruction-tuning.

    Humans rank potential ChatGPT output and their feedback is used as input for a Bradley-Terry model of conversational reward. This is then used to retrain the network. Chris suggests a much simpler approach than the one they use.

    While at the workshop, John Thickstun, a Stanford CS postdoc, pointed me to the following (and also filled me in on a bunch of technical details in between sessions).

    Chen Cheng, Hilal Asi, and John Duchi. 2022. How many labelers do you
    have? A close look at gold-standard labels
    . arXiv.

    It makes some simplifying assumptions to prove results including the bias of majority voting. I show similar things through simulation in a case study I posted on the Stan forums a while back,

    Bob Carpenter. For probabilistic prediction, full Bayes is better than point estimators.

    More on that soon when Seong Han and I finish our recent paper on annotation models.

    LLMs and copyright

    The highlight of the entire event for me was a captivating talk by a brilliant professor of intellectual property law at Berkeley

    Pamela Samuelson. Large language models meet copyright law.

    If you’re at all interested in copyright and AI, you should watch this. She very clearly explains what copyright is and how the law sees works of artistic expression different than function and hence how it sees code and (other) artistic works differently. She also covers the basis for the cases currently being litigated. She was also masterly at handling the rather unruly crowd. I’ve never been to an event with so many interruptions by the audience members. It was almost like the audience was practicing to be DARPA program managers (a notoriously interruption-prone group).

    Is ChatGPT just a stochastic parrot?

    The other talk I’d encourage everyone to watch is

    Steven Piantadosi. Meaning in the age of large language models.

    He goes over a lot of the cognitive science and philosophy of language necessary to understand why ChatGPT is not just a “stochastic parrot.” He focuses on the work of, wait for it…Susan Gelman (Andrew’s sister). Susan works in my favorite area of cognitive science—concept development.

    I can’t recommend this one highly enough, and I’ll be curious what people get out of it. This one’s closest to my own background (my Ph.D. is joint in cognitive science/computer science and I taught semantics, philosophy of language, and psycholinguistics as well as NLP at CMU), so I’m curious how understandable it’ll be to people who haven’t studied a lot of cognitive anthropology, philosophy of mind, and cognitive development.

    Sanjeev Arora gave a talk about combining skills and how he did a simple combinatorial experiment of combining five “skills” among thousands of skills (not defining what a skill was drove the audience into a frenzy of interruptions that’s quite something to watch). This is behavior that “emerged” in GPT-4 that isn’t so great in the less powerful models.

    Speaking of parrots, the West Coast Stats Views blog (which Andrew often cites) is parroting mainstream chatbot FUD (fear, uncertainty, and doubt); see, e.g., Thursday tweets. I say “parrot” because the blog’s Thursday posts just point to things we used to call “tweets” before a certain someone decided to throw away a brand name that’s become a verb. The irony, of course, is that they accuse GPT of being a parrot!

    Scaling laws

    There were a couple of nice talks by Yasaman Bahri (DeepMind) on Understanding the origins and taxonomy of neural scaling laws and Sasha Rush (Cornell/Hugging Face) on Scaling data-constrained language models. These are important as they’re what allows you to decide how much data to use and how large a model to build for your compute budget. It’s what gave companies the incentive to invest hundreds of millions of dollars in infrastructure to fit these models. Sasha’s talk also discusses the roles researchers can take who don’t have access to big-tech compute power.

    Watermarking LLMs

    Scott Aaronson (UT Austin, on sabbatical at OpenAI) gave a really interesting talk,

    Scott Aaronson. Watermarking of large language models

    The talks a masterclass in distilling a problem to math, explaining why it’s difficult, and considering several solutions and their implications. I felt smarter after watching this one.

    You might also want to check out the competition from John Thickstun, Robust distortion-free watermarks for language models, which takes an encryption key-based approach.

    In-context learning

    “In-context learning” is what people call an LLM’s ability to be given zero or more examples and then to complete the pattern. For example, if I say “translate to French. oui: “, we get what’s called “zero-shot learning”. If I give it an example, then it’s called “one-shot learning”, for example, “translate to French. notre: our, oui: “, and so on. ChatGPT can manage all sorts of nuanced language tasks given only a few examples. It’s so good that it’s competitive with most custom solutions to these problems.

    Everyone kept pointing out in-context learning did not learn in the sense of updating model weights. Of course it doesn’t. That’s because it’s conditioning, not learning. The whole process is Markovian, returning a draw from Pr[continuation | context]. The issue is whether you can do a good job of this prediction without being AI-compete (i.e, a general artificial intelligence, whatever that means).

    A whole lot of attention was given to ChatGPT’s poor performance on arithmetic problems coded as characters like “123 * 987”, with a couple different talks taking different approaches. One trained a transformer with the actual digits and showed it could be made to do this, pointing to the problem being the encoding of math in language. Perhaps the most insightful is that if you give GPT access to an API (with in-context learning, no less), it can call on that API to do arithmetic and the problem goes away. The final talk on this was during the lightning sessions, where Nayoung Lee (a Ph.D. student from Wisconsin-Madison) showed if you reversed the digits in the output (so that they were least significant first, the way we usually do arithmetic), transformers could be trained to do arithmetic very well; here’s a link to her arXiv paper, Teaching arithmetic to small transformers.

    Sparks of artificial general intelligence

    Yin Tat Lee kicked off the program talking about the Microsoft paper, Sparks of general AI. If you haven’t read the paper it’s a fascinating list of things that ChatGPT can do with relatively simple prompting.

    One of the interesting aspects of Yin Tat’s talk is his description of how they treated ChatGPT (4) as an evolving black box. To me, this and a bunch of these other talks that people did probing GPT’s abilities, point out that we need much better evaluation methods.

    Federated learning and task specialization

    For those interested in hierarchical modeling and the idea of a foundation model that can be adapted to different tasks, Colin Raffel (UNC/Hugging Face) gave an interesting talk on federated learning and deployment.

    Colin Raffel. Build an ecosystem, not a monolith

    This was not unrelated to Sasha’s talk (perhaps not surprising as they’re both affiliated with Hugging Face). They also talk about the ecosystem of image models sprouting up around Stable Diffusion and the ability to fine-tune them using low rank methods.

    OpenAI is closed

    Ilya Sutskever, CTO of OpenAI, gave a talk that I can best describe as adversarial. It was the only talk that filled the room to staning room only. He said he couldn’t talk about anything computational or anything related to LLMs, so he spent an hour talking about the duality between probability and compression and Kolmogorov complexity.

    The fundamental role of data partitioning in predictive model validation

    David Zimmerman writes:

    I am a grad student in biophysics and basically a novice to Bayesian methods. I was wondering if you might be able to clarify something that is written in section 7.2 of Bayesian Data Analysis. After introducing the log pointwise predictive density as a scoring rule for probabilistic prediction, you say:

    The advantage of using a pointwise measure, rather than working with the joint posterior predictive distribution … is in the connection of the pointwise calculation to cross-validation, which allows some fairly general approaches to approximation of out-of-sample fit using available data.

    But would it not be possible to do k-fold cross-validation, say, with a loss function based on the joint predictive distribution over each full validation set? Can you explain why (or under what circumstances) it is preferable to use a pointwise measure rather than something based on the joint predictive?

    My reply: Yes, for sure you can do k-fold cross validation. Leave-one-out (LOO) has the advantage of being automatic to implement in many models using Pareto-smoothed importance sampling, but for structured problems such as time series and spatial models, k-fold can make more sense. The reason we made such a big deal in our book about the pointwise calculation was to emphasize that predictive validation fundamentally is a process that involves partitioning the data. This aspect of predictive validation is hidden by AIC and related expressions such as DIC that work with the unpartitioned joint likelihood. When writing BDA3 we worked to come up with an improvement/replacement for DIC—the result was chapter 7 of BDA3, along with this article with Aki Vehtari and Jessica Hwang—and part of this was a struggle to manipulate the posterior simulations of the joint likelihood. At some point I realized that the partitioning was necessary, and this point struck me as important enough to emphasize when writing all this up.

    And here’s Aki’s cross validation FAQ and two of his recent posts on the topic:

    from 2020: More limitations of cross-validation and actionable recommendations

    from 2022: Moving cross-validation from a research idea to a routine step in Bayesian data analysis

    Faculty position in computation & politics at MIT

    We have this tenure-track Assistant Professor position open at MIT. It is an unusual opportunity in being a shared position between the Department of Political Science and the College of Computing. (I say “unusual” compared with typical faculty lines, but by now MIT has hired faculty into several such shared positions.)

    So we’re definitely inviting applications not just from social science PhDs, but also from, e.g., statisticians, mathematicians, and computer scientists:

    We seek candidates whose research involves development and/or intensive use of computational and/or statistical methodologies, aimed at addressing substantive questions in political science.

    Beyond advertising this specific position, perhaps this is an interesting example of the institutional forms that interdisciplinary hiring can take. Here the appointment would be in the Department of Political Science and then also within one of the relevant units of the College of Computing. And there are two search committees working together, one from the Department and one from the College. I am serving on the latter, which includes experts from all parts of the College.

    [This post is by Dean Eckles.]

    The connection between junk science and sloppy data handling: Why do they go together?

    Nick Brown pointed me to a new paper, “The Impact of Incidental Environmental Factors on Vote Choice: Wind Speed is Related to More Prevention-Focused Voting,” to which his reaction was, “It makes himmicanes look plausible.” Indeed, one of the authors of this article had come up earlier on this blog as a coauthor of paper with fatally-flawed statistical analysis. So, between the general theme of this new article (“How might irrelevant events infiltrate voting decisions?”), the specific claim that wind speed has large effects, and the track record of one of the authors, I came into this in a skeptical frame of mind.

    That’s fine. Scientific papers are for everyone, not just the true believers. Skeptics are part of the audience too.

    Anyway, I took a look at the article and replied to Nick:

    The paper is a good “exercise for the reader” sort of thing to find how they managed to get all those pleasantly low p-values. It’s not as blatantly obvious as, say, the work of Daryl Bem. The funny thing is, back in 2011, lots of people thought Bem’s statistical analysis was state-of-the-art. It’s only in retrospect that his p-hacking looks about as crude as the fake photographs that fooled Arthur Conan Doyle. Figure 2 of this new paper looks so impressive! I don’t really feel like putting in the effort to figuring out exactly how the trick was done in this case . . . Do you have any ideas?

    Nick responded:

    There are some hilarious errors in the paper. For example:
    – On p. 7 of the PDF, they claim that “For Brexit, the “No” option advanced by the Stronger In campaign was seen as clearly prevention-oriented (Mean (M) = 4.5, Standard Error (SE) = 0.17, t(101) = 6.05, p < 0.001) whereas the “Yes” option put forward by the Vote Leave campaign was viewed as promotion-focused (M = 3.05, SE = 0.16, t(101) = 2.87, p = 0.003).": But the question was not "Do you want Brexit, Yes/No". It was "Should the UK Remain in the EU or Leave the EU". Hence why the pro-Brexit campaign was called "Vote Leave", geddit? Both sides agreed on before the referendum that this was fairer and clearer than Yes/No. Is "Remain" more prevention-focused than "Leave"? - On p. 12 of the PDF, they say "In the case of the Brexit vote, the Conservative Party advanced the campaign for the UK to leave the EU." This is again completely false. The Conservative government, including Prime Minister David Cameron, backed Remain. It's true that a number of Conservative politicians backed Leave, and after the referendum lots of Conservatives who had backed Remain pretended that they either really meant Leave or were now fine with it, but if you put that statement, "In the case of the Brexit vote, the Conservative Party advanced the campaign for the UK to leave the EU" in front of 100 UK political scientists, not one will agree with it. If the authors are able to get this sort of thing wrong then I certainly don't think any of their other analyses can be relied upon without extensive external verification. If you run the attached code on the data (mutatis mutandis for the directories in which the files live) you will get Figure 2 of the Mo et al. paper. Have a look at the data (the CSV file is an export of the DTA file, if you don't use Stata) and you will see that they collected a ton of other variables. To be fair they mention these in the paper ('Additionally, we collected data on other Election Day weather indicators (i.e., cloud cover, dew point, precipitation, pressure, and temperature), as well as historical wind speeds per council area.5 The inclusion of other Election Day weather indicators increases our confidence that we are detecting an association between wind speed and election outcomes, and not the effect of other weather indicators that may be correlated with wind speed.") My guess is that they went fishing and found that wind speed, as opposed to the other weather indicators that they mentioned, gave them a good story. Looking only at the Swiss data, I note that they also collected "Income", "Unemployment", "Age", "Race" (actually the percentage of foreign-born people; I doubt if Switzerland collects "Race" data; Supplement, Table S3, page 42), "Education", and "Rural", and threw those into their model as well. They also collected latitude and longitude (of the centroid?) for each canton, although those didn't make it into the analyses. Also they include "Turnout", but for any given Swiss referendum it seems that they only had the national turnout because this number is always the same for every "State" (canton) for any given "Election" (referendum). And the income data looks sketchy (people in Schwyz canton do not make 2.5 times what people in Zürich canton do). I think this whole process shows a degree of naivety about what "kitchen-sink" regression analyses (and more sophisticated versions thereof) can and can't do, especially with noisy measures (such as "Precipitation" coded as 0/1). Voter turnout is positively correlated with precipitation but negatively with cloud cover, whatever that means. Another glaring omission is any sort of weighting by population. The most populous canton in Switzerland has a population almost 100 times the least populous, yet every canton counts equally. There is no "population" variable in the dataset, although this would have been very easy to obtain. I guess this means they avoid the ecological fallacy, up to the point where they talk about individual voting behaviour (i.e., pretty much everywhere in the article).

    Nick then came back with more:

    I found another problem, and it’s huge:

    For “Election 50”, the Humidity and Dew Point data are completely borked (“relative humidity” values around 1000 instead of 0.6 etc; dew point 0.4–0.6 instead of a Fahrenheit temperature slightly below the measured temperature in the 50–60 range). When I remove that referendum from the results, I get the attached version of Figure 2. I can’t run their Stata models, but by my interpretation of the model coefficients from the R model that went into making Figure 2, the value for the windspeed * condition interaction goes from 0.545 (SE=0.120, p=0.000006) to 0.266 (SE=0.114, p=0.02).

    So it seems to me that a very big part of the effect, for the Swiss results anyway, is being driven by this data error in the covariates.

    And then he posted a blog with further details, along with a link to some other criticisms from Erik Gahner Larsen.

    The big question

    Why do junk science and sloppy data handling so often seem together? We’ve seen this a lot, for example the ovulation-and-voting and ovulation-and-clothing papers that used the wrong dates for peak fertility, the Excel error paper in economics, the gremlins paper in environmental economics, the analysis of air pollution in China, the collected work of Brian Wansink, . . . .

    What’s going on? My hypothesis is as follows. There are lots of dead ends in science, including some bad ideas and some good ideas that just don’t work out. What makes something junk science is not just that it’s studying an effect that’s too small to be detected with noisy data; it’s that the studies appear to succeed. It’s the misleading apparent success that’s turns a scientific dead end into junk science.

    As we’ve been aware since the classic Simmons et al. paper from 2011, researchers can and do use researcher degrees of freedom to obtain apparent strong effects from data that could well be pure noise. This effort can be done on purpose (“p-hacking”) or without the researchers realizing it (“forking paths”), or through some mixture of the two.

    The point is that, in this sort of junk science, it’s possible to get very impressive-looking results (such as Figure 2 in the above-linked article) from just about any data at all! What that means is that data quality doesn’t really matter.

    If you’re studying a real effect, then you want to be really careful with your data: any noise you introduce, whether in measurement or through coding error, can be expected to attenuate your effect, making it harder to discover. When you’re doing real science you have a strong motivation to take accurate measurements and keep your data clean. Errors can still creep in, sometimes destroying a study, so I’m not saying it can’t happen. I’m just saying that the motivation is to get your data right.

    In contrast, if you’re doing junk science, the data are not so relevant. You’ll get strong results one way or another. Indeed, there’s an advantage to not looking too closely at your data at first; that way if you don’t find the result you want, you can go through and clean things up until you reach success. I’m not saying the authors of the above-linked paper did any of that sort of thing on purpose; rather, what I’m saying is that they have no particular incentive to check their data, so from that standpoint maybe we shouldn’t be so surprised to see gross errors.

    Fully Bayesian computing: Don’t collapse the wavefunction until it’s absolutely necessary.

    Kevin Gray writes:

    In marketing research, it’s common practice to use averages of MCMC draws in Bayesian hierarchical models as estimates of individual consumer preferences.

    For example, we might conduct choice modeling among 1,500 consumers and analyze the data with an HB multinomial logit model. The means or medians of the (say) 15,000 draws for each respondent are then used as parameter estimates for each respondent. In other words, by averaging the draws for each respondent we obtain an individual-level equation for each respondent and individual-level utilities.

    Recently, there has been criticism of this practice by some marketing science people. For example, we can compare predictions of individuals or groups of individuals (e.g., men versus women), but not the parameters of these individuals or groups to identify differences in their preferences.

    This is highly relevant because since the late 90s it has been common practice in marketing research to use these individual-level “utilities” to compare preferences (i.e., relative importance of attributes) of pre-defined groups or to cluster on the utilities with K-means (for example).

    I’m not an authority on Bayes of course, but have not heard of this practice outside of marketing research, and have long been concerned. Marketing research is not terribly rigorous…

    This all seems very standard to me and is implied by basic simulation summaries, as described for example in chapter 1 of Bayesian Data Analysis. Regarding people’s concerns: yeah, you shouldn’t first summarize simulations over people and then compare people. What you should do is compute any quantity of interest—for example, a comparison of groups of people—separately for each simulation draw, and then only at the end should you average over the simulations.

    Sometimes we say: Don’t prematurely collapse the wave function.

    This is also related to the idea of probabilistic programming or, as Jouni and I called it, fully Bayesian computing. Here’s our article from 2004.

    Slides on large language models for statisticians

    I was invited by David Banks to give an introductory talk on large language models to the regional American Statistical Association meeting on large language models. Here are the slides:

    Most usefully, it has complete pseudocode up to but not including multi-head attention. It also has an annotated bibliography of the main papers if you want to catch up. After the talk, I added a couple slides on scaling laws and an annotated bibliography, which I didn’t have time to get to before the talk. I also added a slide describing multi-head attention, but without pseudocode.

    P.S. The meeting was yesterday at Columbia and I hadn’t been to the stats department since the pandemic started, so it felt very strange.

    P.P.S. GPT-4 helped me generate the LaTeX Tikz code to the point where I did zero searching through doc or the web. It also generates all of my pandas and plotnine code (Python clones of R’s data frames and ggplot2) and a ton of my NumPy, SciPy, and general Python code. It can explain the techniques it uses, so I’m learning a lot, too. I almost never use StackOverflow any more!

    Workflow for robust and efficient projection predictive inference

    Yann McLatchie, Sölvi Rögnvaldsson, Frank Weber, and I (Aki) write in a new preprint “Robust and efficient projection predictive inference

    The concepts of Bayesian prediction, model comparison, and model selection have developed significantly over the last decade. As a result, the Bayesian community has witnessed a rapid growth in theoretical and applied contributions to building and selecting predictive models. Projection predictive inference in particular has shown promise to this end, finding application across a broad range of fields. It is less prone to over-fitting than naïve selection based purely on cross-validation or information criteria performance metrics, and has been known to out-perform other methods in terms of predictive performance. We survey the core concept and contemporary contributions to projection predictive inference, and present a safe, efficient, and modular workflow for prediction-oriented model selection therein. We also provide an interpretation of the projected posteriors achieved by projection predictive inference in terms of their limitations in causal settings.

    The main purpose of the is to present a workflow for projection predictive variable selection so that users may obtain reliable results in the least time-consuming way (sometimes there are safe shortcuts that can save enormous amount of wall clock and computing time). But it also discusses the use of the projected posterior in causal settings and gives some more background in general. All these have been implemented in the projpred R package (the most recent workflow supporting features added by Frank who has been doing awesome job in recent years improving projpred). While writing the introduction to the paper, we were happy to notice that projpred is currently the most downloaded R package for Bayesian variable selection!

    Summer School on Advanced Bayesian Methods in Belgium

    (this post is by Charles)

    This September, the Interuniversity Institute for Biostatistics and statistical Bioinformatics is holding its 5th Summer School on Advanced Bayesian Methods. The event is set to take place in Leuven, Belgium. From their webpage:

    As before, the focus is on novel Bayesian methods relevant to the applied statistician. In the fifth edition of the summer school, the following two courses will be organized in Leuven from 11 to 15 September 2023:

    The target audience of the summer school are statisticians and/or epidemiologists with a sound background in statistics, but also with a background in Bayesian methodology. In both courses, practical sessions are organized, so participants are asked to bring along their laptop with the appropriate software (to be announced) pre-installed.

    I’m happy to do a three-day workshop on Stan: we’ll have ample time to dig into a lot of interesting topics and students will have a chance to do plenty of coding.

    I’m also looking forward to the course on spatial modeling. I’ve worked quite a bit on the integrated Laplace approximation (notably its implementation in autodiff systems such as Stan), but I’ve never used the INLA package itself (or one of its wrappers), nor am I very familiar with applications in ecology. I expect this will be a very enriching experience.

    The registration deadline is July 31st.

    HIIT Research Fellow positions in Finland (up to 5 year contracts)

    This job post is by Aki

    The Helsinki Institute for Information Technology has some funding for Research Fellows and the research topics can include Bayes, probabilistic programming, ML, AI, etc

    HIIT Research Fellow positions support the career development of excellent advanced researchers who already have some postdoctoral research experience. While HIIT Research Fellows have a designated supervisor at University of Helsinki or Aalto, they are expected to develop their own research agenda and to gain the skills necessary to lead their own research group in the future. HIIT Research Fellows should strengthen Helsinki’s ICT research community either through collaboration or by linking ICT research with another scientific discipline. In either case, excellence and potential for impact are the primary criteria for HIIT Research Fellow funding.

    The contract period is up to five years in length.

    I (Aki) am one of the potential supervisors, so you could benefit from my help (other professor are great, too), but as the text says you would be an independent researcher. This is an awesome opportunity to advance your career in a lovely and lively environment between Aalto University and University of Helsinki. I can provide further information about the research environment and working in Finland.

    The deadline is August 13th 2023

    See more at HIIT webpage