Johannes Castner writes:

Suppose there are k scientists, each with her own model (Bayesian Net) over m random variables. Then, because the space of Bayesian Nets over these m variables, with the square-root of the Jensen-Shannon Divergence as a distance metric is a closed and bounded space, there exists one unique Bayes Net that is a mixture of the k model joint-distributions which is at equal distance to each of the k models and may be called a “consensus graph.” This consensus graph is in turn a Bayes Net, which can be updated with evidence. The first question is: What are the conditions for which, given a new bit of evidence, the updated consensus graph is exactly the same graph as the consensus graph of the updated k Bayes Nets? In other words, if we arrive at a synthetic model from k models and then update this synthetic model, under what conditions is this the same thing as if we had first updated all k models and then build a synthesis. The second question is: If these are not the same, then which of the two would be better and under what conditions, from the perspective of collective learning?

Does anyone have any thoughts on this? It all seems related to various topics of interest to me (see, for example, this presentation from 2003) but I don’t know anything about what he is talking about.

The model is alien. Wondering whether the old problem of choice aggregation was already formulated more generally than allowed by even minimalist assumptions called for an underlying network structure…

I am confused by the description of the problem.

First, say m=2 random variables. Then the possible graphs are (i) A causes B; (ii) B causes A; (iii) neither, A and B are independent. If the consensus graph is required to be a DAG (emphasis on acyclic) then I am not even sure how to represent it when one expert has in mind graph (i) and the other (ii). Part of the problem is that the terminology for graphical models is all over the place (Bayes net, Bayes belief networks, ….) so not sure what type of graphical model we are talking about.

Second, I am also confused by the final reference to “collective learning”. My understanding is that most Bayesian inference is built around an individual decision maker, not a collective. Subjective beliefs involve a subject. I think the problem needs to be set up more clearly. E.g. who is learning what from whom, and for what purposes. This has implications for the way data and priors are aggregated.

I don’t think the fact that they’re acyclic is a problem. We’re just defining a joint distribution over the variables. So if we select mixture component #1, then we have a model where A causes B, and if we select mixture component #2, then we have a model where B causes A. The mixture might assign different weights to these two possibilities, depending on how many people in the sample subscribe to each belief.

The question about “collective learning” seems to be this: If you have a bunch of models and you want to update them with new data, will you get the same answer from the following two approaches?

1. Aggregating the models into a mixture-based meta-model and then updating the metamodel

2. Updating the models individually and then aggregating them.

The models could be based on a committee of experts or just a bunch of different models from different model initializations. The Bayesian machinery for updating the models in response to new data should be the same regardless of who is doing the updating and for what purpose.

My first point is about the _graphical representation_ of the consensus graphical model (e.g. MAP graph), not a distribution over graphs (e.g. BMA). The latter is just a parameterization of an underlying non-parametric graph, which, in the example I gave, would by cyclic. Cyclicality may, or may not be a problem, but a cyclical graph cannot be a consensus DAG.

I guess you have to define better what you mean by “unique Bayes Net”, and what graphical formalism you are restricting yourself to. At a minimum you need undirected edges for that representation.

Re variables I would have thought the interest here would be on posterior edge marginals, not the variables themselves. Are your trying to learn structure, or parameters, or both?

Re collective learning. I am not sure the meaning of “aggregating” is the same in your points 1 and 2. Also I thought you wanted to aggregate priors of experts, in which case points of view matter. My understanding is this is key difference between behavioral vs mathematical aggregation. See Chapter 9 in “uncertain judgments” by O’Hagan et al. for a literature review.

PS Sorry to be nit picky but I find 90% of solving a problem is setting it up right. And 90% of setting it up right is knowing what we are talking about.

“The latter is just a parameterization of an underlying non-parametric graph, which, in the example I gave, would by cyclic. Cyclicality may, or may not be a problem, but a cyclical graph cannot be a consensus DAG.”

Why would it be cyclic? Both (i) and (ii) belong to the same equivalence class of Bayesian networks, ie. they can represent the same joint distributions. So we could always tranform the type (i) to the type (ii) by changing the parameters or other way around. The choice which one we would use is arbitrary.

So best way to quit smoking is to get rid of cancer…

I understand what you are saying about equivalence classes etc. but I think we are lumping various things together here:

1. The most likely causal structure (is this a “consensus graph”?);

2. A distribution over mutually exclusive causal structure (as per acyclicality, assuming we are working with DAGs and not more general formalisms). Note that I would not refer to this distribution as a graphical model per se, nor a “consensus graph”, but as table of probabilities over graph structures. Again, semantics;

3. Or a parameterization of a single structure as a mixture model.

I think you are mixing up causal networks and Bayesian networks. Bayesian networks are just DAGs which are used to represent conditional (in)dependencies of the joint distribution of some set of random variables. Nothing else. So indeed, cancer -> smoking or smoking -> cancer. Doesn’t make a difference here.

Causal networks on the other hand, are Bayesian network with an explicit requirement that the relationships are causal.

Germo:

Unfortunately the terminology of Bayesian Network, Probabilisitc Networks (Cowell, Dawid, Lauritzen, Spiegelhalter), Belief networks, etc… is all over the place. That is why in my first comment I mentioned explicitly whether the consensus graph was required to be a DAG. Oh well, it was fun talking past each other ;-)

PS I often use DAG as synonymous with causal network, diagram, etc… which is not strictly correct but common in social science.

Fernando: you’re right. I somehow missed the part where they said that the aggregated model was a Bayes Net. My explanation for #1 was off the mark. Thanks for pointing it out.

Dear Fernando, “nit picky” can be very helpful for someone, like myself, who is trying to exactly and not vaguely answer some questions. However, the two graphs (A->B and B->A) are I-equivalent, which means that they specify the same exact set of conditional independence assertions (one can be expressed in terms of the other with the appropriate change in parameter values, see chapter 3, p. 76 on Bayes Nets in “Probabilistic Graphical Models” by Daphne Koller and Nir Friedman).

I think we are confusing here what we think the true model is, from what we can estimate from the data. These are two very different things.

From data alone I may not be able to give directionality to the edge connecting A and B (though some new struture learning algorithms claim to be able to do this by making some (minor?) distributional assumptions). But this doe not make the model “A causes B” equivalent with “B causes A”. They are observationally equivalent, but not substantively equivalent.

Now, DAGs are a means of making statements about what we believe is true about the world, not what we can estimate from data. And if we stick with acyclicality, I can only express A causes B, or B causes A, or neither. That is, when drawing the DAG I need to have the arrow point one direction of have no arrow at all.

(On the next page (pg 77) Koller and Friedman explain precisely that the notion of I-Equivalence is a problem from inferring a graph from an observed distribution. But my simple point is that a “consensus DAG”, as a graph drawn on a piece of paper according to some formalisms, cannot be cyclical. Which is why I have trouble with the notion of a “unique consensus DAG”. Whether you can learn the true DAG from the data is another story altogether. I am still at the level of semantics and definitions, you seem to be already at the level of estimation and inference.)

Actually, the paper I’m trying to write on this is all theory and no real estimation. I’m interested in diversity of models and collective wisdom. I want to know in general, how models should be combined so that we, as a society can synthesize disparate theories and get as close as possible to understanding something about the true underlying systems. Also, I am interested in the Bayesian updating mechanism; what does it say about Bayesian inference if updating k models and then combining them leads to grossly different results than first combining models and then updating the resulting synthetic model? What is better and under what circumstances? How far do these two ways of getting results from multiple perspectives diverge, if they do diverge?

I’m no expert but I guess my fisrt reaction is to be clear about “Collective wisdom” and “we as a society”. If “we” wants to do Bayesian inference “we” needs a prior. “We” will need to decide how to generate that prior if it is to be a “we” and not “I” prior. But how does “we” choose to aggregate a prior from its multiple “I”s? The literature on public choice is not very encouraging here.

Why not just say: I want to make the best use of all the (causal?) knowledge already out there, incl experts, texts, data to learn about structure and parameters? Then consider what is the best way of going about this. I think this set up is more transparent no? But maybe you have something different in mind.

So, the scenario I have in mind is a committee of independent experts who do NOT communicate with each other, but report to some decision maker, who then has to aggregate the experts’ opinions and make a decision (putting social, or public choice on hold, for the sake of not complicating things even further) …a similar scenario would be scientists writing research papers about causes and effects of global climate change and politicians, then, using these experts’ opinions to make decisions about policies. The point of this is to understand belief aggregation seperately from strategic concerns etc. In other words, where social choice is all about the aggregation of preferences or utilities, what I’m trying to understand here is the aggregation of beliefs when preferences are the same, or when everyone has the same goals. Later on, I will likely concern myself with harder questions on how to aggregate beliefs when there are individual incentives to distort beliefs due to preferential differences, but I would like to first understand the problem of belief aggregation without these extra complications.

I would hesitate to describe what a politician decides on climate change policy as our “collective wisdom”. It is her wisdom even if based on all our inputs. After all, presumably she is the one that chose the aggregation mechanism. Different aggregations yield different wisdoms, and I may have different preferences over the aggregation. How did we collectively decide on that mechanism?

Yes, you can assume that we all have the exact same preference but by then perhaps it is more transparent to talk about how I, as a decision maker, aggregate information, and leave the “we” baggage behind. I think that is a separate problem, not a statistical one per se.

PS obviously if all opinions have converged to the single truth then the aggregation weights are irrelevant.

I agree that politicians don’t work the way that I described …I was talking about theory. Also, thank you for your patience with me with regards to the directionality of causation discussion; you clarified some things for me that I was indeed confused about!

So, ok how should an ideal politician (defined as one who only bases her decisions on some aggregation of experts’ opinions and has no opinion or preference of her own) aggregate opinions if aggregation results in some form of mixture joint-distribution? Should she first mix and then upfate, or should she first upfate and then mix? What is the difference and what are the conditions under which one approach is better than the other? Note, to be more explicit this time (please excuse me for the prior confusion as to my exact questions!), this is a normative question as to how one should do it, if one is the politician and one’s only goal is to come as close to knowing the true data generation process as possible (I guess, figuring out causal directionality from evidence is out of reach and there better be full agreement on that?)!

What if all the models and collective wisdom are completely wrong yet still converge? Aren’t you just creating a wrong collective model? Better yet, where do collective ignorance, motivated reasoning, and ideology fit in to your proposed model?

This is indeed possible, but then, if the real (true) system continuously spits out evidence (data), all models should not converge to something grossly wrong, if there is enough model diversity. This is where I think (what I’m trying to ultimately show) that diversity (measured as the square-root of the n-point Jensen-Shannon Divergence) will help: the more diverse the models are, the shorter lived collective ignorance should be …but I’m still far removed from that conclusion, I think. But ultimately, I think you hit the nail on the head; this question is what I’m after; I think that long lived collective ignorance is due to a lack of diversity in models (we can’t learn anything that isn’t in the convex hull of what we already believe and we are more likely to have “the truth” in this convex hull if our models are diverse).

Yes, I guess what I meant is a “consensus joint distribution” …I’m sorry for the confusion!

Johannes,

Two points. First, “closed and bounded” aren’t strong enough conditions; you need some property like convexity. Imagine if the space were a donut and the implied ‘consensus model’ was in the donut hole. I expect whatever property you need does hold.

Second, I would have thought that Cosma Shalizi’s paper on Bayesian updating and possibly-misspecified models would have helpful approaches here.

That’s a really good point. It may even be related in some subtle way to Fernando’s argument about it being impossible to reconcile some groups of DAGs into a single DAG. Maybe you need this “convexity” property for that to be possible.

Thank you, I’m reading Cosma Shalizi’s paper now!

Consider a sports book at a casino. It is a series of models which create a single model of odds. That is, each bettor has a world view and all these world views generate odds. Those odds are the balancing of the book, at least in the ideal case, so the house is covered no matter what happens. A bet can move the book if it is large enough. New information comes to the book through bets, not through the information itself, meaning it is filtered through the various models. That is the ideal case, but we know that books expect certain activity so if Peyton Manning gets hurt on Thursday the book will likely anticipate the bettor changes. But that generally shouldn’t matter because the book should balance by the time of the event, which is when the game starts, the horses go to post, etc.

Using this as a mental model, I’d say the individual bettor models don’t reflect reality and the book reflects a balancing of those views which is not reality but is just that. This speaks to the importance of outcome, which gets at our conceptions of expected outcomes, null hypotheses, etc. The outcome of the game or race, etc. is like the result of a physical experiment where x happens or doesn’t. The energy levels show the peak at the right spots for that particle or not. And so on. I can’t see how one could decide which is superior, the pieces or the synthesis, without the outcome. Before that, we’re talking about how we evaluate our bets, which means we move the book or not depending on the strength of new information and the amount bet. Or as they used to say, late money coming in on the favorite, etc.

Well,in the betting,or market case,it appears that the models of those who bet or buy (sell) the most have greater weights in the aggregate model. However, such an aggregation does not seem principled to me from a collective learning perspective, as if individual models should be weighted differently, it seems to me that this should be based on the likelihood ratios of the models, given some data and not on who has the most money or on who is the most aggressive bettor, which may not have anything to do with who has the most understanding. Also, the space of models should be indeed convex.

OK, Jonathan, I just re-read what you said and you are making a very interesting point that I completely missed before! This brings up the following question: After some evidence, or outcome, can we then judge which is better? …so would it make sense to take the posterior-ratio of the updated mixture and the mixture of the updated pieces?

I don’t think will be a fruitful line of thought relative to (what I understand to be) Johannes Castner’s research question. An agent with a linear utility function will just plow all of its money into betting until the offered odds are equal to the agent’s odds. We need convex utility functions to get “risk-averse” betting; my impression is that some sort of “canonical” aggregation is being sought.

Well, Corey, in this case I’m not so interested in preferences; there is already a lot of theorizing and empirical work on that, but rather I’m interested in differences in causal beliefs when preferences are clear. Suppose that we know that all preferences are the same and trivial, but that there are differences in beliefs about some system. How can we fruitfully aggregate these beliefs, or is there one consistant way to do so? So, in other words, we know from Arrow etc. that there are problems with aggregating preferences when there are three or more choices, under unrestricted domain, with non-dictatorship, when pareto efficiency is required and with independence of irrelevant alternatives. But what do we know about cases where all preferences are identical so that their aggregation is not a problem, but where beliefs differ about how things work, or how the outcomes that all equally desire are to be brought about?

“… there exists one unique Bayes Net …”? How so, at the very least there’s the I-equivalence class?

The description of the aggregation procedure is vague, but it seems unlikely that you aren’t losing information + degrees of freedom in the process. Why would you expect to obtain the same result? I’d be surprised if associativity _did_ hold.

Well, the I-equivalence class does not specify the parameter values, whereas all of the individual models specify exact parameter values and the unique Bayes Net, which I have called the “consensus graph” above, is a mixture of those models and as such it has parameter values associated with it, which are unique (of course, you could re-express the same graph in terms of a different structure and with the appropriate changes in parameters, but that is really the same joint distribution …so for complete accuracy I should have said that there is a unique joint distribution).

The point is that there are likely some specific circumstances (perhaps if all relations are linear and all models specify multi-variate Gaussian distributions?) where the two procedures do give the same answer and in most cases they probably do not. I also have not yet worked out the algorithm that finds the unique mixture model that has an equal distance to all k models, any hints for that? If the space is convex, closed and bounded this unique mixture-model is guaranteed to exist (right?); finding it is still another matter.

I think the confusion here is what you are trying to model.

Are you trying to do _descriptive inference_ and just model the observed distribution of A and B, aggregating P_i(A,B) over i in some way? Here you are fine modeling up to an equivalence class.

Or are you trying to do _causal inference_ to infer the unobserved causal graph G_i generating the distribution P_i(A|do(B)), say? Here you need to go beyond the equivalence class.

The problem with the latter is that it is not so simple to aggregate G_i over i, as illustrated by the convexity issue referred to above. Put plainly, an animal can be a dog or a cat but not a weighted average of both.

Well, what I’m interested in is to gain understanding of how a collective, wherein all the individuals have the same goals (for now) but potentially very different causal beliefs, can uncover (or come closer to) some true data generating process (causal relations and degrees of causation).