Skip to content

Covariance computations

Marc Tanguay writes,

I am senior economist at Statistics Canada and I have been involved since 6 years in building information systems for what we call here the System of National Accounts. This consists of a exhaustive collection of economics tables in many dimensions. The statistics concern employment, wages, hours worked, Gross domestics product (GDP), sales, imports, inventories, industrial uses, saving, household expenditures, etc. There is many dimensions in those tables as industries, sector, or geographical identities. The building of those tables involve a huge amount of information that are provided by dozens of sources of information; surveys, information from census and administrative data are the main providers.

The particularity of the System of National Accounts it that it makes use of the accounting framework. This means that any statistics that describe for example, the use of a product (demand) has a counterpart on its origin (the supply) of that product…any use of income has a counterpart on the side of the origin of this income (for example expenditures and savings must balance with various sources of incomes as wages, dividends, etc.).

Those global balancing constraints are used in the balancing of National Accounts tables as a way to fine-tune and finalise the statistical estimations….

In other words, we have very strong priors regarding how the estimations produced should behave, and specially, of how they should behave in relation one to each other…This is the consequence of the accounting framework. And indeed, most of people working here developed, after a while, a good feeling of what should be a realistic universe for the statistics that we updated periodically. We have strong priors, for example, about the range of values for hours worked by jobs or wages rate by hour worked, growth of GDO, etc.

My point of view is that our work would probably be a lot more efficient if we were formalizing all out priors in a complete Bayesian framework. In other words, if we were changing our paradigm which until now, is still very classical. Most of the time we double and triple check the data when the estimations appear to be out of a reasonable change but we do not formalize what is a reasonable range.

Now before trying to convince my colleagues to take the road of the Bayesian paradigm, I started getting myself more familiar with Bayesian framework since, unfortunately, and as most economists, this was just a small part of my academic background.

I am currently doing a second lecture of your book ‘Bayesian Data Analysis’ and I really appreciate this book for its great clarity and how it facilitate the introduction to the Bayesian paradigm for those, like me, that are unfamiliar with it. I started replicated some of the examples using SAS package.

I want to developed a first application of those techniques to our data, a Gibbs sampler on estimation for regional jobs benchmark. This involves a combinations of 4 different sources of information: census, labor force surveys, Administrative data on Incomes and establishment survey. Each of these source has strengths and weakness.. Some of them produce efficient but biased estimates while others are consistent but less efficient…

Some are available on a low frequency only while others do not cover all the universe.

The model I started to build is made up of an equation system index by 2 dimensions, industry (i) and region (g). There is 10 industry groups, 14 regions and 2 sets of equations by combination of region and industry (i,g), one that described employment, another that describe wages rates.

We noted ye(i,g) and yw(i,g) the unobserved variables of employment and wages rates.

Each equation takes the form:


Where y(i,g) is a target, unobserved matrix of dimension t x 2 (t is the time serie index, the two columns are for employment and wage rates) , x(i,g) a matrix of k elements (lenght t) of observed covariates provided by different sources of information, bij parameters a k-vector to be estimated and eij, an error term that we presumed to be normally distributed, with mean zero but contaminated by autocorrelation and unequally distributed.

We get some variance estimators of E(x(i,g)) (by hotdeck, bootstrap and other methods provided by the methodological team here in Statcan) but not all of them.

For each combination (i,g), there is a specific constraint provided by administrative data on Total wages (W(i,g)). So;

ye(i,g) * yw(i,g) =(W(i,g)).

On another hand, we have a regional benchmark for total jobs (J(g)).
Another constraint is therefore that the sum over 10 industries by region reaches the target:

J(g)=Sum(i=1 to 10) ye(i,g)

While building the Gibbs routines that would address this problem, I had to fix the problem of sampling from the covariance matrix for which I assumed contemporaneous correlation of the errors across industries within a given region but no correlation across geographical identities. I also assume serial correlation between periods within a specific industry of a given region. So, we do not need to build a full matrix tX10X14 but only some blocks of it when correlation occurs which reduces, of course, the size of the problem.

I was glad to find an easy way to sample from the covariance matrix as you exposed in the separation strategy on page 483.

However, I have 2 small questions regarding this approach.

1-Having said that the determinant of R can be expressed as a quadratic function of any of its elements, |R(r) | = ar2+br+c, will this function always be concave (which means a<0) ? 2-Having redefined the bounds of r with this constraint (that ar2+br+c > 0) will the conditional distribution of r be uniform within those bounds ?

3-My third question is about the way I should introduce the constraint in the sequence of the sampling. For example, suppose I sample the first 9 industries, Then, conditional on those 9 drops, my 10th industry values of ye(i,g) and bij will be determined by the constraint J(g)=Sum(i=1 to 10) ye(i,g).

But this may cause a very strong adjustment in one single cells while the previous 9 haven’t been affected yet.

If I now rerun a sampling sequence while switching the last industry (the one that will absorb the constraint), it may cause the same problem on that new last cell.

In fact, if I impose the constraint only on the last cell while sampling the other ones, I may be systematically out of range on the last cell while not influencing the sampling of the others. I think I need another approach in which the range of sampling has to be update more frequently, taking into account the target constraint between each sampling sequence from one cell (say industry 1) to another (say industry 2) rather than waiting for the universe been covered by the sampling sequence before introducing the constraint.

Here I miss experience and I would greatly appreciated an advice or a reference to a paper where a similar problem has been addressed (respecting a contemporaneous constraint within an equation system using Markov chain technique).

My reply: It should be possible to use prior distributions that incorporate knowledge of constraints. I don’t know anything about this particular problem, but in our radon analysis we did work with a large amount of noisy, biased data and a relatively small number of more precise unbiased measurements. See here for details (also the example is discussed in the two books).

Regarding the model and computations on p. 483, my new favorite multivariate model is the scaled-inverse-Wishart distribution, which multiplies an inverse Wishart with arbitrary scales that are themselves modeled. See our new book for more information on this model, also the paper by O’Malley and Zaslavsky.

One Comment

  1. Keith O'Rourke says:

    Not what the post was about but

    "Most of the time we double and triple check the data when the estimations appear to be out of a reasonable … "

    This checking when "we don't like the result" and not checking when "we do like the result" converts random data errors into systematic errors in one's favour.

    Checking data (and models) ideally should NOT be driven by what we thought/hoped the results would be…