Skip to content

Fake-data simulation as a research tool

I received the following email:

I was hoping if you could take a moment to counsel me on a problem that I’m having trying to calculate correct confidence intervals (I’m actually using a bootstrap method to simulate 95%CIs). . . . [What follows is a one-page description of where the data came from and the method that was used.]

My reply:

Without following all the details, let me make a quick suggestion which is that you try simulating your entire procedure on a fake dataset in which you know the “true” answer. You can then run your procedure and see if it works there. This won’t prove anything but it will be a way of catching big problems, and it should also be helpful as a convincer to others.

If you want to carry this idea further, try to “break” your method by coming up with fake data that causes your procedure to give bad answers. This sort of simulation-and-exploration can be the first step in a deeper understanding of your method.

And then I got another, unrelated email from somebody else:

I am working on a mixed treatment comparison of treatments for non-small cell lung cancer. I am doing the analysis in two parts in order to estimate treatment effects (i.e. log hazard ratios) and absolute effects (by projecting the log hazard ratios onto a baseline treatment scale parameter; the baseline treatment times to event are assumed to arise from a Weibull distribution. . . . .[What follows is a one-page description of the model, which was somewhat complicated by constraints on some of the variance parameters] . . . I can get my analysis to run with constraints imposed on the treatment specific prior distributions for PFS and OS, and on the population log hazard ratios for PFS and OS. However, my proble is that the constraint does not appear to be doing anything and the results are similar to what I obtain without imposing the constraint. This is not what I expect . . .

My reply:

Sometimes the data are strong enough that essentially no information is supplied by external constraints. You can, to some extent, check how important this is for your problem by simulating some fake data from a setting similar to yours and then seeing whether your method comes close to reproducing the known truth. You can look at point estimates and also the coverage of posterior intervals.

One Comment

  1. maarten Buis says:

    I agree that such fake data simulations can be extremely useful to get a feel about how a technique works. I wrote a tutorial on how to do such simulations in Stata, for last years North American Stata Users' Group meeting. It can be downloaded from . As it is a tutorial on how to use Stata, it requires Stata to view it, there are instruction in the readme.txt in the .zip file.