Skip to content
 

Could you say that again less clearly, please? A general-purpose data garbler for applications requiring confidentiality

Ariel Rokem pointed me to this Python program by Bill Howe, Julia Stoyanovich, Haoyue Ping, Bernease Herman, and Matt Gee that will take your data matrix and produce a new data matrix that has the same size, shape, and general statistical properties but with none of the same actual numbers.

The use case is when you want to give your data to someone to play around with some analyses, but the data themselves are proprietary or confidential.

This is different from the scenario such as with the Census where they create a synthetic dataset that adds a little bit of noise or takes out some observations to preserve confidentiality but otherwise is intended to give the right answers. Here, there’s no aim to use the fake data to perform real applied analyses; it’s all about creating something similar that people can work with, for example to prototype their data analysis plans.

Howe et al. call their program DataSynthesizer but if it were up to me it would just be called Garbler.

Here’s the paper describing what the program actually does. The statistical procedures used in the garbling are nothing fancy, but that’s fine: just as well to keep it simple, given the simplicity of the goals. The title of their paper is “Synthetic Data for Social Good,” but I don’t really understand that last part: it seems to me that Garbler, like other statistical tools, could be used for good or bad.

6 Comments

  1. Wouter Steenbeek says:

    Sounds very similar to the synthpop R package… https://cran.r-project.org/web/packages/synthpop/index.htm

  2. Sean P Mackinnon says:

    Very cool. This has great applications for saving time when teaching statistics. For example, reusing assignments that analyze data, but garbling the dataset each semester. I know I can just simulate data, but I find they often turn out too idealized when compared to real data.

  3. Ryan says:

    I work mostly with medical data, and this should be quite useful when turning encountered errors into reproducible examples that I can post online, and turning real applications into examples/vignettes.

  4. Gabi says:

    Wakefield would be a great name for this kind of thing, but it’s taken. https://github.com/trinker/wakefield.

Leave a Reply