Could you say that again less clearly, please? A general-purpose data garbler for applications requiring confidentiality

Posted on May 13, 2018 9:32 AM by Andrew

Ariel Rokem pointed me to this Python program by Bill Howe, Julia Stoyanovich, Haoyue Ping, Bernease Herman, and Matt Gee that will take your data matrix and produce a new data matrix that has the same size, shape, and general statistical properties but with none of the same actual numbers.

The use case is when you want to give your data to someone to play around with some analyses, but the data themselves are proprietary or confidential.

This is different from the scenario such as with the Census where they create a synthetic dataset that adds a little bit of noise or takes out some observations to preserve confidentiality but otherwise is intended to give the right answers. Here, there’s no aim to use the fake data to perform real applied analyses; it’s all about creating something similar that people can work with, for example to prototype their data analysis plans.

Howe et al. call their program DataSynthesizer but if it were up to me it would just be called Garbler.

Here’s the paper describing what the program actually does. The statistical procedures used in the garbling are nothing fancy, but that’s fine: just as well to keep it simple, given the simplicity of the goals. The title of their paper is “Synthetic Data for Social Good,” but I don’t really understand that last part: it seems to me that Garbler, like other statistical tools, could be used for good or bad.

6 thoughts on “Could you say that again less clearly, please? A general-purpose data garbler for applications requiring confidentiality”

Wouter Steenbeek on May 13, 2018 1:16 PM at 1:16 pm said:

Sounds very similar to the synthpop R package… https://cran.r-project.org/web/packages/synthpop/index.htm

Reply ↓
- Daniel Lakeland on May 13, 2018 2:31 PM at 2:31 pm said:
  
  working link:
  
  https://cran.r-project.org/web/packages/synthpop/
  
  From their link: The key objective of generating synthetic data is to replace sensitive original values with synthetic ones causing minimal distortion of the statistical information contained in the data set.
  
  Sounds like synthpop maybe makes more effort to “get the right answer” compared to the program Andrew is referring to.
  
  Thanks for this link, I definitely could imagine using it.
  
  Reply ↓
Sean P Mackinnon on May 13, 2018 2:10 PM at 2:10 pm said:

Very cool. This has great applications for saving time when teaching statistics. For example, reusing assignments that analyze data, but garbling the dataset each semester. I know I can just simulate data, but I find they often turn out too idealized when compared to real data.

Reply ↓
- Martha (Smith) on May 13, 2018 6:23 PM at 6:23 pm said:
  
  Good point.
  
  Reply ↓
Ryan on May 14, 2018 8:15 AM at 8:15 am said:

I work mostly with medical data, and this should be quite useful when turning encountered errors into reproducible examples that I can post online, and turning real applications into examples/vignettes.

Reply ↓
Gabi on May 14, 2018 5:21 PM at 5:21 pm said:

Wakefield would be a great name for this kind of thing, but it’s taken. https://github.com/trinker/wakefield.

Reply ↓

Statistical Modeling, Causal Inference, and Social Science

Could you say that again less clearly, please? A general-purpose data garbler for applications requiring confidentiality

6 thoughts on “Could you say that again less clearly, please? A general-purpose data garbler for applications requiring confidentiality”

Leave a Reply Cancel reply