Skip to content
 

Get the Data

At GetTheData, you can ask and answer data related questions. Here’s a preview:getthedata.png

I’m not sure a Q&A site is the best way to do this.

My pipe dream is to create a taxonomy of variables and instances, and collect spreadsheets annotated this way. Imagine doing a search of type: “give me datasets, where an instance is a person, the variables are age, gender and weight” – and out would come datasets, each one tagged with the descriptions of the variables that were held constant for the whole dataset (person_type=student, location=Columbia, time_of_study=1/1/2009, study_type=longitudinal). It would even be possible to automatically convert one variable into another, if it was necessary (like age = time_of_measurement-time_of_birth). Maybe the dream of Semantic Web will actually be implemented for relatively structured statistical data rather than much fuzzier “knowledge”, just consider the difficulties of developing a universal Freebase. Wolfram|Alpha is perhaps currently closest effort to this idea (consider comparing banana consumption between different countries), but I’m not sure how I can upload my own data or do more complicated data queries – also, for some simple variables (like weight), the results are not very useful.

I’ve talked about data tools before, as well as about Q&A sites.

6 Comments

  1. statc says:

    speaking of getting data, I was looking at the list of survivors from Titanic. http://en.wikipedia.org/wiki/List_of_Titanic_pass
    The people who survived (mostly women and children), lived a very long life (most of them up to their 90's). I have contemplated several questions as to why they lived so long after the such a horrible thing happened? Could we point out any statistical significance?

  2. Basil says:

    I like to read and answer questions/responses at these Q&A sites. I feel like it keeps my blend of knowledge growing.

  3. Amanda says:

    These sites are good, but I agree that it would be much more awesome to create a database of datasets which were searchable in the manner you mentioned. It could use similar search algorithms to Google in terms of relevance, but extract the specific components of the data that are relevant to the search and compile a dataset for you. I have no doubt this is something that will become available in the near (ish) future.

  4. Jeremy Miles says:

    @statc: The people who survived tended to be first class passengers (IIRC), and a first class ticket on the Titanic was extraordinarily expensive. One factor may therefore be that rich people live longer.

  5. jake hofman says:

    @aleks, @amanda: i believe <a>infochimps is hoping to solve just this problem. check them out.

  6. Bob Carpenter says:

    This is a really hard problem because of the vagaries of the ways we encode data.

    The application Aleks wants would need to be based on either agreed upon factors and levels (e.g., a shared "ontology" of predictors and outcomes) or a general system for database linkage.

    Linkage is really three problems. The first is just sorting out how to map the fields together. In the simplest case, we may need to link a variable named "sex" with one named "male/female".

    Second, when we've linked the fields, we need to figure out the categorical value "male" in "male/female" corresponds with the integer 0 in "sex". This is tricky when one survey has "income" defined as five ordinal levels of pre-tax income and a second has seven ordinal levels of take-home pay. You also get issues like the year of a movie corresponding to (a) completion date, (b) first release date anywhere in the world, (c) mainstream release in the U.S., etc.

    Third, we often need to link the items. For instance, we might want to link the categorical value "Star Wars" in one database to "Star Wars IV: A New Hope" in another database. A big problem arises when the different datases use a different notion of item. For instance, one database might lump all the various aspect ratios, director cuts and TV edits, and dubs as a single movie, where another separates them as different items. Similarly, one survey may measure household income and another individual income.

    Stanford's InfoLab group is taking a stab at this kind of general linkage in their SERF project.