Tools worth knowing about:
Google Refine is a power tool for working with messy data, cleaning it up, transforming it from one format into another, extending it with web services, and linking it to databases like Freebase.
A recent discussion on the Polmeth list about the ANES Cumulative File is a setting where I think Refine might help (admittedly 49760×951 is bigger than I’d really like to deal with in the browser with js… but on a subset yes). [I might write this example up later.]
Go watch the screencast videos for Refine. Data-entry problems are rampant in stuff we all use — leading or trailing spaces; mixed decimal-indicators; different units or transformations used in the same column; mixed lettercase leading to false duplicates; that’s only the beginning. Refine certainly would help find duplicates, and it counts things for you too. Just counting rows is too much for researchers sometimes (see yesterday’s post)!
Refine 2.0 adds some data-collection tools for scraping and parsing web data. I have not had a chance to play with any of this kind of advanced scripting with it yet. I also have not had occasion to use Freebase which seems sort of similar (in that it is mostly open data with web APIs) to infochimps (for more on this, see the infochimps R package by Drew Conway).