There is a lot of data on the web, meant to be looked at by people, but how do you turn it into a spreadsheet people could actually analyze statistically?
The technique to turn web pages intended for people into structured data sets intended for computers is called “screen scraping.” It has just been made easier with a wiki/community http://scraperwiki.com/.
They provide libraries to extract information from PDF, Excel files, to automatically fill in forms and similar. Moreover, the community aspect of it should allow researchers doing similar things to get connected. It’s very good. Here’s an example of scraping road accident data or port of London ship arrivals.
You can already find collections of structured data online, examples are Infochimps (“find the world’s data”), and Freebase (“An entity graph of people, places and things, built by a community that loves open data.”). There’s also a repository system for data, TheData (“An open-source application for publishing, citing and discovering research data”).
The challenge is how to keep these efforts alive and active. One early company helping people screen-scrape was Dapper that’s now helping retailers advertise by scraping their own websites. Perhaps the library funding should be used towards tools like that rather than piling up physical copies of expensive journals everyone reads just online.
That's very useful information for a student of social sciences. I'm sure I'll be referring to this post again — thank you.
The greatest tool a scraper can have in his toolkit is Perl and it's library collection called CPAN. Its test handling capabilities make it *the* tool for making custom scrapers, handling file formats like PDF and excel.
Perl (as in Allan's comment) and Python (as in the scraper wiki) are great tools for this kind of thing as they have both a lot of general text processing tools and libraries for dealing with HTML. However you can also accomplish a lot of this with the <a>R library "XML," which can parse html either locally or directly off the web.
for instance, i wrote a script which uses the library's "readHTMLtable()" to clean an archive of data I'd already scraped with wget. <a />here's my script if you're interested
I think you should also read this post about how to compare and choose web scraping tools.
http://www.fornova.net/blog/?p=18
Awesome post! By the way, one of our cofounders at Infochimps, Dhruv Bansal (@dhruvbansal), is a Columbia alum.