Skip to content
Archive of entries posted by

New New York data research organizations

In a single day, New York City obtained two data analysis/statistics/machine learning organizations: Microsoft Research New York City with John Langford (machine learning), Duncan Watts (networks), and Dave Pennock (algorithmic economics). eBay technology center focusing on data – led by Chris Dixon, the co-founder of the recommendation engine company Hunch, which has recently been acquired [...]

Agreement Groups in US Senate and Dynamic Clustering

Adrien Friggeri has a lovely visualization of US Senators movement between clusters: You have to click the image and play with it to appreciate it. The methodology isn’t yet published – but I can see how this could be very illuminating. The dynamic clustering aspect hasn’t been researched much – one of the notable pieces [...]

Factual – a new place to find data

Factual collects data on a variety of topics, organizes them, and allows easy access. If you ever wanted to do a histogram of calorie content in Starbucks coffees or plot warnings with a live feed of earthquake data – your life should be a bit simpler now. Also see DataMarket, InfoChimps, and a few older [...]

Rare name analysis and wealth convergence

Steve Hsu summarizes the research of economic historian Greg Clark and Neil Cummins: Using rare surnames we track the socio-economic status of descendants of a sample of English rich and poor in 1800, until 2011. We measure social status through wealth, education, occupation, and age at death. Our method allows unbiased estimates of mobility rates. [...]

Statistical Murder

Robert Zubrin writes in “How Much Is an Astronaut’s Life Worth?” (Reason, Feb 2012): …policy analyst John D. Graham and his colleagues at the Harvard Center for Risk Analysis found in 1997 that the median cost for lifesaving expenditures and regulations by the U.S. government in the health care, residential, transportation, and occupational areas ranges [...]

Beautiful Line Charts

I stumbled across a chart that’s in my opinion the best way to express a comparison of quantities through time: It compares the new PC companies, such as Apple, to traditional PC companies like IBM and Compaq, but on the same scale. If you’d like to see how iPads and other novelties compare, see here. [...]

Data mining efforts for Obama’s campaign

From CNN: In July, KDNuggets.com, an online newsite focused on data mining and analytics software, ran an unusual listing in its jobs section: “We are looking for Predictive Modeling/Data Mining Scientists and Analysts, at both the senior and junior level, to join our department through November 2012 at our Chicago Headquarters,” read the ad. “We [...]

DBQQ rounding for labeling charts and communicating tolerances

This is a mini research note, not deserving of a paper, but perhaps useful to others. It reinvents what has already appeared on this blog. Let’s say we have a line chart with numbers between 152.134 and 210.823, with the mean of 183.463. How should we label the chart with about 3 tics? Perhaps 152.132, [...]

Luck or knowledge?

Joan Ginther has won the Texas lottery four times. First, she won $5.4 million, then a decade later, she won $2million, then two years later $3million and in the summer of 2010, she hit a $10million jackpot. The odds of this has been calculated at one in eighteen septillion and luck like this could only [...]

Examining US Legislative process with “Many Bills”

This is Many Bills, a visualization of US bills by IBM: I learned about it a few days ago from Irene Ros at Foo Camp. It definitely looks better than my own analysis of US Senate bills.

Traffic Prediction

I always thought predicting traffic for a particular day and time would be something easily predicted from historic data with regression. Google Maps now has this feature: It would be good to actually include season, holiday and similar information: the predictions would be better. I wonder if one can find this data easily, or if [...]

Data mining and allergies

With all this data floating around, there are some interesting analyses one can do. I came across “The Association of Tree Pollen Concentration Peaks and Allergy Medication Sales in New York City: 2003-2008″ by Perry Sheffield. There they correlate pollen counts with anti-allergy medicine sales – and indeed find that two days after high pollen [...]

Weather visualization with WeatherSpark

WeatherSpark: prediction and observation quantiles, historic data, multiple predictors, zoomable, draggable, colorful, wonderful: Via Jure Cuhalev.

Get the Data

At GetTheData, you can ask and answer data related questions. Here’s a preview: I’m not sure a Q&A site is the best way to do this. My pipe dream is to create a taxonomy of variables and instances, and collect spreadsheets annotated this way. Imagine doing a search of type: “give me datasets, where an [...]

Poverty, educational performance – and can be done about it

Andrew has pointed to Jonathan Livengood’s analysis of the correlation between poverty and PISA results, whereby schools with poorer students get poorer test results. I’d have written a comment, but then I couldn’t have inserted a chart. Andrew points out that a causal analysis is needed. This reminds me of an intervention that has been [...]

Fattening of the world and good use of the alpha channel

In the spirit of Gapminder, Washington Post created an interactive scatterplot viewer that’s using alpha channel to tell apart overlapping fat dots better than sorting-by-circle-size Gapminder is using: Good news: the rate of fattening of the USA appears to be slowing down. Maybe because of high gas prices? But what’s happening with Oceania?

Model Makers’ Hippocratic Oath

Emanuel Derman and Paul Wilmott wonder how to get their fellow modelers to give up their fantasy of perfection. In a Business Week article they proposed, not entirely in jest, a model makers’ Hippocratic Oath: I will remember that I didn’t make the world and that it doesn’t satisfy my equations. Though I will use [...]

R Advertised

The R language is definitely going mainstream:

Bribing statistics

I Paid a Bribe by Janaagraha, a Bangalore based not-for-profit, harnesses the collective energy of citizens and asks them to report on the nature, number, pattern, types, location, frequency and values of corruption activities. These reports would be used to argue for improving governance systems and procedures, tightening law enforcement and regulation and thereby reduce [...]

Google’s word count statistics viewer

Word count stats from the Google books database prove that Bayesianism is expanding faster than the universe. A n-gram is a tuple of n words.

Why a bonobo won’t play poker with you

Sciencedaily has posted an article titled Apes Unwilling to Gamble When Odds Are Uncertain: The apes readily distinguished between the different probabilities of winning: they gambled a lot when there was a 100 percent chance, less when there was a 50 percent chance, and only rarely when there was no chance In some trials, however, [...]

Diabetes stops at the state line?

From Discover: Razib Khan asks: But follow the gradient from El Paso to the Illinois-Missouri border. The differences are small across state lines, but the consistent differences along the borders really don’t make. Are there state-level policies or regulations causing this? Or, are there state-level differences in measurement? This weird pattern shows up in other [...]

Getting a job in pro sports… as a statistician

Posted at MediaBistro: The Harvard Sports Analysis Collective are the group that tackles problems such as “Who wrote this column: Bill Simmons, Rick Reilly, or Kevin Whitlock?” and “Should a football team give up free touchdowns?” It’s all fun and games, until the students land jobs with major teams. According to the Harvard Crimson, sophomore [...]

Statistics of food consumption

Visual Economics shows statistics on average food consumption in America: My brief feedback is that water is confounded with these results. They should have subtracted water content from the weight of all dietary items, as it inflates the proportion of milk, vegetable and fruit items that contain more water. They did that for soda (which [...]

Journalism in the age of data

Journalism in the age of data is a video report including interviews with many visualization people. It’s also a great example of how citations, and further information appear alongside with the video – showing us the future of video content online.