Model Makers’ Hippocratic Oath

Emanuel Derman and Paul Wilmott wonder how to get their fellow modelers to give up their fantasy of perfection. In a Business Week article they proposed, not entirely in jest, a model makers’ Hippocratic Oath:

    \t

  • I will remember that I didn’t make the world and that it doesn’t satisfy my equations.
  • \t

  • Though I will use models boldly to estimate value, I will not be overly impressed by mathematics.
  • \t

  • I will never sacrifice reality for elegance without explaining why I have done so. Nor will I give the people who use my model false comfort about its accuracy. Instead, I will make explicit its assumptions and oversights.
  • \t

  • I understand that my work may have enormous effects on society and the economy, many of them beyond my comprehension.

Found via Abductive Intelligence.

Bribing statistics

I Paid a Bribe by Janaagraha, a Bangalore based not-for-profit, harnesses the collective energy of citizens and asks them to report on the nature, number, pattern, types, location, frequency and values of corruption activities. These reports would be used to argue for improving governance systems and procedures, tightening law enforcement and regulation and thereby reduce the scope for corruption.

Here’s a presentation of data from the application:
bribe.png

Transparency International could make something like this much more widely available around the world.

While awareness is good, follow-up is even better. For example, it’s known that New York’s subway signal inspections were being falsified. Signal inspections are pretty serious stuff, as failures lead to disasters, such as the one in Washington. Nothing much happened after: the person responsible (making $163k a year) was merely reassigned.

Why a bonobo won’t play poker with you

Sciencedaily has posted an article titled Apes Unwilling to Gamble When Odds Are Uncertain:

The apes readily distinguished between the different probabilities of winning: they gambled a lot when there was a 100 percent chance, less when there was a 50 percent chance, and only rarely when there was no chance In some trials, however, the experimenter didn’t remove a lid from the bowl, so the apes couldn’t assess the likelihood of winning a banana The odds from the covered bowl were identical to those from the risky option: a 50 percent chance of getting the much sought-after banana. But apes of both species were less likely to choose this ambiguous option.

Like humans, they showed “ambiguity aversion” — preferring to gamble more when they knew the odds than when they didn’t. Given some of the other differences between chimps and bonobos, Hare and Rosati had expected to find the bonobos to be more averse to ambiguity, but that didn’t turn out to be the case.

Thanks to Stan Salthe for the link.

Diabetes stops at the state line?

From Discover:

stateline.png

Razib Khan asks:

But follow the gradient from El Paso to the Illinois-Missouri border. The differences are small across state lines, but the consistent differences along the borders really don’t make. Are there state-level policies or regulations causing this? Or, are there state-level differences in measurement? This weird pattern shows up in other CDC data I’ve seen.

Turns out that CDC isn’t providing data, they’re providing model. Frank Howland answered:

I suspect the answer has to do with the manner in which the county estimates are produced. I went to the original data source, the CDC, and then to the relevant FAQ.

There they say that the diabetes prevalence estimates come from the “CDC’s Behavioral Risk Factor Surveillance System (BRFSS) and data from the U.S. Census Bureau’s Population Estimates Program. The BRFSS is an ongoing, monthly, state-based telephone survey of the adult population. The survey provides state-specific information”

So the CDC then uses a complicated statistical procedure (“indirect model-dependent estimates” using Bayesian techniques and multilevel Poisson regression models) to go from state to county prevalence estimates. My hunch is that the state level averages thereby affect the county estimates. The FAQ in fact says “State is included as a county-level covariate.”

I’d prefer to have real data, not a model. I’d do the model myself, thank you. Data itself is tricky enough, as J. Stamp said.

Getting a job in pro sports… as a statistician

Posted at MediaBistro:

The Harvard Sports Analysis Collective are the group that tackles problems such as “Who wrote this column: Bill Simmons, Rick Reilly, or Kevin Whitlock?” and “Should a football team give up free touchdowns?

It’s all fun and games, until the students land jobs with major teams.

According to the Harvard Crimson, sophomore John Ezekowitz and junior Jason Rosenfeld scored gigs with the Phoenix Suns and the Shanghai Sharks, respectively, in part based on their work for HSAC.

It’s perhaps not a huge surprise that the Sharks would be interested in taking advantage of every available statistic. They are owned by Yao Ming, who plays for the Houston Rockets. The Rockets, in turn, employ general manager Daryl Morey who Simmons nicknamed “Dork Elvis” for his ahead of the curve analysis. (See Michael LewisThe No Stats All-Star for an example.) But still, it’s very cool to see the pair get an opportunity to change the game.

Statistics of food consumption

Visual Economics shows statistics on average food consumption in America:

food.jpg

My brief feedback is that water is confounded with these results. They should have subtracted water content from the weight of all dietary items, as it inflates the proportion of milk, vegetable and fruit items that contain more water. They did that for soda (which is represented as sugar/corn syrup), amplifying the inconsistency.

Time Magazine had a beautiful gallery that visualizes diets around the world in a more appealing way.

Why I blog?

There is sometimes a line of news, a thought or an article sufficiently aligned with the general topics on this blog that is worth sharing.

I could have emailed it to a few friends who are interested. Or I could have gone through the relative hassle of opening up the blog administration interface, cleaned it up a little, added some thoughts and made it pretty to post on the blog. And then it’s poring through hundreds of spam messages, just to find two or three false positives in a thousand spams. Or, finding the links, ideas and comments reproduced on another blog without attribution or credit. Or, even, finding the whole blog mirrored on another website.

It might seem all work and no fun, but what keeps me coming back is your comments: the discussions, the additional links, information and insights you provide, this is what makes it all worthwhile.

Thanks, those of you who are commenters! And let us know what would make your life easier.

DataMarket

It seems that every day brings a better system for exploring and sharing data on the Internet. From Iceland comes DataMarket. DataMarket is very good at visualizing individual datasets – with interaction and animation, although the “market” aspect hasn’t yet been developed, and all access is free.

Here’s an example of visualizing rankings of countries competing in WorldCup:

datamarket1.png

And here’s a lovely example of visualizing population pyramids:


datamarket2.png

In the future, the visualizations will also include state of the art models for predicting and imputing missing data, and understanding the underlying mechanisms.

Other posts: InfoChimps, Future of Data Analysis

Probability-processing hardware

Lyric Semiconductor posted:

For over 60 years, computers have been based on digital computing principles. Data is represented as bits (0s and 1s). Boolean logic gates perform operations on these bits. A processor steps through many of these operations serially in order to perform a function. However, today’s most interesting problems are not at all suited to this approach.

Here at Lyric Semiconductor, we are redesigning information processing circuits from the ground up to natively process probabilities: from the gate circuits to the processor architecture to the programming language. As a result, many applications that today require a thousand conventional processors will soon run in just one Lyric processor, providing 1,000x efficiencies in cost, power, and size.

Om Malik has some more information, also relating to the team and the business.

The fundamental idea is that computing architectures work deterministically, even though the world is fundamentally stochastic. In a lot of statistical processing, especially in Bayesian statistics, we take stochastic world, force it into determinism, simulate stochastic world by computationally generating deterministic pseudo-random numbers, and simulate stochastic matching by deterministic likelihood computations. What Lyric could do is to bypass this highly inefficient intermediate deterministic step. This way, we’ll be able to fit bigger and better models much faster.

They’re also working on a programming language: PSBL (Probability Synthesis for Bayesian Logic), but there are no details. Here is their patent for Analog Logic Automata, indicating applications for images (filtering, recognition, etc.).

[D+1: J.J. Hayes points to a US Patent indicating that one of the circuits optimizes the sum-product belief propagation algorithm. This type of algorithms is popular in machine learning for various recognition and denoising problems. One way to explain it to a statistician is that it does imputation with an underlying loglinear model.]

Turning pages into data

There is a lot of data on the web, meant to be looked at by people, but how do you turn it into a spreadsheet people could actually analyze statistically?

The technique to turn web pages intended for people into structured data sets intended for computers is called “screen scraping.” It has just been made easier with a wiki/community http://scraperwiki.com/.

They provide libraries to extract information from PDF, Excel files, to automatically fill in forms and similar. Moreover, the community aspect of it should allow researchers doing similar things to get connected. It’s very good. Here’s an example of scraping road accident data or port of London ship arrivals.

You can already find collections of structured data online, examples are Infochimps (“find the world’s data”), and Freebase (“An entity graph of people, places and things, built by a community that loves open data.”). There’s also a repository system for data, TheData (“An open-source application for publishing, citing and discovering research data”).

The challenge is how to keep these efforts alive and active. One early company helping people screen-scrape was Dapper that’s now helping retailers advertise by scraping their own websites. Perhaps the library funding should be used towards tools like that rather than piling up physical copies of expensive journals everyone reads just online.

Some earlier posts on this topic [1], [2].

(Partisan) visualization of health care legislation

Congressman Kevin Brady from Texas distributes this visualization of reformed health care in the US (click for a bigger picture):

obamacare.png

Here’s a PDF at Brady’s page, and a local copy of it.

Complexity has its costs. Beyond the cost of writing it, learning it, following it, there’s also the cost of checking it. John Walker has some funny examples of what’s hidden in the almost 8000 pages of IRS code.

Text mining and applied statistics will solve all that, hopefully. Anyone interested in developing a pork detection system for the legislation? Or an analysis of how much entropy to the legal code did each congressman contribute?

There are already spin detectors, that help you detect whether the writer is a Democrat (“stimulus”, “health care”) or a Republican (“deficit spending”, “ObamaCare”).

D+0.1: Jared Lander points to versions by Rep. Boehner and Robert Palmer.

Why don’t we have peer reviewing for oral presentations?

Panos Ipeirotis writes in his blog post:

Everyone who has attended a conference knows that the quality of the talks is very uneven. There are talks that are highly engaging, entertaining, and describe nicely the research challenges and solutions. And there are talks that are a waste of time. Either the presenter cannot present clearly, or the presented content is impossible to digest within the time frame of the presentation.

We already have reviewing for the written part. The program committee examines the quality of the written paper and vouch for its technical content. However, by looking at a paper it is impossible to know how nicely it can be presented. Perhaps the seemingly solid but boring paper can be a very entertaining presentation. Or an excellent paper may be written by a horrible presenter.

Why not having a second round of reviewing, where the authors of accepted papers submit their presentations (slides and a YouTube video) for presentation to the conference. The paper will be accepted and be included in the proceedings anyway but having a paper does not mean that the author gets a slot for an oral presentation.

Under an oral presentation peer review, a committee looks at the presentation, votes on accept/reject and potentially provides feedback to the presenter. The best presentations get a slot on the conference program.

While I’ve enjoyed quiet time for meditation during boring talks, this is a very interesting idea – cost permitting. As the cost of producing a paper and a presentation to pass peer review goes into weeks, a lot of super-interesting early-stage research just moves off the radar.

The Three Golden Rules for Successful Scientific Research

A famous computer scientist, Edsger W. Dijkstra, was writing short memos on a daily basis for most of his life. His memo archives contains a little over 1300 memos. I guess today he would be writing a blog, although his memos do tend to be slightly more profound than what I post.

Here are the rules (follow link for commentary), which I tried to summarize:

    \t

  • Pursue quality and challenge, avoid routine. (“Raise your quality standards as high as you can live with, avoid wasting your time on routine problems, and always try to work as closely as possible at the boundary of your abilities. Do this, because it is the only way of discovering how that boundary should be moved forward.”)
  • \t

  • When pursuing social relevance, never compromise on scientific soundness. (“We all like our work to be socially relevant and scientifically sound. If we can find a topic satisfying both desires, we are lucky; if the two targets are in conflict with each other, let the requirement of scientific soundness prevail.”)
  • \t

  • Solve the problems nobody can solve better than you. (“Never tackle a problem of which you can be pretty sure that (now or in the near future) it will be tackled by others who are, in relation to that problem, at least as competent and well-equipped as you.”)

[D+1: Changed “has been” into “was” – the majority of commenters decided Dijkstra is better treated as a dead person who was, rather than an immortal who “has been”, is, and will be.]

Demographics: what variable best predicts a financial crisis?

A few weeks ago I wrote about the importance of demographics in political trends. Today I’d like to show you how demographics help predict financial crises.

Here are a few examples of countries with major crises.

  1. \t

  2. \t

  3. China’s working-age population, age 15 to 64, has grown continuously. The labor pool will peak in 2015 and then decline.

There are more charts in Demography and Growth report by the Reserve Bank of Australia:

working-age-share-growth.png

Wikipedia surveys the causes of the financial crisis, such as “liquidity shortfall in the United States banking system caused by the overvaluation of assets”. Oh my! Slightly better than the usual Democrat-Republican finger pointing, but no, no, no, no. One has to pick the right variables to explain these things. Why were the assets overvalued?

There is a simple answer: not enough people have been looking at demographic trends to understand that it’s the working-age population that buys most of the goods, services and real estate.

Back in 2007 I found a book by Henry S. Dent, the main message of which was nicely summarized by this March 2000 article in Businessweek:

The bull market is as vast and powerful as the baby boomer generation, and the two are inextricable. The 80 million or so boomers–those born between 1946 and 1964–are hitting their peak earning, spending, and investing years, and that’s what’s driving the economy’s incredible performance and the stock market’s spectacular returns. His target for the Dow is 40,000–which he believes it will hit somewhere around 2008.

After that, watch out. As an economic force, the boomers will have peaked, and there just aren’t enough Generation Xers to sustain the economic and stock market boom. Even the revolutionary changes wrought by the rapid growth of the Internet don’t change that. In Dent’s view, the economy goes into a deflationary funk for another 10 or so years, until the boomers’ children–the 83 million ”echo baby boom” generation–reach their economic prime.

Unfortunately for Dent, there was the 2001 dip before the 2008 drop.

Assuming economists finally learn the demographics they should have known for a long time, the next financial crisis will be caused by some other under-appreciated variable.

Cost of communicating numbers

Freakonomics reports:

A reader in Norway named Christian Sørensen examined the height statistics for all players in the 2010 World Cup and found an interesting anomaly: there seemed to be unnaturally few players listed at 169, 179, and 189 centimeters and an apparent surplus of players who were 170, 180, and 190 centimeters tall (roughly 5-foot-7 inches, 5-foot-11 inches, and 6-foot-3 inches, respectively). Here’s the data:

Freaky-Heights-blogSpan.jpg

It’s not costless to communicate numbers. When we compare “eighty” (6 characters) vs “seventy-nine” (12 characters) – how much information are we gaining by twice the number of characters? Do people really care about height at +-0.5 cm or is +-1 cm enough?

It’s harder to communicate odd numbers (“three” vs four or two, “seven” vs “six” or “eight”, “nine” vs “ten”) than even ones. As language tends to follow our behaviors, people have been doing it for a long time. We remember the shorter description of a quantity.

This is my theory why we end up with more rounded numbers. This is also partially why Benford’s law holds: we change the scales and measurement units as to enable us to store the numbers in our minds more economically. Compare “ninety-nine” (11 characters) with “hundred” (7c), or “nine hundred ninety-nine” (24) with “thousand” (8c).

For our advanced readers, let me give you another example. Let’s say I estimate something to be 100. The fact that I said 100 implies that there is a certain amount of uncertainty in my estimate. I could have written it as 1e2, implying that the real quantity is somewhere between 50 and 150. If I said 102, I’d be implying that the real quantity is between 101 and 103. If I said 103, I’d be implying that the real quantity is between 102.5 and 103.5. If I said 50, the real quantity is probably between 40 and 60.

This way, by rounding up, I have been both economical in my expression but also been able to honestly communicate my standard error.

Eventually, increased accuracy is not always worth the increased cost of communication and memorization.

So, do you still think World Cup players are being self-aggrandizing, or are they perhaps just economical or even conscious of standard errors?

[D+1: Hal Varian points to number clustering in asset markets. Also thanks to Janne helped improve the above presentation.]

What makes a community on the Internet?

Before I’d commit to anything, I’d like to know:

  • 1. Who owns the content?
  • 2. Who’s making money and how much of it?
  • 3. Can I take my content and reputation elsewhere?
  • 4. Can someone else take my content and reputation elsewhere?
  • 5. What are the controls limiting freeloading and abusive behavior?

Wikipedia was set up as foundation, providing clear answers to 1 and 2. Usenet was an effort of internet community that was destroyed by 4 (Dejanews and Google Groups) and 5.

I cannot endorse supporting any kind of initiative that pretends to be a community without being transparent about these matters. Without that, the operators of the website just expect free labor from volunteers who are filling someone else’s coffers. The world would be a better place with these volunteers volunteering where it matters.

Question & Answer Communities

StackOverflow has been a popular community where software developers would help one another. Recently they raised some VC funding, and to make profits they are selling job postings and expanding the model to other areas. Metaoptimize LLC has started a similar website, using the open-source OSQA framework for such as statistics and machine learning. Here’s a description:

You and other data geeks can ask and answer questions on machine learning, natural language processing, artificial intelligence, text analysis, information retrieval, search, data mining, statistical modeling, and data visualization.

Here you can ask and answer questions, comment and vote for the questions of others and their answers. Both questions and answers can be revised and improved. Questions can be tagged with the relevant keywords to simplify future access and organize the accumulated material.

If you work very hard on your questions and answers, you will receive badges like “Guru”, “Student” or “Good answer”. Just like a computer game! In return, well-meaning question answerers will be helping feed Google and numerous other companies with good information they will offer the public along with sponsored information that someone is paying for.

I’ll join the party myself when they introduce the “Rent,” “Mortgage Payment,” “Medical Bill”, and “Grocery” badges. Until then, I’ll be spending time and money, and someone else will be saving time and earning money. For a real community, there has to be some basic fairness.

[9:15pm: Included Ryan Shaw’s correction to my post, pointing out that MetaOptimize is based on OSQA and not on the StackOverflow platform.]
[D+1, 7:30am: Igor Carron points to an initiative that’s actually based on the StackOverflow.]