Skip to content
 

Bing is preferred to Google by people who aren’t like me

This one is fun because I have a double conflict of interest: I’ve been paid (at different times) both by Google and by Microsoft.

Here’s the story:

Microsoft, September 2012:

An independent research company, Answers Research based in San Diego, CA, conducted a study using a representative online sample of nearly 1000 people, ages 18 and older from across the US. The participants were chosen from a random survey panel and were required to have used a major search engine in the past month. Participants were not aware that Microsoft was involved.

In the test, participants were shown the main web search results pane of both Bing and Google for 10 search queries of their choice. Bing and Google search results were shown side-by-side on one page for easy comparison – with all branding removed from both search engines. The test did not include ads or content in other parts of the page such as Bing’s Snapshot and Social Search panes and Google’s Knowledge Graph. For each search, the participant was asked which search engine provided the best results – “Left side search engine”, “Right side search engine”, or “Draw.” After each participant performed 10 searches, their votes were totaled to determine the winner (Bing, Google or Draw, in the case of a tie).

When the results were tallied, the outcome was clear – people chose Bing web search results over Google nearly 2:1 in the blind comparison tests. Specifically, of the nearly 1000 participants: 57.4% chose Bing more often, 30.2% chose Google more often; 12.4% resulted in a draw.

A group at Yale Law School, October 2013:

In advertisements associated with its ―Bing It On‖ campaign, Microsoft claimed that ―people preferred Bing web search results nearly 2:1 over Google in blind comparison tests.‖ We tested Microsoft‘s claims by way of a randomized experiment involving U.S.-based MTurk subjects and conducted on Microsoft‘s own www.bingiton.com website. We found that (i) a statistically-significant majority of participants preferred Google search results to Bing search results (53% to 41%); and (ii) participants were significantly less likely to prefer Bing results when randomly assigned to use popular search terms or self-selected search terms instead of the search terms Microsoft recommends test-takers employ on its website.

And then the punchline:

Our findings suggest that each of these implicit claims is likely false and might provide the basis for a viable Lanham Act claim by Google.

It seems a bit much to claim false advertising because a 2013 survey on Mechanical Turk differs from a 2012 survey on a random sample of Americans. Also, and possibly more important than the populations being studied, the experimental conditions were different. In Microsoft’s original study, it appears that the purpose of the study was somewhat concealed, whereas the Yale study went directly through the Bing website. So, lots of differences from two very different studies performed on two very different populations under two very different conditions (see Bob’s comment for more on that point), a year apart.

The authors of the 2013 paper argue that the Turk sample is better than the U.S. random sample because it better captures internet users. That’s a reasonable point but I think I’d prefer to go with the random sample and just reweight. In any case, it hardly seems like deceptive advertising for Microsoft to make a claim about Americans, based on a random sample of Americans.

Just imagine the results had gone the other way around. Suppose Microsoft had done a MTurk study and the Yale group turned around and found a much different result using a third-party random sample. I can only assume that they’d be saying that Microsoft was doing illegal advertising by representing the results of a convenience sample as if it were real.

But maybe I’m not really putting on my “lawyer hat” here. I was thinking that, because the authors of this 2013 paper wrote that Microsoft could be sued for false advertising, that they believe that Microsoft did illegally false advertising. But the threat is more powerful than the execution, right? You only have to threaten to sue, maybe that’s enough? It seems a bit creepy to me, along the lines of throwing mud at something on the hope that it might stick. (Microsoft’s own response is here.)

That said, I personally prefer Google to Bing. My impression that Google is a bit more tailored to academic users, and Bing doesn’t really work for me.

But I’m with the Bing guys in the above debate. I’m no fan of their search engine, but their study seems much more professional to me. Really it seems like no contest. Ultimately, Bing wins for some searches and for some users, and Google wins for other searches and for other users. For all I know, they’re using very similar algorithms, just with different weights. Or maybe when you type something into Bing, it just googles it. Whatever.

The statistical message here is that in the presence of large and systematic variation, different experimental conditions will give you different average treatment effects. (See Jeremy Miles’s comment for more elaboration on this point.) If I had to pick just one of the experiments discussed here, I’d choose Microsoft’s, hands down. But the variation is the interesting story. To their credit, the Yale team did explore the variation a bit, before they went into the legal mumbo-jumbo.

A fight that benefits both sides

Often we say that, in a fight, both sides lose. In this case, though, maybe both sides win.

From Microsoft’s side, more attention is drawn to their Bing-is-preferred-to-Google study, and the attack is so weak that it lends credence to Microsoft’s pro-Bing claims. (Just consider: if Microsoft had just released this study on its own, not in the context of a ham-handed attack, maybe I’d be posting on all the reasons not to take the Bing claims seriously.) Bing is the forgotten and distant #2 in searches, so I assume they’ll be happy with anything that keeps them in the news.

And, from the Yale team’s side, I can’t imagine that these people are really so fired up angry about Microsoft’s purportedly illegal misrepresentation. Rather, they did a cute little project (the authors are a professor and 5 students) and thought it would be fun to get some press. No harm in that: I do fun projects myself from time to time and am not averse to seeing my name in the newspaper. And they might even get some legal business from this, I suppose, for example if Google or some legal entrepreneur wanted to actually try out their suggested lawsuit.

I do feel a little bit bad about contributing to the publicity. But I’ll do it, for the statistical content.

P.S. I tried to play “bingiton” myself but when I clicked on the website (“Bing It On is a side-by-side Bing versus Google search-off challenge. . . .”), I got switched directly to www.bing.com. I guess bingiton is no more?

P.P.S. I purposely did not post the names of the Microsoft psychologist or the Yale law professor here, because I’d like to avoid personalizing the conflict. Microsoft vs. Yale seems better off than Mr. A vs. Mr. B. (If you care, all the names and links are in this news article by Will Oremus.)

20 Comments

  1. Rahul says:

    Andrew: Can you elaborate on why the Microsoft study “seems much more professional” to you?

    Both seem using online respondents, Yale used MTurk whereas MS used “a representative online sample of nearly 1000 people”.

    This seems a lot about external validity, but if at all I’m swayed a tiny bit in favor of Yale because:

    (1) Yale has declared more methodological details than MS’s short blog post.

    (2) Yale has less of a conflict of interest here. Had MS’s consultant concluded that Google did better would they publicize it? It’d be perfectly legal, BTW, for MS to shop around till they found a study that made Bing look good.

    (3) Most non-Yale, non-MS informal studies I’ve read of in the past seem to support the Google better than Bing hypothesis.

    • Andrew says:

      Rahul:

      – I prefer a random sample to a sample of volunteers, especially when it’s Mturk and you don’t know who those people are. I understand the value of convenience samples but I prefer the random sample where possible, unless the nonrandom sample has compensating advantages such as a much larger sample size, less bias, or whatever.

      – I couldn’t be sure, but my impression is that the participants in the Microsoft study were not told they were in a Bing vs. Google matchup, whereas the participants in the Yale study were given this information. I think that, in this sort of study, it’s better for the participants to not to be told what the study is about.

      – Beyond all this, the way the studies were described, the MS study seemed more professional. It looked like standard survey research, whereas the Yale study looked more amateurish. This can’t be the only criterion to use—amateurs can do good work too—but given that the two studies gave different results, I’m inclined to go with the pros.

      I take your point on the conflict of interest. The Bing people have an obvious conflict here, whereas I’m not aware of any conflicts involving the Yale people. The Bing study was done via a third-party research organization, but you’re right that, for all we know, they commissioned 100 different studies and are only reporting the one that makes them look best. I kind of doubt it, but who knows.

      Regarding your last point: I prefer Google too! But maybe I have different search goals, compared to typical Americans. As noted in the above post, I’m guessing that there are scenarios where Google does better, scenarios where Bing does better, and many many scenarios where they do the same.

      • Anonymous says:

        Andrew:

        Regarding your statement: “I prefer a random sample to a sample of volunteers, especially when it’s Mturk and you don’t know who those people are.”

        What is the basis is of your conclusions regarding the quality of the Microsoft-study sample? All I could see here is Microsoft’s statement:

        “An independent research company […] conducted a study using a representative online sample of nearly 1000 people, ages 18 and older from across the US. The participants were chosen from a random survey panel and were required to have used a major search engine in the past month.”

        From what I know, there are many “research companies” who sell you “representative data”, but that doesn’t mean it actually is representing much of anything. What these companies usually do is they have a large pool of people, their survey panelists, and the survey companies are allowed to keep them in their databases in exchange for whatever incentives they provide. Then when the company is launching a new survey they send tons of emails to their panelists (who get extra incentives for clicking through the questionnaires) and they stop collecting data once certain quotas (e.g., 20% African-American etc.) are fulfilled and the target N is reached. So the sample actually would be representative if (a) the population from which they actually draw the sample would represent the target population, and (b) if that sample were actually a random sample. It is easy to see that none of which is the case if the company is doing it like I described above.

        Disclaimer: I am no expert with online surveys but occasionally work with data from such companies. I cannot tell how the company that was hired by Microsoft actually did it. However, the point is that if somebody claims (s)he had a representative online survey, then it usually isn’t really representative.

        • Andrew says:

          Anon:

          What you say could be. I think I’d still prefer this panel to Turk, as at least it attempts to match the US adult population, but maybe it’s not so clean as I thought.

        • zbicyclist says:

          I’m with Anonymous on this one.

          “random survey panel” This is not to be confused with an actual random sample, although it’s not clear that a true random sample is possible for this type of study, since survey response rates are often in single digits and continue to decline. As a result, we’re really closer to quota sampling territory.

          Anyway, I would doubt that the different conclusions have much to do with the nature of the samples.

  2. Jeremy Miles says:

    The problem with the Bing It ON study (IMHO) is that the searches are artificial (and I’m not sure about the Yale study). I tried the Bing study, and when you’re asked to search for 10 things, you don’t do the kind of searches that you do normally. You look for your name, your hometown, things like that, which are relatively easy to find. But when you’re actually searching, you’re not so sure what to search for.

    For example, if I search “Andrew Gelman” most of the links on the Google home page are about the author of this blog, with a couple of other Andrew Gelman’s home pages thrown in. Same for Bing, except the non-author of this blog pages are LinkedIn pages. That’s possibly more useful, but if I was looking for someone and Google didn’t help me, LinkedIn is the next place I’d go.

    But what about something harder? Say I’ve forgotten the name of the last movie I saw, it was about a kid who worked in a water park. So I type (without quotes) “movie about kid who works in a water park” into both Bing and Google. In Google 9 of the links on the front page are about the movie “The Way, Way Back” (which is the movie I was thinking of), on Bing, I went to the 5th page and didn’t find a link to the movie.

    Another example: I’m interested in Andrew (and Jennifer Hill)’s book, but I can’t remember it’s title – I know it’s a book about regression, and it uses R and BUGS. So I put that into Google: the 6th hit is the Amazon page for the book, and it’s the first link to a book (although one of them is a link to a tutorial associated with a different book). In Bing, I’m over half way down the second page before I find what I was looking for, and there are several books listed before that.

    But (to reiterate) these aren’t searches I would have come up with if you’d said “Hey, do a search and tell me which of these results you like best.”

    • Z says:

      Still interesting that Bing seems to do a better job in the think-of-something-off-the-top-of-your-head category. I wonder if they knew they had that advantage and that’s why they did the study?

  3. K? O'Rourke says:

    > to use—amateurs can do good work too—but given that the two studies gave different results, I’m inclined to go with the pros

    So would I, but very hard to formalise!

  4. jonathan says:

    1. As a user of both Bing and Google, I prefer Bing’s presentation of results. But I prefer Google’s actual results. The two come together, of course, and I suppose you could draw usability of results versus quality of results curves and generate a sweet spot for average users. I think that for many, Bing may be more usable, though I realize I’m conflating the appearance of results with their usability.

    2. This kind of “testing” depends greatly on framing the question. One example is the famous Pepsi challenge of old. Pepsi understood that Coke, at least as it tasted then (and this was straight sugared Coke versus straight sugared Pepsi because the world had not yet changed to diet and fake sugars), was harsher in the first sip or two. Drinkers preferred the lighter, somewhat sweeter taste of Pepsi in first sips … but preferred Coke when they drank more. So they offered you a sip or two to frame the question to get the “right” answer.

    3. Another example is dog food. Something I learned recently from Gulp is that they take the largely tasteless dry bits and add stuff to them – sprays, tumbled in powder, etc. They then manipulate the question by showing dogs running to the food bowl. They don’t show that dogs may then switch to the other product, preferring it in the long run. The companies choose their additives to accentuate that eagerness … or I suppose they balance eagerness with longer term chowing down. You can imagine the commercials: see, Fido wants x brand or see, Fido may taste x brand but he really wants y brand for long term satisfaction. You can see in that the way we treat our pets: treats versus care over time.

    4. Imagine it’s 10 years from now and companies can identify who you are – with an anonymized ID – by what you log into from what and where. Now take your search results. They look not only at what you want to search for and of course any social network connections but they “sweeten” your dog food by giving you some candy, must-have treat. Say they think – their algorithms determine – you’re a straight male so they not only give you your results but include an image line of models in bikinis. And remember, they could determine from time of day, etc. whether you’d likely be just you or if the kids would be home, etc. The pet food people refer to “cat crack”, meaning the additive spray-ons cats love in their food. Why should we be treated differently?

    • Andrew says:

      Jonathan:

      Good idea! I’ll have to add some “cat crack” to my blog. So, political-science-oriented readers will see election predictions, statistics-oriented readers will see our latest research on Gaussian processes, and long-time blog readers will get a steady diet of Wegman and Kanazawa posts.

    • K? O'Rourke says:

      jonathan: Do you have a reference for the Pepsi story.

      We were repeatedly told in MBA school, that the challenge was limited to those who only drank one brand based research that they could not tell the difference and so 50% preferring the other was just the expected result ( as they could not distinguish).

      We did redo the experiments ourselves (most often with beer brands) and it seemed to replicate.

      But I have always doubted the “story” even though my advisor was quite certain it was a case of flagrant mis-use of research findings.

      • zbicyclist says:

        I don’t have a reference, but Jonathan’s story is the one that usually kicks around marketing research circles.

        At the time, there was a Bill Cosby ad for Coke which was based on the idea that in the long run you’d like Coke because it wasn’t as cloyingly sweet as Pepsi, so Coke seems to have thought this was the problem.

        This isn’t the ad I remember, but it’s a Cosby Coke ad with that theme: http://www.youtube.com/watch?v=YnJPk3-N92I

        And, remember, “new Coke” was formulated to taste more like Pepsi.

        • K? O'Rourke says:

          > usually kicks around marketing research circles
          Not in 1981-83 at the University of Toronto as three of the faculty were in marketing ;-)

  5. There are a bunch of situations like this for A/B comparisons being skewed.

    The search engines change so much over time, that it’s difficult to measure. Google also runs many in parallel for experimentation, so they’re not even consistent there. And everyone’s customizing for users based on IP address and/or cookies. For a while, I found Bing searches better for non-technical terms — if you searched for “scala”, Google only brought back programming language hits and Bing only the prom dress company on the top page — both are more diverse now).

    A while back a “people prefer cheap wines” study was making the rounds. Most reporters took it as an opportunity to make fun of the “wine snobs”. Those who read further, also reported that survey respondents with prior knowledge of wine had different responses, and in fact preferred more expensive wines. So the point is that people who don’t profess to like wine tend to prefer very fruity wines without a lot of acidity or tannins. Hence the birth of sweet, unstructured wines like Yellow Tail. You also have the A/B comparison problem where a delicate wine (like say, a Burgundy) is drowned out by more assertive wines (like say, a Rhone).

    If you see TV displays in a store, they’re all set to super high contrast, because it looks better in A/B tests. But for long-term watching, have that set calibrated to something more reasonable.

    There’s been an ongoing race to the top in audio levels, where the baseline level is much higher (10 decibels+) than it was 20 years ago. This cuts down on dynamics and actually loses information (and compresses other signal), but louder sounds better in A/B comparisons, so louder and louder they go. And it makes it impossible for me to play my CD collection on top-level shuffle (everything’s ripped in raw format, so I haven’t level-adjusted the rips). See: http://en.wikipedia.org/wiki/Loudness_war

  6. Z says:

    Anyone know if it’s legal for a company to perform multiple trials like this and only report the ones that are favorable?

  7. Zygmunt says:

    > Bing is the forgotten and distant #2 in searches

    Exactly. Maybe people get pissed because in the big picture Microsoft’s results are so obviously out of sync with reality.

  8. SergeyK says:

    It is interesting whether the original Microsoft study kept the “Left Search engine” and “Right search engine” to be always Bing & Google respectively (I hope not :), because I wouldn’t be surprised if there is a tendency to prefer the left one because that’s the first one western people will look at.