It seemed to me that most destruction was being done by those who could not choose between the two

Amateurs, dilettantes, hacks, cowboys, clones — Nick Cave

[Note from Dan 11Sept: I wanted to leave some clear air after the StanCon reminder, so I scheduled this post for tomorrow. Which means you get two posts (one from me, one from Andrew) on this in two days. That’s probably more than the gay face study deserves.]

I mostly stay away from the marketing aspect of machine learning / data science / artificial intelligence. I’ve learnt a lot from the methodology and theory aspects, but the marketing face makes me feel a little ill.  (Not unlike the marketing side of statistics. It’s a method, not a morality.)

But occasionally stories of the id and idiocy of that world wander past my face and I can’t ignore it any more. Actually, it’s happened twice this week, which—in a week of two different orientations, two different next-door frat parties until two different two A.M.s, and two grant applications that aren’t as finished as I want them to be—seems like two times too many to me.

The first one can be dispatched with quickly.  This idiocy came from IBM who are discovering that deep learning will not cure cancer.  Given the way that governments, health systems, private companies, and  research funding bodies are pouring money into the field, this type of snake oil pseudo-science is really quite alarming. Spectacularly well-funded hubris is still hubris and sucks money and oxygen from things that help more than the researcher’s CV (or bottom line).

In the end, the “marketing crisis” in data science is coming from the same place as the “replication crisis” in statistics.  The idea that you can reduce statistics and data analysis (or analytics) down to some small set of rules that can be automatically applied is more than silly, it’s dangerous. So just as the stats community has begun the job of clawing back the opportunities we’ve lost to p=0.05, it’s time for data scientists to control their tail.

Untouchable face

The second article occurred, as Marya Dmitriyevna (who is old-school) in Natasha, Pierre, and the Great Comet of 1812 shrieks after Anatole (who is hot) tries to elope with Natasha (who is young) , in my house. So I’m going to spend most of this post on that. [It was pointed out to me that the first sentence is long and convoluted, even for my terrible writing. What can I say? It’s a complicated Russian novel, everyone’s got nine different names.]

(Side note: The only reason I go to New York is Broadway and cabaret. It has nothing to do with statistics and the unrepeatable set of people in one place. [please don’t tell Andrew. He thinks I like statistics.])

It turns out that deep neural networks join the illustrious ranks of “men driving past in cars”, “angry men in bars”, “men on the internet”,  and “that one woman at a numerical analysis conference in far north Queensland in 2006” in feeling obliged to tell me that I’m gay.

Now obviously I rolled my eyes so hard at this that I almost pulled a muscle.  I marvelled at the type of mind who would decide “is ‘gay face’ real” would be a good research question. I also really couldn’t understand why you’d bother training a neural network. That’s what instagram is for. (Again, amateurs, dilettantes, hacks, cowboys, clones.)

But of course, I’d only read the headline.  I didn’t really feel the urge to read more deeply.  I sent a rolled eyes emoji to the friend who thought I’d want to know about the article and went back about my life.

Once, twice, three times a lady

But last night, as the frat party next door tore through 2am (an unpleasant side-effect of faculty housing it seems) it popped up again on my facebook feed. This time it had grown an attachment ot that IBM story.  I was pretty tired and thought “Right. I can blog on this”.

(Because that’s what Andrew was surely hoping for, three thousand words on whether “gay face” is a thing. In my defence, I did say “are you sure?” more than once and reminded him what happened that one time X let me post on his blog.)

(Side note: unlike odd, rambling posts about a papers we’d just written, posts about bad statistics published in psychology journals is right in Andrew’s wheelhouse.  So why aren’t I waiting until mid 2018 for a post on this that he may or may not have written to appear? [DS 11Sept: Ha!] There’s not much that I can be confident that I know a lot more about than Andrew does, but I am very confident I know a lot more about “gay face”.)

So I read the Guardian article, which (wonder of wonders!) actually linked to the original paper. Attached to the original paper was an authors’ note, and an authors’ response to the inevitable press release from GLAAD calling them irresponsible.

Shall we say that there’s some drift from the paper to the press reports. I’d be interested to find out if there was at any point a press release marketing the research.

The main part that has gone AWOL is a rather long, dystopian discussion that the authors have about how they felt morally obliged to publish this research because people could do this with “off the shelf” tools to find gay people in countries where its illegal.  (We will unpack that later)

But the most interesting part is the tension between footnote 10 of the paper (“The results reported in this paper were shared, in advance, with several leading international LGBTQ organizations”) and the GLAAD/HRC press release that says

 Stanford University and the researchers hosted a call with GLAAD and HRC several months ago in which we raised these myriad concerns and warned against overinflating the results or the significance of them. There was no follow-up after the concerns were shared and none of these flaws have been addressed.

With one look

If you strip away all of the coverage, the paper itself does some things right. It has a detailed discussion of the limitations of the data and the method. (More later)  It argues that, because facial features can’t be taught, these findings provide evidence towards the prenatal hormone theory of sexual orientation (ie that we’re “born this way”).

(Side note: I’ve never like the “born this way” narrative. I think it’s limiting.  Also, getting this way took quite a lot of work. Baby gays think they can love Patti and Bernadette at the same time. They have to learn that you need to love them with different parts of your brain or else they fight. My view is more “We’re recruiting. Apply within”.)

So do I think that this work should be summarily dismissed? Well, I have questions.

(Actually, I have some serious doubts, but I live in Canada now so I’m trying to be polite. Is it working?)

Behind the red door

Male facial image brightness correlates 0.19 with the probability of being gay, as estimated by the DNN-based classifier. While the brightness of the facial image might be driven by many factors, previous research found that testosterone stimulates melanocyte structure and function leading to a darker skin. (Footnote 6)

Again, it’s called instagram.

In the Authors’ notes attached to the paper, the authors recognise that “[gay men] take better pictures”, in the process acknowledging that they themselves have gay friends [honestly, there are too many links for that] who also possess this power. (Once more for those in the back: they’re called filters. Straight people could use them if they want. Let no photo go unmolested.)

(Side note: In 1989 Liza Minelli released a fabulous collaboration with the Petshop Boys that was apparently called Results because that’s what Janet Street-Porter used to call one of her outfits. I do not claim to understand straight men [nor am I looking to], but I have seen my female friends’ tinder. Gentlemen: you could stand to be a little more Liza.)

Back on top

But enough about methodology, let’s talk about data. Probably the biggest criticism that I can make of this paper is that they do not identify the source of their data. (Actually, in the interview with The Economist that originally announced the study, Kosinski says that this is intentional to “discourage copycats”.)

Obviously this is bad science.

(Side note: I cannot imagine the dating site in question would like its users to know that it allows its data to be scraped and analysed. This is what happens when you don’t read the “terms of service”, which I imagine the authors (and the ethics committee at Stanford) read very closely.)

[I’m just assuming that, as a study that used identifiable information about people who could not consent to being studied and deals with a sensitive study, this would’ve gone though an ethics process.]

This failure to disclose the origin of the data means we cannot contextualise it within the Balkanisation of gay desire. Gay men (at least, I will not speak for lesbians and I have nothing useful to say about bisexuals except they’re ignored by this study [and hence this blogpost] and deserve better from the latter [and probably aren’t crying over the former]) will tell you what they want (what they really really want). This perverted and wonderful version of “The Secret” has manifested in multiple dating platforms that cater to narrow subgroups of the gay community.

Withholding information about the dating platform prevents people from independently scrutinising how representative the sample is likely to be. This is bad science.

(Side note: How was this not picked up by peer review? I’d guess it was reviewed by straight people. Socially active gay men should be able to pick a hole in the data in three seconds flat.  That’s the key point about diversity in STEM. It’s not about ticking boxes or meeting quotas. It’s that you don’t know what a minority can add to your work if they’re not in the room to add it.)

If you look at figure 4, the composite “straight” man is a trucker, while the composite “gay” man is a twunk. This strongly suggests that this is not my personal favourite type of gay dating site: the ones that caters to those of us who look like truckers. It also suggests that the training sample is not representative of the population the authors are generalising to. This is bad science.

(Terminology note: “Twunk” is the past form of “twink”, which is gay slang for a young (18-early20something), skinny, hairless gay man.)

The reality of gay dating websites is that you tailor your photographs to the target audience. My facebook profile picture (we will get to that later) is different to my grindr profile picture is different to my scruff profile picture. In the spirit of Liza, these are chosen for results. In these photos, I run a big part of the gauntlet between those two “composite ideals” in Figure 4. (Not all the way, because never having been a twink, and hence I never twunk.)

(Side note: The last interaction I had on Scruff was a two hour conversation about Patti LuPone. This is an “off-label” usage that is not FDA approved.)

So probably my biggest problem with this study is that it the training sample is likely unrepresentative of the population at large. This means that any inferences drawn from a model trained on this sample will be completely unable to answer questions about whether gay face is real in Caucasian Americans.  By withholding critical information about the data, the authors make it impossible to assess the extent of the problem.

One way to assess this error would be to take the classifier trained on their secret data and use it to, for example, classify face pics from a site like Scruff. There is a problem with this (as mentioned in the GLAAD/HRC press release) that activity and identity are not interchangeable. So some of the men who have sex with men (MSM, itself a somewhat controversial designation) will identify as neither gay nor bisexual and this is not necessarily information that would be available to the researcher. Nevertheless, it is probably safe to assume that people who have a publicly showing face picture on an dating app mainly used by MSMs are not straight.

If the classifier worked on this sort of data, then there is at least a chance that the findings of the study will replicate. But, those of you who read the paper of the Author notes howl, the authors did test the classifier on a validation sample gathered from Facebook.

At this point, let us pause to look at stills from Derek Jarman films

Pictures of you

First, we used the Facebook Audience Insights platform to identify 50 Facebook Pages most popular among gay men, including Pages such as: “I love being Gay”, “Manhunt”, “Gay and Fabulous”, and “Gay Times Magazine”. Second, we used the “interested in” field of users’ Facebook profiles, which reveals the gender of the people that a given user is interested in. Males that indicated an interest in other males, and that liked at least two out of the predominantly gay Facebook Pages, were labeled as gay.

I beseech you, in the Bowels of Christ, think it possible that that your validation sample may be biased.

(I mean, really. That’s one hell of an inclusion criterion.)

Rebel Girl / Nancy Boy

So what of the other GLAAD/HRC problems? They are mainly interesting to show the difference between the priorities of an advocacy organisation and statistical priorities. For example, the study only considered caucasians, which the press release criticises. The paper points out that there was not enough data to include people of colour. Without presuming to speak for LGB (the study didn’t consider trans people, so I’ve dropped the T+ letters [they also didn’t actively consider bisexuals, but LG sells washing machines]) people of colour, I can’t imagine that they’re disappointed to be left out of this specific narrative. That being said, the paper suggests that these results will generalise to other races. I am very skeptical of this claim.

Those composite faces also suggest that fat gay men don’t exist. Again, I am very happy to be excluded from this narrative.

What about the lesbians? Neural networks apparently struggle with lesbians. Again, it could be an issue of sampling bias. It could also be that the mechanical turk gender verification stage (4 of 6 people needed to agree with the person’s stated gender for them to be included) is adding additional bias.  The most reliable way to verify a person’s gender is to ask them.  I am uncertain why the researchers deviated from this principle. Certain sub-cultures in the LGB community identify as lesbian or gay but have either androgynous style, or purposely femme or butch style.  Systematically excluding these groups (I’m thinking particularly of butch lesbians and femme gays) will bias the results.

Breaking the law

So what of the dystopian picture the authors paint of governments using this type of procedure to find and eliminate homosexuals? Meh.

Just as I’m opposed to marketing in machine learning, I’m opposed to “story time” in otherwise serious research.  This claim makes the paper really exciting to read—you get to marvel at their moral dilemma. But the reality is much more boring. This is literally why universities (and I’m including Stanford in that category just to be nice) have ethics committees. I assume this study went through the internal ethics approval procedure, so the moralising is mostly pantomime.

The other reason I’m not enormously concerned about this is that I am skeptical of the idea that there is enough high-quality training data to train a neural network in countries where homosexuality is illegal (which are not white caucasian majority countries). Now I may well be very wrong about this, but I think we can all agree that it would be much harder than in the Caucasian case. LGBT+ people in these countries have serious problems, but neural networks are not one of them.

Researchers claiming that gay men, and to a slightly lesser extent lesbians, are structurally different from straight people is a problem. (I think we all know how phrenology was used to perpetuate racist myths.)  The authors acknowledge this during their tortured descriptions of their moral struggles over whether to publish the article. Given that, I would’ve expected the claims to be better validated. I just don’t believe that either their data set or their  validation set gives a valid representation of gay people.

All science should be good science, but controversial science should be unimpeachable. This study is not.

Live and let die (Gerri Halliwell version)

Ok. That’s more than enough of my twaddle. I guess the key point of this is that marketing and story telling, as well as being bad for science, gets in the way of effectively disseminating information.

Deep learning / neural networks / AI are all perfectly effective tools that can help people solve problems. But just like it wasn’t a generic convolutional neural network that wins games of Go, this tool only works when deployed by extremely skilled people. The idea that IBM could diagnose cancer with deep learning is just silly. IBM knows bugger all about cancer.

In real studies, selection of the data makes the difference between a useful step forward in knowledge and “junk science”.  I think there are more than enough problems with the “gay face” study (yes I will insist on calling it that) to be immensely skeptical. Further work may indicate that the authors’ preliminary results were correct, but I wouldn’t bet the farm on it. (I wouldn’t even bet a farm I did not own on it.) (In fact, if this study replicates I’ll buy someone a drink. [That someone will likely be me.])

Afternotes

Some things coming after the fact:

103 thoughts on “It seemed to me that most destruction was being done by those who could not choose between the two

  1. As I mentioned on twitter, I am writing a paper which has a deep learning NN being able to correctly classify linguists as psycho linguists with 91% accuracy. Sneak preview: psycho linguists have whiter hair and slacker jaws (from the repeated surprises when they analyze their data).

  2. “[they also didn’t actively consider bisexuals, but LG sells washing machines]”

    I actually laughed out loud at that.

    But more to the point of the study, I don’t see anything wrong with asking the question of whether certain facial structures are correlated with people who identify as gay. It seems at least plausible, and anecdotally certainly seems to be the case. This study isn’t a good example of using a DNN to do that, for all the reasons you point out, but I think the question itself is interesting.

    The same kind of machine learning could be used to detect “gay voice,” another topic fraught with social landmines.

      • Agreed. I just have seen some take the legitimate criticism of this and other problematic research, and go further to suggest that the question itself is homophobic, or that the researchers’ motivation to ask the question is based on homophobia.

      • I don’t think there’s a problem questioning the motivation of the research. Some controversial research (I mentioned phrenology) is motivated by things like homophobia.

        There is also the very relevant criticism that this type of research that it takes marginalised communities and exploits them for publicity.

        How can you avoid those accusations? By doing the research carefully and getting it right.

        • Dan:

          Unfortunately, “doing the research carefully and getting it right” is not necessarily enough for someone to be accused of working in bad faith. I think if you work in a controversial area you just have to accept that your motivations may be questioned, and you just have to keep with it, if you think the topic is important enough.

          What’s not cool, in my opinion, is when people pick a controversial area, bask in the publicity, and then act all surprised when their motivations are questioned.

          And, just to be clear, I’m making zero comments one way or another on the motivations of the people who did this particular study; I’m just saying that positive and negative publicity are two sides of the same coin. I had the same problem with those psychology researchers we sometimes talk about on the blog, who don’t seem to mind adoring coverage of their controversial claims but then get angry when people publicly question the quality of their work.

        • That’s true. But getting the research right is really the only thing you can control about the whole scenario.

          I also don’t think anyone should be protected from having their motivations questioned. It can go too far (Authors’ notes include death threats, but there are a whole range of unacceptable interactions below “death threat”), but a part of the freedom to do controversial research is the requirement that you can defend it.

          Without ascribing motivation to the authors, I do think they work in a field and research culture that rewards certain types of publicity.

        • I agree with what both you and Andrew have written. To amend my comment, you are totally right that motivations can and should be questioned. And that having your motivations questioned is the price of admission if you want to study controversial topics. I was trying to say that sometimes bad faith by researchers is assumed, rather than established.

          I am sensitive to the dangers of exploiting already marginalized communities, however, it is easy to use that concern as a justification to attack research merely because it may make a particular community “look bad.” Or on the flip side, uncritically embrace poor research because it makes a particular community “look good.”

        • It doesn’t even have to be explicit bad faith motivations but a variety of things affect what sort of explanation they’re are hoping to jump to at the slightest hint of an ‘effect’.

          See also
          > That’s the key point about diversity in STEM. It’s not about ticking boxes or meeting quotas. It’s that you don’t know what a minority can add to your work if they’re not in the room to add it.

          Why were they so keen to attribute the difference to neonatal testosterone exposure? And the faux worry about societal consequences is ‘hilarious’, as mentioned in the other thread.

    • the question of whether certain facial structures are correlated with people who identify as gay…This study isn’t a good example of using a DNN to do that

      They didn’t use a neural network to do to that at all though. The dnn was just used as a data reduction step. They took the output of the last fully connected layer of a pre-trained DNN (VGG-face: trained to distinguish between different people’s faces), then did some further summarizing followed by logistic regression:

      We used a simple prediction model, logistic regression, combined with a standard dimensionality-reduction approach: singular value decomposition (SVD). SVD is similar to principal component analysis (PCA), a dimensionality-reduction approach widely used by social scientists. The models were trained separately for each gender. Self-reported sexual orientation (gay/heterosexual) was used as a dependent variable; 4,096 scores, extracted using VGG-Face, were used as independent variables.

  3. When I read about what passes for science and research these days it just makes me sad. I mean, this paper not only got published… someone actually spent portions of their life carrying it out… someone spent money on this, people sat in IRBs and approved it… it was published on news sites, people received salaries and stipends and things to do this stuff…

    I need some kind of happy talk to keep me from just going insane

    https://www.youtube.com/watch?v=SJUhlRoBL8M

    • Think of it this way. At least we don’t have to shift the production possibilities frontier to be better; we just have to move a little inside what’s already attainable. Surely that’s reason for optimism!

  4. Is it possible to see value in the ability of their model to outperform human judges by a considerable margin while still recognizing that the social science part of the paper is largely junk? I too am skeptical about most of the claims in the paper, but I think the problem of identifying which dating site a picture came from is non-trivial, if the human subjects only had about 60% accuracy. Would there be a lot of problems with the paper if it had been written as a modest machine learning problem? Or am I missing something?

    • This can happen. If you replace a human with a classifier that is trained on the whole population, then you’d expect the classifier trained on the strange subpopulation to outperform the global classifier on another data set that had similar structure to the training subpopulation.

      • Certainly this depends on if the model is validated similarly across both datasets (as well as being structurally similar).

        I took Joe’s comment to be more along the lines of: could we appreciate that the authors created a classifier for the data they had and that it did improve upon a “control”? The answer to that, I think, is yes. It’s neat to make classifiers and to model all kinds of things. Does that mean we should all run around writing papers hyping the models we’ve made? Probably not.

        Maybe a blog post would have been more appropriate.

    • Joe:

      Yes, as I wrote in my post yesterday, it’s fine as a classification exercise, and it can be interesting to see what happens to show up in the data (lesbians wear baseball caps!), but the interpretation is way over the top.

  5. Thanks for this distraction – I was starting to feel ill realizing how much similar a statistical methods paper is getting to this with their limitation statement being if the study has XYZ technical problems (which all studies realistically will have to varying degrees), our methods won’t apply.

    As David Donoho puts it “glitter and deceit”.

  6. I did a brief skim of the IBM article and I don’t see anything particularly abhorrent in IBMs approach or discussion of Watson. The commercial marketing may have been over the top (I never saw an advertisement but for arguments sake lets say they stretched) but a company hyping their “new” technology isn’t exactly novel.

    I am a little perplexed at their approach of using literature, case-studies and the like to train the model but then let doctors, from one country, from one hospital, more or less have the final say for all doctors in all countries. What exactly was the point of the model I wonder? Really this thing is an information retrieval engine for the medical literature…

    • IBM does a lot of ML things that make little sense to me. For instance, they got a lot of publicity over a tool which could supposedly predict your big-5 scores from a text sample.

      But seen from a machine learning perspective, big-5 scores are just an intermediate representation. An embedding (ML people call them embeddings, I know they’re technically projections) of what people answer on big-5 questionnaires. A dead simple linear one at that. Why not use the raw answers from the big-5 questionnaires to predict whatever it was that you were trying to predict with the big-5 scores?

      I think it’s because IBM are institutionally in love with dimensionality reduction techniques applied to personality, cultural variation, competence, intelligence, politics etc. They think they represent profound truths, even when they get few concrete answers out of them.

      I suppose it makes sense if their big business customers are also in love with their posh charts. Which they probably are.

  7. Probably the biggest criticism that I can make of this paper is that they do not identify the source of their data. (Actually, in the interview with The Economist that originally announced the study, Kosinski says that this is intentional to “discourage copycats”.)

    Obviously this is bad science.

    If this is genuinely your biggest criticism, then this is a very good paper! Sometimes good ethics are a bar to good science, and I’m a lot less sanguine than it seems most of the commentariat is about making this replicable.

    • “I’m a lot less sanguine than it seems most of the commentariat is about making this replicable.”

      Can you reword that to make it clearer what you’re saying? Are you saying that you think it’s never going to happen because there are obstacles to releasing the data, or are you saying that even if they released the data you think people would find that no one can replicate the result?

      My personal take on this is that it seems likely that developmental biology events affect many aspects of development, including bone development and brain development, so it seems likely that the distribution of facial features among self-identified gay people likely *is* different than the distribution of facial features of self-identified straight people (also, that the distribution of facial features of people with diabetes is different from the distribution of facial features of people with MS… or whatever… lots of differences exist in populations and are related to biology), but that this paper seems to have completely failed to actually do a good job of investigating this claim. It’s about a convenience sample on DATING websites. Any result here is going to be interpretable purely in terms of “people who seek dates on site X look different from people who seek dates on site Y”

      • regarding “people who seek dates on site X look different from people who seek dates on site Y”..

        if i am reading the paper correctly, the gay population and straight population were from the same website.”We obtained facial images from public profiles posted on a U.S. dating website. We recorded 130,741images of 36,630men and 170,360 images of 38,593women between the ages of 18 and 40, who reported their location as the U.S. Gayand heterosexual people were represented in equal numbers. Their sexual orientation was established based on the gender of the partners that they were looking for(according to their profiles).”

        • This is true. The first problem is that that the people who use that site are probably not representative of the population of white Americans. The second problem is that it’s likely that gay and straight users of the site self selected using *different* criteria (the Balkanisation I talked about) which means that you’re not comparing two populations that differ in only one feature. This makes the causal interpretation (that this provides evidence for a certain hormone-based theory of why people are gay) incorrect.

      • I’m saying that I think making the data available has worse potential consequences than other people commenting here seem to think. A party that seeks to use these tools for nefarious purposes would likely want to start by replicating the work reported in the paper prior to using the tools on some other set of data; so I think that the authors’ choice to stymie that straightforward approach by not making the data available is justified by ethical concerns, “bad science” be damned. (We all know that converting the rawest of raw data into a form suitable for analysis requires a significant proportion of the total effort needed to carry out an analysis.)

    • Corey:

      Maybe not bad science but non-science – to be scientific means to set out how others in the community can bring out what you did and discovered for themselves – without you.

      Lots of good non-science things to do and they can be important and or profitable.

      My guess is that various nefarious regimes are already working on this with more relevant data sets that have labelled outcomes determined by informants and direct surveillance. To get an excuse to talk about these possibilities they should have only needed to report some preliminary success that could be verified by other researchers for instance whose REB will attest to their proper ethical conduct (which should never bar the assessment of claimed success).

      But sizzle is better that steak, glitter and deceit (really starting to like this expression on Donoho’s) is the current academic highroad.

      Good thing there are alternatives to being in academia and struggling to avoid doing bad science.
      (The percentage that does avoid doing bad science is unknown and I am hoping its greater than at least 50%)

        • I don’t know, claiming a thing is science doesn’t make it so. I mean, I’d rather say something like “it’s masquerading as science”. There is these days a tendency to say “science is whatever scientists do” and “scientists” are just “people trained by universities in science” but I think this significantly devalues the word “science” as we’ve clearly seen over the last few years here on Andrew’s blog, lots of stuff that is “done by ‘scientists'” is non-scientific, little more than superstition “the gods (Stata) told me in a dream (script) that I was the chosen one (p was less than 0.05)”

        • That’s fine. I don’t want to be the “science” gatekeeper (being, as I am, fallible). I’d rather take people on their word and then write unbearably long blog posts castigating them for doing it badly.

        • Have to agree, science is what scientists do so this is just degenerate science in that others cannot replicate even if the wanted to.

        • Keith:

          We sometimes use the term “junk science” or “cargo cult science” for Kanazawa- or Wansink-like work that has the form of science but which because of lack of understanding and noisy data cannot have any hope of directly advancing scientific knowledge.

        • Keith: If you can claim, “science is what scientists do,” it seems you can equally claim, “scientists are people who do science,” so this becomes circular and meaningless.

        • Not really. If you self identify as a scientist, then your claims should be critiqued scientifically. If no one else would identify you as a scientist, things typically fall apart at this point.

        • This is fine so long as only a small minority of scientists are “scientists” with scare quotes. When whole fields are full of “scientists” then … not so much. That’s more or less what seems to have happened to Meehl, he was one of the few people who could see that vast swaths of the rest of his field were cargo-cultists, but one guy against a whole cult?

  8. I have some questions arising from the discussion of Watson. Let’s say that Watson has a well-refined model for the treatment of MALT lymphoma (a gastric cancer) and one day a paper comes out that reports “Surprise! A bacteria named Helicobacter pylori may well be the cause and if your patient is infected eradicating it will do the trick, often in dramatic fashion.” How do you model that? Is Watson supposed to update its recommendation (maybe “give chemo but also check for …”) or throw it away and adopt the new one? How do you model evidential plausibility? How could Watson possibly model not just a potential shift (supplementation/modification) in the current treatment paradigm but its sudden collapse and replacement by one never previously considered? Given the number of such recent discoveries (oftentimes published in non-elite journals precisely because they threaten prevailing dogma) it’s difficult for me (a long time participant in IBM’s dividend re-investment plan, alas) to see how Watson, as imagined, can be anything other than (a) a poor substitute for a committed and open-minded cancer doc up on her literature; (b) a very expensive Mechanical Turk; or, (c) both.

    • Personally, I think that this is actually an area where Watson has a whole lot of potential, but would take a long time to realize it. Since people want results tomorrow, this may never happen.

      Why do I think Watson has a lot of potential? Well, it’s strength is that it can review a very large number of documents in an amount of time that is simply impossible for human researchers to do. This is why Watson won Jeopardy; all the answers to the questions asked are sitting in Wikipedia, and it can easily parse all of Wikipedia.

      How does that translate to utility in medical research? I don’t think that’s by being up on all the latest research papers. Rather, it’s the issue of medical records. There is a vast amount of information in medical records, but trying to aggregate them to find parts is simply not possible for human researchers. If Watson was able to answer the question “What set of symptoms are over-represented in the last 2 weeks in this 20 mile radius?”, that would be hugely beneficial.

      However, that’s a much more difficult question to answer than copying and pasting together answers from Wikipedia. So I think a good deal of enhancement to Watson would be required before this utility was ever reached.

    • I think something like Watson is being taken as if it were some kind of strong AI, when in reality it should be seen as a kind of fancy database retrieval system. The thing that’s needed is the skill of “being a doctor who can use and interpret sophisticated database retrieval systems”. perhaps in the future doctors will learn less about say the microanatomy of the dermis, and more about say how to ask Watson for a quick run-down on the microanatomy of the dermis, and how to interpret what it gives back.

      • This is fantastic. But since humans do seem to be bad at integrating information for decisions (Meehl’s “little book” on clinical versus actuarial judgment), we might add a further Watson layer, say these are the possible afflictions, with say, predicted probabilities of the patient suffering or some more easily interpreted score. For what it’s worth, Meehl also included consideration of what he called “broken legs” (due to an extended analogy) — humans can make better judgments than algorithms, but mostly when they have access to information that the algorithm cannot/does not account for. I’m not sure what the right way of doing this is, but some *systematic* combination of human judgment and machine learning algorithm seems plausible.

        • Seth’s comments seem on point. Adding to this, the following quote from the article struck me:

          “Pilar Ossorio, a professor of law and bioethics at University of Wisconsin Law School, said Watson should be subject to tighter regulation because of its role in treating patients. “As an ethical matter, and as a scientific matter, you should have to prove that there’s safety and efficacy before you can just go do this,” she said.

          Norden dismissed the suggestion IBM should have been required to conduct a clinical trial before commercializing Watson, noting that many practices in medicine are widely accepted even though they aren’t supported by a randomized controlled trial.

          “Has there ever been a randomized trial of parachutes for paratroopers?” Norden asked. “And the answer is, of course not, because there is a very strong intuitive value proposition. … So I believe that bringing the best information to bear on medical decision making is a no-brainer.””

          Norden’s comment seems pretty scary to me.

    • “Watson” isn’t going to come up with anything not in the data. It is a classifier trained via supervised learning. Maybe it would come up with some cheaper way of distinguishing between different cancers/patient needs,. However, apparently doctors don’t agree enough, so there is no way to train it for this:

      But three years after IBM began selling Watson to recommend the best cancer treatments to doctors around the world… It is still struggling with the basic step of learning about different forms of cancer.

      https://www.statnews.com/2017/09/05/watson-ibm-cancer/

      It sounds like they are just feeding the algorithm conflicting information then complaining when it doesn’t make a confident decision…

      “It’s been a struggle to update, I’ll be honest,” said Dr. Mark Kris, Memorial Sloan Kettering’s lead Watson trainer. He noted that treatment guidelines for every metastatic lung cancer patient worldwide recently changed in the course of one week after a research presentation at a cancer conference.

      […]

      its treatment recommendations are not based on its own insights from these data. Instead, they are based exclusively on training by human overseers, who laboriously feed Watson information about how patients with specific characteristics should be treated.

      […]

      Given the same clinical scenario, doctors can — and often do — disagree about the best course of action, whether to recommend surgery or chemotherapy, or another treatment.

      […]

      The system is essentially Memorial Sloan Kettering in a portable box. Its treatment recommendations are based entirely on the training provided by doctors, who determine what information Watson needs to devise its guidance as well as what those recommendations should be.

      […]

      That training does not teach Watson to base its recommendations on the outcomes of these patients, whether they lived, or died or survived longer than similar patients. Rather, Watson makes its recommendations based on the treatment preferences of Memorial Sloan Kettering physicians.

      […]

      In Denmark, oncologists at one hospital said they have dropped the project altogether after finding that local doctors agreed with Watson in only about 33 percent of cases.

      […]

      Sometimes, the recommendations Watson gives diverge sharply from what doctors would say for reasons that have nothing to do with science, such as medical insurance.

      https://www.statnews.com/2017/09/05/watson-ibm-cancer/

      So… the current state of the art is to make arbitrary decisions, and “Watson” is simply detecting this.

      • Yes, I think Watson does not have a ton of utility in predicting diagnoses. By the time a doctor has written down a medical record, they are already prepared to give their diagnosis. Given that all doctors have variety in what and how they record things, the best Watson could hope to do is give back the diagnosis the doctor already came up with by the time they filled out the record.

        My point is that medical records should have a lot of information in them, but they are *extremely* hard to use due to their non-uniformity. At a recent talk I attended, one of the researchers was discussing how much time they had to spend extracting data from medical records and stated “The expression goes: once you’ve seen one VA’s records…you’ve seen one VA’s records”.

        Presumably, medical records are information rich documents (I’m saying presumably as I’ve never read one). But using all this data takes too much time to parse key bits of information from. What is needed is a super-fancy database system that can easily query this messy data set.

      • Ojm:

        We have serious concerns about the ability of authoritarian states to use our ESP research to zap people’s minds. We’ll publish this work anyway and splash it all over the news media, but for the record we are very attuned to these important ethical issues.

        • It would be irresponsible not to consider the dangerous implications of this research. In addition, we are fully aware of the risks of working on such a taboo project. It would be easy for us to be politically correct and avert our eyes from the science of ESP but we must reluctantly admit that our research is trailblazingly innovative.

          In addition, we are aware that unscrupulous businesspeople could easily make BILLIONS OF DOLLARS off our technology. This bothers us very much, and we certainly would NOT want any of these evildoers to consider PAYING US to explain in more detail exactly how our SECRET SAUCE works and why they should definitely avoid the many juicy opportunities for PROFITS in this work.

          (To give Daryl Bem his due, it seems that at some level he realized his ideas were all bogus and I’ve not heard any rumors that he tried to monetize them.)

        • For what it’s worth – Daryl Bem may have dropped off a bit for personal reasons:
          https://www.nytimes.com/2015/05/17/magazine/the-last-day-of-her-life.html?_r=0

          (On the other hand, I heard him discussing working on the ESP stuff on a bus from NYC to Ithaca circa 2013 – I think he really believes it. It might be bad science, but he’s not a cynical “exploiter”. I think he *wants* it all to be true-moreover, he’s convinced it is.)

        • I appreciate the parody, Andrew. You might appreciate the irony that what got the reproducibility crisis started was that people concluded that Bem’s methods must be faulty, *not* because they were familiar with the evidence in that field, but because they *already knew*, a priori, that his findings couldn’t be true. I believe that you might be an exception here, but I’m pretty sure that 95% of the people coming to that conclusion had no grasp of the empirical literature on ESP, so their priors were literally uninformed. That’s such an obviously awful way to do science and the fact that this still happens today and everybody jumps on the bandwagon tells me we still have a long way to go to keep our social biases out of science. Just one thought: how do people reasoning like that ever expect to make new and even groundbreaking discoveries (which the future no doubt holds in store for us)?

        • Alex:

          I think it’s fine for people to study things that they think might not be real, and to investigate anomalies, and it should also be fine for people to be open to the idea that their data don’t provide the answer they want. (Recall Tukey’s “aching desire” quote.)

          My problem with the studies of Bem, Kanazawa, Cuddy, Fiske, Gilbert, etc., is not that the phenomena they care about can’t exist or don’t exist, but that their measurement tools are too crude to have a chance of making the discoveries they want to make.

          To put it another way: Galileo discovered the moons of Jupiter. To do that he had to invent the telescope. Bem etc. are not inventing any telescopes; they’re just screwing around.

          Some of the blame for this situation also goes to the promotors of statistical methods based on null hypothesis significance testing. Possibly well-intentioned but definitely methodologically ignorant researchers such as Bem, Kanazawa, Cuddy, Fiske, Gilbert are misled by their low p-values into thinking that they have strong evidence. The whole system dis-incentives them from putting any serious thought or effort into measurement.

        • Another way to put this is they took their models way to seriously.

          Randomized trials and the adequate meta-analyses are an attempt to get close to those properties that follow from the idealizations involved. But you never get that close in practice and so – the studies and their analyses were not as good as the assumptions we made – versus ESP?

          Folks were arguing you need to have a very skeptical prior to ignore the combine likelihood (evidence) when they should have been thinking out prior of the assumptions being so closely met is just way too hopeful.

        • Andrew, Keith,
          I don’t disagree with your points, but I’m making a different point. I’m not talking about the researchers, but about the people judging the research. My claim here is that 95% of these people were not in any way familiar with the ESP literature and judged Bem’s paper on ideological grounds, by simply rehearsing mainstream science’s metaphysical commitment to materialism. If I’m right, then this is one more case of scientists finding only their own biases in the data. If that’s not worrisome, then I don’t know what is.

        • Openness is fine, but probably a red herring in this context. It’s about applying the same standards of informed criticism across the board. People commenting on this blog are usually careful pointing out when something is not their area of expertise. I don’t recall people doing this with Bem: “I’m not an expert in the field of ESP, but…”, “I think X and Y, but then I’m not familiar with the research literature in ESP…”.

          Why not? What makes it OK – scientifically acceptable – to dismiss ESP without knowing the field and without qualifying your judgment, while in most other areas this would be considered absolutely unacceptable?

        • Alex:

          It’s complicated because it also seems that bad work on ESP, embodied cognition, etc., gets a break because journals want to be open to innovation. In terms of measurement and statistics, Bem’s paper was just horrible, and it would’ve been just as horrible had he been studying a phenomenon whose connections to existing science are better understood. But maybe JPSP published this crappy paper in part because (a) if the claims really had been supported by experimental data, the claims would’ve been newsworthy, and (b) the editors didn’t want to be censors.

        • Andrew,
          I think we’re talking past each other. You’ve come to the conclusion that Bem’s findings don’t hold up because you examined the methods he used. I claim (and I might be wrong, but that’s the point I’m making) that the vast majority of people forming a negative opinion on his paper did not do this. They rejected it because of a scientifically uninformed belief that ESP can’t be real (maybe because “there’s no supernatural stuff”) or that all this research is bogus and performed by people who just “want to believe” or by outright fraudsters.

          I suspect that this fact played a considerable role in academia’s broad acceptance of the reproducibility crisis. If a study shows that ESP is real, there surely must be something wrong with its methods! If these are the methods psychology uses routinely…, o my! Again, I might be completely off, but so far no one here has engaged the issue.

        • I have yet to hear any ESP researcher give a mechanistic model for how a person shuffling cards in Milan can transmit information about the card draws to Seattle and cause a person in Seattle to predict them with some accuracy or the like. Until a mechanism is hypothesized I think it’s appropriate to write much of this stuff off as fundamentally violating known and highly validated physics in favor of known to be problematic statistical techniques for detecting deviations from randomness where we have no compelling reason to believe the literal truth of the randomness model in the first place.

        • Daniel,

          your heuristic will probably give good results in most cases, but the problem is that it is of little help when it comes to discovering phenomena outside the currently known and accepted boundaries of physical theory – unless, that is, you’re willing to assume that there is nothing to discover outside those boundaries or a priori specifiable extensions to those boundaries. I think that would be a very foolish (and arrogant) assumption, given the history of science.

          This point is not just abstractly relevant but has a particular urgency today, since there is actually a gaping hole in the middle of metaphysical materialism. That hole is consciousness (here meant *only* in the sense of “phenomenal experience” or “qualia” (https://plato.stanford.edu/entries/qualia/)). Everything we experience, learn and know about the world is mediated through consciousness. In fact, the one and only thing we know about with certainty is that we are conscious, everything else can be doubted (e.g. solipsism could be true). And the current state of evidence and theorizing is that consciousness cannot be reduced to or explained by physical concepts, even more, that there is not even a conceivable mechanism in the vicinity of physical theory that could account for it. Consciousness is as far away from any physical thing as could possibly be.

          That means, consciousness both exists with a certainty that is not attached to any physical thing and it strongly violates your heuristic of judging a phenomenon to be real only when a conceivable mechanism based on “known and highly validated physics” can be hypothesized for it.

        • I have no interest in getting into an argument over the “true nature” of consciousness or whatever. But I’m *totally* open to considering new phenomenon not explained by current physics. All you have to do is explain your mechanistic theory and how it can be manipulated, and tested. ESP research (that I’ve seen) doesn’t do that, it seeks to “detect anomaly” without any explanation for how such an anomaly can arise. I say let ESP researchers waste their time ignored by people like me who have no interest in “detecting anomalies” until such time as they have detected sufficient anomalies to hypothesize a mechanism, a theory of how the whole thing works, one which is specifically testable and manipulatable.

          A theory is a scientific theory to the extent that it makes specific predictions of what happens under various experimental manipulations and those predictions are validated against such experiments. ESP and a large fraction of psych research doesn’t do that. It “predicts” that if “nothing is going on” that observed data will be “as if from a null-random number generator” and then “detects a deviation from this null hypothesis”. That simply isn’t science whether you call it ESP or Psychology research, or Evidence Based Medicine, or whatever.

        • : I have no interest in getting into an argument over
          : the “true nature” of consciousness or whatever.

          Of course you don’t. But let me reassure you, what Monty Python once said about the human body is also true of consciousness: “There’s nothing filthy or disgusting about consciousness, except for the intestines and parts of the bottom.”

          Really, I’m fine with your approach.

        • Alex, I’m one of the people you speak of who heard Bem’s conclusion and immediately figured “that can’t be right”. And I am not familiar with the literature on ESP, and I’m enough of a skeptic that I don’t really think ESP “can’t” exist – only that what Bem claimed to discover would be shocking.

          Do you think it’s wrong for people to casually apply the principle that extraordinary claims require extraordinary evidence? Bem claimed that there are psychological phenomena in which effect precedes cause. That’s pretty bold. If I see someone claim evidence for this based on p-values, my default assumption is going to be p-hacking, incorrectly specified models, or random chance. I’m not saying pre-cognition is impossible, only that other explanations seem more likely.

          The Global Consciousness Project provides another similar example: http://noosphere.princeton.edu/results.html

          They claim that they have standard normal RNGs set up worldwide, and that when they specify the times of “global events that engage our hearts and minds” and extract z-values from their RNGs at these specified times, these z-values aggregate to a Stouffer’s Z value that has a “one in a trillion” odds of occurring by chance.

          I don’t believe that “global consciousness” is pushing RNGs to create positive z-values. But I admit that I haven’t looked into the literature. My presumption is that they’ve made a mistake somewhere – my best guesses are that either their RNGs aren’t really normal and that they crank out “z” values with expectation > 1, or that there is some data dependent selection going on. But I don’t know this, and I wouldn’t bet my life on the global consciousness hypothesis being false. I just look at these results and figure that the chance their missing something greatly exceeds the chance they’ve discovered a way to detect changes in global consciousness. Is that unreasonable?

        • Ben:

          I’ve thought about the question “is this reasonable?” and I think the best answer I can give is that, yes, your (and Daniel’s) approach strike me as very reasonable, *given* the background assumptions of current metaphysical materialism.

          However, I also think that if we take seriously the massive irregularity that consciousness represents in a materialistic framework, and consider the history of our current science’s metaphysical commitment, the landscape shifts and standards of reasonability along with them. I have said this above: the thing most familiar to us, conscious experience, is at the same time something that does not fit in any way into the materialistic explanatory edifice of science. It’s an anomaly on a scale that should make any – reasonable? – person very skeptical of claims that we’ve basically figured out how the world works, and new discoveries will fit neatly into existing categories – no surprises.

          At the same time, looking into history, materialism has only achieved its current status as the dominant world view by systematically neglecting the conscious aspects of phenomena. It *has* been successful, yes! But only at the cost of leaving consciousness unexplained. The phenomenon of heat, for example, could be ± reduced to the motion of molecules, but only by leaving out the aspect of subjective conscious experience of heat. Or colors: there are some (not always exception-free) mappings of physical dimensions (such as wavelength) to the perceptual qualities of colors, but the question of why and how we have conscious color experiences in the first place remains as mysterious as ever.

          There’s even a good reason to skip over the conscious aspects of phenomena: we would simply be stumped if we had to explain the conscious aspects along with the physical aspects. Also, the consciousness-excluding accounts of many phenomena work very well, so we may leave the conscious aspects aside. That’s reasonable again, but there’s a danger that we forget the price of our strategy and start believing that we’re actually doing a pretty good job at explaining *everything* materialistically that comes our way. But consciousness is still there, it’s still big, and still unexplained.

          What this means for me is that I don’t give particular weight to our current materialistic commitments. There’s nothing particularly compelling about them. It also means that I have a clear expectation that the phenomenon of consciousness might be the gateway to a world of fundamentally different stuff that obeys different rules than the physics we know (in fact, quantum physics already points in that direction).

          So when you say, for example, that “extraordinary claims require extraordinary evidence”, then the question really is, what is an extraordinary claim? For me, an extraordinary claim is the widely shared belief that consciousness would yield smoothly to an explanation in terms of current physics. And that’s not *just* an opinion of mine, it is based on decades of philosophers and scientists trying and coming up empty. It is based on the fact that not only can we not *explain* consciousness, we cannot even conceive of a *possible* physical explanation.

          Or take the claim that ESP phenomena would violate highly validated physics. Such a claim, for me, doesn’t amount to much, because it assumes that what we now know or believe to know about the laws of physics will forever be valid. Not reasonable to me, this one, rather foolish.

          There’s one thing that strikes me about many “mainstream” scientists’ stance towards ESP, which I regard as further evidence that their rejection of ESP is the result of social processes, not of scientific evaluation. That’s the fact that these scientists seem strangely uninterested in examining phenomena that *could* be related to ESP. You, Ben, for example tell me that your reaction to researchers reporting that events of widespread emotional interest (such as the 9/11 terrorist attacks) produce deviations from randomness in the output of random-event generators is to believe that they must have fucked up their statistical analysis. That’s of course possible. What I don’t understand is that for many scientists, such a possible explanation is enough to reassure them that the matter is not worthy of further investigation. In other words, they just assume that if somebody comes up with a *potential* explanation *not* in terms of ESP, that must be the *actual* explanation.

          Why is everybody not thinking: wow, this REG thing would be the most ground-breaking discovery since , we absolutely have to check these results, examine them critically, do our own experiments, reproduce them or not etc. After all, it seems that testing deviations from randomness should be a statistically easy thing to do.

          This holds for many more ESP-ish things. A lot of people know the experience of a baffling synchrony of thought/speech in people strongly emotionally related to each other (spouses etc). Prima facie, this is striking, and no one denies it. Why is that not a major research topic? Because, even if this has nothing to do with ESP, it is a striking phenomenon, something we’d want to know about. Something that could teach us a lot about how minds and brains work. However, it seems that scientists are socially conditioned to avoid such research questions because they are often brought up in an ESP context, which is socially taboo, and so they shun it. Being scientists, they at least want some minimal assurance that ESP is not involved, so they seize on any “conventional” account that *could* explain the phenomenon and their minds are appeased.

          Overall, I think that given the sheer amount of anecdotal evidence of potential ESP phenomena, the fact that many people have either personally experienced something of this sort or heard it from people whose judgements they trust, also the fact that there might be a sizeable file drawer of “anomalous” data that never gets published (either because researchers don’t take it seriously, don’t want to publish it because it could harm their reputation, or because journals would not publish it anyway), should at least make many more researchers very *curious* about these phenomena. Nobody is going to convince me that the fact that we don’t see more research in such areas is only due to scientific reasonableness. It is to a large extent a social phenomenon.

  9. For those who didn’t have the energy to follow all of Dan Simpson’s links, there is a great line in the linked article by Greggor Mattson:

    “In Maciej Ceglowski’s memorable phrase, “Machine learning is like money laundering for bias.” ”

    PS – I know the discussion here is about serious stuff, so excuse my levity, but are Dan’s classes like his last two posts?? If so, can I get the class notes?

    PPS – I think I have followed the reasoning on about half of the links in the two posts. I must be getting slow!

  10. This was cathartic, thanks!

    The other thing that comes to mind is that it’s probably still pretty common to be closeted even in America, unfortunately (e.g. http://www.nytimes.com/interactive/2013/12/08/sunday-review/where-the-closet-is-still-common.html). It seems to me like closeted and “out” populations probably vary substantially in the typicality of their gender presentation, which means you’re getting a biased subset of gay men/lesbians even before you consider the self-selection bias associated with using Tinder vs. OKCupid vs. Scruff vs. Craigslist vs. Christian Mingle.

Leave a Reply

Your email address will not be published. Required fields are marked *