Extinct Champagne grapes? I can be even more disappointed in the news media

Posted on January 1, 2024 5:51 PM by Lizzie

Happy New Year. This post is by Lizzie.

Over the end-of-year holiday period, I always get the distinct impression that most journalists are on holiday too. I felt this more acutely when I found an “urgent” media request in my inbox when I returned to it after a few days away. Someone at a major reputable news outlet wrote:

We are doing a short story on how the climate crisis is causing certain grapes, used in almost all champagne, to be on the brink of extinction. We were hoping to do a quick interview with you on the topic….Our deadline is asap, as we plan to run this story on New Years.

It was late on 30 December so I had missed helping them but still had to reply that I hoped that found some better information because ‘the climate crisis is causing certain grapes, used in almost all champagne, to be on the brink of extinction’ was not good information in my not-so-entirely-humble opinion as I study this and can think of zero-zilch-nada evidence to support this.

This sounded like insane news I would expect from more insane media outlets. I tracked down what I assume was the lead they were following (see here), and found it seems to relate to some AI start-up I will not do the service of mentioning that is just looking for more press. They seem to put out splashy sounding agricultural press releases often — and so they must have put out one about Champagne grapes being on the brink of extinction to go with New Year’s.

I am on a bad roll with AI just now, or — more exactly — the intersection of human standards and AI. There’s no good science that “the climate crisis is causing certain grapes, used in almost all champagne, to be on the brink of extinction.” The whole idea of this is offensive to me when human actions are actually driving species extinct. And it ignores tons of science on winegrapes and the reality that they’re pretty easy to grow (growing excellent ones? Harder). So, poor form on the part of the zero-standards-for-our-science AI startup. But I am more horrified by the media outlets that cannot see through this. I am sure they’re inundated with lots of crazy bogus stories every day, but I thought that their job was to report on ones that matter and they hopefully have some evidence are true.

What did they do instead of that? They gave a platform to a “a highly adaptable marketing manager and content creator” to talk about some bogus “study” and a few soundbites to a colleague of mine who actually knew the science (Ben Cook from NASA).

Here’s a sad post for you to start the new year. The Onion (ok, an Onion-affiliate site) is plagiarizing. For reals.

Posted on January 1, 2024 9:00 AM by Andrew

How horrible. I remember when The Onion started. They were so funny and on point. And now . . . What’s the point of even having The Onion if it’s running plagiarized material? I mean, yeah, sure, everybody’s gotta bring home money to put food on the table. But, really, what’s the goddam point of it all?

Jonathan Bailey has the story:

Back in June, G/O Media, the company that owns A.V. Club, Gizmodo, Quarts and The Onion, announced that they would be experimenting with AI tools as a way to supplement the work of human reporters and editors.

However, just a week later, it was clear that the move wasn’t going smoothly. . . . several months later, it doesn’t appear that things have improved. If anything, they might have gotten worse.

The reason is highlighted in a report by Frank Landymore and Jon Christian at Futurism. They compared the output of A.V. Club’s AI “reporter” against the source material, namely IMDB. What they found were examples of verbatim and near-verbatim copying of that material, without any indication that the text was copied. . . .

The articles in question have a note that reads as follows: “This article is based on data from IMDb. Text was compiled by an AI engine that was then reviewed and edited by the editorial staff.”

However, as noted by the Futurism report, that text does not indicate that any text is copied. Only that “data” is used. The text is supposed to be “compiled” by the AI and then “reviewed and edited” by humans. . . .

In both A.V. Club lists, there is no additional text or framing beyond the movies and the descriptions, which are all based on IMDb descriptions and, as seen in this case, sometimes copied directly or nearly directly from them.

There’s not much doubt that this is plagiarism. Though A.V. Club acknowledges that the “data” came from IMDb, it doesn’t indicate that the language does. There are no quotation marks, no blockquotes, nothing to indicate that portions are copied verbatim or near-verbatim. . . .

Bailey continues:

None of this is a secret. All of this is well known, well-understood and backed up with both hard data and mountains of anecdotal evidence. . . . But we’ve seen this before. Benny Johnson, for example, is an irredeemably unethical reporter with a history of plagiarism, fabrication and other ethical issues that resulted in him being fired from multiple publications.

Yet, he’s never been left wanting for a job. Publications know that, because of his name, he will draw clicks and engagement. . . . From a business perspective, AI is not very different from Benny Johnson. Though the flaws and integrity issues are well known, the allure of a free reporter who can generate countless articles at the push of a button is simply too great to ignore.

Then comes the economic argument:

But in there lies the problem, if you want AI to function like an actual reporter, it has to be edited, fact checked and plagiarism checked just like a real human.

However, when one does those checks, the errors quickly become apparent and fixing them often takes more time and resources than just starting with a human author.

In short, using an AI in a way that helps a company earn/save money means accepting that the factual errors and plagiarism are just part of the deal. It means completely forgoing journalism ethics, just like hiring a reporter like Benny Johnson.

Right now, for a publication, there is no ethical use of AI that is not either unprofitable or extremely limited. These “experiments” in AI are not about testing what the bots can do, but about seeing how much they can still lower their ethical and quality standards and still find an audience.

Ouch.

Very sad to see an Onion-affiliated site doing this.

Here’s how Bailey concludes:

The arc of history has been pulling publications toward larger quantities of lower quality content for some time. AI is just the latest escalation in that trend, and one that publishers are unlikely to ignore.

Even if it destroys their credibility.

No kidding. What next, mathematics professors who copy stories unacknowledged, introduce errors, and then deny they ever did it? Award-winning statistics professors who copy stuff from wikipedia, introducing stupid-ass errors in the process? University presidents? OK, none of those cases were shocking, they’re just sad. But to see The Onion involved . . . that truly is a step further into the abyss.

Plagiarism means never having to say you’re clueless.

Posted on December 22, 2023 11:55 AM by Andrew

In a blog discussion on plagiarism ten years ago, Rahul wrote:

The real question for me is, how I would react to someone’s book which has proven rather useful and insightful in all aspects but which in hindsight turns out to have plagiarized bits. Think of whatever textbook, say, you had found really damn useful (perhaps it’s the only good text on that topic; no alternatives) and now imagine a chapter of that textbook turns out to be plagiarized.

What’s your reaction? To me that’s the interesting question.

It is an interesting question, and perhaps the most interesting aspect to it is that we don’t actually see high-quality, insightful plagiarized work!

Theoretically such a thing could happen: an author with a solid understanding of the material finds an excellent writeup from another source—perhaps a published article or book, perhaps something on wikipedia, maybe something written by a student—and inserts it directly into the text, not crediting the source. Why not credit the source? Maybe because all the quotation marks would make the resulting product more difficult to read, or maybe just because the author is greedy for acclaim and does not want to share credit. Greed is not a pretty trait, but, as Rahul writes, that’s a separate issue from the quality of the resulting product.

So, yeah, how to think about such a case? My response is that it’s only a hypothetical case, that in practice it never occurs. Perhaps readers will correct me in the comments, but until that happens, here’s my explanation:

When we write, we do incorporate old material. Nothing we write is entirely new, nor should it be. The challenge is often to put that old material into a coherent framework, which requires some understanding. When authors plagiarize, they seem to do this as a substitute for understanding. Reading that old material and integrating it into the larger story, that takes work. If you insert chunks of others’ material verbatim, it becomes clear that you didn’t understand it all, and not acknowledging the source is a way of burying that meta-information. To flip it around: as a reader, that hypertext—being able to track to the original source—can be very helpful. Plagiarists don’t want you to be aware of the copying in large part because they don’t want to reveal that they have not put the material all together.

To use statistical terminology, plagiarism is a sort of informative missingness: the very fact that the use of outside material has not been acknowledged itself provides information that the copyist has not fully integrated it into the story. That’s why Basbøll and I referred to plagiarism as a statistical crime. Not just a crime against the original author—but, yeah, as someone whose work has been plagiarized, it annoys me a lot—but also against the reader. As we put it in that article:

Much has been written on the ethics of plagiarism. One aspect that has received less notice is plagiarism’s role in corrupting our ability to learn from data: We propose that plagiarism is a statistical crime. It involves the hiding of important information regarding the source and context of the copied work in its original form. Such information can dramatically alter the statistical inferences made about the work.

To return to Rahul’s question above: have I ever seen something “useful and insightful” that’s been plagiarized? In theory it could happen: just consider an extreme example such as an entirely pirated book. Take a classic such as Treasure Island, remove the name Robert Louis Stevenson and replace it with John Smith, and it would still be a rollicking good read. But I don’t think this is usually what happens. The more common story would be that something absolutely boring is taken from source A and inserted without checking into document B, and no value is added in the transmission.

To put it another way, start with the plagiarist. This is someone who’s under some pressure to produce a document on topic X but doesn’t fully understand the topic. One available approach is to plagiarize the difficult part. From the reader’s perspective, the problem is that the resulting document has undigested material, the copied part could actually be in error or could be applied incorrectly. By not disclosing the source, the author is hiding important information that could otherwise help the reader better parse the material.

If I see some great material from another source, I’ll copy it and quote it. Quotations are great!

Music as a counterexample

In his book, It’s One for the Money, music historian Clinton Heylin gives many examples of musicians who’ve used material from others without acknowledgment, producing memorable and sometimes wonderful results. A well known example is Bob Dylan.

How does music differ from research or science writing? For one thing, “understanding” seems much more important in science than in music. Integrating a stolen riff into a song is just a different process than integrating an explanation of a statistical method into a book.

There’s also the issue of copyright laws and financial stakes. You can copy a passage from a published article, with quotes, and it’s no big deal. But if you take part of someone’s song, you have to pay them real money. So there’s a clear incentive not to share credit and, if necessary, to muddy the waters to make it more difficult for predecessors to claim credit.

Finally, in an academic book or article it’s easy enough to put in quotation marks and citations. There’s no way to do that in a song! Yes, you can include it in the liner notes, and I’d argue that songwriters and performers should acknowledge their sources in that way, but it’s still not as direct as writing, “As X wrote . . .”, in an academic publication.

What are the consequences of plagiarism?

There are several cases of plagiarism by high-profile academics who seem to have suffered no consequences (beyond the occasional embarrassment when people like me bring it up or when people check them on wikipedia): examples include some Harvard and Yale law professors and this dude at USC. The USC case I can understand—the plagiarist in question is a medical school professor who probably makes tons of money for the school. Why Harvard and Yale didn’t fire their law-school plagiarists, I’m not sure, maybe it’s a combination of “Hey, these guys are lawyers, they might sue us!” and a simple calculation along the lines of: “Harvard fires prof for plagiarism” is an embarrassing headline, whereas “Harvard decides to do nothing about a plagiarist” likely won’t make it into the news. And historian Kevin Kruse still seems to be employed at Princeton. (According to Wikipedia, “In October 2022, both Cornell, where he wrote his dissertation, and Princeton, where he is employed, ultimately determined that these were “citation errors” and did not rise to the level of intentional plagiarism.” On the plus side, “He is a fan of the Kansas City Chiefs.”)

Other times, lower-tier universities just let elderly plagiarists fade away. I’m thinking here of George Mason statistician Ed Wegman and Rutgers political scientist Frank Fischer. Those cases are particularly annoying to me because Wegman received a major award from the American Statistical Association and Fischer received an award from the American Political Science Association—for a book with plagiarized material! I contacted the ASA to suggest they retract the award and I contacted the APSA to suggest that they share the award with the scholars who Fischer had ripped off—but both organizations did nothing. I guess that’s how committees work.

We also sometimes see plagiarists get canned. Two examples are Columbia history professor Charles Armstrong and Arizona State historian Matthew Whitaker. Too bad for these guys that they weren’t teaching at Harvard, Yale, or Princeton, or maybe they’d still be gainfully employed!

Outside academia, plagiarism seems typically to have more severe consequences.

Journalism: Mike Barnicle, Stephen Glass, etc.
Pop literature: that spy novelist (also here), etc.

Lack of understanding

The theme of this post is that, at least regarding academics, plagiarism is a sign of lack of understanding.

A common defense/excuse/explanation for plagiarism is that whatever had been copied was common knowledge, just some basic facts, so who cares if it’s expressed originally? This is kind of a lame excuse given that it takes no effort at all to write, “As source X says, ‘. . .'” There seems little doubt that the avoidance of attribution is there so that the plagiarist gets credit for the words. And why is that? It has to depend on the situation—but it doesn’t seem that people usually ask the plagiarist why they did it. I guess the point is that you can ask the person all you want, but they don’t have to reply—and, given the record of misrepresentation, there’s no reason to suspect a truthful answer.

But, yeah, sometimes it must be the case that the plagiarist understands the copied material and is just being lazy/greedy.

What’s interesting to me is how often it happens that the plagiarist (or, more generally, the person who copies without attribution) evidently doesn’t understand the copied material.

Here are some examples:

Weggy: copied from wikipedia, introducing errors in the process.

Chrissy: copied from online material, introducing errors in the process; this example of unacknowledged copying was not actually plagiarism because it was stories being repeated without attribution, not exact words.

Armstrong (not the cyclist): plagiarizing material by someone else, in the process getting the meaning of the passage entirely backward.

Fischer (not the chess player): OK, I have to admit, this one was so damn boring I didn’t read through to see if any errors were introduced in the copying process.

Say what you want about Mike Barnicle and Doris Kearns Goodwin, but I think it’s fair to assume that they did understand the material they were ripping off without attribution.

In contrast, academic plagiarists seem to copy not so much out of greed as from laziness.

Not laziness as in, too lazy to write the paragraph in their own words, but laziness as in, too lazy to figure out what’s going on—but it’s something they’re supposed to understand.

That’s it!

You’re an academic researcher who is doing some work that is relying on some idea or method, and it’s considered important that the you understand it. This could be a statistical method being used for data analysis, it could be a key building block in an expository piece, it could be some primary sources in historical work, something like that. Just giving a citation and a direct quote wouldn’t be enough, because that wouldn’t demonstrate the required understanding:
– If you’re using a statistical method, you have to understand it at some level or else the reader can’t be assured that you’re using it correctly.
– In a tutorial, you need to understand the basics, otherwise why are you writing the tutorial in the first place.
– In historical work, often the key contribution is bringing in new primary sources. If you’re not doing that, a lot more of a burden is placed on interpretation, which maybe isn’t your strong point.

So, you plagiarize. That’s the only choice! OK, not the only choice. Three alternatives are:
1. Don’t write and publish the article/book/thesis. Just admit you have nothing to add. But that would be a bummer, no?
2. Use direct quotes and citations. But then there may be no good reason for anyone want to read or publish the article/book/thesis. To take an extreme example, is Wiley Interdisciplinary Reviews going to publish a paper that is a known copy of a wikipedia entry? Probably not. Even if your buddy is an editor of the journal, he might think twice.
3. Put in the work to actually understand the method or materials that you’re using. But, hey, that’s a lot of effort! You have a life to read, no? Working out math, reading obscure documents in a foreign language, actually reading what you need to use, that would take effort! Ok, that’s effort that most of us would want to put in, indeed that’s a big reason we became academics in the first place: we enjoy coding, we enjoy working out math, understanding new things, reading dusty old library books. But some subset of us doesn’t want to do the work.
If, for whatever reason, you don’t want to do any of the above three options, then maybe you’ll plagiarize. And just hope that, if you get caught, you receive the treatment given to the Harvard and Yale law professors, the USC medical school professor, and the Princeton history professor or, if you do it late enough in your career, the George Mason statistics professor and the Rutgers history professor. So, stay networked and avoid pissing off powerful people within your institution.

As I wrote last year regarding scholarly misconduct more generally:

I don’t know that anyone’s getting a pass. What seems more likely to me is that anyone—left, center, or right—who gets more attention is also more likely to see his or her work scrutinized.

Or, to put it another way, it’s a sad story that perpetrators of scholarly misconduct often “seem to get a pass” from their friends and employers and academic societies, but this doesn’t seem to have much to do with ideological narratives; it seems more like people being lazy and not wanting a fuss.

The tell

The tell, as they say in poker, is that the copied-without-attribution material so often displays a lack of understanding. Not necessarily a lack of ability to understand—-Ed Wegman could’ve spent an hour reading through the Wikipedia passage he’d copied and avoided introducing an error; Christian Hesse could’ve spent some time actually reading the words he typed, and maybe even some doing some research, and avoiding errors such as this, reported by chess historian Edward Winter:

In 1900 Wilhelm/William Steinitz died, a fact which did not prevent Christian Hesse from quoting a remark by Steinitz about a mate-in-two problem by Pulitzer which, according to Hesse, was dated 1907. (See page 166 of The Joys of Chess.) Hesse miscopied from our presentation of the Pulitzer problem on page 11 of A Chess Omnibus (also included in Steinitz Stuck and Capa Caught). We gave Steinitz’s comments on the composition as quoted on page 60 of the Chess Player’s Scrap Book, April 1907, and that sufficed for Hesse to assume that the problem was composed in 1907.

Also, I can only assume that Korea expert Charles Armstrong could’ve carefully read the passage he was ripping off and avoided getting its meaning backward. But having the ability to do the work isn’t enough. To keep the quality up in the finished product, you have to do the work. Understanding new material is hard; copying is easy. And then it makes sense to cover your tracks. Which makes it harder for the reader to spot the mistakes. Etc.

In his classic essay, “Politics and the English Language,” the political journalist George Orwell drew a connection between cloudy writing and cloudy content, which I think applies to academic writing as well. Something similar seems to be going on with copying without attribution. It happens when authors don’t understand what they’re writing about.

P.S. I just came across this post from 2011, “A (not quite) grand unified theory of plagiarism, as applied to the Wegman case,” where I wrote, “It’s not that the plagiarized work made the paper wrong; it’s that plagiarism is an indication that the authors don’t really know what they’re doing.” I’d forgotten about that!

“Reading Like It’s 1965”: Fiction as a window into the past

Posted on December 4, 2023 9:04 AM by Andrew

Raghu Parthasarathy writes:

The last seven books I read were all published in 1965. I decided on this literary time travel after noticing that I unintentionally read two books in a row from 1965. I thought: Why not continue? Would I get a deep sense of the mid-1960s zeitgeist? I don’t think so . . .

Contra Raghu, I do think that reading old books gives us some sense of how people used to live, and how they used to think. I have nothing new to offer on this front, but here are some relevant ideas we’ve discussed before:

1. The Speed Racer principle: Sometimes the most interesting aspect of a scientific or cultural product is not its overt content but rather its unexamined assumptions.

2. Storytelling as predictive model checking: Fiction is the working out of possibilities. Nonfiction is that too, just with more constraints.

3. Hoberman and Deliverance: Some cultural artifacts are striking because of what they leave out. My go-to example here is the book Deliverance, which was written during the U.S.-Vietnam war and, to my mind, is implicitly all about that war even though I don’t think it is mentioned even once in the book.

4. Also, Raghu mentions Stoner so I’ll point you to my post on the book. In the comments section, Henry Farrell promises us an article called “What Meyer and Rowan on Myth and Ceremony tells us about Forlesen.” So, something to look forward to.

5. And Raghu mentions Donald Westlake. As I wrote a few years ago, my favorite Westlake is Killing Time, but I also like Memory. And then there’s The Axe. And Slayground’s pretty good too. And Ordo, even if it’s kind of a very long extended joke on the idea of murder. Overall, I do think there’s a black hole at the center of Westlake’s writing: as I wrote a few years ago, he has great plots and settings and charming characters, but nothing I’ve ever read of his has the emotional punch of, say, Scott Smith’s A Simple Plan (to choose a book whose plot would fit well into the Westlake canon). But, hey, nobody can do everything. Also see here and here.

Russell’s Paradox of ghostwriters

Posted on December 1, 2023 7:42 AM by Andrew

A few months ago we discussed the repulsive story of a ~~UCLA~~ USC professor who took full credit for a series of books that were ghostwritten. It turned out that one of the books had “at least 95 separate passages” of plagiarism, including “long sections of a chapter on the cardiac health of giraffes.”

You’d think you’d remember a chapter on the cardiac health of giraffes. Indeed, if I hired someone to write a chapter under my name on the cardiac health of giraffes, I think I’d read it, just out of curiosity! But I guess this guy has no actual curiosity. He just wants another bestselling book so he can go on TV some more and mingle with rich and famous people.

OK, I’ve ranted enough about this guy. What I wanted to share today is a fascinating story from a magazine article about the affair, where the author, Joel Stein, “Nearly all experts and celebrities use ghostwriters,” and then links to an amusing magazine article from 2009 subtitled, “If Sarah Palin can write a memoir in four months, can I write my life story in an afternoon?”:

When I heard that Sarah Palin wrote her upcoming 400-page autobiography, Going Rogue: An American Life, in four months, I thought, What took her so long? To prove that introspection doesn’t need to be time-consuming, I decided to try to write my memoir in one day. Since Palin had a ghostwriter, I figured it was only fair that I have help too, so I called Neil Strauss, who co-wrote the best-selling memoirs of Marilyn Manson, Mötley Crüe, Dave Navarro and Jenna Jameson. . . .

The whole article is fun. They wrote a whole memoir in an afternoon!

That particular memoir-book was a gag, but it got me thinking of this general idea of recursive writing. A writer hiring a ghostwriter . . . what a great idea! Of course this happens all the time when the writer is a brand name, as with James Patterson. But then what if Patterson’s ghostwriter is busy and hires a ghostwriter of his own . . .

Perhaps the most famous ghostwritten book is The Autobiography of Malcolm X, by Alex Haley. After Roots came out, the Malcom X autobiography was promoted heavily based on the Haley authorship. On the other hand, parts of Roots were plagiarized, which is kind of like a ghostwriter hiring a ghostwriter.

A writer hiring a writer to do his writing . . . that sounds so funny! But should it? I’m a professional writer and I call upon collaborators all the time. Collaborative writing is very rare in literary writing; it sometimes happens in nonliterary writing (for example here, or for a less successful example, here), but usually there it follows a model of asymmetric collaboration, as with Freakonomics where Levitt supplied the material, Dubner supplied the writing, but I assume that both the content and the writing benefited from conversations between the authors.

One of the common effects of ghostwriting is to give a book a homogenized style. Writers of their own books will have their original styles—most of us cannot approach the caliber of Mark Twain, Virginia Woolf, or Jim Thompson, but style is part of how you express yourself—and nonprofessional writers can have charming idiosyncratic styles of their own. The homogenized, airport-biography style comes from writers who are talented enough to produce this sort of thing on demand, while having some financial motivation not to express originality. In contrast, Malcolm Gladwell deserves credit for producing readable prose while having his own interesting style. I doubt he uses a ghostwriter.

Every once in awhile, though, there will be a ghostwriter who adds compelling writing of his own. One example is the aforementioned Alex Haley; another is the great Leonard Shecter. I’d say Stephen Dubner too, but I see him as more of a collaborator than a hired gun. Also Ralph Leighton: much of the charm in the Feynman memoirs is that voice, and you gotta give the ghostwriter some of the credit here, even if only to keep that voice as is and not replace it with generic prose.

There must be some other ghostwriters who added style rather than blandness, although I can’t think of any examples right now.

More generally, I remain interested in the idea that collaboration is so standard in academic writing (even when we are writing fiction) and for Hollywood/TV scripts (as discussed in comments) and so unusual elsewhere, with the exception of ghostwriting.

Hey! Here’s how to rewrite and improve the title and abstract of a scientific paper:

Posted on November 27, 2023 9:48 AM by Andrew

Last week in class we read and then rewrote the title and abstract of a paper. We did it again yesterday, this time with one of my recent unpublished papers.

Here’s what I had originally:

title: Unifying design-based and model-based sampling inference by estimating a joint population distribution for weights and outcomes

abstract: A well-known rule in practical survey research is to include weights when estimating a population average but not to use weights when fitting a regression model—as long as the regression includes as predictors all the information that went into the sampling weights. But it is not clear how to apply this advice when fitting regressions that include only some of the weighting information, nor does it tell us what to do when analyzing already-collected surveys where the weighting procedure has not been clearly explained or where the weights depend in part on information that is not available in the data. It is also not clear how one is supposed to account for clustering in such analyses. We propose a quasi-Bayesian approach using a joint regression of the outcome and the sampling weight, followed by poststratifcation on the two variables, thus using design information within a model-based context to obtain inferences for small-area estimates, regressions, and other population quantities of interest.

Not terrible, but we can do better. Here’s the new version:

title: MRP using sampling weights

abstract: A well-known rule in practical survey research is to include weights when estimating a population average but not to use weights when fitting a regression model—as long as the regression includes as predictors all the information that went into the sampling weights. But what if you don’t know where the weights come from? We propose a quasi-Bayesian approach using a joint regression of the outcome and the sampling weight, followed by poststratifcation on the two variables, thus using design information within a model-based context to obtain inferences for small-area estimates, regressions, and other population quantities of interest.

How did we get there?

The title. The original title was fine—it starts with some advertising (“Unifying design-based and model-based sampling inference”) and follows up with a description of how the method works (“estimating a joint population distribution for weights and outcomes”).

But the main point of the title is to get the notice of potential readers, people who might find the paper useful or interesting (or both!).

This pushes the question back one step: Who would find this paper useful or interesting? Anyone who works with sampling weights. Anyone who uses public survey data or, more generally, surveys collected by others, which typically contain sampling weights. And anyone who’d like to follow my path in survey analysis, which would be all the people out there who use MRP (multilevel regression and poststratification). Hence the new title, which is crisp, clear, and focused.

My only problem with the new title, “MRP using sampling weights,” is that it doesn’t clearly convey that the paper involves new research. It makes it look like a review article. But that’s not so horrible; people often like to learn from review articles.

The abstract. If you look carefully, you’ll see that the new abstract is the same as the original abstract, except that we replaced the middle part:

But it is not clear how to apply this advice when fitting regressions that include only some of the weighting information, nor does it tell us what to do when analyzing already-collected surveys where the weighting procedure has not been clearly explained or where the weights depend in part on information that is not available in the data. It is also not clear how one is supposed to account for clustering in such analyses.

with this:

But what if you don’t know where the weights come from?

Here’s what happened. We started by rereading the original abstract carefully. That abstract has some long sentences that are hard to follow. The first sentence is already kinda complicated, but I decided to keep it, because it clearly lays out the problem, and also I think the reader of an abstract will be willing to work a bit when reading the first sentence. Getting to the abstract at all is a kind of commitment.

The second sentence, though, that’s another tangle, and at this point the reader is tempted to give up and just skate along to the end—which I don’t want! The third sentence isn’t horrible, but it’s still a little bit long (starting with the nearly-contentless “It is also not clear how one is supposed to account for” and the ending with the unnecessary “in such analyses”). Also, we don’t even really talk much about clustering in the paper! So it was a no-brainer to collapse these into a sentence that was much more snappy and direct.

Finally, yeah, the final sentence of the abstract is kinda technical, but (a) the paper’s technical, and we want to convey some of its content in the abstract!, and (b) after that new, crisp, replacement second sentence, I think the reader is ready to take a breath and hear what the paper is all about.

General principles

Here’s a general template for a research paper:
1. What is the goal or general problem?
2. Why is it important?
3. What is the challenge?
4. What is the solution? What must be done to implement this solution?
5. If the idea in this paper is so great, why wasn’t the problem already solved by someone else?
6. What are the limitations of the proposed solution? What is its domain of applicability?

We used these principles in our rewriting of my title and abstract. The first step was for me to answer the above 6 questions:
1. Goal is to do survey inference with sampling weights.
2. It’s important for zillions of researchers who use existing surveys which come with weights.
3. The challenge is that if you don’t know where the weights come from, you can’t just follow the recommended approach to condition in the regression model on the information that is predictive of inclusion into the sample.
4. The solution is to condition on the weights themselves, which involves the additional step of estimating a joint population distribution for the weights and other predictors in the model.
5. The problem involves a new concept (imagining a population distribution for weights, which is not a coherent assumption, because, in the real world, weights are constructed based on the data) and some new mathematical steps (not inherently sophisticated as mathematics, but new work from a statistical perspective). Also, the idea of modeling the weights is not completely new; there is some related literature, and one of our contributions is to take weights (which are typically constructed from a non-Bayesian design-based perspective) and use them in a Bayesian analysis.
6. Survey weights do not include all design information, so the solution offered in the paper can only be approximate. In addition the method requires distributional assumptions on the weights; also it’s a new method so who knows how useful it will be in practice.

We can’t put all of that in the abstract, but we were able to include some versions of the answers to questions 1, 3, and 4. Questions 5 and 6 are important, but it’s ok to leave them to the paper, as this is where readers will typically search for limitations and connections to the literature.

Maybe we should include the answer to question 2 in the abstract, though. Perhaps we could replace “But what if you don’t know where the weights come from?” with “But what if you don’t know where the weights come from? This is often a problem when analyzing surveys collected by others.”

Summary

By thinking carefully about goals and audience, we improved the title and abstract of a scientific paper. You should be able to do this in your own work!

I disagree with Geoff Hinton regarding “glorified autocomplete”

Posted on November 18, 2023 9:55 AM by Andrew

Computer scientist and “godfather of AI” Geoff Hinton says this about chatbots:

“People say, It’s just glorified autocomplete . . . Now, let’s analyze that. Suppose you want to be really good at predicting the next word. If you want to be really good, you have to understand what’s being said. That’s the only way. So by training something to be really good at predicting the next word, you’re actually forcing it to understand. Yes, it’s ‘autocomplete’—but you didn’t think through what it means to have a really good autocomplete.”

This got me thinking about what I do at work, for example in a research meeting. I spend a lot of time doing “glorified autocomplete” in the style of a well-trained chatbot: Someone describes some problem, I listen and it reminds me of a related issue I’ve thought about before, and I’m acting as a sort of FAQ, but more like a chatbot than a FAQ in that the people who are talking with me do not need to navigate through the FAQ to find the answer that is most relevant to them; I’m doing that myself and giving a response.

I do that sort of thing a lot in meetings, and it can work well, indeed often I think this sort of shallow, associative response can be more effective than whatever I’d get from a direct attack on the problem in question. After all, the people I’m talking with have already thought for awhile about whatever it is they’re working on, and my initial thoughts may well be in the wrong direction, or else my thoughts are in the right direction but are just retracing my collaborators’ past ideas. From the other direction, my shallow thoughts can be useful in representing insights from problems that these collaborators had not ever thought about much before. Nonspecific suggestions on multilevel modeling or statistical graphics or simulation or whatever can really help!

At some point, though, I’ll typically have to bite the bullet and think hard, not necessarily reaching full understanding in the sense of mentally embedding the problem at hand into a coherent schema or logical framework, but still going through whatever steps of logical reasoning that I can. This feels different than autocomplete; it requires an additional level of focus. Often I need to consciously “flip the switch,” as it were, to turn on that focus and think rigorously. Other times, I’m doing autocomplete and either come to a sticking point or encounter an interesting idea, and this causes me to stop and think.

It’s almost like the difference between jogging and running. I can jog and jog and jog, thinking about all sorts of things and not feeling like I’m expending much effort, my legs pretty much move up and down of their own accord . . . but then if I need to run, that takes concentration.

Here’s another example. Yesterday I participated in the methods colloquium in our political science department. It was Don Green and me and a bunch of students, and the structure was that Don asked me questions, I responded with various statistics-related and social-science-related musings and stories, students followed up with questions, I responded with more stories, etc. Kinda like the way things go here on the blog, but spoken rather than typed. Anyway, the point is that most of my responses were a sort of autocomplete—not in a word-by-word chatbot style, more at a larger level of chunkiness, for example something would remind me of a story, and then I’d just insert the story into my conversation—but still at this shallow, pleasant level. Mellow conversation with no intellectual or social strain. But then, every once in awhile, I’d pull up short and have some new thought, some juxtaposition that had never occurred to me before, and I’d need to think things through.

This also happens when I give prepared talks. My prepared talks are not super-well prepared—this is on purpose, as I find that too much preparation can inhibit flow. In any case, I’ll often finding myself stopping and pausing to reconsider something or another. Even when describing something I’ve done before, there are times when I feel the need to think it all through logically, as if for the first time. I noticed something similar when I saw my sister give a talk once: she had the same habit of pausing to work things out from first principles. I don’t see this behavior in every academic talk, though; different people have different styles of presentation.

This seems related to models of associative and logical reasoning in psychology. As a complete non-expert in that area, I’ll turn to wikipedia:

The foundations of dual process theory likely come from William James. He believed that there were two different kinds of thinking: associative and true reasoning. . . . images and thoughts would come to mind of past experiences, providing ideas of comparison or abstractions. He claimed that associative knowledge was only from past experiences describing it as “only reproductive”. James believed that true reasoning could enable overcoming “unprecedented situations” . . .

That sounds about right!

After describing various other theories from the past hundred years or so, Wikipedia continues:

Daniel Kahneman provided further interpretation by differentiating the two styles of processing more, calling them intuition and reasoning in 2003. Intuition (or system 1), similar to associative reasoning, was determined to be fast and automatic, usually with strong emotional bonds included in the reasoning process. Kahneman said that this kind of reasoning was based on formed habits and very difficult to change or manipulate. Reasoning (or system 2) was slower and much more volatile, being subject to conscious judgments and attitudes.

This sounds a bit different from what I was talking about above. When I’m doing “glorified autocomplete” thinking, I’m still thinking—this isn’t automatic and barely conscious behavior along the lines of driving to work along a route I’ve taken a hundred times before—; I’m just thinking in a shallow way, trying to “autocomplete” the answer. It’s pattern-matching more than it is logical reasoning.

P.S. Just to be clear, I have a lot of respect for Hinton’s work; indeed, Aki and I included Hinton’s work in our brief review of 10 pathbreaking research articles during the past 50 years of statistics and machine learning. Also, I’m not trying to make a hardcore, AI-can’t-think argument. Although not myself a user of large language models, I respect Bob Carpenter’s respect for them.

I think that where Hinton got things wrong in the quote that led off this post was not in his characterization of chatbots, but rather in his assumptions about human thinking, in not distinguishing autocomplete-like associative reasoning with logical thinking. Maybe Hinton’s problem in understanding this is that he’s just too logical! At work, I do a lot of what seems like autocomplete—and, as I wrote above, I think it’s useful—but if I had more discipline, maybe I’d think more logically and carefully all the time. It could well be that Hinton has that habit or inclination to always be in focus. If Hinton does not have consistent personal experience of shallow, autocomplete-like thinking, he might not recognize it as something different, in which case he could be giving the chatbot credit for something it’s not doing.

Come to think of it, one thing that impresses me about Bob is that, when he’s working, he seems to always be on focus. I’ll be in a meeting, just coasting along, and Bob will interrupt someone to ask for clarification, and I suddenly realize that Bob absolutely demands understanding. He seems to have no interest in participating in a research meeting in a shallow way. I guess we just have different styles. It’s my impression that the vast majority of researchers are like me, just coasting on the surface most of the time (for some people, all of the time!), while Bob, and maybe Geoff Hinton, is one of the exceptions.

P.P.S. Sometimes we really want to be doing shallow, auto-complete-style thinking. For example, if we’re writing a play and want to simulate how some characters might interact. Or just as a way of casting the intellectual net more widely. When I’m in a research meeting and I free-associate, it might not help immediately solve the problem at hand, but it can bring in connections that will be helpful later. So I’m not knocking auto-complete; I’m just disagreeing with Hinton’s statement that “by training something to be really good at predicting the next word, you’re actually forcing it to understand.” As a person who does a lot of useful associative reasoning and also a bit of logical understanding, I think they’re different, both in how they feel and also in what they do.

P.P.P.S. Lots more discussion in comments; you might want to start here.

P.P.P.P.S. One more thing . . . actually, it might deserve its own post, but for now I’ll put it here: So far, it might seem like I’m denigrating associative thinking, or “acting like a chatbot,” or whatever it might be called. Indeed, I admire Bob Carpenter for doing very little of this at work! The general idea is that acting like a chatbot can be useful—I really can help lots of people solve their problems in that way, also every day I can write these blog posts that entertain and inform tens of thousands of people—but it’s not quite the same as focused thinking.

That’s all true (or, I should say, that’s my strong impression), but there’s more to it than that. As discussed in my comment linked to just above, “acting like a chatbot” is not “autocomplete” at all, indeed in some ways it’s kind of the opposite. Locally it’s kind of like autocomplete in that the sentences flow smoothly; I’m not suddenly jumping to completely unrelated topics—but when I do this associative or chatbot-like writing or talking, it can lead to all sorts of interesting places. I shuffle the deck and new hands come up. That’s one of the joys of “acting like a chatbot” and one reason I’ve been doing it for decades, long before chatbots ever existed! Walk along forking paths, and who knows where you’ll turn up! And all of you blog commenters (ok, most of you) play helpful roles in moving these discussions along.

Hey, check this out! Here’s how to read and then rewrite the title and abstract of a paper.

Posted on November 17, 2023 9:16 AM by Andrew

In our statistical communication class today, we were talking about writing. At some point a student asked why it was that journal articles are all written in the same way. I said, No, actually there are many different ways to write a scientific journal article. Superficially these articles all look the same: title, abstract, introduction, methods, results, discussion, or some version of that, but if you look in detail you’ll see that you have lots of flexibility in how to do this (with the exception of papers in medical journals such as JAMA which indeed have a pretty rigid format).

The next step was to demonstrate the point by going to a recent scientific article. I asked the students to pick a journal. Someone suggested NBER. So I googled NBER and went to its home page:

I then clicked on the most recent research paper, which was listed on the main page as “Employer Violations of Minimum Wage Laws.” Click on the link and you get this more dramatically-titled article:

Does Wage Theft Vary by Demographic Group? Evidence from Minimum Wage Increases

with this abstract:

Using Current Population Survey data, we assess whether and to what extent the burden of wage theft — wage payments below the statutory minimum wage — falls disproportionately on various demographic groups following minimum wage increases. For most racial and ethnic groups at most ages we find that underpayment rises similarly as a fraction of realized wage gains in the wake of minimum wage increases. We also present evidence that the burden of underpayment falls disproportionately on relatively young African American workers and that underpayment increases more for Hispanic workers among the full working-age population.

We actually never got to the full article (but feel free to click on the link and read it yourself). There was enough in the title and abstract to sustain a class discussion.

Before going on . . .

In class we discussed the title and abstract of the above article and considered how it could be improved. This does not mean we think the article, or its title, or its abstract, is bad. Just about everything can be improved! Criticism is an important step in the process of improvement.

The title

“Does Wage Theft Vary by Demographic Group? Evidence from Minimum Wage Increases” . . . that’s not bad! “Wage Theft” in the first sentence is dramatic—it grabs our attention right away. And the second sentence is good too: it foregrounds “Evidence” and it also tells you where the identification is coming from. So, good job. We’ll talk later about how we might be able to do even better, but I like what they’ve got so far.

Just two things.

First, the answer to the question, “Does X vary with Y?”, is always Yes. At least, in social science it’s always Yes. There are no true zeroes. So it would be better to change that first sentence to something like, “How Does Wage Theft Vary by Demographic Group?”

The second thing is the term “wage theft.” I took that as a left-wing signifier, the same way in which the use of a loaded term such as “pro-choice” or “pro-life” conveys the speaker’s position on abortion. So I took the use of that phrase in the title as a signal that the article is taking a position on the political/economic left. But then I googled the first author, and . . . he’s an “Adjunct Senior Fellow at the Hoover Institution.” Not that everyone at Hoover is right-wing, but it’s not a place I associate with the left, either. So I’ll move on and not worry about this issue.

The point here is not that I’m trying to monitor the ideology of economics papers. This is a post on how to write a scholarly paper! My point is that the title conveys information, both directly and indirectly. The term “wage theft” in the title conveys that the topic of the paper will be morally serious—they’re talking about “theft,” not just some technical violations of a law—; also it has this political connotation. When titling your papers, be aware of the direct and indirect messages you’re conveying.

The abstract

As I said, I liked the title of the paper—it’s punchy and clear. The abstract is another story. I read it and then realized I hadn’t absorbed any of its content, so I read it again, and it was still confusing. It’s not “word salad”—there’s content in that abstract—; it’s just put together in a way that I found hard to follow. The students in the class had the same impression, and indeed they were kinda relieved that I too found it confusing.

How to rewrite? The best approach would be to go into the main paper, maybe start with our tactic of forming an abstract by taking the first sentence of each of the first five paragraphs. But here we’ll keep it simple and just go with the information right there in the current abstract. Our goal is to rewrite in a way that makes it less exhausting to read.

Our strategy: First take the abstract apart, then put it back together.

I went to the blackboard and listed the information that was in the abstract:
– CPS data
– Definition of wage theft
– What happens after minimum wage increase
– Working-age population
– African American, Hispanic, White

Now, how to put this all together? My first thought was to just start with the definition of wage theft, but then I checked online and learned that the phrase used in the abstract, “wage payments below the statutory minimum wage,” is not the definition of wage theft; it’s actually just one of several kinds of wage theft. So that wasn’t going to work. Then there’s the bit from the abstract, “falls disproportionately on various demographic groups”—that’s pretty useless, as what we want to know is where this disproportionate burden falls, and by how much.

Putting it all together

We discussed some more—it took surprisingly long, maybe 20 minutes of class time to work through all these issues—and then I came up with this new title/abstract:

Wage theft! Evidence from minimum wage increases

Using Current Population Survey data from [years] in periods following minimum wage increase, we look at the proportion of workers being paid less than the statutory minimum, comparing different age groups and ethnic groups. This proportion was highest in ** age and ** ethnic groups.

OK, how is this different from the original?

1. The three key points of the paper are “wage theft,” “evidence,” and “minimum wage increases,” so that’s now what’s in the title.

2. It’s good to know that the data came from the Current Population Survey. We also want to know when this was all happening, so we added the years to the abstract. Also we made the correction of changing the tense in the abstract from the present to the past, because the study is all based on past data.

3. The killer phrase, “wage theft,” is already in the title, so we don’t need it in the abstract. That helps, because then we can use the authors’ clear and descriptive phrase, “the proportion of workers being paid less than the statutory minimum,” without having to misleadingly imply that this is the definition of wage theft, and without having to lugubriously state that it’s a kind of wage theft. That was so easy!

4. We just say we’re comparing different age and ethnic groups and then report the results. This to me is much cleaner than the original abstract which shared this information in three long sentences, with quite a bit of repetition.

5. We have the ** in the last sentence because I’m not quite clear from the abstract what are the take-home points. The version we created is short enough that we could add more numbers to that last sentence, or break it up into two crisp sentences, for example, one sentence about age groups and one about ethnic groups.

In any case, I think this new version is much more readable. It’s a structure much better suited to conveying, not just the general vibe of the paper (wage theft, inequality, minority groups) but the specific findings.

Lessons for rewriters

Just about every writer is a rewriter. So these lessons are important.

We were able to improve the title and abstract, but it wasn’t easy, nor was it algorithmic—that is, there was no simple set of steps to follow. We gave ourselves the relatively simple task of rewriting without the burden of subject-matter knowledge, and it still took a half hour of work.

After looking over some writing advice, it’s tempting to think that rewriting is mostly a matter of a few clean steps: replacing the passive with the active voice, removing empty words and phrases such as “quite” and “Note that,” checking for grammar, keeping sentences short, etc. In this case, no. In this case, we needed to dig in a bit and gain some conceptual understanding to figure out what to say.

The outcome, though, is positive. You can do this too, for your own papers!

“Modeling Social Behavior”: Paul Smaldino’s cool new textbook on agent-based modeling

Posted on October 24, 2023 9:58 AM by Andrew

Paul Smaldino is a psychology professor who is perhaps best known for his paper with Richard McElreath from a few years ago, “The Natural Selection of Bad Science,” which presents a sort of agent-based model that reproduces the growth in the publication of junk science that we’ve seen in recent decades.

Since then, it seems that Smaldino has been doing a lot of research and teaching on agent-based models in social science more generally, and he just came out with a book, “Modeling Social Behavior: Mathematical and Agent-Based Models of Social Dynamics and Cultural Evolution.” The book has social science, it has code, it has graphs—it’s got everything.

It’s an old-school textbook with modern materials, and I hope it’s taught in thousands of classes and sells a zillion copies.

There’s just one thing that bothers me. The book is entertainingly written and bursting with ideas, also does a great job of giving concerns about the models that it’s simulating, not just acting like everything’s already known. My concern is that nobody reads books anymore. If I think about students taking a class in agent-based modeling and using this book, it’s hard for me to picture most of them actually reading the book. They’ll start with the homework assignments and then flip through the book to try to figure out what they need. That’s how people read nonfiction books nowadays, which I guess is one reason that books, even those I like, are typically repetitive and low on content. Readers don’t want the book to offer a delightful reading experience, so authors don’t deliver it, and then readers don’t expect it, etc.

To be clear: this is a textbook, not a trade book. It’s a readable and entertaining book in the way that Regression and Other Stories is a readable and entertaining book, not in the way that Guns, Germs, and Steel is. Still, within the framework of being a social science methods book, it’s entertaining and thought-provoking. Also, I like it as a methods book because it’s focused on models rather than on statistical inference. We tried to get a similar feel with A Quantitative Tour of Social Sciences but with less success.

So it kinda makes me sad to see this effort of care put into a book that probably very few students will read from paragraph to paragraph. I think things were different 50 years ago: back then, there wasn’t anything online to read, you’d buy a textbook and it was in front of you so you’d read it. On the plus side, readers can now go in and make the graphs themselves—I assume that Smaldino has a website somewhere with all the necessary code—so there’s that.

P.S. In the preface, Smaldino is “grateful to all the modelers whose work has inspired this book’s chapters . . . particularly want to acknowledge the debt owed to the work of,” and then he lists 16 names, one of which is . . . Albert-Lázló Barabási!

Huh?? Is this the same Albert-Lázló Barabási who said that scientific citations are worth $100,000 each. I guess he did some good stuff too? Maybe this is worthy of an agent-based model of its own.

Academia corner: New candidate for American Statistical Association’s Founders Award, Enduring Contribution Award from the American Political Science Association, and Edge Foundation just dropped

Posted on October 18, 2023 9:05 AM by Andrew

Bethan Staton and Chris Cook write:

A Cambridge university professor who copied parts of an undergraduate’s essays and published them as his own work will remain in his job, despite an investigation upholding a complaint that he had committed plagiarism.

Dr William O’Reilly, an associate professor in early modern history, submitted a paper that was published in the Journal of Austrian-American History in 2018. However, large sections of the work had been copied from essays by one of his undergraduate students.

The decision to leave O’Reilly in post casts doubt on the internal disciplinary processes of Cambridge, which rely on academics judging their peers.

Dude’s not a statistician, but I think this alone should be enough to make him a strong candidate for the American Statistical Association’s Founders Award.

And, early modern history is not quite the same thing as political science, but the copying thing should definitely make him eligible for the Aaron Wildavsky Enduring Contribution Award from the American Political Science Association. Long after all our research has been forgotten, the robots of the 21st century will be able to sift through the internet archive and find this guy’s story.

Or . . . what about the Edge Foundation? Plagiarism isn’t quite the same thing as misrepresenting your data, but it’s close enough that I think this guy would have a shot at joining that elite club. I’ve heard they no longer give out flights to private Caribbean islands, but I’m sure there are some lesser perks available.

According to the news article:

Documents seen by the Financial Times, including two essays submitted by the third-year student, show nearly half of the pages of O’Reilly’s published article — entitled “Fredrick Jackson Turner’s Frontier Thesis, Orientalism, and the Austrian Militärgrenze” — had been plagiarised.

Jeez, some people are so picky! Only half the pages were plagiarized, right? Or maybe not? Maybe this prof did a “Quentin Rowan” and constructed his entire article based on unacknowledged copying from other sources. As Rowan said:

It felt very much like putting an elaborate puzzle together. Every new passage added has its own peculiar set of edges that had to find a way in.

I guess that’s how it felt when they were making maps of the Habsburg empire.

On the plus side, reading about this story motivated me to take a look at the Journal of Austrian-American History, and there I found this cool article by Thomas Riegler, “The Spy Story Behind The Third Man.” That’s one of my favorite movies! I don’t know how watchable it would be to a modern audience—the story might seem a bit too simplistic—but I loved it.

P.S. I laugh but only because that’s more pleasant than crying. Just to be clear: the upsetting thing is not that some sleazeball managed to climb halfway up the greasy pole of academia by cheating. Lots of students cheat, some of these students become professors, etc. The upsetting thing is that the organization closed ranks to defend him. We’ve seen this sort of thing before, over and over—for example, Columbia never seemed to make any effort whatsoever to track down whoever was faking its U.S. News numbers—, so this behavior by Cambridge University doesn’t surprise me, but it still makes me sad. I’m guessing it’s some combination of (a) the perp is plugged in, the people who make the decisions are his personal friends, (b) a decision that the negative publicity for letting this guy stay on at his job is not as bad as the negative publicity for firing him.

Can you imagine what it would be like to work in the same department as this guy?? Fun conversations at the water cooler, I guess. “Whassup with the Austrian Militärgrenze, dude?”

Meanwhile . . .

There are people who actually do their own research, and they’re probably good teachers too, but they didn’t get that Cambridge job. It’s hard to compete with an academic cheater, if the institution he’s working for seems to act as if cheating is just fine, and if professional societies such as the American Statistical Association and the American Political Science Association don’t seem to care either.

The Ministry of Food sez: Potatoes are fattening

Posted on October 8, 2023 9:05 AM by Andrew

As is often the case, I was thinking about George Orwell, and some googling led me to this post from 2005 that had some fun food-related quotes from Bernard Crick’s biography of Orwell, including this:

Just before they moved to Mortimer Crescent [in 1942], Eileen [Orwell’s wife] changed jobs. She now worked in the Minstry of Food preparing recipes and scripts for ‘The Kitchen Front’, which the BBC broadcast each morning. These short programmes were prepared in the Ministry because it was a matter of Government policy to urge or restrain the people from eating unrationed foods according to their official estimates, often wrong, of availability. It was the time of the famous ‘Potatoes are Good For You’ campaign, with its attendant Potato Pie recipes, which was so successful that another campaign had to follow immediately: ‘Potatoes are Fattening’.

Good stuff.

I looked up Crick on wikipedia and it turns out that early in his career he wrote a book that “identified and rejected [the] premises that research can discover uniformities in human behaviour, that these uniformities could be confirmed by empirical tests and measurements, that quantitative data was of the highest quality, and should be analysed statistically, that political science should be empirical and predictive . . .”

Interesting. Too bad he’s no longer around, as I’d have liked to talk with him about this, given how different it is from my own perspective. Crick’s Orwell biography is great, so much so that I’d hope he and I would have been able to find some common ground. He died in 2008, three years after the appearance of the above-linked post, so at least in theory I could’ve tried to reach him and have a discussion of quantitative political science and goats.

The Freaky Friday that never happened

Posted on September 18, 2023 10:51 AM by Andrew

Speaking of teaching . . . I wanted to share this story of something that happened today.

I was all fired up with energy, having just taught my Communicating Data and Statistics class, taking notes off the blackboard to remember what we’d been talking about so I could write about it later, and students were walking in for the next class. I asked them what it was, and they said Shakespeare. How wonderful to take a class on Shakespeare at Columbia University, I said. The students agreed. They love their teacher—he’s great.

This gave me an idea . . . maybe this instructor and I could switch classes some day, a sort of academic Freaky Friday. He could show up at 8:30 and teach my statistics students about Shakespeare’s modes of communication (with his contemporaries and with later generations including us, and also how Shakespeare made use of earlier materials), and I could come at 10am to teach his students how we communicate using numbers and graphs. Lots of fun all around, no? I’d love to hear the Shakespeare dude talk to a new audience, and I think my interactions with his group would be interesting too.

I waited in the classroom for awhile so I could ask the instructor when he came into the room, during the shuffling-around period before class officially starts at 10:10. Then 10:10 came and I stood outside to wait as the students continued to trickle in. A couple minutes later I saw a guy approaching, about my age, I ask if he teaches the Shakespeare class. Yes, he is. I introduce myself: I teach the class right before, on communicating data and statistics, maybe we could do a switch one day, could be fun? He says no, I don’t think so, and goes into the classroom.

That’s all fine, he has no obligation to do such a thing, also I came at him unexpectedly at a time when he was already in a hurry, coming to class late (I came to class late this morning too. Mondays!). His No response was completely reasonable.

Still . . . it was a lost opportunity! I’ll have to brainstorm with people about other ways to get this sort of interdisciplinary opportunity on campus. We could just have an interdiscplinary lecture series (Communication of Shakespeare, Communication in Statistics, Communication in Computer Science, Communication in Medicine, Communication in Visual Art, etc.), but it would be a bit of work to set up such a thing, also I’m guessing it wouldn’t reach so many people. I like the idea of doing it using existing classes, because (a) then the audience is already there, and (b) it would take close to zero additional effort: you’re teaching your class somewhere else, but then someone else is teaching your class so you get a break that day. And all the students are exposed to something new. Win-win.

The closest thing I can think of here is an interdisciplinary course I organized many years ago on quantitative social science, for our QMSS graduate program. The course had 3 weeks each of history, political science, economics, sociology, and psychology. It was not a statistics course or a methods course; rather, each segment discussed some set of quantitative ideas in the field. The course was wonderful, and Jeronimo Cortina and I turned it into a book, A Quantitative Tour of the Social Sciences, which I really like. I think the course went well, but I don’t think QMSS offers it anymore; I’m guessing it was just too difficult to organize a course with instructors from five different departments.

P.S. I read Freaky Friday and its sequel, A Billion for Boris, when I was a kid. Just noticed them on the library shelves. The library wasn’t so big; I must have read half the books in the children’s section at the time. Lots of fond memories.

Advice on writing a discussion of a published paper

Posted on September 1, 2023 9:24 AM by Andrew

A colleague asked for my thoughts on a draft of a discussion of a published paper, and I responded:

My main suggestion is this: Yes, your short article is a discussion of another article, and that will be clear when it is published. But I think you should write it to be read on its own, which means that you should focus on the points you want to make, and only then talk about the target article and other discussions.

So I’d do like this:

paragraph 1: Your main point. The one takeaway you want the reader to get.

the next few paragraphs: Your other points. Everything you want to say.

a few paragraphs more: How this relates to the articles you are discussing. Where you agree with them and where you disagree. If there are things in the target article you like, say so. Readers will in part use the discussion to make their judgment on the main article, so if your discussion reads as purely negative, that will take its toll. Which is fine, if that’s what you want to do.

final paragraph: Summary and pointers to future work.

I hope this is helpful. This advice might sound kinda generic but I actually wrote it specifically with your article in mind!

Awhile ago I gave some advice on writing research articles. This is the first time I recall specifically giving advice on writing a discussion.

My two courses this fall: “Applied Regression and Causal Inference” and “Communicating Data and Statistics”

Posted on August 30, 2023 5:00 AM by Andrew

POLS 4720, Applied Regression and Causal Inference:

This is a fast-paced one-semester course on applied regression and casual inference based on our book, Regression and Other Stories. The course has an applied and conceptual focus that’s different from other available statistics courses.
Topics covered in POLS 4720 include:
• Applied regression: measurement, data visualization, modeling and inference, transformations, linear regression, and logistic regression.
• Simulation, model fitting, and programming in R.
• Causal inference using regression.
• Key statistical problems include adjusting for differences between sample and population, adjusting for differences between treatment and control groups, extrapolating from past to future, and using observed data to learn about latent constructs of interest.
• We focus on social science applications, including but not limited to: public opinion and voting, economic and social behavior, and policy analysis.
The course is set up using the principles of active learning, with class time devoted to student-participation activities, computer demonstrations, and discussion problems.

The primary audience for this course is Poli Sci Ph.D. students, and it should also be ideal for statistics-using graduate students or advanced undergraduates in other departments and schools, as well as students in fields such as computer science and statistics who’d like to get an understanding of how regression and causal inference work in the real world!

STAT 6106, Communicating Data and Statistics:

This is a one-semester course on communicating data and statistics, covering the following modes of communication:
• Writing (including storytelling, writing technical articles, and writing for general audiences)
• Statistical graphics (including communicating variation and uncertainty)
• Oral communication (including teaching, collaboration, and giving presentations).
The course is set up using the principles of active learning, with class time devoted to discussions, collaborative work, practicing and evaluation of communication skills, and conversations with expert visitors.

The primary audience for this course is Statistics Ph.D. students, and it should also be ideal for Ph.D. students who do quantitative work in other departments and schools. Communication is sometimes thought of as a soft skill, but it is essential to statistics and scientific research more generally!

See you there:

Both courses have lots of space available, so check them out! In-person attendance is required, as class participation is crucial for both. POLS 4720 is offered Tu/Th 8:30-10am; STAT 6106 will be M/W 8:30-10:am. These are serious classes, with lots of homework. Enjoy.

Artificial intelligence and aesthetic judgment

Posted on August 16, 2023 12:50 PM by Jessica Hullman

This is Jessica. In a new essay reflecting on how we get tempted to aestheticize generative AI, Ari Holtzman, Andrew, and I write:

Generative AIs produce creative outputs in the style of human expression. We argue that encounters with the outputs of modern generative AI models are mediated by the same kinds of aesthetic judgments that organize our interactions with artwork. The interpretation procedure we use on art we find in museums is not an innate human faculty, but one developed over history by disciplines such as art history and art criticism to fulfill certain social functions. This gives us pause when considering our reactions to generative AI, how we should approach this new medium, and why generative AI seems to incite so much fear about the future. We naturally inherit a conundrum of causal inference from the history of art: a work can be read as a symptom of the cultural conditions that influenced its creation while simultaneously being framed as a timeless, seemingly acausal distillation of an eternal human condition. In this essay, we focus on an unresolved tension when we bring this dilemma to bear in the context of generative AI: are we looking for proof that generated media reflects something about the conditions that created it or some eternal human essence? Are current modes of interpretation sufficient for this task? Historically, new forms of art have changed how art is interpreted, with such influence used as evidence that a work of art has touched some essential human truth. As generative AI influences contemporary aesthetic judgment we outline some of the pitfalls and traps in attempting to scrutinize what AI generated media “means.”

I’ve worked on a lot of articles in the past year or so, but this one is probably the most out-of-character. We are not exactly humanities scholars. And yet, I think there is some truth to the analogies we are making. Everywhere we seem to be witnessing the same sort of beauty contest, where some interaction with ChatGPT or another generative model is held up for scrutiny, and the conclusion drawn that it lacks a certain emergent “je ne sais quoi” that human creative expressions like great works of art achieve. We approach our interactions as though they have the same kind of heightened status as going to a museum, where it’s up to us to peer into the work to cultivate the right perspective on the significance of what we are seeing, and try to anticipate the future trajectory of the universal principle behind it.

At the same time, we postulate all sorts of causal relationships where conditions under which the model is created are thought to leave traces in the outputs – from technical details about the training process to the values of the organizations that give us the latest models – just like we analyze the hell out of what a work of art says about the culture that created it. And so we end up in a position where we can only recognize what we’re looking for when we see it, but what we are looking for can only be identified by what is lacking. Meanwhile, the artifacts that we judge can be read as a signal of anything and everything at once.

If this sounds counterproductive (because it is), it’s worth considering why these kinds of contradictory modes of reading objects have arisen in the past over the history of art: to keep fears at bay. By making our judgments as spectators seem essential to understanding the current moment, we gain a feeling of control.

And so, despite these contradictions, we see our appraisals of model outputs in the the current moment as correct and arising from some innate ability we have to recognize human intelligence. But aesthetic judgments have never been fixed – they have always evolved along with innovations in our ability to represent the world, whether through painting or photography or contemporary art. And so we should expect that with judgments of generative AI as well. We conclude by considering how the idea of taste and aesthetic judgment might continue to shape our interactions with generative model outputs, from “wireheading” to generative AI as a kind of art historical tool we can turn toward taste itself.

Blogging is “destroying the business model for quality”?

Posted on August 13, 2023 9:17 AM by Andrew

I happened to come across this post from 2011 about a now-forgotten journalist who was upset that bloggers were “destroying the business model for quality” in writing by flooding the market with free and crappy content.

It’s all so quaint. To be a journalist and to think that Public Enemy #1 is blogging . . . wow!

Here we are 12 years later and blogging has pretty much disappeared. This makes me sad. But that dude might well be happy about this state of affairs!

(from 2017 but still relevant): What Has the Internet Done to Media?

Posted on August 9, 2023 9:52 AM by Andrew

Aleks Jakulin writes:

The Internet emerged by connecting communities of researchers, but as Internet grew, antisocial behaviors were not adequately discouraged.

When I [Aleks] coauthored several internet standards (PNG, JPEG, MNG), I was guided by the vision of connecting humanity. . . .

The Internet was originally designed to connect a few academic institutions, namely universities and research labs. Academia is a community of academics, which has always been based on the openness of information. Perhaps the most important to the history of the Internet is the hacker community composed of computer scientists, administrators, and programmers, most of whom are not affiliated with academia directly but are employed by companies and institutions. Whenever there is a community, its members are much more likely to volunteer time and resources to it. It was these communities that created websites, wrote the software, and started providing internet services.

“Whenever there is a community, its members are much more likely to volunteer time and resources to it” . . . so true!

As I wrote a few years ago, Create your own community (if you need to).

But it’s not just about community; you also have to pay the bills.

Aleks continues:

The skills of the hacker community are highly sought after and compensated well, and hackers can afford to dedicate their spare time to the community. Society is funding universities and institutes who employ scholars. Within the academic community, the compensation is through citation, while plagiarism or falsification can destroy someone’s career. Institutions and communities have enforced these rules both formally and informally through members’ desire to maintain and grow their standing within the community.

Lots to chew on here. First, yeah, I have skills that allow me to be compensated well, and I can afford to dedicate my spare time to the community. This is not new: back in the early 1990s I wrote Bayesian Data Analysis in what was essentially my spare time, indeed my department chair advised me not to do it at all—master of short-term thinking that he was. As Aleks points out, was a time when a large proportion of internet users had this external compensation.

The other interesting thing about the above quote is that academics and tech workers have traditionally had an incentive to tell the truth, at least on things that can be checked. Repeatedly getting things wrong would be bad for your reputation. Or, to put it another way, you could be a successful academic and repeatedly get things wrong, but then you’d be crossing the John Yoo line and becoming a partisan hack. (Just to be clear, I’m not saying that being partisan makes you a hack. There are lots of scholars who express strong partisan views but with intellectual integrity. The “hack” part comes from getting stuff wrong, trying to pass yourself off as an expert on topics you know nothing about, ultimately being willing to say just about anything if you think it will make the people on your side happy.)

Aleks continues:

The values of academic community can be sustained within universities, but are not adequate outside of it. When businesses and general public joined the internet, many of the internet technologies and services were overwhelmed with the newcomers who didn’t share their values and were not members of the community. . . . False information is distracting people with untrue or irrelevant conspiracy theories, ineffective medical treatments, while facilitating terrorist organization recruiting and propaganda.

I’ve not looked at data on all these things, but, yeah, from what I’ve read, all that does seem to be happening.

Aleks then moves on to internet media:

It was the volunteers, webmasters, who created the first websites. Websites made information easily accessible. The website was property and a brand, vouching for the reputation of the content and data there. Users bookmarked those websites they liked so that they could revisit them later. . . .

In those days, I kept current about the developments in the field by following newsgroups and regularly visiting key websites that curated the information on a particular topic. Google entered the picture by downloading all of Internet and indexing it. . . . the perceived credit for finding information went to Google and no longer to the creators of the websites.

He continues:

After a few years of maintaining my website, I was no longer receiving much appreciation for this work, so I have given up maintaining the pages on my website and curating links. This must have happened around 2005. An increasing number of Wikipedia editors are giving up their unpaid efforts to maintain quality in the fight with vandalism or content spam. . . . On the other hand, marketers continue to have an incentive to put information online that would lead to sales. As a result of depriving contributors to the open web with brand and credit, search results on Google tend to be of worse quality.

And then:

When Internet search was gradually taking over from websites, there was one area where a writer’s personal property and personal brand were still protected: blogging. . . . The community connected through the comments on blog posts. The bloggers were known and personally subscribed to.

That’s where I came in!

Aleks continues:

Alas, whenever there’s an unprotected resource online, some startup will move in and harvest it. Social media tools simplified link sharing. Thus, an “influencer” could easily post a link to an article written by someone else within their own social media feed. The conversation was removed from the blog post and instead developed in the influencer’s feed. As a result, carefully written articles have become a mere resource for influencers. As a result, the number of new blogs has been falling.
Social media companies like Twitter and Facebook reduced barriers to entry by making so easy to refer to others’ content . . .

I hadn’t thought about this, but, yeah, good point.

As a producer of “content”—for example, what I’m typing right now—I don’t really care if people come to this blog from Google, Facebook, Twitter, an RSS feed, or a link on their browser. (There have been cases where someone’s stripped the material from here and put it on their own site without acknowledging the source, but that’s happened only rarely.) Any of those legitimate ways of reaching this content is fine with me: my goal is just to get it out there, to inform people and to influence discussion. I already have a well-paying job, so I don’t need to make money off the blogging. If it did make money, that would be fine—I could use it to support a postdoc—but I don’t really have a clear sense of how that would happen, so I haven’t ever looked into it seriously.

The thing I hadn’t thought about was that, even if to me it doesn’t matter where our reader are coming, this does matter to the larger community. Back in the day, if someone wanted to link or react to something on a blog, they’d do it in their own blog or in a comment section. Now they can do it from Facebook or Twitter. The link itself is no problem, but there is a problem in that there’s less of an expectation of providing new content along with the link. Also, Facebook and Twitter are their own communities, which have their strengths but which are different than those of blogs. In particular, blogging facilitates a form of writing where you fill in all the details of your argument, where you can go on tangents if you’d like, and where you link to all relevant sources. Twitter has the advantage of immediacy, but often it seems more like community without the content, where people can go on and say what they love or hate but without the space for giving their reasons.

“They got a result they liked, and didn’t want to think about the data.” (A fish story related to Cannery Row)

Posted on August 2, 2023 9:06 AM by Andrew

John “Jaws” Williams writes:

Here is something about a century-old study that you may find interesting, and could file under “everything old is new again.”

In 1919, the California Division of Fish and Game began studying the developing sardine fishery in Monterey. Ten years later, W. L. Scofield published an amazingly through description of the fishery, the abstract of which begins as follows:

The object of this bulletin is to put on record a description of the Monterey sardine fishery which can be used as a basis for judging future changes in the conduct of this industry. Detailed knowledge of changes is essential to an understanding of the significance of total catch figures, or of records of catch per boat or per seine haul. It is particularly necessary when applying any form of catch analysis to a fishery as a means of illustrating the presence or absence of depletion or of natural fluctuations in supply.

As detailed in this and subsequent reports, the catch was initially limited by the market and the capacity of the fishing fleet, both of which grew rapidly for several decades and provided the background for John Steinbeck’s “Cannery Row.” Later, sardine population famously collapsed, and never recovered.

Sure enough, just as Scofield feared, scientists who did not understand the data subsequently misused it as reflecting the sardine population, as I pointed out in this letter (which got the usual kind of response). They got a result they liked, and didn’t want to think about the data.

The Division of Fisheries was not the only agency to publish detailed descriptive reports. The USGS and other agencies did as well, but generally they have gone out of style; they take a lot of time and field work, are expensive to publish, and don’t get the authors much credit.

This comes to mind because I am working on a paper about a debris flood on a stream in one of the University of California’s natural reserves, and the length limits for the relevant print journals don’t allow for a reasonable description of the event and a discussion of what it means. However, now I can write a separate and more complete description, and have it go as on-line supplementary material. There is some progress.

The Ten Craziest Facts You Should Know About A Giraffe:

Posted on July 21, 2023 9:00 AM by Andrew

Palko points us to this story:

USC oncologist David Agus’ new book is rife with plagiarism

The publication of a new book by Dr. David Agus, the media-friendly USC oncologist who leads the Lawrence J. Ellison Institute for Transformative Medicine, was shaping up to be a high-profile event.

Agus promoted “The Book of Animal Secrets: Nature’s Lessons for a Long and Happy Life” with appearances on CBS News, where he serves as a medical contributor, and “The Howard Stern Show,” where he is a frequent guest. Entrepreneur Arianna Huffington hosted a dinner party at her home in his honor. The title hit No. 1 on Amazon’s list of top-selling books about animals a week before its March 7 publication.

However, a [Los Angeles] Times investigation found at least 95 separate passages in the book that resemble — sometimes word for word — text that originally appeared in other published sources available on the internet. The passages are not credited or acknowledged in the book or its endnotes. . . .

The passages in question range in length from a sentence or two to several continuous paragraphs. The sources borrowed from without attribution include publications such as the New York Times and National Geographic, scientific journals, Wikipedia and the websites of academic institutions.

The book also leans heavily on uncredited material from smaller and lesser-known outlets. A section in the book on queen ants appears to use several sentences from an Indiana newspaper column by a retired medical writer. Long sections of a chapter on the cardiac health of giraffes appear to have been lifted from a 2016 blog post on the website of a South African safari company titled, “The Ten Craziest Facts You Should Know About A Giraffe.”

Never trust a guy who wears a button down shirt and sweater and no tie.

The author had something to say:

“I was recently made aware that in writing The Book of Animal Secrets we relied upon passages from various sources without attribution, and that we used other authors’ words. I want to sincerely apologize to the scientists and writers whose work or words were used or not fully attributed,” Agus said in a statement. “I take any claims of plagiarism seriously.”

From the book:

“I’m not pitching a tent to watch chimpanzees in Tanzania or digging through ant colonies to find the long-lived queen, for example,” he writes. “I went out and spoke to the amazing scientists around the world who do these kinds of experiments, and what I uncovered was astonishing.”

All good, except that when he said, “I went out and spoke to the amazing scientists around the world,” he meant to say, “I went on Google and looked up websites of every South African safari company I could find.”

“The Ten Craziest Facts You Should Know About A Giraffe,” indeed.

And here are a few relevant screenshots:

I have no idea what that light bulb thingie is doing in that last image, but here’s some elaboration:

“Research misconduct,” huh? I guess if USC ever gives Dr. Agus a hard time about that, he could just move a few hundred miles to the north, where they don’t care so much about that sort of thing.

Why is every action hero named Jack, John, James, or, occasionally, Jason, but never Bill, Bob, or David?

Posted on July 11, 2023 9:44 AM by Andrew

Demetria Glace writes:

I wasn’t the first to make the connection, but once I noticed it, it was everywhere. You walk past a poster for a new movie and think, Why is every action hero named Jack, John, James, or, occasionally, Jason?

I turned to my friends and colleagues, asking desperately if they had also noticed this trend, as I made my case by listing off well-known characters: John Wick, Jason Bourne, Jack Reacher, John McClane, James Bond, Jack Bauer, and double hitter John James Rambo. . . .

As a data researcher, I [Glace] had to get to the bottom of it. What followed was months of categorizing hundreds of action movies, consulting experts in the field of name studies, reviewing academic papers and name databases, and seeking interviews with authors and screenwriters as to the rationale behind their naming decisions. . . .

Good stuff. It’s fun to see a magazine article with the content of a solid blog post.

Don’t get me wrong, I enjoy reading magazines. But magazine articles, even good magazine articles, follow a formula: they start off with a character and maybe an anecdote, then they ease into the main topic, they follow through with a consistent story, ending it all with a pat summary. By contrast, a blog post can start anywhere, go wherever it wants, and, most importantly, does not need to come to a coherent conclusion. The above-linked article on hero names was like that, and I was happy to see it running in Slate.

Statistical Modeling, Causal Inference, and Social Science

Category Archives: Literature

Extinct Champagne grapes? I can be even more disappointed in the news media

Here’s a sad post for you to start the new year. The Onion (ok, an Onion-affiliate site) is plagiarizing. For reals.

Plagiarism means never having to say you’re clueless.

“Reading Like It’s 1965”: Fiction as a window into the past

Russell’s Paradox of ghostwriters

Hey! Here’s how to rewrite and improve the title and abstract of a scientific paper:

I disagree with Geoff Hinton regarding “glorified autocomplete”

Hey, check this out! Here’s how to read and then rewrite the title and abstract of a paper.

“Modeling Social Behavior”: Paul Smaldino’s cool new textbook on agent-based modeling

Academia corner: New candidate for American Statistical Association’s Founders Award, Enduring Contribution Award from the American Political Science Association, and Edge Foundation just dropped

The Ministry of Food sez: Potatoes are fattening

The Freaky Friday that never happened

Advice on writing a discussion of a published paper

My two courses this fall: “Applied Regression and Causal Inference” and “Communicating Data and Statistics”

Artificial intelligence and aesthetic judgment

Blogging is “destroying the business model for quality”?

(from 2017 but still relevant): What Has the Internet Done to Media?

“They got a result they liked, and didn’t want to think about the data.” (A fish story related to Cannery Row)

The Ten Craziest Facts You Should Know About A Giraffe:

Why is every action hero named Jack, John, James, or, occasionally, Jason, but never Bill, Bob, or David?