Usually I don’t post answers to questions right away, but Mark Liberman was kind enough to answer my question yesterday so I think I should reciprocate. Mark asks:
I’ve been playing around with data from Coursera transaction logs, for an economics course and a modern poetry course so far.
For the Modern Poetry course, where there’s quite a bit of activity in the forums, the instructor (Al Filreis) is interested in what the factors are that lead to discussion threads being longer or shorter. For example, he wonders whether his own (fairly frequent) interventions have the effect of prolonging discussion or cutting it off.
With respect to Al’s specific question, my thought was to look at each of his comments, each one being the nth in some sequence, and to look at the empirical probability of continuing (at all, or perhaps for at least 1,2,3,… additional turns) in those cases compared to the same numbers for nth comments contributed by others.
More generally, I wondered about doing some kind of regression to predict the probability of continuation as a function of whatever factors (current length of the thread, size of the current comment, some function of the words in the current comment, etc.) But this is presumably a well-trodden path in survival analysis, about which I know very little. Any suggestions for reading material?
My reply: I think just about any analysis you do will be useful. (I say this partly because Mark’s analyses on Language Log always look pretty excellent to me, also because more generally it’s a good idea to play with the data rather than sitting in the corner trying to come up with the ideal analysis.)
But, to be more specific, my inclination, given the way the question was framed, is to set it up as an observational study. In this case, the experimental items are class periods, or segments of class periods, the “treatment group” are those periods where Al intervenes, and the “control group” are those periods where Al doesn’t intervene. To me, the natural way to proceed is to put together a bunch of treatment and control cases from your data, then get pre-treatment background variables (time of day, day of week, time during the semester, whether there is an exam coming up, number of discussions the previous day, etc.) and compare various outcomes of interest (basically, whatever flows from the treatment). The treatment itself can be considered as a continuous variable. Chapter 9 of my book with Jennifer should give the basic idea. Any “survival analysis” aspect of the problem will come up naturally in terms of the pre-treatment variables used in the regression, or in the coding of the post-treatment outcomes.
Also, if you have enough data, you can try to uncover treatment interactions, so that the question is not, “Does the prof’s interventions prolong discussion or cut it off?” but, rather, “Under what circumstances do the prof’s interventions prolong discussion” etc.