Mark Liberman posted some interesting summaries of telephone speech records from the Linguistic Data Consortium. He writes:
I [Mark Liberman] took a quick look at demographic variation in the frequency of the filled pauses conventionally written as “uh” and “um”. For technical reasons that I won’t go into here, I used the frequency of the definite article “the” as the basis for comparison. Thus I selected a group of speakers (e.g. men aged 60-69), counted how often they were transcribed as saying “uh”, and to normalize that count (since the number of people in each category was different) I divided by the number of times the same speakers were transcribed as saying “the”.
He also did “Um”:
And now, some contentless comments about graphical presentation:
1. I like the clear axis labels and titles, and even more importantly, that the lines are labeled directly (rather than using different dotted lines and a key). Good labeling is important–I do it even for the little graphs I’m making in my own research when exploring data or model fits.
2. I would’ve used blue for boys and pink for girls–easier to remember–although perhaps Mark was purposely trying to be non-stereotypical.
3. My biggest change would have been to (a) put the 2 graphs on a common scale, and (b) make them smaller, and put them next to each other. Smaller graphs allow us to see more at once, and see patterns that can be more obscure when we are forced to scroll back and forth between mutiple plots. In R, I do par(mfrow=c(2,2)) as a default.
4. I would have the bottom of each graph go to 0, since that’s a natural baseline (the zero-uh and zero-um level that we might all like to try to reach!). There’s been some debate about the “start-at-zero rule” but I usually favor it in a situation such as this, where it doesn’t require much extension of the axis.
Anyway, Mark’s blog entry has much more on this interesting data source.
Caroline says “emmm” instead of “ummm.” Is this standard among native Spanish speakers?