Empirical implications of Empirical Implications of Theoretical Models

Robert Bloomfield writes:

Most of the people in my field (accounting, which is basically applied economics and finance, leavened with psychology and organizational behavior) use ‘positive research methods’, which are typically described as coming to the data with a predefined theory, and using hypothesis testing to accept or reject the theory’s predictions. But a substantial minority use ‘interpretive research methods’ (sometimes called qualitative methods, for those that call positive research ‘quantitative’). No one seems entirely happy with the definition of this method, but I’ve found it useful to think of it as an attempt to see the world through the eyes of your subjects, much as Jane Goodall lived with gorillas and tried to see the world through their eyes.)

Interpretive researchers often criticize positive researchers by noting that the latter don’t make the best use of their data, because they come to the data with a predetermined theory, and only test a narrow set of hypotheses to accept or reject. In contrast, the interpretive researchers dwell in their data (interviews and observation) to develop their theories. Sure, it might fail Popper’s requirements of falsifiability, but is that such an unreasonable tradeoff if you get to use all of your data?

Now to my questions. First, how different is the interpretive argument from the usual Bayesian critique of frequentist hypothesis testing? Isn’t much of the argument that frequentist hypothesis-testers are ignoring the bulk of their data in order to get a yay or nay on one narrow question? Second, does the increased focus on big data make quantitative statistics even more consistent with the interpretive view? If someone is given the first crack at a big data set, isn’t the Bayesian recommendation to wallow in the data and use it to develop theory, rather than spell out some narrow hypotheses and try to reject or accept them using (most likely) a small slice of the big data?

No doubt I’ve muddied the issues by using loaded terms like Bayesian, frequentist, and big data. But hopefully the question makes enough sense to be answered. Any references you or your readers could provide would be greatly appreciated.

As an aside, I should mention that I think that most positive researchers aren’t nearly as pure as Popper and other falsificationists would demand. Most of the studies in my field are statistical analyses of the exact same data sets studied by many researchers before them (stock prices, accounting disclosures and the like), so basically every paper tests theories that have been developed by others who examined almost exactly the same data sets. And no one admits it, but they all run a host of analyses before reporting their final hypothesis tests. I’m not criticizing, only noting that they seem to dwell in their data more like an interpretive researcher than one would guess from reading their published papers.

My reply:

In political science, I’ve sometimes heard the phrase, “empirical implications of theoretical models,” which sounds similar to what is called “positive research methods” in your field. In either case it looks like a sort of extreme Popperism that goes as follows:

1. Researcher A uses theory to come up with a clear set of testable hypotheses.

2. Researcher B gathers data and comes to a conclusion, which is either:

(a) Whew! The hypothesis is not rejected. It lives another day and is stronger by virtue of surviving the test; or

(b) Wow! The hypothesis is rejected. Start the revolution.

3. Either way, you win. You’ve either supported a model you like, or you’ve shot down a formerly-live model. Typically, though, the goal is (a). You’re supposed to specify the hypothesis in a way that can be testable, then you’re supposed to design the data collection in a way so that you’re likely to learn something about the hypothesis (or perhaps to distinguish between two hypotheses, although that formulation has always made me uncomfortable, in that once we entertain hypothesis X and hypothesis Y, I’d like to do continuous model expansion and consider X and Y to be special cases of a more general hypothesis Z, and at this point I consider “X or Y” to be a dead question.

In theory, researchers A and B can come from different research groups—they don’t even need to know each other—but there seems to be a sense that it’s best for them to be the same person. In this way, empirical implications of theoretical models seems to be following the James Taylor rather than the Frank Sinatra model. The singer is the songwriter.

One difficulty of the Empirical Implications paradigm is that, in practice, the testable hypotheses can be pretty vague and can map many different ways into empirical data. This is what Eric Loken and I call the garden of forking paths.

Popper turned upside down

Even setting aside any difficulties of specifying research hypotheses, one odd thing about the above Popperian approach is that the hypothesis being tested is, in the terms of classical statistical analysis, an “alternative hypothesis” not a “null hypothesis.”

To put it yet another way, the research hypothesis (the “theoretical model” whose “empirical implications” are being tested) is (typically) something you want to uphold, whereas the statistical null hypothesis is a straw man (or, as Dave Krantz would say, a straw person) that you want to reject.

And I’m not just playing with words here. The two hypotheses are exact hypotheses, in that support of the research hypothesis (an outcome of the form 2(a) in the above story) is typically defined precisely as a statistical rejection of the null hypothesis.

This is Popper turned upside down. It could still all make sense—but we just have to keep in mind that rejection in the Neyman-Pearson sense counts as non-rejection in the Popper sense, and non-rejection in the Neyman-Pearson sense counts as rejection in the Popper sense. I don’t think this point is generally clear at all.

To get back to the original question

In my own research, I think data exploration is super-important. Much of my most influential and important applied work never could or would have happened had I been required to follow the Empirical Implications playbook and write out my hypotheses ahead of time. For example, see here, here, and here. In all these cases, it was only through lots of data analyses that my collaborators and I came to our research hypotheses. I see the appeal of Empirical Implications in pushing researchers to a sort of intellectual discipline but I don’t think it would work for me.

Finally, I think the whole Bayesian thing is a red herring here. As Uri Simonsohn has explained, you can p-hack just as well with Bayes as with any other approach. As I see it, the problem is not with classical rules or even with p-values but rather comes earlier, with a confusion between research hypotheses and statistical hypotheses and an attitude that an extreme p-value or posterior probability or whatever can be taken as strong evidence, without reference to how that number came about.

35 thoughts on “Empirical implications of Empirical Implications of Theoretical Models

  1. Quantitative should always sandwiched between speculative qualitative and reflective qualitative, and maybe non-productive to argue which is more important.

    Roughly, possible -> actual -> new possible -> actual -> new new possible … until momentary lack of doubt encountered and one stops.

  2. 1. I think we could learn more and advise the questioner more, if he gave us some tactical examples of his work. Plus, heck it would be interesting. Hate abstracted discussions, let’s get some grease on our hands. For instance, how is accounting research different from finance research? (Just the department you are in? For example, you teach different topics to students but research is same? Or not?)

    2. With finance (and econ), there’s a pretty long history of people trying to come up with counter-conventional hypotheses (e.g. that you can beat CAPM, beat efficient markets). For example saying that yesterday’s price move corresponds to today’s. Or triple witching hour. Or that like. Usually, they are not that meaningful for a couple reasons. First, they don’t hold up in true out of sample testing [by which I mean AFTER the paper, even collecting several years data if needed]. This is because people have p-fished. Second, if they really have found an opportunity…people will trade on it and it will go away, once known.

    P.s. I still would like to know if Hotelling was a frequentist. Cause I really like him. And he’s famous. So if he was a frequentist, I can go “neener, neener” to the Bayesian cheerleaders. (Wikipedia and Google failed me.)

    • I don’t know what Hotelling called himself. Back then they probably wouldn’t have used “Bayesian” and “Frequentist”, rather they likely would have said “inverse probability” and just “statistics” respectively.

      His work was strongly in the Frequentist camp though, and indeed he can fairly be called one of the founders of Classical Statistics, so I don’t think anyone will object if you call him a Frequentist.

      It’s safe to say that without Frequentists Statisticians like Hotelling, Economics wouldn’t be the paradigm of predictive accuracy and reproducibility that it is today.

      So “neener, neener” away!

    • Here is one example that might make some of these issues more concrete. The Financial Accounting Standards Board (FASB) proposes and votes on standards for how US firms should account for events and transactions. They vote after receiving comment letters from anyone who wants to way in.

      In this paper, Allen and Ramanna compiled a data set of the votes of FASB members over many decades, and a separate data set indicating the extent to which comment letters through the statement sacrificed reliability for relevance. For example, a standard that requires firms to report the fair value of an asset, rather than it’s original cost, is typically thought to make financial statements more relevant, because if the fair value were calculated appropriately, investors would much rather know that fair value than what the asset cost when it was bought. However, fair value standards are also thought to make the statements less reliable, because firms can massage reported fair values more easily than they can original cost numbers.

      The authors’ goal was to test theories about how standard setters vote. From the SSRN abstract, “we find FASB members with backgrounds in financial services are more likely to propose standards that decrease “reliability” and increase “relevance,” partly due to their tendency to propose fair-value methods. We find opposite results for FASB members affiliated with the Democratic Party, although only when excluding financial-services background as an independent variable.”

      The paper is written as a mix of hypothesis testing and ‘dwelling in the data’. The authors state some theories suggesting that auditors will tend to value reliability, that former bankers are comfortable with high-relevance/low-reliability standards, and that Democrats are less sensitive to corporate interests (with mixed implications for relevance and reliability). But it isn’t hard to see this as an exercise in letting the data speak.

      My biggest concern about the *conclusion* is that the the Democrats on the board tend to come from the financial services industry, so it is hard to justify emphasizing party affiliation at all. But my concern about the *method* is quite different. The authors don’t interview any Board members to ask them why they voted as they did. They don’t dig in to the text of the comment letters other than to use simple text searches for the first instance of words related to “relevance” or “reliability” to determine the letter writer’s concerns. They devote only a little effort to understanding the individual standards themselves. So they are letting their data speak, but not much else.

      I can imagine a very different type of paper, in which researchers would pore over the huge volume of due-process documents generated by FASB’s deliberations, and interview Board members and commenters, to determine why people vote the way they do, and how they balance corporate and investor interests (and ultimately relevance and reliability). While, I can’t see an easy way to present this as a hypothesis-testing type of paper, it could easily form the foundation of a better version of Allen and Ramanna’s work. But such a paper would be difficult to publish (probably impossible) in a top Accounting journal.

      • As an aside, the original-cost versus fair-value tradeoff reminded me of the Frequentist vs. Bayesian choice (in some cases).

        Bayesian can be more relevant yet more easy to manipulate to yield conclusion of your choice.

        • Yes, I’ve thought about that connection as well. Accountants often talk about the virtue of being “precisely wrong”. Many rules, like “you can’t show internally-developed intellectual property as an asset” force businesses to report something known to be valuable as having a value of zero. We know that is wrong, but at least we know every law-abiding firm in the same circumstances is doing it the same way and getting the same wrong value. If firms estimated the value of their IP, they’d all be wrong in different ways.

      • > authors don’t interview any Board members to ask them why they voted as they did

        One of my concerns was/is that people in type 2 disciplines often skip over initial type 1 work that may be helpful or critical and (and then also afterwards to get salience bringing the stories out as Andrew would put it.)

        RE: you query below JC Gardin ( http://fr.wikipedia.org/wiki/Jean-Claude_Gardin ) did a fair bit or work on bringing the two solitudes together, in particular coming up with type 2 instantiations of type 1 work to understand it better (e.g. the computer program that generated a paper that look so much like a Claude-Levis Strauss (http://en.wikipedia.org/wiki/Claude_L%C3%A9vi-Strauss ) paper, that the author mistakenly claimed as his work) but not sure who to point to today.

  3. How about doing interpretive research to your hearts content; dwell in their data, use all of it, play with it yada yada but ultimately come up with your novel, shining gem of a theory that then the more boring, blinkered, pedantic “positive researchers” get to verify against some new, wild data?

      • Robert Bloomfield’s comment sounded like an either / or. I agree with you that the modes are synergistic. It bothers me too when Type #1 papers try to pass off as Type #2.

        • Andrew ways “both types of papers can be published.” Perhaps in political science. In Accounting, 5 of the top 6 journals will basically publish papers only if they are (or are presented as) type 2 papers: they spell out formal hypothesis and test them statistically. There is only one journal, Accounting, Organizations and Society, based in the UK, which publishes interpretive/qualitative papers, almost always written by Europeans.

          The other problem we face is that the people who write interpretive/qualitative papers are typically not interested or well-trained in statistical analyses of data. So we really don’t get the type of interplay between type 1 and type 2 that Andrew talks about. I can’t think of a single example of type 1-style paper that ‘dwells in data’, which has then provided the foundation for a type 2 paper that tests its theory more formally.

        • Is there a risk that a field allows too much of Type-1 papers (often sensational) and not many people want to do the drudgery of producing a follow-up Type-2 paper?

          Over time there’s a temptation to start thinking of the Type-1 papers not as exploratory but as conclusive? I think this sometimes happens. And that’s dangerous.

    • I should clarify. I think that most quantitative empirical papers in political science are of type 1 in your comment above, but they are mostly written as if they are of type 2. That’s what bothers me.

      • That’s why I advocate open kimono, full monty style papers. Keep it simple, be truthful. Don’t try to make it something it’s not. You might be surprised by how much easier such an approach gets through peer review. And how much better for readers!

        Just say it’s a type 1 paper, rack up the LPU and move onto the next notch on the belt. It’s good for you, additive to science, and just speeds things up for editors/reviewers. And it’s more manly.

        • Nony:

          The “manly” comment is tacky. I know you’re trolling, but it’s still tacky even as trolling. So please stop that. Otherwise, though, I agree with you. Open is good. Simple is fine too, but sometimes we have to be complicated because we care about complicated things or because complicated adjustments can be needed to adjust for data issues; for example, see my recent paper with Yair.

  4. It’s a bit misleading to think of Jane Goodall as a sort of ape ethnographer. She went into the field to collect data, not quite to “see the world through their eyes.” She and other researchers used the data they collected to test specific theories of primate group behavior. (Also, Goodall studied chimpanzees; Dian Fossey is better known for gorillas.) Primate field researchers might object to their work being characterized as “interpretive”. For a great account of the way data, theories, and researchers circulated in 20th century primatology, I recommend Donna Haraway’s book Primate Visions (Routledge, 1990).

  5. Interpretive methods are about a lot more than simply listening to the data vs. bringing a hypothesis to the data. There are many forms of interpretivism — so it is hazardous to say anything is fully general. It is quite common, though, for interpretivist methodologists to also (1) question the possibility of objectivity and (2) question the possibility of inference to general rules or laws.

    • I think (1) is silly & nihilistic & if you assume (2) why bother with interpretive methods even? The interpretivism gained us nothing beyond that particular experience.

      • You are certainly not alone in those replies. From an interpretivist perspective, the point is expressive and the development of individual perspectives rather than the discovery of covering laws. There are several books from Dvora Yanow that set out the foundations of interpretivist methods for social sciences. I don’t endorse this perspective but I wanted to correct what was sounding like a limited and misleading characterization of interpretivist methods as simply a move away from Popperian approaches to falsification and a willingness to be more open to the data. For an approach closer to that, see “grounded theory” (Glaser and Strauss). The two (grounded theory and interpretivism) are not necessarily related but both fall under the umbrella term “qualitative.”

        • This was one attempt at such a dialogue – Representations in archaeology edited by Jean-Claude Gardin and Christopher S. Peebles. 1992.

  6. Using hypothesis testing requires some thought, yes, but I don’t think Neyman-Pearson nor Popper would have claimed to have removed this from inquiry.

    In some cases the null you test is your theory – eg when looking for discrepancies under newly testable conditions under which your theory has some chance to fail. In other cases the null is more of a ‘strawman’ ie a possible error you want to rule out – but didn’t Popper advocate comparative theory evaluation and that the best tested theory should be adopted? That should include a range of tests of a range of theories, including those you personally don’t want to be true but should rule out.

    WRT your discussion – if you observe an effect and one of the competing possible explanations is random chance then wouldn’t you want to test (and reject presumably) that competing hypothesis? How is this un-Popperian? Obviously you also want to test your personal theory too, either via embedding it as an alternative to this null or carrying out a different test.

    Is the problem that there is no guaranteed algorithm that directly translates Popperian ideas into Neyman-Pearson tests automatically? Surely this feature is Popperian through and through! And also seems to go against the accusations of rigidity compared to interpretive or whatever methods.

    • Hjk:

      To me, to be Popperian is to take the model that you like, that you want to use to explain the world, and test it, so that a rejection represents that something new was learned, highlighting a discrepancy between the model and reality.

      In contrast, in null hypothesis significance testing as is commonly framed, the null hypothesis is not a model that the researcher likes. Rather, the researcher finds confirmation by rejecting a null hypothesis. “Random chance” is not the model that the researcher typically is interested in.

      To put it another way, consider what happens to any model in social science if you get enough data or if you look hard enough at existing data. From my perspective (which I consider Popperian), you will eventually find problems with the model, which will cause the model to be discarded or to be improved (by altering the “protective belt” of theories that surround the model, as Lakatos would say). But from the classical null hypothesis significance testing perspective, the story is the opposite: you’ll eventually get enough ammunition to shoot down the null hypothesis, thus getting a confirmation. That was my point above: the language of “rejection” in hypothesis testing has, in practice, the opposite meaning as “rejection” in Popper’s philosophy (or in my own philosophy based on predictive checks, as discussed in my paper with Shalizi).

      And, no, I have never accused the null hypothesis significance testing approach of “rigidity.” It’s the opposite. The classical hypothesis-testing approach is all too flexible (recall the garden of forking paths) and it is used as a way to allow theories to appear to be confirmed repeatedly. “Psychological Science”-type theories are, as Popper said of Freudianism and Marxism, unrefutable: they can be interpreted in a way to fit any data. And significance testing is a tool that enables this. That’s not to say that these theories are empty. Freudianism and Marxism aren’t empty either. They just are frameworks, they’re not predictive models that can be refuted from data.

      • Thanks for your interesting response Andrew.

        You say:
        “To me, to be Popperian is to take the model that you like, that you want to use to explain the world, and test it, so that a rejection represents that something new was learned, highlighting a discrepancy between the model and reality.”

        – Isn’t this precisely how, say, physicists use null hypothesis significance testing? E.g. embed different models of gravity using a parameter, which when equal to zero (to some precision) gives the ‘standard’ theory/model you currently like (based on other observations/theoretical reasoning), while discrepancies indicate alternative effects which need to be taken into account. The null is now the theory you like, no? And notions like power/confidence intervals etc are there to deal with the whole ‘the effect is never exactly zero’ thing.

        “In contrast, in null hypothesis significance testing as is commonly framed, the null hypothesis is not a model that the researcher likes. Rather, the researcher finds confirmation by rejecting a null hypothesis. “Random chance” is not the model that the researcher typically is interested in.”

        – See above. The null could be that there is no real effect *over and above* the model you like, no? Perhaps common practice is different in the social sciences than in other sciences, however this appears (appropriately?) primarily sociological?

        “And, no, I have never accused the null hypothesis significance testing approach of “rigidity.” It’s the opposite.”

        – I guess all frameworks are open to abuse. Again, notions like power, confidence intervals, standards for appropriate choice of hypotheses are needed.

        To clarify: do you think the null hypothesis significance testing approach is fundamentally flawed, even when practiced ‘properly’, or is it that you believe that it leads too easily to bad practice?

        • Re-reading your post, I agree with:

          “Finally, I think the whole Bayesian thing is a red herring here. As Uri Simonsohn has explained, you can p-hack just as well with Bayes as with any other approach. As I see it, the problem is not with classical rules or even with p-values but rather comes earlier, with a confusion between research hypotheses and statistical hypotheses and an attitude that an extreme p-value or posterior probability or whatever can be taken as strong evidence, without reference to how that number came about”

          Though, as discussed above, I see no problem with carefully translating research hypotheses into standard statistical hypotheses, and see no conflict with/reversal of a Popperian (in your sense) approach. Actually, rather, I see no alternative that is much better.

  7. Thanks to all of you for such thoughtful replies. You’ve given me some good leads for further reading.

    Just to give a little more insight about my field, here are some observations that might be relevant to what some of you said above.

    The most common area of research entails merging daily stock price and volume data (CRSP) with quarterly financial statement data (Compustat), and analyst projections of future earnings (I/B/E/S). We use this to understand how investors react to accounting information, look at games firms play to manipulate earnings to meet analyst estimates, assess the “quality” of earnings (how well do they reflect true performance?), and how earnings quality affects stock prices. Many studies look at how changes in accounting standards change these relationships, or look across countries and time periods, or bring in new data sets (management earnings forecasts, patent filings)to test additional theories.

    While we’ve learned quite a bit over the last 45 years (the first econometric study in accounting was in 1968), we are often studying very small effects. It is quite common to see multiple regressions with R^2 values of .001, but 3-digit t-statistics because of the huge sample sizes.

    As a result, we often debate the economic significance of the results in ways that would make Andrew proud…except that the papers get published and heavily cited anyway. Maybe this makes sense, since so much money circulates in these markets that even a small effect can be important. But I also think it reflects the fact that we aren’t providing much context in our analyses. We are satisfied with very crude measures, and lump together very different situations, because the large sample sizes allow us to get high t-stats and low p-values (and publications!), even though the low R^2 values suggest we are missing something very important.

    At heart, this is the issue that led me to think more about the connection between interpretive/qualitative research and trying to get everything you can from a data set–even at the cost of sacrificing our idealized hypothesis-testing perspective. Which, as Andrew says above, is really more a description of how we present our research, rather than a description of what we actually do.

  8. Pingback: [28-3-14] Heterogeneous links | Caótica Economía

  9. Pingback: The role of models and empirical work in political science - Statistical Modeling, Causal Inference, and Social Science

Leave a Reply

Your email address will not be published. Required fields are marked *