Statisticians take tours in other people’s data.
All methods of statistical inference rest on statistical models. Experiments typically have problems with compliance, measurement error, generalizability to the real world, and representativeness of the sample. Surveys typically have problems of undercoverage, nonresponse, and measurement error.
Real surveys are done to learn about the general population. But real surveys are not random samples. For another example, consider educational tests: what are they exactly measuring? Nobody knows. Medical research: even if it’s a randomized experiment, the participants in the study won’t be a random sample from the population for whom you’d recommend treatment. You don’t need random sampling to generalize the results of a medical experiment to the general population but you need some substantive theory to make the assumption that effects in your nonrepresentative sample of people will be similar to effects in the population of interest.
Very rarely, the assumptions of a statistical model will be known to be correct. The only examples of this that I’ve ever seen up close have been samples of documents. For example, we had a spreadsheet with a list of a few thousand legal files and we took a random sample of 600. The sample files were examined and then we used these to get inference for the full population. This doesn’t happen in surveys of people because we have nonavailability, nonresponse, and shifting sampling frames. But in rare cases we are sampling documents and the statistical theory is exactly correct.
Textbook statistical theory is like the physics in an introductory mechanics text that assumes zero friction etc. Friction can be modeled but that turns out to be a bit “phenomenological,” that is approximate.
Models are great and there’s no reason to be embarrassed about them. Assumptions are the levers that allow us to move the world.
Statisticians take tours in other people’s data. Assumptions about the underlying world + assumptions about the data collection process + the data themselves -> inferences about the world.