Advice on do-it-yourself stats education?

Dustin Palmer writes:

I am a recent graduate looking for a bit of advice. While I took intro classes on math and statistics in my undergraduate degree as a political science major, I find myself university-less and seeking to develop my statistics toolkit.

I work for an NGO in the international development field. I think that a solid statistics foundation would offer me not only more career opportunities, but more importantly, a deeper and more nuanced understanding of the processes and problems that interest me. I’m talking about field experiments and practical quantitative and qualitative data analysis.

I have plenty of free time, ambition, and enthusiasm to improve this part of my toolbox, but I lack an attachment to an institution and much in the way of financial resources. How would you go about making a concentrated effort at acquiring an understanding of the field and its actual application in something like R or Stata, which I admit to never having used?

Perhaps I am simply asking about web resources or best texts, but any broad advice would be much appreciated too.

My gut recommendation is to start with a problem you care about and figure out what you need to get a reasonable solution, then go to the next problem, and so forth. For books, you could start with The Statistical Sleuth and my book with Jennifer. If you want to learn R, just try to make some pretty and useful graphs, that will motivate you to be able to do more.

Any other suggestions?

49 thoughts on “Advice on do-it-yourself stats education?

    • As a graduate student at the institution, with friends who have taken this class, I have heard nothing but praise for the physical version of the class, which is a core 2nd semester grad course that has been taught for years. It is hard work by all accounts, but very practical and incorporates re-analysis of real past studies. Afaik, this is the first time it is being offered online, so I can’t speak to that aspect of it. Good luck.

      • Indeed. I had thought it was free and considered taking it as a refresher course of sorts but it costs $1,025 for no credit and $1,950 for graduate credit.

    • I doubt this is not good; the question is whether the Dustin is ready for it or needs more training to get something useful out of it.

  1. Both Jeff Wooldridge’s introductory econometrics and his graduate textbook are written so well that they serve for self-study.

    • I’ll second the recommendation for Wolldridge’s Introductory Econometrics for an accessible presentation of how to do (mainly causal) analysis with observational (nonrandomized) data. Used copies should be cheap. Having said that, I’ve only read chapter 9 of the Gelman-Hill text and thought that was pretty good, too. Honest.

      No idea re: Field experiments and qualitative.

  2. The difficulty with just attacking a problem you care about and finding a reasonable solution is that without some knowledge about what is out there, you won’t have any knowledge about what is appropriate and what is inappropriate. Nor will you easily be able to interpret or verify your results. A text such as “R in Action” would be a good choice I think: http://amzn.to/AaItj7

  3. I went through something similar when I started my masters program in survey methodology. My focus is on the social science side of things (as opposed to sampling and statistical methods), but even so I found my math and stats background was not even close to where it needed to be. Statistics in Plain English by Timothy Urdan was incredibly helpful in refreshing my memory from undergrad on all the basic concepts that I had forgotten. The first few sections in Gelman and HIll were great for learning linear and logistic regression. And then all of the online R resources like Quick-R or the UCLA R tutorials. This may be overkill for the reader’s purposes, but MIT’s OpenCourseware has a great linear algebra class available on iTunes. I didn’t have time to get all the way through, but the first 5 or 6 classes filled in a lot of what I needed to know for the purposes of understanding matrix notation and the operations as written in a lot of statistics books.

  4. Why not take a class or two at a local community college? These classes are inexpensive and don’t require being a degree student.
    Or if those classes aren’t enough, see if a local university will let you take a class as a non-degree student.
    I’d also say check with your HR department to see if they have any sort of tuition reimbursement programs, or if they’re just willing to pay for the class–if you can make a case for why it will be a good investment for them, they may be willing to cover it.

    • Unfortunately I have to say “caveat emptor” to this suggestion. You can’t count on finding good courses at local colleges. I took a computer programming course at a local four-year state college a few years ago. In a one-semester course, required of future comp sci majors, we never got to looping structures. I don’t mean we didn’t cover detailed analysis of the computational efficiency of looping structures, or something like that. I mean we didn’t get the basic syntax of looping structures in the language we were using (C). It was the kind of class that makes you dumber than you were to start with, because of all the busy work and the effort you have to make to hold yourself back.

      Please believe I am not happy to say this. There are plenty of smart, well-prepared people teaching in all sorts of institutions. If you can find a good course in a community college, great. But in an educational system that is increasingly bifurcated into filet mignon for those who can pay and shit on a stick for those who can’t, make sure you’re not getting shit on a stick.

  5. I agree: find a problem you care about and for which you can get reasonable data, and attack it. Read every statistical blog you can. Watch youtube videos and OpenCourseware. Be prepared to work on your dataset for a year or two: as you keep examining it, or stumble onto new methods, or read something in a blog, you’ll realize that your analysis up to this point is lacking.

    Get all the free tools you can, particularly R. I also got gretl and X12-ARIMA (needed compiling on my Mac). I’m a package-a-holic and look regularly for new R packages that might be useful. Then read their documentation and try to use them on my data. If you’re naive and think because you have package and it gives answer X, it must be correct, you’ll be burned of course. But if you instead use each package as a different view into the mater, you’ll learn.

    Seek out alternative presentations. I can’t say how many times I’ve hit a brick wall in one book, only to come back to it in 4 months and suddenly it makes sense. (Because I’d read about the topic from another perspective in the mean time.)

    Always keep in mind that a self-education will have blind spots: holes in your knowledge that you don’t know you have. That’s what I’m doing, anyhow.

  6. The online classes at statistics.com are good. I’ve taken three. Just be careful to pay attention to the labeling in terms of the course level and match it to your own. There are R courses there, also.

    In many cases, the instructor is using their own textbook. For example, before I started teaching forecasting I took the forecasting course as a refresher. This used Hanke and Wichern as the text (a very popular textbook in business grad schools). Wichern taught the course.

    The plus is that the other students ask good questions, you are time boxed into a four week format so you have to spend time on it, somebody marks the assignments so you can tell whether you are really learning stuff, etc.

  7. You should ask the professor of an econometrics course at your local university if you can sit in on their class. Better if it is a program that has a decent research economics department, of course. Just sit in, read the material and don’t get a grade.

    A 4000-level undergrad econometrics course would be a good start and will introduce you to most of the tools for program evaluation. You might also consider a development economics course if it is taught at a research university and is focused on microeconomics (you’ll get all the program eval stuff there too). Then you can try to learn R or Stata on the side using online tutorials and available books that others can recommend.

    With some knowledge of Stata and a solid education in undergrad econometrics or the stat dept. equivalent, you can tackle many of the problems you might encounter. And you’ll start to get a better sense of what else you want to know, so you can choose higher-level courses or texts to read at that point.

  8. For learning R as I go, I’ve found the Quick-R web site (www.statmethods.net) and Norman Matloff’s book _The Art of R Programming_ both very helpful.

  9. You have mentioned an interest in qualitative research as well as quantitative. It is normal for people to specialise in one or the other, both are vast fields and it would take decades to become equivalently qualified/experienced in both areas. Statistics methods will not help you with qualitative research, even sampling can be very different. From a qualitative perspective, one of the worst things you can do is treat the data like it is quantitative, e.g. counts of the number of times a research participant mentions X, and this is an easy trap to fall into, particularly at the start of learning qualitative research.

  10. I went the traditional rout with statistics, but I learned to program from UC Berkley’s online webcasts, and lots of practice. It looks like they have a number of statistics classes available http://webcast.berkeley.edu/series.html#c,d,Statistics .

    A couple things to note:
    0. Do the classes in order (otherwise you will get lost)
    1. Buy the books (So you can follow along)
    2. Do the homework (there is a huge temptation to just watch, but the homework is where you learn)
    3. Make a schedule (I did one video every other work day during lunch)

    Self-motivated learning requires a lot of discipline, and it is easy to get distracted/lost and just say forget about it. On the plus side… with the information that is easily available now, you can become an expert all by yourself. I agree with andrew that this should be supplemented with problems that you care about and need to solve.

    Good luck!!!

    • I would actually say the opposite. I started with R before I learned any other language. By the time I started learning programming, the R experience helped a lot. But it could also be that I used computers prior to when they had a GUI for the operating system, so R wasn’t that different from what I grew up with.

  11. I do not know where this individual lives, but maybe he should look into “Data without Borders” and volunteering there. Prof. Gelman, would you also recommend checking out Journals like Annals of Applied Statistics, the UK organization’s “A” journal focused on statistics and society, and methodology oriented journals? Or would it require more exploration and education to actually understand something like that?

    Also, since Prof. Gelman recommended working on problems of interest, if this individual is interested in sports, perhaps working on a question there could be a fruitful opportunity.

    • I’ll third that. The “real” Wooldridge is indeed quite theoretical, but the “baby” Wooldridge is wonderfully applied, full of examples and highly accessible (to a fault, I’d argue).
      I still always start there and then move up to the harder texts.

  12. I’ve managed to pick up a lot of skills by using scholars’ replication files. It provides practical skills, teaches you new programming languages (e.g., I learned some JAGS from Simon Jackman’s house effects replication set) and gives an application-based approach to abstract statistical techniques.

    It’s no substitute for a theoretically-driven course in statistics, but it’s doable for personal instruction.

  13. The issue Zachary Jones raises about not being aware of what you don’t know is an important one.

    One possibility is to look at papers in your field that use available data and try to replicate their results and understand why they choose the analyses they do. This should help to acquaint you with what a ‘reasonable solution’ looks like in an applied setting.

  14. I’ve been in a similar position. Currently completing a PhD in political science, and using a lot of statistical analysis. My university is weak in the social science statistics field, so I’ve been basically teaching myself. I’ve found a good way to work is to look at the academic literature in the field you’re interested in, finding models that are similar to what you need to use, then working out how they were done. Basically reverse engineering for statistics.

    If you don’t have access to a university’s online journal database, then simply look for useful articles on Google Scholar, then check the library websites of your local universities to see if they carry a physical copy of the journal. If they do you should be able to visit them and look at the article onsite.

    Texts such as Gelman and Hill can also be useful to fill in some of the blanks, as can a range of online sources, such as those outlined above and the guides for statistical software.

  15. I find myself in a pretty similar position. I took a few courses in math, stats and econometrics at uni, though I was probably too lazy to make best use of them. Also, them being taught without any clear application didn’t help.

    More recently, though, I’ve been making huge strides. I bought Greene’s bible, and started at the appendices. Now I’m about a third through the main book, and suddenly feel confident reading econometric papers. The thing that has REALLY worked for me is signing up to a website which emails me every day telling me that I have to read more pages. The consistency in reading has meant I don’t have to keep on re-learning the same slightly difficult building blocks.

  16. To learn R in the context of modern data analysis nothing beats Gelman & Hill. Another important skill is learn how to plot data, and R is also great for that. I would recommends to check ggplot2 library in R. Also one needs reasonable programming skills to be a proficient modern data analist. Stay away from econometrics books: there are evil !

    • Why step away from econometrics books? Or why are they evil? My impression was that they seem to be fairly standard stats with economic data used for the examples?

      • At the very basic level it doesn’t matter much, but these textbooks point to the wrong direction (unless of have to publish in economic journals); so why not start on the right foot? Problems included: old fashioned style; obsession with assymtoptics; focus on the “unbiased” estimation of a single beta (thus assuming you know which beta we care apriori); make too many assumptions, such as GAUSS-MARKOV assumptions so that if they go, say, from a cross-sectional level to a longitudinal level they have to extend this assumptions, making everything more complicated, but we may not need them in the first place! assume that the goal of stats analysis is *always* causal inference and thus don’t teach about other useful stuff that can be done with regression, such as prediction or description; btw, implicitly or explicitly assume description is for loosers; too theoretical: give the impression is that stats is an area of applied math and thus that we should focus on the math instead of data analysis or even computation; focus on baby math algebric calculation that though mathematically trivial distract people from the ultimate goal of conceptual understanding; the emphasis on math solutions over simulations also has the downsize of, despite the math effort, one end up with trivially simple models; don’t help to developed stats programming skills;
        long tables of regression coefs as if we were still in using SAS punch cards (how do I spell that ?) in SAS in the 1970’s; not enough (or even any) emphasis on modern stats graphs for exploration, model results’ presentation and model diagnostics; don’t help to learn stats graphics skills; I can easily keep going … To sum up these books still live in an era before the emergence of modern, powerful and cheap computers. Stay away from them unless you are an economist for a living. But of course, I can be wrong.

        • Agree with Antonio wholeheartedly! Econometrics is a very peculiar branch of applied statistics, where the complexity and elegance of the mathematical models takes precedence over real world considerations, such as whether the data actually obeys the underlying assumptions.

          Most macroeconometric modeling is also too intellectually weak to be of much practical value. Partly this is because the observed behavior of markets is greatly altered by the expectations of the individual economic actors, in the fashion of rational expectations. Because of the feedback loops and anticipations, it becomes close to impossible to identify the true underlying motives of the actors. For example the stock market becomes a random walk, and non-forecastable.

          However, there is a light at the end of the tunnel, because in recent years there has been a move towards microeconometric modeling, which is more like conventional statistics in the rest of the social sciences.

          Another difference between econometrics and the rest of statistics is that they often hand out Nobel Prizes for this work, only to have the results ignominiously disconfirmed several years later. Their flawed understanding of risk and the statistical behavior of markets has several times come close to bringing down civilization itself.

          It’s an open question whether economics itself can ever become a true science.

          But I’m using some of this econometric stuff myself, e.g. the work of Heckman and Yatchew, so it can’t be completely dismissed out of hand.

  17. Perhaps below the level Dustin is looking for, but I’ve found the introductory statistics course offered through Carnegie Mellon’s Online Learning Initiative to be pretty good:
    http://oli.web.cmu.edu/openlearning/forstudents/freecourses/statistics
    No longer supported, but they also offered a course called “Causal and Statistical Reasoning” that might be helpful. The materials are still available:
    http://oli.web.cmu.edu/openlearning/forstudents/freecourses/59

  18. I´m also in a similar position. I´m polsci grad student and I´m mostly self learning statistics and R.
    I agree with Gelman and others that it is very important to work with data you care for. Otherwise it is very easy to get demotivated and just quit.
    A good starting point is finding a paper with some replication code and data. Try to run it and try to understand it.
    For me the key is not to rely on one textbook or class, but to read from different sources. Most of the time, something that I didn´t quite understand in a textbook, is made clear in another one, or in a youtube video, etc.
    I may not be very structured or organized while learning, but I certainly learned a lot since I started. In other words, whatever works for you.
    Regards.

  19. There is (apparently) an old Irish saying “That if I wanted to go there, I would not be wanting to start out here!”

    Then there is Geof Hinton’s comment “You either have to make friends with a statistician or learn statistics [the punch line being he was not sure which was most difficult]”

    A good number of ideas posted and I so far have enjoyed reading Gary King course intro slides which are free on line [as an aside, if you have the $1,000 it’s probably a steal]”.

    But it’s likely going to be a life long iterative struggle – it was for me (and still ongoing).

    Perhaps some short accounts of what happened for me.

    Did an intense research assignment in Psyc 101 to design an experiment to investigate the Gambler’s Fallacy years before my first stats course (got an A+ and encourage to consider experimental design as a career goal, though I had learned something about chance outcomes).

    After MBA school and avoiding their attempts to get me in their Phd program, I applied to Biostats major. In the interview I was unable to answer the question about the distribution of p_values given all assumptions are fine and the treatment is strictly null. Now use that as my question to test whether anyone understands even a minimum about statistics.

    Learned little in that program other than a half course in discrete data analyse via GLMs but managed to pass another course based on rigorous treatment of the first few chapters Lehman new estimation text – so I was passed. I also received lowest mark ever given in their lab course (that was an encouraging sign).

    Spent next 10 years sitting in on university statistics and math courses (worked well me, I think) and then I finally did a Phd at an expensive university (that unfortunately would have been much better for me had I done that 10 years earlier.)

    Spend a lot time now trying to discern why stats is hard for others to learn…

  20. The comment thread is so long, I don’t know if anyone mentioned these free courses:

    OpenIntro Statistics was built by teachers, for teachers. We support open source and free products, including a free textbook OpenIntro Statistics and free online course management software. When we say “free”, that means free for you and free for your students.
    Browse our ever-growing collection of products, register, and get involved. We look forward to working with you.
    http://www.openintro.org/stat/
    (Created by folks from the Harvard Biostat department)

    Statistics: Introduces the basic concepts, logic, and issues involved in statistical reasoning. Topics include Exploratory Data Analysis, Producing Data and Study Design, Probability and Statistical Inference.
    http://oli.web.cmu.edu/openlearning/forstudents/freecourses/statistics
    (Created by the Open Learning Initiative at Carnegie Mellon University:http://oli.web.cmu.edu/openlearning/initiative)

  21. A short addition to a long comments section:
    – You do not need to pay to watch videos of Gary King’s Gov2001 lectures or read the lecture notes.
    – Stata is more common in development, but not free. If you decide to take the stata route, Cameron and Trivedi’s book is a great book.
    -If you want to know about “field experiments and practical quantitative and qualitative data analysis” then start by reading some. Then you’ll identify what you don’t know but want to, and then you can fill in the knowledge you need. MHE is probably very useful for your specific needs.

  22. This is a very interesting thread, not just for the recommendations but also for what it reveals about the spectrum of attitudes (spectra, probably) among those who use statistics.

    The original poster did not mention economics, or econometrics, just an interest in international development and a background in political science. But that is enough to many to assert that what is wanted, or needed, here is econometrics — and then to provoke equally strong assertions that econometrics does not encapsulate the best of current statistical practice. I wonder whether such divergent responses would be evoked by a question on, say, calculus or geometry.

    Some additional thoughts focusing on books:

    A.C. Davison, Statistical models, Cambridge U.P. covers an extraordinarily rich range of material in a serious but also friendly manner. It’s wider than might be inferred from the title.

    W.S. Cleveland, The elements of graphing data and Visualizing data, both Hobart Press, remain well thought out surveys of most of the really good ideas in statistical graphics. If something more basic is sought Naomi Robbins’ Creating More Effective Graphs, Wiley is full of sensible practical tips and is much more valuable than several more recent but over-hyped books on graphics.

    Few books prepare people for the reality that much of one’s time is taken in data management, including data cleaning, and exploratory or initial data analysis. Thinking that analysis must culminate in a model and that you can move from here to there in one step is likely to lead to all kinds of frustration. The better books that I know of are mostly geared to particular software, such as several good books linked to Stata (start at http://www.stata.com/bookstore/). The recent book

    R.K. Pearson, Exploring data in engineering, the sciences and medicine, Oxford U.P.

    is a bit quirky, but that also makes it more interesting. It is likely to strike most experienced data analysts as wrong-headed in some respects, but that would be true of any other book, and I think it is well worth a look.

  23. See if your employer has a statistics tool they prefer and learn that one. R is nice and free, but if everyone in the office uses SAS you would be better off getting to know that tool so you can talk to others.

    If you like school, Penn and Texas A&M both have very reputable applied statistics Masters degree programs online.

    I second the suggestion of reading lots of statistical blogs – you will find that there is a lot of information and good perspective in those that you might not get otherwise.

    As Andrew suggests, find a problem you are interested in and dig into that problem. That will help you learn.

    MIT Open courseware has a whole bunch of Statistics courses, at a variety of different levels. Some are in comp sci, some in math, a data mining course under the sloan school of management, etc, so you will need to search.

    Along the same lines, hit up ITunes U and look at what is available there. Specifically Berkeley has a lot of courses available at no cost. Just make sure you get the books and do the assignments.

    In my experiences books in isolation aren’t as helpful. I have a shelf full of them, but the context from a course helps you put that knowledge to use.

    Finally hit google and search for statistics syllabus – you will find TONS of them, with exercises, at various levels. That may help put structure around some of the techniques you are trying to learn.

Comments are closed.