3 quick tricks to get into the data science/analytics field

John McCool writes:

Do you have advice getting into the data science/analytics field? I just graduated with a B.S. in environmental science and a statistics minor and am currently interning at a university. I enjoy working with datasets from sports to transportation and doing historical analysis and predictive modeling.

My quick advice is to avoid interning at a university as I think you’ll learn more by working in the so-called real world. If you work at a company and are doing analytics, try to do your best work, don’t just be satisfied with getting the job done, and if you’re lucky you’ll interact with enough people that you’ll find the ones who you like, who you can work with. You can also go to local tech meetups to stay exposed to new ideas.

But maybe my advice is terrible, I have no idea, so others should feel free to share your advice and experience in the comments.

34 thoughts on “3 quick tricks to get into the data science/analytics field

  1. Andrew’s advice is sage: he needs to get into industry as soon as possible to get real-world experience. I’d add the following:
    1. Figure out what he enjoys doing – is it coding or is it problem solving? Those are two different jobs, one is software engineering, the other is more statistics and analyses. If he is in NYC, come to one of my public lectures at NYPL in which I explain how to pick a career path within this wide and exciting field.
    2. Once he has picked an area, and hopefully also an industry, then he needs to reach out and talk to as many people in industry as possible. Go to networking events and meetups.
    3. Then apply to jobs. The job search is a job in itself; keep applying until someone gives you a chance. You will encounter lots of rejection but keep trying.
    4. If nothing is working, consider going to a bootcamp. They are set up to give you practical skills that appeal to hiring managers. Talk to the bootcamp organizers to get a sense of what their vision is, and see if it’d help you make your case.

  2. Good advice. I would add that in the “so-called real world” you want to solve a practical problem, with the initial challenge being that you are often working with nontechnical people who do not state their problem in statistical terms. You hear “how much should we charge for this product/service?” or “how come it takes so long to get a version of this report?” or “I wish I knew which of these cases is likely to be most costly” (in High School, we called these word problems). Translating the question into a technical model and then communicating the practical results clearly is the key.

    PS. We’re hiring

    • Yes, and unfortunately, academic classes especially STEM classes fail to train people in critical thinking. Just the other day, I assigned a problem to students for which I deliberately did not provide all the data needed to plug into a formula. The challenge is for the students to (a) recognize that some data are missing (b) realize that they must make assumptions to move the analysis forward (c) use something they learned in class to make well-reasoned assumptions (d) look at the outcome and revise the assumptions as necessary. A good student wrote back, very flustered, because she has read through my notes many times and just couldn’t find a way through. After emailing with her, she was stuck because she noticed the missing data, and couldn’t apply the formula. Then she thought she could use another formula to fill in the missing data when all she needs to do is to make a reasonable assumption based on the information given.

      If you are a student reading this, remember in the “real world”, you don’t have all the numbers you need, and there are no answer keys at the end of the book.

      • Reminds me of a conversation with a relatively fresh statistical consultant/collaborator (about a year post Phd) – “the clinicians did not provide all the information I need to do a power calculation they want. So I don’t need to do anything further for them – right?”.

        No, you need to help them find something they/you can do.

  3. Given the advice here so far – the career prospects for the “statistically able” has changed drastically.

    For instance, a primary motivation for leaving my family in 2001 for two years to get a Phd at Oxford was a decision not to hire me in a role I had excelled at for 10+ years at another university. This was by a university group that wanted to keep the position open in case someone with a Phd in Statistics did apply.

    However, I would worry about these three challenges of statistical inference that Andrew recently blogged on “(1) generalizing from sample to population, (2) generalizing from treatment group to control group, and (3) generalizing from measured data to underlying constructs of interest.”

    So my advice would be to try to be near a good university so that you can attend courses and talks to get more theory as you are practicing. Now, in some locations, the local tech meetups might be adequate for theory.

    • Don’t be so gloomy. At the moment I’ve got three clients each of whom are in the business of selling things to enormous refineries. Though they’re little Davids to the refineries’ Goliaths they each began to collect data on what their customers were buying, how much, how often and when. With some very (very) primitive modeling they’ve gotten to the point that they know more about preventive maintenance, looming failures and upsets than their customers. In some cases the customers have tried to run off with their employees (and algorithms), in other cases they’re trying to catch up (copycats), and in the best cases the little Davids are extending their leads. They are all desperate for people who can make sense of their data lakes – some of them collecting a petabyte of data every 12 hours off what once were dumb pumps and valves but which are now a part of the internet of things – signaling second by second, flow, volume, level, temperature and pressure. Something similar is unfolding in healthcare and I suspect a myriad other avenues of inquiry. If I could wind the clock back 2 decades + I’d do what you do and know what you know.

      • Thanatos:

        Did not mean to sound gloomy, the world changes but I do think it was common that without a Phd a career in statistics was limited to mostly working “under supervision” until 20??. I do think it was mostly the creation of data lakes as well as online clicks that has changed it.

        Perhaps it now is mostly, as Bob put it, more engineering than science but the three challenges of statistical inference remain behind it all. So I am speculating that a better grasp of dealing with those challenges will be helpful.

        That’s the role for the theory of applying statistics. I don’t see that as something one currently will get outside a university. But maybe that is a role local tech meetups might be able to take on if not even do better.

  4. Good advice so far. I’d also emphasize that you know “the basics” (stolen and summarized from a post on LinkedIn). Be able to do these things: run a basic SQL query to get data from a database, ingest/clean data in R/Python, fit a basic model in R/Python, generate basic visualizations in R/Python, explain your results in layman’s terms to your business partners. If you cannot do some of these things, learn how. If you cannot do most of these things, consider taking a bootcamp to boost your skills. Or, do a Kaggle or Kaggle-like competition. You’d be surprised how much you learn in one of those. If you can do these things, great! Make sure to be able to discuss (in depth) instances from school or your current work where you’ve put them to use.

    When you interview, don’t forget your interview basics either: research the company, show interest in the company, ask good questions of your interviewer when the time comes. I’m interviewing people right now for our data science intern program. Of eight interviews so far, not a single one has shown specific interest in the company. Few of them have had good questions for me at the end. All of them have been technically qualified, so it’s the soft skills that make candidates stand out.

    When you’re looking for jobs, make sure to think outside of tech firms and start-ups, too. I work for a Fortune 100 company in a non-tech field, but we employ a couple of hundred data scientists – enough to have an internal data science conference every year. Some of the other benefits: I work a 40-hour week, I work on more fun projects than I would at a “big 5” tech company, we have TONS of internal continuing education opportunities, my company even pays for continuing education (MS degrees, for example).

    Good luck!!

      • Does anyone else find humor in “Annnnnon” asking me for personal information or is it just me? :)

        All kidding aside (please…don’t take it seriously…I really am JUST kidding), the point is not where I as an individual work, but that there are options outside tech companies, Silicon Valley, and startups. A lot of “regular” companies are hiring data scientists, and some of them are doing some pretty cool things! Also, if you have the skills to do cool stuff, get hired on somewhere and find a cool side project at work. Do up a proof of concept, figure out the cost-benefit, run it by your boss, and then you’re doing cool stuff too! It’s how I got into an image recognition project even though my bread and butter is still statistical/ML modeling. Data scientists are fairly lucky in that we can often nudge our jobs in the direction of the job we want. Do this.

  5. Having spent roughly 15 years in academia and 15 years in industry, my one take away is that there’s not much difference if you’re actually trying to solve applied problems or develop software.

    Most academic research settings in statistics or computer science aren’t actually trying to build sustainable solutions to applied problems. They’re trying to generate novel ideas that might have broad impact (NSF’s obsession with swinging for the fences [warning: first of a long series of mixed metaphors]). The payoff is in generating the ideas and getting them published and cited, not putting them into production in a workable way.

    Most academic classroom settings are artificial in the way Kaiser describes above. Andrew tries to get away from that by intentionally throwing confusing, open-ended, and incomplete exercises at students, but as Kaiser also points out, this is very very challenging in a controlled classroom setting where the presumption is that there is an easy, right answer. Just like players in Dungeons & Dragons who realize no sane dungeonmaster is going to throw an adult dragon at third level characters, students play the meta game with an assignment and assume the teacher’s not being so sloppy and conclude they must have missed something.

    If like me, you’re in academia working on a project like Stan, you’re in the unusual position of trying to develop a serious product from within the ivory tower. There’s a bit of an impedence mismatch [if I may continue to freely mix my metaphors]. If like Andrew, you’re in academia trying to work on many real-world modeling problems, then you have the same problem—they don’t translate neatly into least-publishable-unit papers. It’s a consequence of what Andrew likes to call “god being in every leaf of every tree”, meaning that when you set out to work on a project, it winds up touching on the entire field. That metaphor sounds better than it plays out—-just ask Sean how bushy the work on SBC has become! [I’ll end with that (this?) final salvo. Sorry.]

  6. Having a public portfolio of work that you can point to is a great way to start.

    1) Pick a topic you are honestly interested/passionate about, something you are willing to work on during your free time.

    2) Create an R package (or whatever) devoted to that topic.

    3) Host the package on github, make a cool page for it using blogdown, and so on.

    4) Make everything as “professional” as you possibly can: good comments, unit tests, pass R CMD check and so on.

    Now, you have something to show off which highlights what you can do. This is the single thing most likely to impress someone in industry, especially coming from a new graduate.

    • Public portfolios are tricky for the following reasons:
      1) It takes a lot of time to put up a good one, and time spent working on projects competes with time spent applying for jobs, preparing for interviews, etc. You won’t get a job if you are not out there pounding the pavement
      2) It is hard to come up with a good project – that is not repetitive, has an interesting message, shows some creativity, etc. There are lots of me-too projects
      3) It is hard to find good datasets that are amenable to solving interesting problems. There are lots of bad datasets that lead to bad projects.
      4) Every hiring manager has different needs; it’s hard to have one vanity project that appeals to all. For example, many hiring manager wants you to demonstrate industry knowledge (or at least, enthusiasm), so they want to see a project using data from their industry. (I don’t agree that industry knowledge is compulsory – I am just describing how many hiring managers think.)

      Would love to hear any suggestions on working around those issues!

      • Kaiser: Everything you say is true.

        Still: I argue that, if you have 100 hours to devote to a search for a job in data science as a newly minted BA, at least 50 of those hours (if not more) should be spent on creating a portfolio of your work to show people, even if it is mostly cleaned up versions of your class problem sets.

        • Putting this together with what Jim H said above, the best way to show interest in a project is by submitting a pull request. That’s how we hired Sean. It shows that you’re interested in the company and that you can make a positive contribution.

          One of the advantages of working on open source either as a job or on the side is that you instantly have a portfolio to show the next company.

          I took my own advice on this front. Here’s my portfolio project circa 2002* after the dot-com crash when I wanted a job. I’d been working as a professor and then at Bell Labs, so I had written lots of papers, a couple technical books, and even some open source software. But more recently, I’d been a professional programmer for two years, so I wanted something that looked a little closer to jobs I wanted, which involved more engineering than science. At the time, my best programming was on proprietary software, so I had nothing to show for it.


          * Turns out Radford Neal is everywhere. My arithmetic coder was based on an algorithm he wrote in C. This was before I’d entered the world of computational stats, so I didn’t know who Radford was!

  7. There are lots of opportunities to do data science in academia, they’re just not in computer science or statistics AFAICT. Public health, biology, ecology, conservation, basically all the even-remotely biological sciences are doing hard-core data science. The main issue is there’s a lack of technical knowledge w.r.t data science in those fields so you need a second mentor and they/you need to have clout on scoping the project.

  8. I agree with Professor Gelman. There is only so much that you are likely to get exposed to in a university setting, and a lot of the really important things are not very well communicated to students. For what it’s worth, I wrote up an article about getting into data science here: https://www.linkedin.com/pulse/learn-data-science-doing-douglas-puett/. It was written specifically to a person who wouldn’t have much academic background in statistics, but probably would apply pretty generally as well. Feel free to reach out to me if you have any more questions, I love connecting to people looking to get started or continue in their data science careers, and my company (usertesting.com) is hiring for Data Science/Quantitative Researcher roles.

  9. I can second Kaiser’s suggestion above, to look for industry meetups. I’ve been to a couple of the big yearly ASA (American Statistical Association) conferences, and found them very stimulating. Those offer great opportunities to meet people at all strata within data science.

    A colleague just mentioned attending a big R conference in San Diego a couple of weekends ago.

    I would imagine these kinds of events could be helpful for you.

  10. I work as a data scientist and the job market is pretty competitive, at least at large tech and financial companies in New York (where I live). Unfortunately for you, whether you are actually qualified or not, you probably will struggle to get an interview without a CS, math, statistics, or physics degree from a good school. So my first recommendation would be to go to graduate school. If that’s not an option, try to do well in some Kaggle competitions; that would impress me.

  11. If you want to get into the field, there are few barriers to entry outside of your control. There are great classes you can take for free to brush up on almost any skill or subject matter you might need. The software is free. There is a plenitude of relevant volunteer opportunities, part-time jobs, and full-time jobs. There is really nothing keeping you out. However, Data Science isn’t a single monolithic thing. You need to figure out where your interests and talents are, and start putting in the work to get to the point where you have something to offer.

  12. One thing to keep in mind, especially as a junior person, is that despite learning all kinds of fancy machine learning and analytics techniques, at least 75% of your time, and perhaps much more, will be spent doing mundane data wrangling and cleaning chores. I actually enjoy that work and would rather do that than be in management, but don’t get the mistaken impression that you’re going to spend 40 hours a week applying the latest and greatest methods. In the private sector, simple and boring methods are often preferred because they’re “good enough” and easy to explain.

  13. Here’s some academic data science (from the little slice I recall right now):

    http://reichlab.io/flusight/
    – dengue prediction, the web app is sadly confidential: http://journals.plos.org/plosntds/article?id=10.1371/journal.pntd.0004761
    http://db.ecosheds.org/
    http://conte-ecology.github.io/shedsGisData/
    http://ice.ecosheds.org/
    http://fpet.track20.org/fpet/
    https://nextflu.org/h3n2/ha/3y/
    https://www.movebank.org/
    – pacific fisheries research in general… not going to look up details…

  14. I’d offer that rather than getting a data science job, any job can become a data science job. It wasn’t that long ago that I was speaking to a group of interns at the company I work for and someone asked me how I got into the job I was in now (statistics, analytics, etc.). The answer was, I created it (albeit accidentally). I was “just” a programmer many years ago when I started creating simulations and models (because I thought it was interesting) to show how certain proposed solutions to problems might play out and it caught some vice president’s attention and they created a role and handed me my first full-time data job as a result.

    Instead of thinking necessarily that the job must be pre-specified to be “data science” think about the types of work that interest you and how you might use data in that role to differentiate yourself from your peers (and thereby creating the win-win for you and the company you work for). You can be a big fish in a little pond if you open up new places to take advantage of data.

Leave a Reply to Scott Cancel reply

Your email address will not be published. Required fields are marked *