Multilevel modeling in Stan improves goodness of fit — literally.

John McDonnell sends along this post he wrote with Patrick Foley on how they used item-response models in Stan to get better clothing fit for their customers:

There’s so much about traditional retail that has been difficult to replicate online. In some senses, perfect fit may be the final frontier for eCommerce. Since at Stitch Fix we’ve burned our boats and committed to deliver 100% of our merchandise with recommendations, solving this problem isn’t optional for us. Fortunately, thanks to all the data our clients have shared with us, we’re able to stand on the shoulders of giants and apply a 50-year-old recommendation algorithm to the problem of clothes sizing. Modern implementations of random effects models enable us to scale our system to millions of clients. Ultimately our goal is to go beyond a uni-dimensional concept of size to a multi-dimensional understanding of fit, recognizing that each body is unique.

Cool! Their post even includes Stan code.

37 thoughts on “Multilevel modeling in Stan improves goodness of fit — literally.

  1. Shilling a categorical model seems uncharacteristic. I thought you would recommend they model proportions directly…

    How mature is Stan anyway? Still not seeing the value relative to JAGS, other than being en vogue.

    • Contraire:

      If you can already solve your problems using Jags, or Stata, or Excel, or whatever, go for it. The point of any tool is to be able to solve problems that you can’t otherwise solve, or to solve them more easily. If you’re interested in using Stan to solve problems, I recommend you take a look at our manual, our example models, and our case studies, all of which you can find here: http://mc-stan.org/users/documentation/

    • The ratings of an item by a person are discrete: too small, just right, too big.

      Gelman and Hill cover IRT/ideal-point models and ordinal logistic models in their regression book. There are very detailed case studies of IRT models on the Stan web site in both the regular case studies section and in the Asilomar StanCon (2018).

      Each person’s size and each item’s size is modeled continuously. That’s their variables alpha and beta. The cutpoints between the decision points are also continuous.

  2. They could also use this information to tailor marketing materials. Though I imagine they could do the same thing with just information on sizes purchased…

    • Daniel:

      All that really matters here is someone with funds thinking than can make money using the algorithm and finding out they are right. I don’t think that was happening in 2005.
      (By the way in 1982, I my MBA term project was to use Herbert Simon AI ideas to discern representations of fashion offerings that matched how women shopped at the time.)

      • I’m really down on the tech industry as a source of real innovation. The problem is these days is its very possible to make money without making VALUE for consumers. When the govt is printing like there’s no tomorrow, a lot of worthless fake “work” will happen… and it has been and still is.

  3. Super cool stuff.

    It seems to me that there’s a wrinkle they haven’t addressed in this post – people’s sizes change over time. So it’s not just a problem about learning more over time about the “true” value – the true value is a moving target.

    I was also surprised by the apparent symmetry in the latent size vs purchase probability plot – I would have expected a steeper dropoff on the “too small” side, but I don’t see it. I wonder whether the shape of that plot varies by garment type (it might be easier to get away with an ill-fitting overcoat than to do the same with too-small pants).

    • I often reject clothing for being simultaneously too small (in tbe hips) and too large (in the waist). I’m not sure which box I’d check in that case.

      • A major part of the stuff I was working on at that Emeryville startup was trying to help people like you ;-)

        The hardest part about all this wasn’t figuring out how to fit you, but figuring out how to communicate to the buyer which style to try… You wind up needing something like 8 different “styles” to fit a wide enough range of body types, and then you need to be able to “tell” the buyer which style they need… not an easy task.

    • Mike:

      We’ve fit item response models in Stan to datasets with many thousands of items. I haven’t tried out this particular model, but I have no reason to think that the approach described in the above post can’t be scaled to real problems.

      • Andrew-

        Ok, but how? RAM is the limiting factor here. Fitting a model with ‘thousands’ of items across gigs and gigs of data is no small feat on any machine using any software. It would help participants if you were to either elaborate just a tiny bit more on the method employed or, perhaps, point to literature where this is discussed.

        • You can get terabytes of ram for not a lot of money these days, and one presumes they can go distributed when they get to a situation of needing to.

        • It certainly will be a lot if you have to have a motherboard custom built, so the fact that there are off the shelf components is I think meaningful. The mobos are only a few hundred bucks, it’s the CPU and RAM that cost the most. Still I suspect you can build a terabyte server for less than $5k

        • I don’t think that’s possible (at least if you have to buy the components at market rates). I’m not sure you can build a terabyte-capable server for $5k and the memory itself will cost significantly more than $5k. Even if it was possible to build a single-terabyte system for $15k, the orginal claim was about “terabytes” and I don’t think such a thing can be built for $50k, maybe not even for $100k.

        • Here is an example of motherboard and 1 TB memory only ($13,633.99): https://pcpartpicker.com/list/FgkXJV

          That motherboard could support up to 2 TB though:

          Up to 2TB† ECC 3DS LRDIMM , up to DDR4- 2400†MHz ; 16x DIMM slots

          https://www.supermicro.com/products/motherboard/xeon/c600/x10dax.cfm

          I didnt see any 128 GB lrdimm sticks on pcpartpicker, but these look compatible and are about 5k each, so it would be ~$80k for 16 of them.

          After looking around more it works out to about $750 per 64 GB module and $1.5k to $6k per 128 GB module. I’m not sure why there is such a large price range for those, perhaps compatibility issues.

          In conclusion, for 2 TB of memory it would be ~$24k worth of 64 gb modules and $24k-96k of 128 GB modules. Going down to 32 GB modules looked like ~$350 each (~$22k total). So “not a lot of money these days” is $20k on the low end and the memory will be by far the most expensive component of the machine.

          It may make sense if they are comparing to prices in the distant past:

          Higher-capacity chips remained expensive: A two-megabit chip, for example, was $599 in September 1985, making the cost-per-megabyte for larger-capacity RAM $300, per McCallum’s analysis. But even at the high end, the prices kept dropping: The cost-per-megabyte fell to as low as $133 in October of 1987.

          https://tedium.co/2016/11/24/1988-ram-shortage-history/

          I can’t tell whats going on with bits/bytes in that article, but since there are 1024^2 MB per TB, that naively works out to $100-300 million per TB in 1988. Relative to that, current prices are quite cheap (about 2k to 20k times cheaper).

        • Server memory looks to be about $700 for 64 Gigs. So the Ram will cost you $11k the processor is say another $1k and mobo $300 so you’ll come in at close to $15k, there are a few mobos that support 1512 Gigs so, it’s not out of the question to get more than a terabyte, more than 2 terabytes is i agree very hard. I think the possibility of terabyte of ram for less than the cost of a used car is remarkable enough. $15k counts as cheap for commercial purposes comparing to say the salary of the person you hire to write the software.

        • You still seem too optimistic in you think the CPU will be “say another $1k”, but as I said maybe the system can be built for $15k (surely not for less). It might not be a lot of money, but it’s half an order of magnitude above $5k. It might be the price of the average used car, but it’s more than a cheap new car.

          I don’t say that’s not a fine achievement, but it’s not as good as it was supossed to be. According to this forecast from fifteen years ago http://www.singularity.com/charts/page58.html by now it should cost a couple hundred of dollars to produce 1TB of memory, not several thousands. Or maybe I’m not reading those reports correctly, I’m not a semiconductors expert. Also, I want my flying car.

        • You can get terabytes of ram for not a lot of money these days

          What do you mean “these days”? I recently went to build a new pc and didnt because RAM (and hence also gpus) are at insane prices. And it wasnt the type of thing I was looking to save costs on either, they were just way too high. I don’t know the authoritative source on memory prices, but here is a bunch of people saying so:

          https://old.reddit.com/r/buildapc/comments/8rqias/why_the_hell_are_ram_prices_still_outrageously/
          https://www.windowscentral.com/best-ram-deals
          https://www.hardcoregames.biz/ram-prices-still-sky-high/https://investorplace.com/2018/06/memory-prices-matter-mu-stock/
          https://www.marketwatch.com/story/micron-earnings-show-more-big-profit-and-sales-gains-but-stock-dips-slightly-2018-06-20

          Are you comparing to 1995?

        • GPUs are at insane prices because of cryptocurrency miners, not primarily RAM prices.

          See above, 15$k for a full terabyte system is not a lot of money compared to the salary of the person you hire to write the software… So I still think the original statement is accurate from that standpoint

        • I was thinking of this:

          In the midst of a global DRAM shortage, Digitimes reports that the market prices for graphics memory from Samsung and SK Hynix have increased by over 30% for August. This latest jump in memory prices is apparently due to the pair of DRAM manufacturers repurposing part of their VRAM production capacities for server and smartphone memories instead.

          https://www.anandtech.com/show/11724/samsung-sk-hynix-graphics-memory-prices-increase-over-30-percent

          Of course there is also the mining issue on top of that. Also I have another post where I looked at prices more in depth that hasnt shown up yet. These pages also look useful:

          https://www.gamersnexus.net/industry/3212-ram-price-investigation-ddr4-same-price-as-initial-launch
          https://jcmit.net/memoryprice.htm

          My conclusion is that if I can wait a bit, I should be able to get the same thing for half to a quarter of current prices.

        • The original claims were that the model “can’t be scalable beyond toy datasets”, that “RAM is the limiting factor here”, and that there are “gigs and gigs of data”. Stitchfix is a company with: “over 70” data scientists (https://newsroom.stitchfix.com/wp-content/uploads/2016/09/StitchFix_FactSheet-2.pdf), and in that context dropping $15k or more on a large-ram machine is barely a rounding error.

          I agree, it is a lot of money for an individual, but I think if you’re scaling up to these data sizes I think you’ve gone past the realm of ‘hobbyist’ into ‘enthusiast’. If you’ve already crossed the line into ‘professional’ then I think you have to consider this in the category of ‘major expenditure needed for my profession’ and should probably consult an accountant. On the academic side, I’ve been out a long time but a quick search suggests that a machine like this can be grant-justifiable, if not the kind of thing a department would be willing to fund.

          And of course if you don’t want to buy it yourself, you can rent: https://aws.amazon.com/about-aws/whats-new/2017/09/general-availability-a-new-addition-to-the-largest-amazon-ec2-memory-optimized-x1-instance-family-x1e32xlarge/

        • Mike:

          I don’t know why you put ‘thousands’ in quotes. When I said thousands, I meant thousands. Regarding details: I was thinking of a simulation that Bob Carpenter did a few years ago, which I can’t find anywhere now. You could take a look at this paper from a few years ago, but our Stan code is faster now, as we have hard-coded some models such as logistic regression.

        • Andrew-
          Putting thousands in quotes wasn’t a gee-whiz moment since thousands of features isn’t really that many, e.g., Google develops their predictive models using hundreds of millions of features.
          I think of both IRT models and models built using Stan as inferential even confirmatory approaches rooted in strong theory and based on a carefully specified model using a reduced set of predictors.
          What theory lends itself to inferential model building with thousands of features? What a priori hypotheses can be specified in that case?

        • How much better are the predictions from some kind of machine learning model involving say 1 Million predictors compared to a model using say a random sample of 1000 of those predictors? I’d be shocked if 1e6 predictors did a lot better than 1e3

          I think typically the reason people throw in 1M is because they have the data and they *don’t* have any kind of theory, not because 1M are needed to get good results.

        • A lot of the problems at which machine learning excels have ridiculously long tails. All the text in the world doesn’t build a very good simple language model to predict the next word given the previous half-dozen words. There are about a million forms of words in fairly common use not counting names. So a half dozen previous words makes for a lot of predictors. Of course, most of them won’t be observed in a finite text corpus, but if you have a a billion words (no longer even considered “big data”), you will likely have tens or hundreds of millions of predictors. Even pruning them heavily doesn’t give you something workable with only a thousand or ten thousand predictors.

          For speech recognition, the input is often broken down into fifty or so psychoacoustically filtered predictors every hundredth of a second. And those big language models over words are used to guide the speech recognizer in narrowing the search for the next word. Using many fewer predictors or much less training data starts dropping accuracy fast.

          Sure, they’re not very good models—basically just remembering what they saw, but they do let us build applications like Alexa or Siri.

        • It’s not clear to me how a half dozen words makes for more than a half dozen predictors.

          But clearly this is a terminology issue… The point about needing a lot of data in building the prediction model is very clear.

        • Mike:

          An example of item response modeling would be a test with hundreds of questions taken by thousands of students. Or an ideal-point model fit to hundreds of questions asked of thousands of survey respondents. Or thousands of movies being rated by millions of viewers. There are lots and lots of applications of such models. Another example is the clothing-fitting problem in the above post.

  4. This thread took an interesting direction, I come from a business perspective, where 5-20k transactions / sec is fairly common, so scale and dynamics are critical. We can deal with the dynamics by running algorithms in parallel and persisting parameters to be shared by the algorithms in production. But scale is a concern I have with Stan. I suspect that, since it’s accessing a compiled C++ object, we already have a bottleneck, and it’s goung to be difficult to scale / thread out in a cluster. Desktops are not an option. Thoughts?

      • Andrew:

        CmdStan 2.18 is out! It has MPI-based (multi-core or multi-machine) parallelism in evaluating the log densities.

        Ellen Terry:

        Stan’s not intended for high throughput applications. If you have a business mindset, it might be helpful to think of it as like a spreadsheet for statisticians. You can code things and do calculations, and it’s very flexible and powerful for what it does, but nobody would seriously suggest building a high throughput application around it. Scalability and parallelism probably wouldn’t be a problem—power consumtion and latency would be.

        We recommend Stan for when you need to perform full Bayesian inference to calculate posterior expectations accurately conditioned on static data. Andrew’s thinking about streaming data, but we’re nowhere near having a coherent solution yet, nor do I believe we ever will be able to do anything useful that’s not approximate or slow.

        I don’t understand why you think compiled C++ would be a bottleneck. It’s what companies like Google that really need to scale their front ends use. I think they’re a bit north of 20K queries/second, but they don’t have to be transactional the same way a bank transfer is.

Leave a Reply

Your email address will not be published. Required fields are marked *