Using partial pooling when preparing data for machine learning applications

Geoffrey Simmons writes:

I reached out to John Mount/Nina Zumel over at Win Vector with a suggestion for their vtreat package, which automates many common challenges in preparing data for machine learning applications.
The default behavior for impact coding high-cardinality variables had been a naive bayes approach, which I found to be problematic due its multi-modal output (assigning probabilities close to 0 and 1 for low sample size levels). This seemed like a natural fit for partial pooling, so I pointed them to your work/book and demonstrated it’s usefulness from my experience/applications. It’s now the basis of a custom-coding enhancement to their package.
You can find their write up here.
Cool.  I hope their next step will be to implement in Stan.
It’s also interesting to think of Bayesian or multilevel modeling being used as a preprocessing tool for machine learning, which is sort of the flipped-around version of an idea we posted the other day, on using black-box machine learning predictions as inputs to a Bayesian analysis.  I like these ideas of combining different methods and getting the best of both worlds.

1 thought on “Using partial pooling when preparing data for machine learning applications

  1. Definitely have some Stan projects in the pipeline.

    Also, it is fun to try to sneak some well founded Bayesain methods into machine learning (and evidently also vice versa). I think there is a lot to be gained.

    Finally I really suggest R users working with machine learning or predictive modeling in R try out vtreat, it can be game changing (makes messy real world data behave almost as well as example data).

Leave a Reply

Your email address will not be published. Required fields are marked *