Anthony Goldbloom writes:
In late August, Kaggle launched an open data platform where data scientists can share data sets. In the first few months, our members have shared over 300 data sets on topics ranging from election polls to EEG brainwave data. It’s only a few months old, but it’s already a rich repository for interesting data sets.
It’s also a nice place to share reproducible data science. We have built a tool called Kaggle Kernels, which allows data scientists and statisticians to share notebooks and scripts in Python or R on top of the data. If you find analysis you want to extend, you can “fork it” which gives you a reproducible version without going through the pain of replicating the author’s environment. It’s useful for learning new techniques (by being able to fork and play with other’s code), to share your side project with a large community and to draw attention to your research and store it in a way that can be easily reproduced.
He adds:
We don’t support Stan yet but we inevitably will.
Sooner rather than later, I hope!
P.S. Jamie Hall of Kaggle writes:
We’ve got RStan and PyStan ready to go in Kernels now. It would be fantastic to see some examples of the best ways to use them.
P.P.S. Aki has made a Kaggle notebook Bayesian Logistic Regression with rstanarm, and it works just fine.
We’ve got RStan and PyStan ready to go in Kernels now. It would be fantastic to see some examples of the best ways to use them.
You’re in luck. We have all sorts of examples and a lot of doc:
http://mc-stan.org/documentation/
The manual has a lot of detail and covers a lot of modeling techniques with examples. The example models repo translates a lot of popular data sets and books in specific domains like cognitive science and ecology. The case studies have fully worked examples. Then RStan and PyStan themselves both have doc—I know there’s a vignette for RStan with a lot of examples.
That’s great, thanks! It would be fantastic to see some of these techniques used in Kernels. In the past we’ve found that packages really take off with our community when there are executable and forkable examples to play with and build on, even if the docs have been around for a while.
I’m using SAS and R in combination. Is there anything like kaggle kernels that works for SAS? I suspect that it’s unlikely to even happen given how SAS cost money and is a nightmare to install with the millions of optional parts (which I have no idea what they do) but one can wish.
Alexia, we don’t support SAS yet. We basically prioritize what our community asks for. If we saw significant demand for SAS, we’d contact them about the possibility of having a SAS environment in Kaggle Kernels.
I should think that Kaggle would also want to have rstanarm and brms available for use with these Kernel things. Email me if you need any help get them installed and set up.
Oh, interesting! Our R environment includes every package on CRAN, plus the basic ones from Bioconductor, and a bunch from GitHub that our community have requested. It seems that rstanarm and brms were installed ok, though nobody’s demoed how they should be used yet.
Best place to start would probably be the respective package vignettes:
https://cran.r-project.org/web/packages/rstanarm/index.html
https://cran.r-project.org/web/packages/brms/vignettes/brms.pdf
We also have a free rstanarm webinar on November 22 using LendingClub data that is also on Kaggle. Maybe you could pass the link along to people who are interested in using Kernels?
https://attendee.gotowebinar.com/register/58293186453699586
I tested Kaggle kernel with rstanarm. I copied rstanarm vignette https://cran.r-project.org/web/packages/rstanarm/vignettes/binomial.html to Kaggle Kernel. All the code works fine. In order to play with the code you need to fork the kernel and edit it.
First I couldn’t find out how to create a new kernel from scratch, so I forked a random one, and only afterwards I realized that kernels are associated with Kaggle datasets and you have to first choose the dataset. Now my kernel is associated with Pima Indians data which cannot be changed.
You can see the kernel here
https://www.kaggle.com/avehtari/d/uciml/pima-indians-diabetes-database/glms-for-binary-and-binomial-data-with-rstanarm
but I guess I should remove it as it is not using Pima indians dataset.
I updated the kernel to use Pima Indians dataset and the link changed
https://www.kaggle.com/avehtari/d/uciml/pima-indians-diabetes-database/bayesian-logistic-regression-with-rstanarm
As the Pima Indians is different from Arsenic wells example, the end of the kernel differs a lot from the original vignette. I also added new examples for computing classification errors, ROC and AUC with PSIS-LOO (hopefully soon available also in loo package)