Bob please correct me if I have misinterpreted this, but I just looked at your LingPipe logistic regression example and your different folds where you compare regression coefficients across folds for stability do not seem to be independent samples – but instead are based upon sampling with replacement as you reference the Wikipedia bootstrapping article in this LingPipe documentation that shows your example logistic regression model. All regression coefficients look stable when there is substantial overlap across training samples in these very artificial and contrived bootstrap validation schemes (with 100% overlap there is 100% stability). Yet such predictive models seldom replicate in the scientific sense where independent validation is required including true independent data samples as we now know from large quality control studies like MAQC-II and Ioannidis.

Try to understand that the only reason that you see such stability is because you have artificially forced such stability across models built from different training samples because your logistic regression training samples are not independent samples and instead have substantial overlaps with one another. RELR, on the other hand, does show extremely good stability across true independent samples and anyone is free to look at the simple toy Excel models that we provide on the Elsevier companion website to Calculus of Thought to see this.

Your logistic regression would have horrible stability in the real world scenario where someone else is going to replicate your classification models blindly and independently by pulling an independent sample of data. In fact, it is not unusual to see regression coefficients that have near zero correlations across true independent samples using standard logistic regression or shrinkage methods like L1 or L2 in the face of multicollinear data (see data I present in my book). When we look at the now famous quality control studies like MAQC II or Ioannidis that suggest that predictive models using the standard methods that Bob Carpenter advocates and sells through LingPipe seldom replicate, we have to look at the lame validation and nothing is more lame than overlapping samples when you wish to assess stability of regression coefficients.

Beyond the obvious problem with overlapping samples when stability is assessed, the lack of replication that is seen in these large quality control studies is more fundamentally related to problematic assumptions that the traditional predictive modeling methods make. Until all the high-minded talk on this blog about lack of replication starts to look at why traditional predictive modeling methods do not replicate including multi-level models that use MCMC where the controversial Random Effects assumption is usually not appropriate, there will be little real science here.

]]>Obviously giving a scientist the right to defend his work is not spam.

Best Regards,

Dan

]]>Just to clarify: I did not specifically close that earlier discussion, it’s just that we’ve set the blog to automatically close old discussions after some length of time, to reduce the amount of spam that we get.

]]>Software patents are here to stay as made clear in this week’s Supreme Court hearing and sophisticated, novel and useful machine learning methods seem to be the very kind of software patents that the justices this week indicated that they strongly support (even the opposing attorneys who wished to limit software patents gave the example of mathematical algorithms for encryption as the kind of machine-based method that deserves patenting). Business methods like software that implements hedging likely will not be able to be patented in the future though as was also evident in this supreme court hearing.

]]>If it were a bogus method, why would you even be concerned about a patent? Obviously, you are concerned about the patent because you understand that it does exactly what it claims which is to produce dramatically reduced error regression coefficients especially with multicollinear and/or high dimension candidate features and small sample data and automatically without any arbitrary user parameters. Because of the much lower error and the lack of arbitrary or biased user choices, RELR models are much more likely to replicate and which overcomes one of the big problems in science today related to the lack of replication in predictive models based upon observation data.

]]>Stochastic Gradient Hamiltonian Monte Carlo

http://arxiv.org/abs/1402.4102