Skip to content

Stanny Stanny Stannitude

stanlogo-main

On the stan-users list, Richard McElreath reports:

With 2.4 out, I ran a quick test of how much speedup I could get by changing my old non-vectorized multi_normal sampling to the new vectorized form. I get a 40% time savings, without even trying hard. This is much better than I expected.

Timings with vectorized multi_normal:

# Elapsed Time: 96.8564 seconds (Warm-up)
# 87.8308 seconds (Sampling)
# 184.687 seconds (Total)

Timings without vectorized multi_normal:

# Elapsed Time: 168.544 seconds (Warm-up)
# 147.83 seconds (Sampling)
# 316.375 seconds (Total)

These timings are for a Gaussian mixed model with three random effects, 5000 cases, and 500 groups. The only difference between these two models was changing this line:

vary ~ multi_normal( rep_vector(0,3) , SIGMA );

to this line:

for ( j in 1:N_id ) vary[j] ~ multi_normal( rep_vector(0,3) , SIGMA );

Really nice stuff.

And it’s only gonna be getting better.

We’ve also been talking about a Julia interface to go along with the R and Python interfaces.

The health policy innovation center: how best to move from pilot studies to large-scale practice?

A colleague pointed me to this news article regarding evaluation of new health plans:

The Affordable Care Act would fund a new research outfit evocatively named the Innovation Center to discover how to most effectively deliver health care, with $10 billion to spend over a decade.

But now that the center has gotten started, many researchers and economists are disturbed that it is not using randomized clinical trials, the rigorous method that is widely considered the gold standard in medical and social science research. Such trials have long been required to prove the efficacy of medicines, and similarly designed studies have guided efforts to reform welfare-to-work, education and criminal justice programs.

But they have rarely been used to guide health care policy — and experts say the center is now squandering a crucial opportunity to develop the evidence needed to retool the nation’s troubled health care system in a period of rapid and fundamental change. . . .

But not all economists think that randomization is the gold standard. Here’s James Heckman, for example, criticizing “the myth that causality can only be determined by randomization, and that glorifies randomization as the ‘gold standard’ of causal inference.” I try to put some of this in perspective here; see p. 956 of that article.

Meanwhile Prabhjot Singh offers some thoughts on the health policy innovation center:

The Innovations center that is in the cross hairs of this article is the one new federal level healthcare initiative that I truly think is transformative. Over just the past 3 years, they leveraged 1 billion dollars to change the organizational behavior of a 2.7 trillion dollar industry in the specific area of service delivery and payment systems. I have sat with the executive / strategy groups of 3 of this city’s largest hospital systems (covering 80% of the city’s patients) and many across the country as they scramble to figure out how their organization should shift strategy to capture some of this money. It’s actually sorta fascinating, because 1 billion divided by all the possible hospital systems in the US, who each submit multiple grants, is a pretty small amount of money. But the innovation center process is smart – hospital systems have to demonstrate that they would do what they suggest anyway. Everybody is angling to catch a piece of this large sounding amount of cash and figure out what the innovation center really wants to see – and they game their applications accordingly and whisper intel across their institutions and the required set of regional partners they have to assemble. By the time initiatives are submitted, there is a remarkable amount of synchronization and consensus across internally fragmented organizations about the sort of payment and delivery innovations they each think has the best shot. I was surprised by how many leaders of interdependent healthcare sub-systems were meeting for the first time through this process. In my view, the sheer internal prep and intra-institutional cooperation required to submit a proposal for an innovation center challenge—repeated across the country—is worth $1b. The shared awareness of common challenges alone will promote the diffusion of successful demonstrations across institutions with synchronized priorities. This is notoriously difficult even with an iron-clad RCT in hand.

When the goal is changing system behavior, the design of the process should be a central concern. It creates a clear navigation path so the entire ecosystem can strive forward, not simply a few lucky demonstration projects that post improvements compared to controls. The demonstrations are like fruit on a tree. In a well designed process, the tree remains after the fruit is long gone. It’s easy to ignore the tree, much its own interdependencies, when you’re entirely focused upon comparing control and treatment apples. They matter too, but an apple needs to satisfice a set of contextual requirements. Moreover, we already have a dedicated RCT/CEA funding stream for well designed healthcare delivery experiments called PCORI. The problem with forcing all innovation center demonstration project to contain an RCT is that it massively constrains the space of participants & potential solutions, and would fundamentally comprise the systems change process I described above. But I’d hardly expect people who seem to believe that informal domain knowledge is a fundamental source of bias to be eliminated to appreciate that.

Here [Singh writes] is a much more thoughtful exploration of these issues:

Over the last twenty or so years, it has become standard to require policy makers to base their recommendations on evidence. That is now uncontroversial to the point of triviality—of course, policy should be based on the facts. But are the methods that policy makers rely on to gather and analyze evidence the right ones? In Evidence-Based Policy, Nancy Cartwright, an eminent scholar, and Jeremy Hardie, who has had a long and successful career in both business and the economy, explain that the dominant methods which are in use now—broadly speaking, methods that imitate standard practices in medicine like randomized control trials—do not work. They fail, Cartwright and Hardie contend, because they do not enhance our ability to predict if policies will be effective.

The prevailing methods fall short not just because social science, which operates within the domain of real-world politics and deals with people, differs so much from the natural science milieu of the lab. Rather, there are principled reasons why the advice for crafting and implementing policy now on offer will lead to bad results. Current guides in use tend to rank scientific methods according to the degree of trustworthiness of the evidence they produce. That is valuable in certain respects, but such approaches offer little advice about how to think about putting such evidence to use. Evidence-Based Policy focuses on showing policymakers how to effectively use evidence, explaining what types of information are most necessary for making reliable policy, and offers lessons on how to organize that information.

Statistics and data science, again

Phillip Middleton writes in with two questions:

(1) Is html markdown or some other formatting script usable in comments? If so, what are the tags I may use?

(2) What are your views on the role of statistics in the evolution of the various folds of convergent science? For example, upon us there is this rather nebulously defined thing called data science which roughly equates to an amalgam of comp sci and stats. I tend to interpret it as somewhat computational stats, somewhat AI – or maybe it’s just an evolution of data mining, I don’t know yet.

The comp sci camp is marketing their brand in this area quite well. But is it me or is the statistical community perhaps taking a bit of a back seat? This feels more like a bit of competition than an invitation to converge ideas (so where does computational and graphical stats sit, then?). I had a brief conversation with Prof. Joe Blitzstein about this some time ago, and he expressed a number of concerns as well.

There are those that believe statisticians as a collective group have become less important in the work to translate data into meaning/knowledge than this unicorn (for now) ‘data scientist’. I believe part of the reason is that this new discipline has quite a bit of historical public ‘glitter’ and a few other shiny objects embedded (the term ‘big data’ for one, and a larger paycheck for another – quite the economically driven discipline), though it hasn’t in my opinion been fully defined. So what do you believe will become of it? Is it fad’ish or the storming/norming stage of a particular convergence between comp sci and stats? Or more?

I don’t see much official discussion by either groups like the ASA or any of the data science group/society outcroppings. What do you make of this? As much as I see the need for convergence in bayesian and frequentist philosophies in statistics, I see the need for convergence between the statistical and comp sci camps to solve problems which each alone may not necessarily be good at. Is that even a reasonable expectation?

Physics has been doing this quite rapidly in recent yrs. (convergence with medicine, biology, finance, economics, etc), and it has integrated fairly seamlessly I believe. I could be making the wrong comparison here, but it would seem to me statistics converges with, well….everything, and in no small way. Yet it seems to me there are forces ‘placing’ statistics as a whole into somewhat of a basement, or at least at the cusp of an identity crisis.

My reply:

(1) You can use html in comments. I’m not sure which are allowed and which aren’t, but, as I tell everyone, if your comment gets caught in the spam filter, just drop me an email. It happens.

(2) As I’ve written before (see also here), I think statistics (and machine learning, which I take to be equivalent to statistical inference but done by different people) is the least important part of data science. But I do think statistics has a lot to contribute, in design of data collection, model building, inference, and a general approach to making assumptions and evaluating their implications. Beyond this, I don’t really know what to say. I agree these things are important but somehow I feel I lack the big picture—having only enough of this picture to recognize that I lack it!

Perhaps others have more useful thoughts to add.

The Ben Geen case: Did a naive interpretation of a cluster of cases send an innocent nurse to prison until 2035?

In a paper called “Rarity of Respiratory Arrest,” Richard Gill writes:

Statistical analysis of monthly rates of events in around 20 hospitals and over a period of about 10 years shows that respiratory arrest, though about five times less frequent than cardio-respiratory arrest, is a common occurrence in the Emergency Department of a typical smaller UK hospital.

He has lots of detailed and commonsensical (but hardly routine) data analysis. Those of you who read my recent posts (here and here) on the World Cup might be interested in this, because it has lots of practical details of statistical analysis but of a different sort (and it’s a topic that’s more important than soccer). The analysis is interesting and clearly written.

But the subject seems pretty specialized, no? Why am I sharing with you an analysis of respiratory arrests (whatever that is) in emergency departments? The background is a possibly huge miscarriage of justice.

Here’s how Gill told it to me in an email:

I’m wondering if you can do anything to help Ben Geen – a British nurse sentenced for 30 years for a heap of crimes that were not committed by anyone. Show the world that in the good hands, statistics can actually be useful!

http://bengeen.wordpress.com/

There are important statistical issues and they need to be made known to the public.

There is in fact an international epidemic of falsely accused health care serial killers.

In Netherlands: Lucia de Berk
In the UK: Colin Norris, Ben Geen
In Canada: Susan Nelles
In the US: https://en.wikipedia.org/wiki/Ann_Arbor_Hospital_Murders

So I [Gill] got involved in the Ben Geen case (asked by defence lawyer to write an expert statistical report). This is it

http://arxiv.org/abs/1407.2731

I became convinced that most of what the media repeated (snippets of over the top info from the prosecution, out of context, misinterpreted) was lies, and actually the real evidence was overwhelmingly strong that Ben was completely, totally innocent.

Here’s the scientific side of the story. It’s connected to the law of small numbers (Poisson and super Poisson variation) and to the Baader-Meinhof effect (observer bias). And to the psycho-social dynamics in a present-day, dysfunctional (financially threatened, badly run) hospital.

In the UK Colin Norris and Ben Geen are in 30 year sentences and absolutely clearly, they are completely innocent. Since no one was murdered there never will be a confession by the true murderer. Because there were no murders there won’t ever be new evidence pointing in a different direction. There will never be a new fact so the system will never allow the cases to be reviewed. Since the medical profession was complicent in putting those guys away no medical doctor will ever say a word to compromise his esteemed colleagues.

What is going on? why this international epidemic of falsely accused “health care serial killers”?

Answer: in the UK: the scare which followed Shipman triggered increased paranoia in the National Health Service. Already stressed, overburdened, underfunded … managers, nurses, specialists all with different interests, under one roof in a hospital … different social classes, lack of communication

So here’s the ingredients for a Lucia / Ben / Colin:

(1) a dysfunctional hospital (chaos, stress, short-cuts being taken)
(2) a nurse who is different from the other nurses. Stands up in the crowd. Different sex or age or class. More than average intelligence. Big mouth, critical.
(3) something goes wrong. Someone dies and everyone is surprised. (Why surprised: because of wrong diagnosis, disinformation, ….)
(4) Something clicks in someone’s mind (a paranoid doctor) and the link is made between the scary nurse and the event
(5) Something else clicks in … we had a lot more cases like that recently (eg. the seasonal bump in respiratory arrests. 7 this month but usually 0, 1 or 2)
(6) The spectre of a serial killer has now taken possession of the minds of the first doctor who got alarmed and he or she rapidly spreads the virus to his close colleagues. They start looking at the other recent cases and letting their minds fall back to other odd things which happened in recent months and stuck in their minds. The scary nurse also stuck in their mind and they connect the two. They go trawling and soon they have 20 or 30 “incidents” which are now bothering them. They check each one for any sign of involvement of the scary nurse and if he’s involved the incident quickly takes on a very sinister look. On the other hand if he was on a week’s vacation then obviously everything must have been OK and the case is forgotten.
(7) Another conference, gather some dossiers – half a dozen very suspicious cases to report to the police to begin with. The process of “retelling” the medical history of these “star cases” has already started. Everyone who was involved and does know something about the screw-ups and mistakes says nothing about them but confirms the fears of the others. That’s a relief – there was a killer around, it wasn’t my prescription mistake or oversight of some complicating condition. The dossiers which will go to the police (and importantly, the layman’s summary, written by the coordinating doctor) does contain “truth” but not the *whole truth*. And there is lots of truth which is not even in hospital dossiers (culture of lying, of covering up for mistakes).
(8) The police are called it, the arrest, there is of course an announcement inside the hospital and there has to be an announcement to the press. Now of course the director of the hospital is in control – probably misinformed by his doctors, obviously having to show his “damage control” capacities and to minimize any bad PR for his hospital. The whole thing explodes out of control and the media feeding frenzy starts. Witch hunt, and then witch trial.

Then of course there is also the bad luck. The *syringe*, in Ben’s case, which clinches his guilt to anyone who nowadays does a quick Google search.

This is what Wendy Hesketh (a lawyer who is writing a book on the topic) wrote to me:

“I agree with your view on the “politics” behind incidences of death in the medical arena; that there is a culture endorsing collective lying”

“Inquries into medico-crime or medical malpractice in the UK see to have been commandeered for political purposes too: rather than investigate the scale of the actual problem at hand; or learn lessons on how to avoid it in future, the inquiries seem designed only to push through current health policy”

“The “Establishment” want the public to believe that, since the Shipman case, it is now easier to detect when a health professional kills (or sexually assaults) a patient. It’s good if the public think there will never be “another Shipman” and Ben Geen and Colin Norris being jailed for 30 years apiece sent out that message; as has the string of doctors convicted of sexual assault but statistics have shown that a GP would have to have a killing rate to rival Shipman’s in order to have any chance of coming to the attention of the criminal justice system. In fact, the case of Northumberland GP, Dr. David Moor, who openly admitted in the media to killing (sorry, “helping to die”) around 300 patients in the media (he wasn’t “caught”) reflects this. I argue in my book that it is not easier to detect a medico-killer now since Shipman, but it is much more difficult for an innocent person to defend themselves once accused of medico-murder.”

Indeed, the rate of serial killers in the UK’s National Health Service must be tiny and if there are good ones around they won’t even be noticed.
Yet is is so so easy in a failing health care organization for the suspicion to arise that there is one around. And once the chances are aligned and the triggering event has happened there is no going back. The thing snowballs. The “victim” has no chance.

Chance events are clustered!!! Pure chance gives little bunches of tightly clustered events with big gaps between them. When chances are changing (e.g. seasonal variation, changes in hospital policies, staffing, new personel with new habits when filling in woefully inadequate diagnosis forms) then the phenomenon is stronger still!

eg three airliners crashed within a couple of days this week!!!

http://www.bbc.co.uk/news/magazine-28481060

How odd is a cluster of cases? Well by the law of *small* numbers (Poisson and even super-Poisson variation – Poisson means pure chance .. super-Poisson means pure chance but with the “chance per day” slowly varying in time) “short intervals between crashes are more likely than long ones”. (actually – very short, and very long, intervals, are both common. Pure chance means that accidents are *not* uniformly spread out in time. They are clustered. Big gap, cluster, biggish gap, smallish cluster… that’s pure randomness!!!)

Then there is the Baader-Meinhof phenomenon

http://www.psmag.com/culture/theres-a-name-for-that-the-baader-meinhof-phenomenon-59670/

[I replaced an earlier link for this which pointed to a flaky news site -- AG]
“Baader-Meinhof is the phenomenon where one happens upon some obscure piece of information– often an unfamiliar word or name– and soon afterwards encounters the same subject again, often repeatedly. Anytime the phrase “That’s so weird, I just heard about that the other day” would be appropriate, the utterer is hip-deep in Baader-Meinhof.”

Another name for this is *observer bias*. You (a medical doctor having to fill in a diagnose for a patient in a standard form, which is totally inadequate for the complexity of medicine) saw one case which they had to give a rather unusual label to, and the next weeks that “unusual diagnosis” will suddenly come up several times.

Well, Professor Jane Hutton (Warwick university, UK) wrote all these things in her expert report for the appeal 6 years ago but the judge said that such kind of statistical evidence “is barely more than common sense” so refused the request for her to tell this common sense out loud in court.

OK, it’s me again. I haven’t looked at this case in any detail and so can’t add anything of substance to Gill’s analysis. But what I will say is, if Gill is correct, this example demonstrates both the dangers and the potential of statistics. The danger because it is statistical analysis that has been used to convict Geen (both in court and in the “court of public opinion” as measured by what can be found on Google). The potential because a careful statistical analysis reveals the problems with the case (again, I’m relying on Gill’s report here; that is, my comments here are conditional on Gill’s report being reasonable).

Just to be clear, I’m not not saying that statistical arguments cannot or should not be used to make public health decisions. Indeed, I was involved last year in a case in which the local public health department made a recommendation based on statistical evidence, and this recommendation was questioned, and I (at the request of the public health department, and for no compensation) wrote a brief concurrence saying why I did not agree with the critics. So I am not saying that any statistical argument can be shot down, or that the (inevitable) reliance of any argument on assumptions makes that argument suspect. What I am doing is passing along Richard Gill’s analysis in this particular case, where he has found it possible for people to draw conclusions from noise, to the extent of, in his view, sending an innocent person to jail for 30 years.

SciLua 2 includes NUTS

The most recent release of SciLua includes an implementation of Matt’s sampler, NUTS (link is to the final JMLR paper, which is a revision of the earlier arXiv version).

According to the author of SciLua, Stefano Peluchetti:

Should be quite similar to your [Stan's] implementation with some differences in the adaptation strategy.

If you have time to have a look and give it a try I would appreciate your feedback. My website includes the LuaJIT binaries for 3 arches [Windows, Mac OS X, Linux], and the Xsys and Sci libraries needed to run NUTS.

I’ve also been working on a higher-level interface with some test models and datasets (the user can specify a simple script to describe the model, but it’s always Lua syntax, not a custom DSL).

Please notice that multi-pass forward automatic differentiation is used (I experimented with a single-pass, tape based, version but the speed-up was not big and for high-dimensional problems one wants to use reverse diff in any case, on my TODO).

For complex models you might want to increase loopunroll a little, and maybe callunroll as well. Feel free to ask any question.

I haven’t had time to try it myself. Looks like it’s implemented using a just-in-time compiler (JIT), like Julia, with models specified in its own language, Lua.

Yummy Mr. P!

10461738_10154371716350487_3326318926383618836_n

Chris Skovron writes:

A colleague sent the attached image from Indonesia.

For whatever reason, it seems appropriate that Mr. P is a delicious salty snack with the tagline “good times.”

Indeed. MRP has made the New York Times and Indonesian snack food. What more can we ask for?

A linguist has a question about sampling when the goal is causal inference from observational data

Nate Delaney-Busch writes:

I’m a PhD student of cognitive neuroscience at Tufts, and a question came recently with my colleagues about the difficulty of random sampling in cases of highly controlled stimulus sets, and I thought I would drop a line to see if you had any reading suggestions for us.

Let’s say I wanted to disentangle the effects of word length from word frequency on the speed at which people can discriminate words from pseudowords, controlling for nuisance factors (say, part of speech, age of acquisition, and orthographic neighborhood size – the number of other words in the language that differ from the given word by only one letter). My sample can be a couple hundred words from the English language.

What’s the best way to handle the nuisance factors without compromising random sampling? There are programs that can automatically find the most closely-matched subsets of larger databases (if I bin frequency and word length into categories for a factorial experimental design), but what are the consequences of having experimental item essentially be a fixed factor? Would it be preferable to just take a random sample of the English language, then use a heirarchical regression to deal with the nuisance factors first? Are there measures I can use to determine the quantified extent to which chosen sampling rules (e.g. “nuisance factors must not significantly differ between conditions”) constrain random sampling? How would I know when my constraints start to really become a problem for later generalization?

Another way to ask the same question would be how to handle correlated variables of interest like word length and frequency during sampling. Would it be appropriate to find a sample in which word length and frequency are orthogonal (e.g. if I wrote a script to take a large number of random samples of words and use the one where the two variables of interest are the least correlated)? Or would it be preferable to just take a random sample and try to deal with the collinearity after the fact?

I replied:

I don’t have any great answers except to say that in this case I don’t know that it makes sense to think of word length or word frequency as a “treatment” in the statistical sense of the word. To see this, consider the potential-outcomes formulation (or, as Don Rubin would put it, “the RCM”). Suppose you want the treatment to be “increase word length by one letter.” How do you do this? You need to switch in a new word. But the effect will depend on which word you choose. I guess what I’m saying is, you can see how the speed of discrimination varies by word length and word frequency, and you might find a model that predicts well, in which case maybe the sample of words you use might not matter much. But if you don’t have a model with high predictive power, then I doubt there’s a unique right way to define your sample and your population; it will probably depend on what questions you are asking.

Delaney-Busch then followed up:

For clarification, this isn’t actually an experiment I was planning to run – I thought it would be a simple example that would help illustrate my general dilemna when it comes to psycholinguistics.

Your point on treatments is well-taken, though perhaps hard to avoid in research on language processing. It’s actually one of the reasons I’m concerned with the tension between potential collinearity and random sampling in cases where two or more variable correlate in a population. Theoretically, with a large random sample, I should be able to model the random effects of item in the same way I could model the random effects of subject in a between-subjects experiment. But I feel caught between a rock and a hard place when on the one hand, a random sample of words would almost certainly be collinear in the variables of interest, but on the other hand, sampling rules (such as “general a large number of potential samples and keep the one that is the least collinear”) undermines the ability to treat item as an actual random effect.

If you’d like, I would find it quite helpful to hear how you address this issue in the sampling of participants for your own research. Let’s say you were interested in teasing apart the effects of two correlated variables – education and median income – on some sort of political attitude. Would you prefer to sample randomly and just deal with the collinearity, or constrain your sample such that recruited participants had orthogonal education and median income factors? How much constraint would you accept on your sample before you start to worry about generalization (i.e. worry that you are simply measuring the fixed effect of different specific individuals), and is there any way to measure what effect your constraints have on your statistical inferences/tests?

Stan NYC Meetup – Thurs, July 31

The next Stan NYC meetup is happening on Thursday, July 31, at 7 pm. If you’re interested, registration is required and closes on Wednesday night: http://www.meetup.com/Stan-Users-NYC/events/193685802/

 

The third session will focus on using the Stan language. If you’re bringing a laptop, please come with RStan, PyStan, or CmdStan already installed.

 

We’re going to focus less on the math and more on the usage of the Stan language. We’ll cover:

• Stan language blocks

• Data types

• Sampling statements

• Vectorization

 

On deck this week

Mon: A linguist has a question about sampling when the goal is causal inference from observational data

Tues: The Ben Geen case: Did a naive interpretation of a cluster of cases send an innocent nurse to prison until 2035?

Wed: Statistics and data science, again

Thurs: The health policy innovation center: how best to move from pilot studies to large-scale practice?

Fri: The “scientific surprise” two-step

Sat, Sun: As Chris Hedges would say: It’s too hot for words

Stan 2.4, New and Improved

We’re happy to announce that all three interfaces (CmdStan, PyStan, and RStan) are up and ready to go for Stan 2.4. As usual, you can find full instructions for installation on the

Here are the release notes with a list of what’s new and improved:

New Features
------------
* L-BFGS optimization (now the default)
* completed higher-order autodiff (added all probability functions,
  matrix functions, and matrix operators);  tested up to 3rd order
* enhanced effective range of normal_cdf to prevent underflow/overflow
* added von Mises RNG
* added ability to use scalars in all element-wise operations
* allow matrix division for mixing scalars and matrices 
* vectorization of outcome variates in multivariate normal with efficiency boosts
* generalization of multivariate normal to allow rwo vectors as means
  and 

Reorganization
--------------
* move bin/print and bin/stanc to CmdStan;  no longer generating main
  when compiling model from Stan C++

New Developer
-------------
* Added Jeffrey Arnold as core Stan developer
* Added Mitzi Morris as core Stan developer

Bug Fixes
---------
* modified error messages so that they're all 1-indexed instead of 0-indexed
* fixed double print out of times in commands
* const added to iterators to allow VS2008 compiles
* fix boundary conditions on ordered tests
* fix for pow as ^ syntax to catch illegal use of vectors (which
  aren't supported)
* allow zero-length inputs to multi_normal and multi_student_t
  with appropriate log prob (i.e., 0)
* fixed bug in inverse-Wishart RNG to match MCMCPack results
  with slightly asymmetric inputs
* fixed problem with compiling user-defined function twice
* fixed problem with int-only parameters for user-defined functions
* fixed NaN init problems for user-defined functions
* added check that user variable doesn't conflict with user function + doc
* disallow void argument types in user-defined functions

Code Cleanup and Efficiency Improvements
----------------------------------------
* removed main() from models generated from C++ Stan (they are
  now available only in CmdStan); removed no_main command options
* reserve vector sizes for saving for sample recorder
* removing many instances of std::cout from API (more still to go)
* removed non-functional Nesterov optimization option
* optimization code refactoring for testing ease
* better constant handling in von Mises distribution
* removed tabs from all source files
* massive re-org of testing to remove redundant files and allow
  traits-based specializations, plus fixed for 1-indexing

Testing
-------
* added tests for log_softmax, multiply_lower_tri_self_transpose, tcrossprod
* break out function signature tests into individual files, add many
  more
* enhanced cholesky factor tests for round trip transforms and
  derivatives
* extensive unit testing added for optimization 
* remove use of std::cout in all tests

Example Models
--------------
* lots of cleanup in links and models in ARM examples
* added BUGS litter example with more stable priors than in the 
  BUGS version (the original model doesn't fit well in BUGS as is, 
  either)

Documentation
-------------
* add infix operators to manual
* categorical_logit sampling statement	
* Cholesky factor with unit diagonal transform example
* example of using linear regression for prediction/forecasting with
  notes
* clarified some relations of naive Bayes to clustering
  vs. classification and relation to non-identifiability
* new advice on multivariate priors for quad_form_diag
* fix typo in multiply_lower_self_transpose (thanks to Alexey Stukalov)
* fix formatting of reserved names in manual
* fixed typo and clarified effective sample size doc