Skip to content
 

Why are functional programming languages so popular in the programming languages community?

Matthijs Vákár writes:

Re the popularity of functional programming and Church-style languages in the programming languages community: there is a strong sentiment in that community that functional programming provides important high-level primitives that make it easy to write correct programs. This is because functional code tends to be very short and easy to reason about because of referential transparency, making it quick to review and maintain. This becomes even more important in domains where correctness is non-trivial, like in the presence of probabilistic choice and conditioning (or concurrency).

The sense – I think – is that for the vast majority of your code, correctness and programmer-time are more important considerations than run-time efficiency. A typical workflow should consist of quickly writing an obviously correct but probably inefficient program, profiling it to locate the bottlenecks and finally – but only then – thinking hard about how to use computational effects like mutable state and concurrency to optimise those bottlenecks without sacrificing correctness.

There is also a sense that compilers will get smarter with time, which should result in pretty fast functional code, allowing the programmer to think more about what she wants to do rather than about the how. I think you could also see the lack of concern for performance of inference algorithms in this light. This community is primarily concerned with correctness over efficiency. I’m sure they will get to more efficient inference algorithms eventually. (Don’t forget that they are not statisticians by training so it may take some time for knowledge about inference algorithms to percolate into the PL community.)

Justified or not, there is a real conviction in the programming languages community that functional ideas will become more and more important in mainstream programming. People will point to the increase of functional features in Scala, Java, C# and even C++. Also, people note that OCaml and Haskell these days are getting to the point where they are really quite fast. Jane Street would not be using OCaml if it weren’t competitive. In fact, one of the reasons they use it – as I understand it – is that their code involves a lot of maths which functional languages lend themselves well to writing. Presumably, this is also part of the motivation of using functional programming for prob prog? In a way, functional programming feels closer to maths.

Another reason I think people care about language designs in the style of Church is that they make it easy to write certain programs that are hard in Stan. For instance, perhaps you care about some likelihood-free model, like a model where some probabilistic choices are made, then a complicated deterministic program is run and only then the results are observed (conditioned on). An example that we have here in Oxford is a simulation of the standard model of particle physics. This effectively involves pushing forward your priors through some complicated simulator and then observing. It would be difficult to write down a likelihood for this model. Relatedly, when I was talking to Erik Meijer, I got the sense that Facebook wants to integrate prob prog into existing large software systems. The resulting models would not necessarily have a smooth likelihood wrt the Lebesgue measure. I am told that in some cases inference in these models is quite possible with rejection sampling or importance sampling or some methods that you might consider savage. The models might not necessarily be very hard statistically, but they are hard computationally in the sense that they do involve running a non-trivial program. (Note though that this has less to do with functional programming than it does with combining probabilistic constructs with more general programming constructs!)

A different example that people often bring up is given by models where observations might be of variable length, like in language models. Of course, you could do it in Stan with padding, but some people don’t seem to like this.

Of course, there is the question whether all of this effort is wasted if eventually inference is completely intractable in basically all but the simplest models. What I would hope to eventually see – and many people in the PL community with me – is a high-level language in which you can do serious programming (like in OCaml) and have access to serious inference capabilities (like in Stan). Ideally, the compiler should decide/give hints as to which inference algorithms to try — use NUTS when you can, but otherwise back-off and try something else. And there should be plenty of diagnostics to figure out when to distrust your results.

Finally, though, something I have to note is that programming language people are the folks writing compilers. And compilers are basically the one thing that functional languages do best because of their good support for user-defined data structures like trees and recursing over such data structures using pattern matching. Obviously, therefore, programming language folks are going to favour functional languages. Similarly, because of their rich type systems and high-level abstractions like higher-order functions, polymorphism and abstract data types, functional languages serve as great hosts for DSLs like PPLs. They make it super easy for the implementer to write a PPL even if they are a single academic and do not have the team required to write a C++ project.

The question now is whether they also genuinely make things easier for the end user. I believe they ultimately have the potential to do so, provided that you have a good optimising compiler, especially if you are trying to write a complicated mathematical program.

I replied:

On one hand, people are setting up models in Church etc. that they may never be able to fit—it’s not just that Church etc are too slow to fit these models, it’s that they’re unidentified or just don’t make sense or have never really been thought out. (My problem here is not with the programming language called Church; I’m just here concerned about the difficulty with fitting any model correctly the first time, hence the need for flexible tools for model fitting and exploration.)

But, on the other hand, people are doing inference and making decisions all the time, using crude regressions or t-tests or whatever.

To me, it’s close to meaningless for someone to say they can write “an obviously correct but probably inefficient program” if the program is fitting a model that you can’t really fit!

Matthijs responded:

When I say “write an obviously correct but probably inefficient program”, I am talking about usual functional programming in a deterministic setting. I think many programming language (PL) folks do not fully appreciate what correctness for a probabilistic program means. I think they are happy when they can prove that their program asymptotically does the right thing. I do not think they quite realise yet that that gives no guarantees about what to think about your results when you run your program for a finite amount of time. It sounds silly, but I think at least part of the community is still to realise that. Personally, I expect that people will wake up eventually and will realise that correctness in a probabilistic setting is a much more hairy beast. In particular, I expect that the PL community will realise that run-time diagnostics and model checking are the things that they should have been doing all along. I am hopeful though that at some point enough knowledge of statistics will trickle through to them such that genuinely useful collaboration is possible. I strongly feel that there is mostly just a lot of confusion rather than wilful disregard.

16 Comments

  1. a reader says:

    In my experience, computational statistics (or more generally, numerical analysis) is a bit of a special case when it comes to general programming. This is because we very frequently need to sacrifice readability for efficiency; if we don’t, our program will never run for moderate datasets. Then, we wrap everything up in an easy to use interface and maybe 10 people in the world ever parse our internal code.

    On the other hand, if you’re working on a huge software project with a very large number of collaborators, readability is usually much more important than efficiency.

    • It’s not just comp stats or numerical analysis. In everything from operating systems to text editors, a small core of the code needs to be highly performant and the rest is largely a wrapper. The performant parts of a code base are always tricky to mesh with the rest. Although readability and efficiency are fundamentally a tradeoff, it’s not a zero-sum game. Breaking things out into comprehensible functional units, choosing good names, etc., can all help make even highly performant code readable. I didn’t mention doc. Although it can help, I’m with a reader here that it’s the code itself that needs to be readable.

  2. The bits about likelihood free inference brings up an interesting issue. I’ve got some projects where something like an agent based model would be a meaningful thing to compute with. What we’d like to do is run our agent based model forward in time, and observe the spatial distribution of the agents and their types etc… and compare that to images of real-world stuff.

    Pretty obviously I’m never going to do that in Stan. Not that it’s theoretically impossible, just totally impractical. But furthermore, ABMs naturally tend to have random choices, an agent sees its environment at time t and makes a decision to do x, y, or z at random. So the detailed outcome is *not* deterministic, but some higher level functions, such as spatial averages or other kinds of summaries are nevertheless reasonably stable (Total percentage of Blue agents, or fraction of agents in some region that are either Red or Green for example).

    So, with all that being said, I’d still like to take advantage of some sort of pseudo-Hamiltonian Monte Carlo to make sure that I efficiently explore the parameter space, even though there is no real smooth derivative. The same is really true for a physical system though. You might have a reacting flow in a pipe, but the molecules of the pipe itself are adding or removing heat, perturbing the system with a random external forcing function. This doesn’t invalidate Newton’s laws, though the system is no longer really Hamiltonian (energy isn’t conserved in the molecules of the fluid for example).

    Anyway I had some ideas about how this could be done, and I did a little bit of experimentation last year, and had some success on low dimensional sample problems, but I haven’t been able to continue with that. I’ve got a collaborator who is working on a grant related to this, so perhaps there will be some chance for further development…

    That all being said, Andrew, have you had discussions in the Stan group about what people should be doing when they have models of this type? Either black-box computational simulations, or simulations that are naturally stochastic in nature?

    • a reader says:

      The approach you describe sounds like Approximate Bayesian Computation, which is it’s own subfield.

    • Rahul says:

      @Daniel:

      In chemistry there’s the Gillespie algorithm to study reactions in a stochastic sense. But I wonder if there’s any low hanging fruit there to make it better? Have you worked on this?

      https://en.wikipedia.org/wiki/Gillespie_algorithm

      • I haven’t worked with that algorithm but it seems like a straightforward idea. It’s related to what I’m talking about here because the ABM I’m thinking of similarly has a discrete time random decision involved, including things like when the cell divides, when the cell dies, and when the cell changes type (say from some kind of stem cell to a kind of muscle or bone or blood cell for example). That’s why there’s no sense in which a proper gradient is possible, and yet I think ideas from Hamiltonian Monte Carlo could still be useful.

  3. Daniel H says:

    I stick to the following pattern:
    – I solve most of everydays problems with simple descriptive statistics
    – when this won’t do the job (eg effect sizes need to be estimated I go for regression
    – where this doesn’t help (e.g. large, nested data) I go for Bayesian inference (started using rstanarm recently)
    … So maybe probabilistic programming is the thing someone (probably not me) uses when Stan won’t do the job?
    It would definitely fit the hierarchy of
    counting << linear regression << Stan << functional programming
    regarding computational and practical effort.

  4. Noah Motion says:

    …which inference algorithms to try — use NUTS when you can…

    This is almost certainly a losing (or already lost) battle, but referring to algorithms like NUTS as “inference algorithms” seems wrong to me. Statistical inference is a much higher-level process than is sampling from a posterior, and statistical inferences can be drawn entirely sans Bayes.

    It reminds me of my frustration when linguists use “linear” when “sequential” would have done just fine. It’s frustrating, since we’re trying to communicate technical concepts, and “linear” has a precise mathematical definition already (and this definition is not what linguists typically mean when they use it).

    There’s probably no putting either genie back in the bottle…

    • I see your point, but what’s your suggested terminology for “algorithm to compute some usable representation of the posterior distribution”? “Posterior sampling algorithm” might be a possibility, but it excludes algorithms like belief propagation for discrete models.

  5. Not to look a gift horse in the mouth, but I think there were two mistakes made in creating the Stan language, both of which revolve around imperative vs. declarative (which includes functional) programming:

    1) Writing the compiler in C++, which is not really a good language for writing sophisticated compilers. Right now you can’t really write a model in its clearest, most natural form; you have to transform it in various ways to get decent performance, and this causes three problems:

    * the transformed model is difficult to read and understand;
    * there’s always the fear that you may have made a mistake somewhere along the way, and your transformed model may not be equivalent to the original one;
    * it’s very time-consuming to make all these transformations and test that you didn’t make a mistake somewhere, which means that even a small change to the model can be pretty time-consuming to fully implement.

    There’s a lot that a compiler could do to automate many of these transformations, but C++ isn’t a great language for symbolic manipulation.

    2) The second mistake is introducing non-declarative (imperative) constructs into the language. Imperative constructs have complex semantics (e.g., I don’t think anyone has ever succeeded in giving a complete formal specification of the semantics of C), and this causes problems if you want to write an optimizing compiler that is going to do a lot of transformations on the source program. I understand the need for flexibility in defining the joint probability distribution, but this does not require non-declarative constructs.

    I’m currently writing a time-series model compiler that takes a high-level description of a Bayesian state-space model and compiles it to Stan, performing a variety of transformations and optimizations along the way. For the reasons given above, it is implemented in the functional programming language Haskell, and the source language itself is 100% declarative. Much of the work I’m doing would, I believe, be transferable to an effort to implement in Haskell a compiler for a purely declarative variant of the Stan language.

    • Andrew says:

      Kevin:

      Can you give an example of a transformed model that is difficult to read and understand? Thanks.

      • I think it already is less readable when you do

        real x;

        x ~ normal(35,18);

        and because the sd is 18 you need to rescale things so you do:

        real xx;
        xx ~ normal(0,1);

        real x;
        x = xx*18+35;

        AKA the “matt trick”. In actual Stan, these different components go in different blocks (parameter, model, transformed_parameter), and the purpose of them is kinda opaque without extensive comments.

        • Andrew says:

          Daniel:

          Fixing that is on the agenda, in three ways: first, we’re getting rid of the need for separate blocks; second, we’re allowing declaration of x and a probability distribution for x on the same line (e.g., real x ~ normal(a, b)); third, we’ll be implementing location-scale transformations (see here).

          To put it another way: the things that bother you, bother us too!

        • The non-centered parameterization (aka Matt trick) was developed for the case when the location and scale are parameters to remove the prior dependencies. It also helps with Stan’s adaptation.

          Blockless Stan is a long way off and it’s not certain we’ll do it. But we’ll certainly do something that lets you import submodules that cross block boundaries. We might allow the compound declaration/sampling statement.

          What we should have for 2.19 is this:

          real<offset = 35, multiplier = 18> x;
          ...
          x ~ normal(35, 18);
          

          which will have the same effect as the hand-written linear transform. We haven’t 100% settled on names. loc/scale or location/scale or offset/multiplier seem natural. For naming, we’d like to be able to deal with the vector case where the offset is a vector and the multiplier is a Cholesky factor. So we can’t really use intercept/slope.

Leave a Reply