Models with constraints

I had an interesting conversation with Aki about monotonicity constraints. We were discussing a particular set of Gaussian processes that we were fitting to the arsenic well-switching data (the example from the logistic regression chapter in my book with Jennifer) but some more general issues arose that I thought might interest you.

The idea was to fit a model where the response (the logit probability of switching wells) was constrained to be monotonically increasing in your current arsenic level and monotonically decreasing in your current distance to the closest safe well. These constraints seem reasonable enough, but when we actually fit the model we found that doing Bayesian inference with the constraint pulled the estimate, not just toward monotonicity, but to a strong increase (for the increasing relation) or a strong decrease (for the decreasing relation). This makes sense from a statistical standpoint because if you restrict a parameter to be nonnegative, any posterior distribution will end up on the positive half of the line. See section 3 of this paper for an example.

Thinking about it more, I’m not always comfortable with any strict constraint unless there is a clear physical reason. For example, yes it seems logical that increasing arsenic would increase the probability of switching but I could imagine that in any particular dataset there could be areas of negative slope. After all, it is observational data and for example there could be a village that happens to have arsenic in a particular high range but where for cultural reasons there would be less switching. Here there’s an omitted variable (“culture” or a village indicator) but the point is that these (hypothetical) data would not really support a strictly monotonic model, and including that restriction could distort things in other ways. This is a general principle, I think. It does not mean we should ignore prior information, of course, but it’s a reason that I prefer soft rather than hard constraints. Alternatively in this example one could put a hard constraint on the monotonicity and then add a latent omitted variable which would have the effect of turning it into a soft constraint, but I don’t usually see people do this.

14 thoughts on “Models with constraints

  1. I’ve been thinking about monotonicity constraints for a while, and I’ve come to believe that they are less useful as a fitting strategy than it might appear at first glance. I’ve come to distinguish three scenarios:

    (1) monotonicity has very high posterior probability under the unconstrained model;
    (2) non-monotonicity has very high posterior probability under the unconstrained model;
    (3) neither of the above;

    There’s no need to create a constrained fit in the first two cases. In case (1), the constraint is superfluous — the prior+model+data specify monotonicity all on their own. In case (2), a puzzle has been uncovered, which is great from a scientific perspective. What’s called for is not a constrained fit, but rather a lot of thinking about why one’s prior expectation of monotonicity is being defeated.

    In case (3), the constrained fit worth obtaining. Case (3) would almost surely become one of the previous two cases with more data. Assuming that case (1) would be the eventual fate, the imposition of the constraint essentially adds some portion of the information that more data would provide. But since it’s impossible to be sure that case (2) is not the evetual fate, the constrained fit must be regarded as tentative.

    • But this is true of any model, no? Monoticity (hard or soft) is a strong model assumption: if it’s ultimately consistent with the data (or true model, depending on how your philosophies lie) then its incorporation will improve inferences, and if it’s not consistent then you have a bad model that you should diagnose and then tweak/generalize/improve.

      I think we’re on the same page, I just wouldn’t say that there’s anything special about monoticity assumptions relative to any other model assumption.

      • Yup. That’s why I’m inclined toward Bayesian nonparametrics with proven consistency properties. (…in principle; in practice these are usually more trouble than they’re worth.)

      • Is any part of this discussion amenable to validation? i.e. On real models which approach works better and / or how different are their predictive powers.

  2. There can be a pretty substantial difference in the prior you induce if you impose *strict* monotonicity vs allowing flat parts. This should be even more pronounced if you’re also imposing a lot of smoothness on your function, I imagine.

    As a subject matter nonexpert I think flat/nearly flat pieces and jumpy or even discontinuous functions (at the household level) seem like a possibility here, or really anywhere you’re modeling human decisionmaking. People are weird. These features aren’t necessarily the strong suit of “vanilla” GPs, particularly those with the ever popular squared exponential covariance functions, whether or not you impose monotonicity.

  3. It’s also important to consider the purpose of the estimated functions. For example, if we include a concavity restriction on our estimated function then we can use standard maximization tools to study it, e.g. interior point methods. If our estimated function is not concave, a maximization program is not readily available.

  4. In stan is it possible to calculate a “monotonicity score” and then place a prior over this score? For example suppose you have a function f(x; c) which depends on some parameter vector c, and for any given value of c you can calculate S = sum(Ineg(d/dx f(x_i;c))dx) where Ineg is 1 if the argument is negative and 0 otherwise. Could you then somehow treat S as data, and place a stochastic prior distribution over S??

    S ~ exponential(1/10) or something like that?

    A more general question, is it possible to calculate a function of observed data and then place a prior over the observed value of this function?

    I haven’t thought about this at all, so it may be quite obvious to the Stan illuminatus, or it may be something a bit obscure, I’m not sure.

  5. I have been working on an inverse “smoothing” problem where there are definite physical constraints on the function: strictly increasing, strictly convex.

    I have tinkered with a Gaussian spatial process, but have not yet implemented the shape constraints.

    I have what may be a naive concern:

    One thing I wondered about was when folks implement shape constraints on GaSPs what function is being constrained? An ensemble, an estimate, a prediction? Or the actual realizations of the GaSP?

    If the constraints are only imposed on a particular predictor, but it still permits realizations of the GaSP to violate the constraint, shouldn’t this bother me from a philosophical perspective?

    I suspect that this philosophical concern would not be limited to fitting GaSPs but to many types of regression as well.

  6. What are your thoughts on cases where I’m really sure something is monotonic, like the probability that a person will die in the next year as a function of their age?

    I guess I’m asking whether you’re making a claim (sort of like the example in ‘Bayesian model building through pure thought’) that imposing constraints in a way that seems natural can lead to unknowingly imposing a prior that has stronger behavior than you intend, or whether you’re really saying “sometimes you think something must be monotonic but it really isn’t, and in that situation you can be led far astray.” Or maybe it’s neither of those.

    • Phil, when stated carefully you can get a truly monotonic function, but as you’ve stated it it’s either ambiguous or not monotonic.

      “The fraction of people born in X year who will have died prior to today” is monotonic increasing (no more people will be born in year X after the end of year X so they can only die).

      But the probability that a person will die in the next year is not a monotonic function of age. In their first year of life, babies tend to die much more frequently than in their second year of life for example. It’s perfectly possible that after certain life events people’s mortality increases or decreases. So for example after retirement perhaps they commute less and have lower chance of dying in the following year than the previous year. Also there is broken heart syndrome, where after the loss of a spouse there is a period of time when risk of death is higher, but if you survive long enough it could return to a baseline.

      we won’t even get into the distinction between the (bayesian) “probability” for a given person and the “frequency” over a large population.

      I’m pretty sure you know this, but I’m pointing it out because it’s another example of how tricky it is to specify a model correctly even when you know what you’re doing, and when you make a strong assumption in an incorrectly specified model it will backfire.

  7. I new methodology to incorporate inequality constraints such as monotonicity, convexity and boundedness constraints into a GP emulator has been recently developed in Maatouk and Bay 2017 https://link.springer.com/article/10.1007/s11004-017-9673-2.

    A new article in the case of noisy observations has been recently submitted and a comparison with existing model has been shown in

    Maatouk 2017. Finite-dimensional approximation of Gaussian processes with inequality constraints.

Leave a Reply

Your email address will not be published. Required fields are marked *