From the Stan users list:

I’m trying out to fit a time series that can have 0 or 1 change point using the sample model from Ch 11.2 of the manual.

To determine if the time series has a change-point or not, would I need to do comparison model (via loo) between 1-change model developed and a model with no change point? Or, would s being close to 1 is sufficient?

I replied:

You’re gonna hate this, but in general I don’t like models where you try to see if there is a change point. Instead I recommend allowing the size of the change to be a continuous parameter, so that small or zero changes are part of your model. Problem solved!

And then Bob followed up with a question to me:

I do hate it because it shows I don’t know where you want to apply model comparisons and where you don’t. There are chapters in BDA3 that cover LOO and chapters covering predictive posterior checks and parts recommending continuous model expansion. I just can’t reconcile it all.

I think the overall suggestion is the same as mine—fit the change point model and wee what the fit looks like. But do you then simplify the model if the change point isn’t doing anything?

How do you decide?

Bob is right that my philosophy (or, perhaps I should say, my practice) is incoherent. I don’t like model choice, model comparison, or model averaging—but then in practice I *do* choose models, I *do* have to compare the models I’m considering, and, much as I dislike model averaging, it’s gotta be better than choosing just one of several candidate models.

Another thing I often say is that I prefer continuous model expansion to discrete model averaging. But where do you draw the line? Any continuous class of models can be discretized, and then what do you do? So I think all I can really recommend is to go continuous as much as possible.

And it does turn out that many problems which are conventionally framed as discrete choices, can instead be written continuously. An example is change point models.

I answered Bob’s question as follows:

There are times when we fit unrelated models, or we have models coming from different sources, and predictive model comparison is one way to evaluate these. Especially the new form where we get the contribution of loo (or waic) for each data point.

But if you have a model (in this case, the change-point model in which the size of the change is an unknown parameter) which includes the smaller model (in this case, the no-change-point model, which is theta=0) as a special case, i’d rather just fit the big model. If you want to say, hey, let’s use the smaller model because it’s simpler and easier to work with, then, sure, I do think that loo or waic can be a way to make this decision. But I definitely wouldn’t frame it as “To determine if the time series has a change-point or not.” The time series, whatever it is, has a change point at every time. The question might be, “Is a change point necessary to model these data?” That’s a question I could get behind.

Bob replied:

I never understand specifically what’s Andrew’s suggesting.

I think the takeaway on model-non-comparison was this:

If you have a model (in this case, the change-point model in which the size of the change is an unknown parameter) which includes the smaller model (in this case, the no-change-point model, which is theta=0) as a special case, i’d rather just fit the big model.

My guess is that he has a whole suite of models of varying granularity in mind. Because he goes on to say

The time series, whatever it is, has a change point at every time.

This implies to me that Andrew wants to unask the question of whether the sequence has 0, 1, or T-1 change points? By definition the sequence has T-1 observations. So it’s just a question of what you want to model.

But it’s not a simple modeling change.

The no change-point model is easy

for (t in 1:T) y[t] ~ normal(mu, sigma);Pardon the non-vectorization — it’s just trying to make it unambiguous there’s a single mu and sigma and T different y.

Allowing one change point you need to marginalize out as in the manual. Then it’s a question of how you model where the change occurs — for a proper prior on a latent continuous change point, you need the CDF differences (for same reason as in our last discussion here on CDF diffs for rounding). Or you can just hack it up as uniform as I did in the example in the manual where the times are evenly spaced.

But when you allow T-1 change points, all of a sudden you’re back to something like the first model, with a time series

for (t in 1:T) y[t] ~ normal(mu[t], sigma);Assuming no stochastic volatility (i.e., a fixed sigma). Then you need a prior for mu, probably one based on autoregression.

And I clarified:

There are 2 issues here.

First is the distribution of change points over time. As Bob notes, I prefer to avoid change points entirely as I think that every moment has its own changes. Something that looks roughly like discrete change points can be constructed using continuous change points with a long tailed distribution of the amount of change.

The second issue is, if we do have a model with a discrete change point at just one time, to avoid the question, “Is there a change point?” by saying that there definitely _is_ a change point of unknown magnitude.

It was the second point that I was trying to make in my response to that earlier question. I think of this second model as a bit of a compromise. Yes, it assumes a particular change point. But it is still a continuous-parameter model, and I like that.

In the fields you’re discussing, is there ever a way to decide on the number and kind of change point(s) based on some sort of theory or model of what they _mean_? (For example, a time series of political poll results would reasonably include change points (of unknown magnitude) at every debate, and other important events depending on how granular you want to get.) Or does this discussion only tend to come up when you don’t have that kind of context for your time series?

Also, and sorry if this is a basic question, but does “change point” here refer to a sudden jump in, say, mu? Or could it refer to a change in something like the slope?

Thanks!

Hey, this gives me an opportunity to plug a talk I gave a couple of years ago: It’s on modeling electric load time series using Stan but the principles apply to modeling other time series too. I do indeed have something that is the equivalent of mu[t] for every data point…indeed, I don’t just have more parameters than data points, I have _far_ more parameters than data points. But with a good prior it all works out. (I love this stuff and am trying to find support to do more work in this area.)

But as for Andrew’s “You’re gonna hate this, but in general I don’t like models where you try to see if there is a change point”, I do hate that but not for the reason Andrew expects. As my talk illustrates, I have no problem with the concept that the time series changes at every point, and indeed, I embrace the opportunity to simply go ahead and model that change. But I _still_ want the ability to model a discrete change as a separate thing, because sometimes that is exactly what happens: there is a step change and I want to include it or detect it. In the context of electric load time series, for example, it sometimes happens that a building operator will change the startup time for the building, by which I mean the time in the morning at which the thermostat set points are automatically changed. Or they’ll change the exterior lighting to turn on at 7pm instead of 6 pm. Or there is some big piece of machinery that is sometimes on and sometimes off.

I share Andrew’s belief that time series models shouldn’t usually be stationary models with occasional discrete changes, which is a very common type of model and I’m glad we can do better, and most people should do better. But we _also_need the ability to have discrete changes. We just do.

Phil, I’ve done some work with electricity for a condo association, and your posting brings back fond memories. Can’t wait to watch the presentation all the way through. (I was only modeling monthly usage/demand so a much simpler problem overall. Still, things like a building’s balance point and considering the affect on a month’s bill of turning on the cooling towers a day too early — blowing up the previous month’s demand — that was all fun stuff. And trying to allocate usages across four buildings with only three bills. It was all fun.)

Well, I hope you enjoy the talk.

To me it is just a fantastic advertisement for Stan. I wrote down the model I wanted to fit; I translated it into Stan in very straightforward fashion; and it ran and generated reasonable results and the cross-validation worked out great. Just a few years ago it would have taken weeks and weeks to fit that model.

Just came back to confirm that I do like the talk a lot. It’s entertaining, and you also seemed to find the right balance for your audience: not too statisticsy, not dumbed down, addressed subject matter well. This looks a lot like a simpler model I made using a State Space approach with R’s `dlm` package. Are you familiar with that approach?

Phil, You can read a nice vignette explaining `dlm` at: https://cran.r-project.org/web/packages/dlm/vignettes/dlm.pdf . It’s a very nice method for creating time series models by joining together various partial models: random walk with trend + quarterly seasonal + …. Last I used it, prediction could get complicated, particularly if you had certain things that changed over time. (The bottom line being you do need to understand how State Space models and their matrices work, even though `dlm` makes it easy to specify them.)

I am not familiar with this but it looks quite interesting. I will first try looking at some examples (if I can find any) and may try adapting some of my models and see what results I get. Thanks for pointing this out.

@Phil: “where you *try to see if* there’s a change point” is different from “you know there are change-points you just don’t know when they occur or how big they are”

In other words, if your model of the world is that sometimes you know some stuff happens that causes a discrete change, then you’re in the world of that model and you need to find out what the evidence is for various possibilities for changes… that sounds like your case, operators go in and monkey with things every few months, there’s a known mechanism.

but if you’re in a situation where you have little idea of what the underlying model should be, then trying to “check to see if there’s a change point” is probably the wrong approach to modeling, I mean you might as well also put in “check to see if crime bosses moved into the region” or “check to see if rainfall has an effect” or “see if telephone congestion occurs at the same time” or “did Nintendo put out a new game console that month” or whatever… just throwing spaghetti at the wall and seeing what the shape is when it sticks…

As long as you’re basically looking for a flexible model class unmotivated by specific theory that can handle lots of stuff, you might as well stick with a pretty general one, like gaussian processes or something.

Daniel: You say “In other words, if your model of the world is that sometimes you know some stuff happens that causes a discrete change, then you’re in the world of that model and you need to find out what the evidence is for various possibilities for changes… that sounds like your case, operators go in and monkey with things every few months, there’s a known mechanism.

but if you’re in a situation where you have little idea of what the underlying model should be, then trying to “check to see if there’s a change point” is probably the wrong approach to modeling.”

I say:

Yes, I agree with all of this. I will elaborate on the theme a bit.

I think that in most time series there are different mechanisms that cause things to change, and some of these are discrete events and some are not. My fitness level varies, and thus so does my resting heart rate and my maximum heart rate etc., and all of these things vary continuously…but if I have a heart attack that is going to be a discrete change. I think you are often better off modeling those discrete changes as coming from a separate mechanism: they deserve a special place in the model. If I model my cardiac health at time t as being drawn from some distribution around whatever it was at time t-1, I think that’s fine in most cases but if you want to include the risk of a heart attack then you shouldn’t just try to give that distribution a long tail on the bad side so that sometimes there’s a big jump down in cardiac health; instead you should have some probability of a heart attack at time t, and a statistical distribution that controls what the negative effect of such an event would be. Or…well, at the very least I think you should be _able_ to write a model like this. And this is the one big thing I don’t like about Stan, that you can’t really do this. I loooooove Stan otherwise, but the inability to fit discrete models is really unfortunate.

Ah I see what you’re saying. The lack of discrete parameters in Stan is a definite issue for expressiveness. However, I have successfully used mixture models for this kind of thing, using increment_log_prob and log_sum_exp. using something like a t distribution for the main effect with a fairly high degree of freedom (sort of normal-ish) and then a normal for the “other type of event”

something like (1-eps) t(15,0,1) + eps normal(-4,0.25) to model something that is either near 0 +- 1 (typical event) or near -4 +- 0.25 (bad event) without TOO much penalty for the intermediate state between say -1 and -4 so that it’s hard to jump that barrier.

The other thing I think you can do is a basis expansion in a basis that includes functions like -C/(1+exp(-(x-k)/a) (ie. a shifted rescaled sigmoid), put priors on the location for the shift and the degree of the shift (C) and the quickness of the shift (a)…

There’s something to be said for approximating things as step functions, and there’s also something to be said for approximating things as infinitely smooth approximations to step functions :-)

getting outside the rigidity of your model and thinking about approximations to your model can be helpful.

Phil:

To link this up with our previous blog post . . . can you make your example into a Stan case study?

> But we _also_need the ability to have discrete changes. We just do.

Agree, but may not necessarily mean we are not better off modeling it with a continuous model…

(Vague, I know, but it is something I am thinking about.)

I get the feeling that often Andrew (& maybe others) recommend modelling things as continuous variables, which is fine, but so long as we agree that there are real world processes which are of a discrete nature (e.g. my air handler motor is either on or off et cetra) even if you do model it as a continuous change at some point in the modelling exercise you must product a mapping function that has binary output.

So you just push the discretization from one part of the model to another.

Daniel’s comment (which initially I missed) speaks to this “getting outside the rigidity of your model and thinking about approximations to your model can be helpful”

My thinking is that we need to distinguish the model (representation) from what its trying to represent (model) while being aware that non-literal representations (_taking_ something we are sure is discrete as continuous in some sense for some purposes) can be advantageous.

Or more poetically http://www.leonardcohenforum.com/viewtopic.php?t=20935

Suppose you flip a switch and turn on an air handler. What happens? First off, you flip the switch, at some point two pieces of metal come in contact. Initially they have perhaps a small area of contact, and a high resistance, possibly they dynamically bounce off each other, and then after a few microseconds they weld more thoroughly at their minute contact point and the resistance drops to the nominal “zero”. Then current rushes into your motor, expanding a magnetic field, charging a starter capacitor, the motor begins to turn, while it’s still fairly stationary, it tends to pull a lot of current, after a few hundred milliseconds to a second or two it’s up to a steady state speed and the current draw is down to a normal level. Or, possibly, there are “soft start” circuits in the motor/controller, which make all of this stuff take longer and reduce the inrush currents to reduce the wear on the capacitors and switches and soforth.

In other words, even this thing we think of as “on or off” isn’t really. The choice to model it as on or off is due to our choice of timescale. If we’re the designer of the soft-start circuitry, our timescale of interest is maybe 1 second or 1 millisecond. If we’re the designer of the HVAC system our timescale is maybe 5 minutes to 12 hours or even a whole heating or cooling season.

so releasing ourselves from “it *really is* discrete” can be useful, and it isn’t even necessarily wrong.

Fair enough. But if I am the designer who works on the 5 minute timescale instead of the 1 millisecond time scale I want some way of mapping your continuum to the Yes/No decision which is all my Programmable Logic Controller understands.

So, I’m totally fine with a modeller assuming a continuum so long as his procedure includes the step to map it to my Yes / No decision.

Another complexity is that if, in a 5-second environment you bring in the complexity of 1-millisecond timescale processes you often don’t have the fine scale data to validate the 1-millisecond structure of your model.

i.e. If you model processes at scales not accessible to the modeller you risk introducing model components that cannot really be validated at that scale. So you are down to a black box again.

I don’t advocate reconsidering a continuous model *because* it’s more accurate at the millisecond timescale, I just advocate reconsidering a continuous model because it may be a good approximation, and it isn’t necessarily the case that your dichotomous discontinuous variable is really somehow fundamentally different and can’t be approximated by a continuous model.

In this case, you’re in the context of a Stan model, and Stan can’t handle discrete parameters. So, maybe it’s fine to have a continuous parameter and a rapid transition function to map it through. it doesn’t violate some fundamental principal of logic or whatever.

What about the question “Does the trend in this data have a change-point?” (or is “there a change in trend?”)? That’s the question I usually use change-point models for. If we had T-1 change-points we wouldn’t be modelling a trend.

The question of whether a change-point (change in trend) exists, (or is needed) or how many change-points exist (are needed), can be addressed in a single model by putting a prior distribution on the number of change-points (e.g. k~Bernoulli(0.5)) . This is easy to do with Dave Lunn’s reversible jump MCMC add-on for WinBUGS, which lets you put a prior on the number of knots in a piecewise linear (or even polynomial) spline model. Putting priors on both the number and location of change-points seems like the natural thing to do in the Bayesian context, and avoids the need for separate model selection. The approach also handles more gradual changes in trend by integrating over multiple change-points (numbers and locations), producing smooth posterior mean curves.

Although this a single model, it can of course also be considered a form of model averaging, since we are integrating over discrete model structures with 0, 1, …, or k change-points. But maybe that just shows that the distinction between model averaging and model expansion is not always clear cut?

If the prior on the location of change-points is discrete (and you don’t allow too many change-points) you can do much the same thing by brute force with standard information criteria…e.g. https://tanytarsus.shinyapps.io/changepoint/ . Of course, unless you’re silly enough to build a shiny app for change-point analysis, its better to just use WinBUGS (which doesn’t work on shinyapps Linux servers).

Jim:

It’s my impression that these models can take forever to converge. But in any case I just don’t think these sort of models make sense in any example I’ve ever seen. Instead of putting priors on the number and location of change points, I’d recommend just setting the total number of change points to be reasonably large for the problem (that is, on the upper end of the “k” that you are considering) and then just let Stan estimate the locations and magnitudes of the jumps.

Try it. I’ll bet it will run fast and work just fine on your example.

Hi Andrew,

I’ve never really had convergence problems with trans-dimensional splines using RJMCMC – and it is surprisingly fast: sure you might have to wait a few minutes (or sometimes hours with very large amounts of data – and candidate model structures), but I find any wait is usually worth it.

I agree that questions about ‘whether’ (or how many) changes have occurred often make less sense than questions about when and by how much they changed – or just ‘what is the nature of the trend”, but I find trans-dimensional splines very useful for answering all of those questions (change-points are just a special case).

Anyway, I will try your approach in Stan, but what sort of priors would you suggest for the change-point locations if I want to allow for more than one? I’ve previously tried putting independent uniform priors spanning the full time period on each of (say) 3 location parameters, and had all sorts of problems with convergence (and interpretation) of both location and corresponding magnitude parameters. It seems you need some sort of joint prior for location parameters designated as the 1st, 2nd and 3rd location, where the 2nd location must be after the 1st etc (but you don’t know a priori where any of them are). Do you just restrict the locations to pre-specified periods, or is there a better approach?

You could put in a grid of change points and then a prior with long tails on the size, most of them are “nearly zero” but a few of them could be big.

thanks Daniel…that should do it.

Plus x for x a positive real number for your continued emphasis on continuous models over discrete models

Thanks, Phil, for adding your example to the conversation. I’ve been faced with somewhat similar problems, as you know.

I’m working now in an area that faces similar issues. If you’re monitoring a business or production process of some sort, you may want to know if and when a process changes in some sense (at least you might, if you’ve been exposed to the ideas of Shewhart or Deming in the past). Shewhart control charts are an easy and quick tool for estimating when something might have changed in the way the process is running. Yes, the distinction in this case may be less clear than in Phil’s case. A company may have replaced the lighting or HVAC system in Phil’s example; in a process improvement case, the company may have changed the process (a discrete change), or some of the people doing the process may have let it evolve somewhat (more of a continuous change in a process parameter).

It’s on my list to try a mixture model approach to process monitoring instead of a control chart. I don’t think it would be that time-consuming to do; it’s just that something else has always crept ahead of that in priority.

+1 to the idea of Phil writing up his video as a case study (reading text is faster than listening to a video, especially when coming back to it later).

The mixture model approach to process monitoring sounds very interesting. Could you elaborate more on your ideas of how you’d do this?

Shewhart control charts are essentially built on the idea that you are tracking IID normal data from a process (that’s what aggregating into batches and averaging brings you), and you detect new points that are not part of that process by looking for batches that lie outside 2 or 3 SD, runs that seem excessive, or some similar rule. In essence, you’re trying to find when the underlying process changed.

What if you fitted a mixture model to the data and observed when points were assigned to one or another (or another) component? I did something like that a few years ago using R’s mixtools package to detect changes in operation of a building that led to changes in power draw (and Phil reviewed the work, finding both things that were good and things to improve). I later redid it using Stan, and it was pretty straightforward.

That would seem to have a few advantages over Shewhart charts. Besides bringing the Bayesian expressiveness (p(H|D), not p(D|H)), it would seem to show more directly both if a process changed and what it changed to. It would give a graph that showed the probability that any particular data point was from a particular process (plot the components of the simplex). The challenges might be determining how many components to include in the model in a way that was easy to do on a routine basis. That’s one advantage of a Shewhart diagram: processes have been set up so that those doing the work can produce the charts with little outside help.

These days with modern computers the simplicity advantage of Shewhart diagrams seems moot. I’m sure you could code up & black box your idea too so that the line operators could use it as easily as a Shewhart chart.

PS. Can you post a link to the building power draw work you mention?

Rahul, see https://ses.lbl.gov/publications/review-prior-commercial-building. Click on “PDF” near the bottom to see Phil’s evaluation of my analysis; my report is embedded in his.

Thanks!

Off topic, but it’s one tongue twister of a report title:

“Review of Prior Commercial Building Energy Efficiency Retrofit Evaluation”

Couple of questions, perhaps naive:

Why did you throw away the timestamp data from the meter readings? I’d have thought the hour-of-day / day-of-year / weekday-weekend would be important explanatory variables? Alternatively, is there any correlation between the kW used & outdoor temperature that day?

Also, isn’t a key point deconvoluting the effect of improved air handlers vs just external changes? e.g. Say 2012 was an unusually mild year then you would see a power saving unrelated to the air handler itself. Or say due to a downturn the occupancy of the building was lower. et cetra.

Shouldn’t a incentive calculation focus on teasing out the air handler effect?

Rahul, you’ve got lots of questions, but they’re not naive. Let me have a go at them.

Meter readings vs. time of day: after some EDA (not shown), I realized that the predominant characteristic of the load was its largely bimodal characteristic. In discussions with our utility engineer, I came to understand that the bimodality was largely due to the chillers being on or off. Once a chiller is on for the day, one is loath to turn it off, for big pieces of equipment like that typically have minimum run times and minimum off times. Finer adjustments in temperature are typically taken care of by the VAV (variable air volume) boxes. Certainly air handlers and VAV boxes contributed, but they didn’t dominate. I had seen this in other buildings, so it wasn’t a complete surprise.

I could have used hour of week (as you suggest and as I think Phil used in his video) as a covariate, but I got the idea that such a model would be sensitive to changes in schedule, while a mixture model or HMM would not be.

I didn’t totally discard the timestamps, though; I did plot probabilities of being in each state as a function of time (not shown), which let us pinpoint an inadvertent, incremental, and increasing change in system state that was imperceptible to the building operator at first. They stabilized that as soon as they knew of it.

Doing it this way avoided the common utility question to the customer about their operating schedule (“When do you start things up in the morning?”, etc.). Now we knew their schedule from the data and could, if useful, use that information to help the customer refine their operating schedule.

kW vs. OAT (outdoor air temperature): the engineer wanted a good but fast evaluation of this site. I grabbed the meter data and started to go for temperature data, but she assured me it wasn’t important: OAT didn’t drive consumption in this building. I pushed back; she pushed back. Rinse, repeat, and she reminded me of her schedule. :-) I did the analysis as requested but noted (see pp. 11-12 of the report) that seasonal effects (i.e., temperature) may affect their accuracy.

Phil did do the analysis with temperature, and he found a minimal change in the result compared to an analysis without temperature (<5% difference sticks in my memory, but I don't recall how much less than 5%). Remember that this was a not-to-exceed contract, so it was important to show the customer had saved at least the specified amount; the exact savings were of interest but not as contractually important.

Besides, there are buildings that are externally dominated (i.e., weather plays a key role–your house may fit in that category), and there are those that are internally dominated (think of something industrial that produces lots of heat which has to be removed–that heat load is likely far greater than the load provided by the current weather). All the engineer was claiming is that this building is more internally than externally dominated.

Building occupancy could have been a key factor. As I recall (the project is now 4 years old), we had no reason to believe that occupancy changed significantly over the time period of interest.

So, yes, the incentive calculation should tease out the effects of the project, excluding changes due to weather and the like. We claimed we did a sufficient job at that task.

Incidentally, utilities sometimes do that by metering just the specific piece of changed equipment (the old chiller vs. the new chiller, for example). The complete system in this case consisted of several chillers, air handlers, and a heck of a lot of VAV boxes. We would have had to buy a heck of a lot of meters to cover all of those, and we would have amassed a heck of a lot of data. We believed this system-level approach promised to deliver as good or better insights at significantly less cost.

Does that help?

@Bill

Thanks! That helps a lot!

So, are there cases where you have rejected the contract incentive claim based on the promised savings not being achieved?

Even more interestingly, are there cases where your more sophisticated analysis accepts / rejects where a simplistic, back of the envelope calculation (average of pre years annual kWhr minus post year annual kWhr) would have given a different answer?

i.e. Contractually, why is the bimodality, confidence intervals and other nuances of your detailed analysis relevant? Have you ever declared a contractual saving as not achieved when a simplistic calculation would have let it pass?

Perhaps off topic, but from a Utilities viewpoint how much sense does it make to quantify incentives on aggregate kW-hrs saved?

i.e. Shouldn’t 100 kW shaved off the average load during peak demand hours matter hugely more than a 100 kW saved at night? Why does the district award incentives on total kWhr rather than a more nuanced measure?

Another question: Had someone naively used a 2 component mixture model for both the pre & post periods, how different would have been the conclusions? Did you run a sensitivity analysis of this sort?

Two more good questions. Timing of savings doesn’t make nearly the difference here in the Pacific Northwest (PNW) of the USA as it may elsewhere. Historically, what has mattered to us is total energy, for our resource is the water we’re holding behind a large series of dams, mostly in the Columbia River System. While we do have constraints (to a significant degree determined by river flow rates that are hospitable to our anadromous fish population), we can vary the flow of water through turbines quickly and over a rather broad range. Transmission capacity is likely our major power limitation. Besides, our weather is relatively mild–maritime climate to minimize temperature swings and very little air conditioning load–so the seasonal fluctuations here are likely smaller than in, say, the southeastern USA.

I’m even less of an expert on thermal systems than on hydro systems, but I gather they are more likely to be power- than energy-limited, and thus time of use counts. I do think that time of use will become increasingly important to us in the future.

You might be able to guess that we have focused more on energy than power by looking at the possibly confusing to others unit of energy we use in the PNW. Most people measure power in watts and energy in watt-hours, both with some suitable metric prefix. Our traditional unit of energy is the aMW or the average megawatt–the energy represented by 1 MW of power being drawn for 8760 hours. I don’t think most utility people outside of the PNW even know this unit.

Incidentally, we don’t pick our savings approaches in a vacuum, although my use of a mixture model was the first I know of in the region (and is legitimate for a so-called custom project). The Northwest Power Act set up the Northwest Power and Conservation Council to regulate such affairs, and there is a Regional Technical Forum (RTF: http://rtf.nwcouncil.org/) as part of the Council, whose job it is to deal with the technical issues. There’s a history of all this at http://www.nwcouncil.org/history/index/.

Sensitivity analysis regarding the model: I used mixtools to determine the number of components it found, and it came up with some 7 or 8 components, most of those very close together. As we couldn’t figure out any physical reason for most of those, nor did they make a difference in the calculation, I settled on 2 in the pre and 3 in the post. When Phil was doing his review, I think I looked again and wished I had used 3 components in both cases for ease of explanation, but it didn’t make a significant difference in the result.

Change “Transmission capacity is likely our major power limitation” to “Transmission capacity has traditionally been our major power limitation.”

@Bill

Isn’t peak demand what drives most of power economics? The fact that coal plants are slow to start / shut & gas turbines etc. make expensive power?

@Rahul: yes, that’s a clearer statement than I made. If you have coal as your base, then it tends to run at a fixed power, and, traditionally, you ramp up by adding gas turbines or the like, all generation that uses more costly fuel. With hydro, assuming you have the water, won’t wash out the fish, and meet other constraints, you just open the valves more.

One idea I’ve had (but have yet to try) is to use the neural network covariance function for change-point modelling. That covariance function models arbitrary jumps close to the origin; to use it for change-point detection, it will be necessary to treat the origin as an unknown parameter and put a flat prior on it.

That sounds like a reasonable approach, but it feels like covariance function choice is kind of a hocus pocus for most people. I’d love to see some kind of book with a massive set of short examples of using gaussian processes by constructing a covariance function that has the properties you want… even if it was kind of cookbook style. It should, however, like a good cookbook, actually have real applied content, not made-up recipes just to illustrate a theoretical point (I rarely if ever bake theoretically pure but tasteless and inedible cakes, scones, or quickbreads).

From an intuition perspective, I think basis expansions are usually easier for people to understand.

So, looking at the following:

http://www.chicagobooth.edu/research/workshops/econometrics/docs/calder-topicsinconvolution.pdf

suggests that gaussian processes can be thought of as filtered brownian motion with the covariance function as a convolution kernel. It’s not clear to me how general this idea is, but if it’s general enough, then it seems like this is a powerful constructive method for thinking about gaussian process models.

Anyway, if you know of a good applied book on gaussian processes I’d love to hear about it. I know about the excellent http://www.gaussianprocess.org/gpml/ but it’s more a textbook in the theory than an exploration of how to build models.

I don’t know of an applied GP book that fits the bill, but Aki Vehtari’s demos in his GPstuff toolbox are pretty good. I seem to recall that he’s done a write-up of one of the analyses in which the model (here meaning the covariance function) is built up term by term.

MacKay has some decent material floating around. Some filmed lectures somewhere. Try http://www.inference.phy.cam.ac.uk%2Fmackay%2FgpB.pdf for a start,

Ugh, http://www.inference.phy.cam.ac.uk/mackay/gpB.pdf

I follow Bob’s sentiment in this strange split between PPC’s+continuous model expansions vs LOO. I don’t mean that in practice we have to face with discrete decisions although we don’t want to, but that LOO should be treated within a PPC framework. Therefore for any 0/1 decisions we end up making based on posterior predictive p-values, we can do the same using LOO as the discrepancy measure. Even with a discrete set of models we should do general model checking and assess each of their fits to data, and LOO is one such measure. For decisions that compare these models, we should stick to discrepancy measures that are relevant to the application at hand.