## Barry Gibb came fourth in a Barry Gibb look alike contest

Every day a little death, in the parlour, in the bed. In the lips and in the eyes. In the curtains in the silver, in the buttons, in the bread, in the murmurs, in the gestures, in the pauses, in the sighs. – Sondheim

The most horrible sound in the world is that of a reviewer asking you to compare your computational method to another, existing method. Like bombing countries in the name of peace, the purity of intent drowns out the voices of our better angels as they whisper: at what cost.

Before the unnecessary drama of that last sentence sends you running back to the still-open browser tab documenting the world’s slow slide into a deeper, danker, more complete darkness that we’ve seen before, I should say that I understand that for most people this isn’t a problem. Most people don’t do research in computational statistics. Most people are happy.

So why does someone asking for a comparison of two methods for allegedly computing the same thing fill me with the sort of dread usually reserved for climbing down the ladder into my basement to discover, by the the light of a single, swinging, naked lightbulb, that the evil clown I keep chained in the corner has escaped? Because it’s almost impossible to do well.

I go through all this before you wake up so I can feel happier to be safe again with you

Many many years ago, when I still had all my hair and thought it was impressive when people proved things, I did a PhD in numerical analysis. These all tend to have the same structure:

• survey your chosen area with a simulation study comparing all the existing methods,
• propose a new method that should be marginally better than the existing ones,
• analyse the new method, show that it’s at least not worse than the existing ones (or worse in an interesting way),
• construct a simulation study that shows the superiority of your method on a problem that hopefully doesn’t look too artificial,
• write a long discussion blaming the inconsistencies between the maths and the simulations on “pre-asymptotic artefacts”.

Which is to say, I’ve done my share of simulation studies comparing algorithms.

So what changed? When did I start to get the fear every time someone mentioned comparing algorithms?

Well, I left numerical analysis and moved to statistics and I learnt the one true thing that all people who come to statistics must learn: statistics is hard.

When I used to compare deterministic algorithms it was easy: I would know the correct answer and so I could compare algorithms by comparing the error in their approximate solutions (perhaps taking into account things like how long it took to compute the answer).

But in statistics, the truth is random. Or the truth is a high-dimensional joint distribution that you cannot possibly know. So how can you really compare your algorithms, except possibly by comparing your answer to some sort of “gold standard” method that may or may not work.

Inte ner för ett stup. Inte ner från en bro. Utan från vattentornets topp.

[No I don’t speak Swedish, but one of my favourite songwriters/lyricists does. And sometimes I’m just that unbearable. Also the next part of this story takes place in Norway, which is near Sweden but produces worse music (Susanne Sunfør and M2M being notable exceptions)]

The first two statistical things I ever really worked on (in an office overlooking a fjord) were computationally tractable ways of approximating posterior distributions for specific types of models.  The first of these was INLA. For those of you who haven’t heard of it, INLA (and it’s popular R implementation R-INLA) is a method for doing approximate posterior computation for a lot of the sorts of models you can fit in rstanarm and brms. So random effect models, multilevel models, models with splines, and spatial effects.

At the time, Stan didn’t exist (later, it barely existed), so I would describe INLA as being Bayesian inference for people who lacked the ideological purity to wait 14 hours for a poorly mixing BUGS chain to run, instead choosing to spend 14 seconds to get a better “approximate” answer.  These days, Stan exists in earnest and that 14 hours is 20 minutes for small-ish models with only a couple of thousand observations, and the answer that comes out of Stan is probably as good as INLA. And there are plans afoot to make Stan actually solve these models with at least some sense of urgency.

Working on INLA I learnt a new fear: the fear that someone else was going to publish a simulation study comparing INLA with something else without checking with us first.

Now obviously, we wanted people to run their comparisons past us so we could ruthlessly quash any dissent and hopefully exile the poor soul who thought to critique our perfect method to the academic equivalent of a Siberian work camp.

Or, more likely, because comparing statistical models is really hard, and we could usually make the comparison much better by asking some questions about how it was being done.

Sometimes, learning from well-constructed simulation studies how INLA was failing lead to improvements in the method.

But nothing could be learned if, for instance, the simulation study was reporting runs from code that wasn’t doing what the authors thought it was. And I don’t want to suggest that bad or unfair comparisons comes from malice (for the most part, we’re all quite conscientious and fairly nice), but rather that they happen because comparing statistical algorithms is hard.

And comparing algorithms fairly where you don’t understand them equally well is almost impossible.

Well did you hear the one about Mr Ed? He said I’m this way because of the things I’ve seen

Why am I bringing this up? It’s because of the second statistical thing that I worked on while I was living in sunny Trondheim (in between looking at the fjord and holding onto the sides of buildings for dear life because for 8 months of the year Trondheim is a very pretty mess of icy hills).

During that time, I worked with Finn Lindgren and Håvard “INLA” Rue on computationally efficient approximations to Gaussian random fields (which is what we’re supposed to call Gaussian Processes when the parameter space is more complex than just “time” [*shakes fist at passing cloud*]). Finn (with Håvard and Johan Lindström) had proposed a new method, cannily named the Stochastic Partial Differential Equation (SPDE) method, for exploiting the continuous-space Markov property in higher dimensions. Which all sounds very maths-y, but it isn’t.

The guts of the method says “all of our problems with working computationally with Gaussian random fields comes from the fact that the set of all possible functions is too big for a computer to deal with, so we should do something about that”.  The “something” is replace the continuous function with a piecewise linear one defined over a fairly fine triangulation on the domain of interest.

A very exciting paper popped up on arXiv on Monday comparing a fairly exhaustive collection of recent methods for making spatial Gaussian random fields more computationally efficient.

Why am I not cringing in fear? Because if you look at the author list, they have included an author from each of the projects they have compared! This means that the comparison will probably be as good as it can be. In particular, it won’t suffer from the usual problem of the authors understanding some methods they’re comparing better than others.

The world is held together by the wind that blows through Gena Rowland’s hair

So how did they go?  Well, actually, they did quite well.  I like that

• They describe each problem quite well
• The simulation study and the real data analysis uses a collection of different evaluations metrics
• Some of these are proper scoring rules, which is the correct framework for evaluating probabilistic predictions
• They acknowledge that the wall clock timings are likely to be more a function of how hard a team worked to optimise performance on this one particular model than a true representation of how these methods would work in practice.

Not the lovin’ kind

But I’m an academic statistician. And our key feature, as a people, is that we loudly and publicly dislike each other’s work. Even the stuff we agree with.  Why? Because people with our skills who also have impulse control tend to work for more money in the private sector.

So with that in mind, let’s have some fun.

(Although seriously, this is the best comparison of this type I’ve ever seen. So, really, I’m just wanting it to be even bester.)

So what’s wrong with it?

It’s gotta be big. I said it better be big

The most obvious problem with the comparison is that the problem that these methods are being compared on is not particularly large.  You can see that from the timings.  Almost none of these implementations are sweating, which is a sign that we are not anywhere near the sort of problem that would really allow us to differentiate between methods.

So how small is small? The problem had 105,569 observations and required prediction at at most 44,431 other locations. To be challenging, this data needed to be another order of magnitude bigger.

God knows I know I’ve thrown away those graces

(Can you tell what I’m listening to?)

The second problem with the comparison is that the problem is tooooooo easy. As the data is modelled with a Gaussian observation noise and a multivariate Gaussian latent random effect, it is a straightforward piece of algebra to eliminate all of the latent Gaussian variables from the model.  This leads to a model with only a small number of parameters, which should make inference much easier.

How do you do that?  Well, if the data is $y$, the Gaussian random field is $x$ and and all the hyperparmeters $\theta$.  In this case, we can use conditional probability to write that

$p(\theta \mid y) \propto \frac{p(y,x,\theta)}{p(x \mid y, \theta)},$

which holds for every value of $x$ and particularly $x=0$.  Hence if you have a closed form full conditional (which is the case when you have Gaussian observations), you can write the marginal posterior out exactly without having to do any integration.

A much more challenging problem would have had Poisson or binomial data, where the full conditional doesn’t have a known form. In this case you cannot do this marginalisation analytically, so you put much more stress on your inference algorithm.

I guess there’s an argument to be made that some methods are really difficult to extend to non-Gaussian observations. But there’s also an argument to be made that I don’t care.

Don’t take me back to the range

The prediction quality is measured in terms of mean squared error and mean absolute error (which are fine), the continuous rank probability score (CRPS) and and the Interval Score (INT), both of which are proper scoring rules. Proper scoring rules (and follow the link or google for more if you’ve never heard of them) are the correct way to compare probabilitic predictions, regardless of the statistical framework that’s used to make the predictions. So this is an excellent start!

But one of these measures does stand out: the prediction interval coverage (CVG) which is defined in the paper as “the percent of intervals containing the true predicted value”.  I’m going to parse that as “the percent of prediction intervals containing the true value”. The paper suggests (through use of bold in the tables) that the correct value for CVG is 0.95. That is, the paper suggests the true value should lie within the 95% interval 95% of the time.

This is not true.

Or, at least, this is considerably more complex than the result suggests.

Or, at least, this is only true if you compute intervals that are specifically built to do this, which is mostly very hard to do. And you definitely don’t do it by providing a standard error (which is an option in this competition).

Boys on my left side. Boys on my right side. Boys in the middle. And you’re not here.

So what’s wrong with CVG?

Why? Well first of all it’s a multiple testing problem. You are not testing the same interval multiple times, you are checking multiple intervals one time each. So it can only be meaningful if the prediction intervals were constructed jointly to solve this specific multiple testing problem.

Secondly, it’s extremely difficult to know what is considered random here. Coverage statements are statements about repeated tests, so how you repeat them will affect whether or not a particular statement is true. It will also affect how you account for the multiple testing when building your prediction intervals. (Really, if anyone did opt to just return standard errors, nothing good is going to happen for them in this criterion!)

Thirdly, it’s already covered by the interval score.  If your interval is $[l,u]$ with nominal level $\alpha$, the interval score is for an observation y is

$\text{INT}_\alpha(l, u, y) = u - l + \frac{2}{\alpha}(l-y) \mathbf{1}\{y < l\} + \frac{2}{\alpha}(y-u)\mathbf{1}\{y>u\}.$

This score (where smaller is better) rewards you for having a narrow prediction interval,  but penalises you every time the data does not lie in the interval.  The score is minimised when $\Pr(y \in [l,u]) = \alpha$. So this really is a good measure of how well the interval estimate is calibrated that also checks more aspects of the interval than CVG (which lacks the first term) does.

There’s the part you’ve braced yourself against, and then there’s the other part

Any conversation about how to evaluate the quality of an interval estimate really only makes sense in the situation where everyone has constructed their intervals the same way. Now the authors have chosen not to provide their code, so it’s difficult to tell what people actually did.  But there are essentially four options:

• Compute pointwise prediction means $\hat{\mu}_i$ and standard errors $\hat{\sigma}_i$ and build the pointwise intervals $\hat{\mu}_i \pm 1.96\hat{\sigma}$.
• Compute the pointwise Bayesian prediction intervals, which are formed from the appropriate quantiles (or the HPD region if you are Tony O’Hagan) of $\int \int p(\hat{y} \mid x,\theta) p(x,\theta \mid y)\,dxd\theta$.
• An interval of the form $\hat{\mu}_i \pm c\hat{\sigma}$, where is chosen to ensure coverage.
• Some sort of clever thing based on functional data analysis.

But how well these different options work will depend on how they’re being assessed (or what they’re being used for).

Option 1: We want to fill in our sparse observation by predicting at more and more points

(This is known as “in-fill asymptotics”). This type of question occurs when, for instance, we want to fill in the holes in satellite data (which are usually due to clouds).

This is the case that most closely resembles the design of the simulation study in this paper. In this case you refine your estimated coverage by computing more prediction intervals and checking if the true value lies within the interval.

Most of the easy to find results about coverage in these is from the 1D literature (specifically around smoothing splines and non-parametric regression). In these cases, it’s known that the first option is bad, the second option will lead to conservative regions (the coverage will be too high), the third option involves some sophisticated understanding of how Gaussian random fields work, and the fourth is not something I know anything about.

Option 2: We want to predict at one point, where the field will be monitored multiple times

This second option comes up when we’re looking at a long-term monitoring network. This type data is common in environmental science, where a long term network of sensors is set up to monitor, for example, air pollution.  The new observations are not independent of the previous ones (there’s usually some sort of temporal structure), but independence can often be assumed if the observations are distant enough in time.

In this case, Option 1 will be the right way to construct your interval, option 2 will probably still be a bit broad but might be ok, and options 3 and 4 will probably be too narrow if the underlying process is smooth.

Option 3: Mixed asymptotics! You do both at once

Simulation studies are the last refuge of the damned.

I see the sun go down. I see the sun come up. I see a light beyond the frame.

So what are my suggestions for making this comparison better (other than making it bigger, harder, and dumping the weird CVG criterion)?

• randomise
• randomise
• randomise

What do I mean by that?  Well in the simulation study, the paper only considered one possible set of data simulated from the correct model. All of the results in their Table 2, which contains the scores, and timings on the simulated data, depends on this particular realisation. And hence Table 2 is a realisation of a random variable that will have a mean and standard deviation.

This should not be taken as an endorsement of the frequentist view that the observed data is random and estimators should be evaluated by their average performance over different realisation of the data. This is an acknowledgement of the fact that in this case the data is actually a realisation of a random variable. Reporting the variation in Table 2 would give an idea of the variation in the performance of the method. And would lead to a more nuanced and realistic comparison of the methods. It is not difficult to imagine that for some of these criteria there is no clear winner when averaged over data sets.

Where did you get that painter in your pocket?

I have very mixed feelings about the timings column in the results table.  On one hand, an “order of magnitude” estimate of how long this will actually take to fit is probably a useful thing for a person considering using a method.  On the other hand, there is just no way for these results not to be misleading. And the paper acknowledges this.

Similarly, the competition does not specify things like priors for the Bayesian solutions. This makes it difficult to really compare things like interval estimates, which can strongly depend on the specified priors.  You could certainly improve your chances of winning on the CVG computation for the simulation study by choosing your priors carefully!

What is this six-stringed instrument but an adolescent loom?

I haven’t really talked about the real data performance yet. Part of this is because I don’t think real data is particularly useful for evaluating algorithms. More likely, you’re evaluating your chosen data set as much as, or even more than, you are evaluating your algorithm.

Why? Because real data doesn’t follow the model, so even if a particular method gives a terrible approximation to the inference you’d get from the “correct” model, it might do very very well on the particular data set.  I’m not sure how you can draw any sort of meaningful conclusion from this type of situation.

I mean, I should be happy I guess because the method I work on “won” three of the scores, and did fairly well in the other two. But there’s no way to say that wasn’t just luck.

What does luck look like in this context? It could be that the SPDE approximation is a better model for the data than the “correct” Gaussian random field model. It could just be Finn appealing to the old Norse gods. It’s really hard to  tell.

If any real data is to be used to make general claims about how well algorithms work, I think it’s necessary to use a lot  of different data sets rather than just one.

Similarly, a range of different simulation study scenarios would give a broader picture of when different approximations behave better.

Don’t dream it’s over

One more kiss before we part: This field is still alive and kicking. One of the really exciting new ideas in the field (that’s probably too new to be in the comparison) is that you can speed up the computation of the unnormalised log-posterior through hierarchical decompositions of the covariance matrix (there is also code). This is a really neat method for solving the problem and a really exciting new idea in the field.

There are a bunch of other things that are probably worth looking at in this article, but I’ve run out of energy for the moment. Probably the most interesting thing for me is that a lot of the methods that did well (SPDEs, Predictive Processes, Fixed Rank Kriging, Multi-resolution Approximation, Lattice Krig, Nearest-Neighbour Predictive Processes)   are cut from very similar cloth. It would be interesting to look deeper at the similarities and differences in an attempt to explain these results.

Edit:

• Finn Lindgren has pointed out that the code will be made available online. Currently there is a gap between intention and deed, but I’ll update the post with a link in the event that it makes it online.

1. “And sometimes I’m just that unbearable.” Only sometimes?

2. Andrew says:

Dan:

Thank you for this post. Perhaps we are all returning to sanity now in this blog and in our research.

3. Z says:

Great post, I’ll definitely be returning to this next time I need to compare methods using a simulation.

(Just wanted to drop some positive feedback here to encourage you to keep posting. One of these per day?)

• Daniel Simpson says:

Thanks! I’m not really expecting this post to be a big hitter, but it was fun to write (and a nice break from grant writing).

4. zbicyclist says:

I’m imagining this group of authors reading this and thinking: “OK, let’s get another paper about of testing a more complex problem. … and then another paper.” And why not?

• Daniel Simpson says:

Writing these papers is a lot of work so I doubt it. And also all these people have more exciting stuff to do!

5. Roy Mendelssohn says:

Made my day. Both entertaining with the usual god knows where he finds them links, as well as informative, especially since spatial- temporal models are something that interest me and are relevant to my work. I have been involved in some work that can handle very large space-time models, but the unfortunately the details are tied up as intellectual property in a private company. But I am not just dropping a teaser, some idea of it can be found at this talk at the 2016 International ISBA meeting. If you go to:

http://www.corsiecongressi.com/isba2016/pdf/ISBA2016_book_abstract.pdf

and search for Lemos you will find the abstract. I believe the talk is available from Ricardo, as it was approved and given, but you would need to check with him.

• Daniel Simpson says:

That abstract looks very interesting. I can’t believe I missed it!

• Roy says:

Yes I did I search and you were at the meeting. As I said, I believe the talk is publicly available, but don’t want to be quoted on that. that is why I said contact Ricardo. As of now, the full algorithm is considered the intellectual property of the company Ricardo works for, and I am under an NDA.

6. Jake says:

I agree that they should have run more randomized tests, but when one of the entrants takes 48 hours on 30 cores, and the other entrants need over 18 hours between them, it’s understandable why they wouldn’t want to. (That’s also an argument against going to bigger problem sizes).

• Dan Simpson says:

I’m ok with some methods reporting a missing value for that. Although if it takes 48 hours to solve a “small” large problem, then a method does not solve large problems.

7. Pancake, a bloody one says:

I was expecting this to be behind the blue “the fear” link:

As someone currently engaged in comparing methods by simulation I’m happy about this post. Even though my project is a bit different, but still, it is great to get some insight in how other people–who are far more talented than I am–see things.

• Dan Simpson says:

A probably more useful post would be “how to compare algorithms”, but that’s not something I can knock out in an afternoon :p

8. Christian Hennig says:

I’m strongly interested in the general workings and logic of comparing algorithms by simulation, so thanks a lot for this post!
I have to say though that some stuff in the post made me think that perfection is the enemy of the good. I have no reason to disagree with any specific point you make, but I’m worried that the overall sound of your discussion might be, at least to some people, that we should better keep our hands off such simulations, and reviewers should not ask for them. Unless we have a team with all the algorithm’s authors together, and all of them spend an obsessive amount of time thinking through loads of issues that ultimately have to be decided somehow, that is, and unless we can have the time to simulate lots of really realistically big datasets from every model we try, and these models better be complex to be convincing. And perhaps even then.

I have a somewhat sophisticated attitude to simulation studies but probably not sophisticated enough to avoid practices that can be shredded to tears in this kind of discussion. Even I have problems getting PhD-students to just try out their stuff and compare it to some “natural” competitors on some toy datasets just as a first sanity check, because they tend to be worried that they get all kinds of things wrong when doing that (the datasets are not realistic! – couldn’t the way they run the competing methods be criticised if one of their authors were a reviewer? – how many replicates do they have to do so that anybody takes their comparison seriously? – not that it’d be needed to address all these for initially just running some stuff for yourself to see whether you’re not totally off tracks!) and so some of them prefer agonizing for months about these things more before actually running something. (Of course it can be worthwhile to put effort into this, but it’s experimental science and trying stuff out will teach you more about how to improve it than getting intimidated by real and potential criticism.)

One problem with the never-ending complexity of setting up a good comparative simulation study is that people tend to implicitly accept that nobody can expect them to do that when for example proposing a new method, with the implication that most published studies are far more shabby than they needed be, and that one could quite easily set up something that adds interesting information to these… but why put even one more week into doing that if reviewers will list the same number of weaknesses and recommend to do twice as much whatever the initial quality is, because there is always so much to criticize?

As reviewer I have seen lots of simulation studies. I have thrown my fair amount of criticism at many of these, and I am certainly not alien to asking authors to compare their method with standard algorithm X on toy scenarios (the author needs to convince the reader that her new method can beat X at least in a cherry-picked situation, otherwise there’s no need for it at all). But I try consciously to accept and reward simulations that are reasonably done and argued (choice of competitors and how exactly they are run, choice of models to simulate from, measurement of quality, all information required for reproduction presented…) rather than making authors of such studies feel that they can never get it right because one can always find something.

Advice and discussion on comparative simulation studies is needed badly, but I’d like it to have an element of encouragement. Of course you’re going to get all kinds of things wrong when doing it but chances are you learn still much more from doing it than from not doing it!

• Dan Simpson says:

“I’m worried that the overall sound of your discussion might be, at least to some people, that we should better keep our hands off such simulations, and reviewers should not ask for them. Unless we have a team with all the algorithm’s authors together, and all of them spend an obsessive amount of time thinking through loads of issues that ultimately have to be decided somehow, that is, and unless we can have the time to simulate lots of really realistically big datasets from every model we try, and these models better be complex to be convincing. “

Well that’s mostly what I said. Science is hard if you want to do it well.

It’s not actually necessary to be a team like these people were, but I can’t see a good argument for publishing one of these sorts of comparisons without contacting and seeking advice from all the interested parties before the work is considered finished. As I said in the post, most of the time this will improve the comparison. It will, at the very least, not make it worse.

But the points that I wanted to make with this post are:
1) It is actually really hard to build a convincing simulation study
2) You should think about (and justify) how you evaluate methods
3) You should be as fair as possible to all methods
4) You should recognise the sources of randomness in a simulation study and report them appropriately.
5) Real data is basically useless for comparing algorithms *unless they are supposed to exactly compute the same thing*.

I’m not super interested in making people think this is easier than it is. As a community, that strategy tends to get us in trouble.

• Dan Simpson says:

One of the things that perplexes me about how we teach research students, is that we somehow treat the four tasks of computational stats (“design a method”, “prove a method works”, “compare a method with others”, and “apply a method”) as if only the first two are REALLY HARD.

• Christian Hennig says:

Again, I agree in principle with everything you write here, particularly every single one of your 5 points above (the paper you criticize apparently scored fairly well on these).

Still I don’t think that a good outcome of such a discussion would be that people who think they don’t have the time to do it as well as we think it should be done stop running comparative simulations altogether. Because I think we can learn a lot from these studies, even those that are not done really well (although probably not from the really shabby ones).
It is certainly worthwhile to put some work into improving the studies that people do, but I also think that rather than discouraging people from running such studies, we should encourage them to be appropriately modest about them. Often harm doesn’t come so much from deficiencies in the study but rather from the tendency to hide such deficiencies and to over-interpret the results.

• Dan Simpson says:

I still don’t see any value in pretending a hard task is easy to make people feel better. There’s a lot you can learn from a shoddy simulation study (and i’ve done many on the hoof in my time), but that doesn’t mean they are publishable. There’s a lot you can learn from a poorly applied NHST too.

• Keith O'Rourke says:

+1

But then then pretending a easy task is hard (i.e. recommending a cultivator to pick a dandy lion) to make people feel dependent on you and lacking in background they might not need – also has little real value.

9. Christian Hennig says:

By the way, there are still many small datasets in reality that are worth analyzing, and in a simulation you often learn most from what happens in simple toy situations that you can control and understand (apart from the computational issue with running replicates with really big sizes mentioned above), so there’s definitely still some sense in running simulations with small and simple datasets.

• Dan Simpson says:

The paper was called “Methods for Analyzing Large Spatial Data”. So it defined it’s scope in the title.

10. Andrew says:

Christian:

Your two comments remind me of something else worth exploring, which is that there are two ways of learning from a simulation, or from an example, or from a story:

1. You learn what works better and what works worse: basically, the simulation or example or story is taken to be representative of a larger population of problems you’re working on. For example, when we fit ADVI to a couple hundred models sitting in Stan and found that ADVI failed on a lot of them, we learned that ADVI has problems. Or, for a more optimistic example, we took some weakly informative priors on logistic regression coefficients, and tried these out on the datasets from a big corpus: this gave us confidence that these priors could work in general.

2. You learn that some promising-seeming idea doesn’t work: here, what you get out of the simulation or example or story is a case of something not going as expected. This is related to my discussion with Basbøll on how we learn from stories.

I think some confusion arises because goals 1 and 2 are different but they are not always explicitly stated.

• Dan Simpson says:

We really should write something properly about simulation studies…

• Andrew says:

Stick it on the queue!

• Christian Hennig says:

I’m thinking about this for some time, too, but due to my low speed in everything nothing has yet materialised, although I’m currently member of a group working on benchmarking and comparisons in cluster analysis,
https://ifcs.boku.ac.at/repository/philosophy.html

There’s running simulations in the first place for yourself to learn about the method that you just developed and get some experience with it.

Then there are simulations that appear in papers in which a new method is proposed, and these are about demonstrating that a) there are some situations and/or respects in which the new method is best, so that there is some need for it, and b) investigating what are conditions under which the method works well and not so well. These are not comprehensive and unbiased comparisons of methods, and they should not be interpreted as such, neither by the authors nor by the readers!

The kind of study that you write about here is on another “interpretative” level, where an unbiased and reasonably comprehensive comparison is aimed at, and of course standards have to be higher there.

When discussing cluster benchmarking we were faced by the same issue that being demanding is tempting and there are very good arguments for it, but that it may imply that some people put even less effort into the comparisons they run because they think that fulfilling the required standards is out of reach anyway.
Just skimming through typical published work of this kind makes me feel that the impossibility of perfection is used as an excuse by many to present things that with even quite limited effort could be improved a lot!

11. Jack says: