Following our recent discussion of p-values, Anne commented:

We use p-values for something different: setting detection thresholds for pulsar searches. If you’re looking at, say, a million independent Fourier frequencies, and you want to bring up an expected one for further study, you look for a power high enough that its p-value is less than one in a million. (Similarly if you’re adding multiple harmonics, coherently or incoherently, though counting your “number of trials” becomes more difficult.) I don’t know whether there’s another tool that can really do the job. (The low computing cost is also important, since in fact those million Fourier frequencies are multiplied by ten thousand dispersion measure trials and five thousand beams.)

That said, we don’t really use p-values: in practice, radio-frequency interference means we have no real grasp on the statistics of our problem. There are basically always many signals that are statistically significant but not real, so we rely on ad-hoc methods to try to manage the detection rates.

I don’t know anything about astronomy–just for example, I can’t remember which way the crescent moon curves in its different phases during the month–but I can offer some general statistical thoughts.

My sense is that p-values are not the best tool for this job. I recommend my paper with Jennifer and Masanao on multiple comparisons; you can also see my talks on the topic. (There’s even a video version where you can hear people laughing at my jokes!) Our general advice is to model the underlying effects rather than thinking of them as a million completely unrelated outcomes.

The idea is to get away from the whole sterile p-value/Bayes-factor math games and move toward statistical modeling.

Another idea that’s often effective is to select as subset of your million possibilities for screening and then analyze that subset more carefully. The work of Tian Zheng and Shaw-Hwa Lo on feature selection (see the Statistics category here) might be relevant for this purpose.

This seems to me to be a perfect application for p-values or Bayes factor, actually. If you have millions of possibilities, and you want to pick one for further study, it helps to have a single number which indexes how "sure" you are that a particular case is a pulsar. In this case, they are only using the statistic to make an efficient decision. It seems like they don't need the baggage of a fully fledged statistical model.

In this case, I think posterior odds would be better suited to the problem, but the p-value and posterior odds would probably be monotonically related anyway, so if you just pick a pulsar with high posterior odds, it is basically picking a pulsar with low p-value.

Posterior inference is fine. I don't see the benefit of framing this as "odds" or a "Bayes factor," but maybe that's just words. As one more fully accounts for the complexity of the data, the Bayesian analysis moves away from any simple computations of independent Bayes factors and toward a joint posterior distribution.

In any case, p-values are something different. The "monotonically related" thing doesn't tell the whole story because the actual monotone relation depends on the entire dataset; it's not a data-independent lookup table.

Sure, the exact monotone relation depends on the data set, but my impression of what was needed in this case was a kind of "sorting" of the possibilities. If the transformation is monotone, the order won't depend on the transformation. Which pulsar do we study today? Pick the one off the top of the list.

Now that I've reread the problem, though, it strikes me that there'd be no reason to correct for multiple comparisons if that's what they are doing, because that's just a monotone transformation too. I guess maybe it's more "pick the best candidate, if it is above a certain threshold?" Even then, it seems like simple posterior odds is a great way to set that threshold, and gets you 99% of the way there – studying a pulsar instead of tweaking a statistical model :)

Am I right in thinking that this is related to Andrew's point about the difference between sig and not sig is itself not sig? Doesn't it follow that it is a bad idea to rank-order p-values between different tests?

Kaiser, it depends on the amount of data used to compute them. If two tests are based on roughly the same amount of data, the ordering of the p values will be the same as the ordering of the observed effect size. But the dependence on N is a good reason to use posterior odds and not p values…

The astronomy problem sounds similar to the gene association one, where people look for genetic markers which are linked to disease. In that context, p-values with correction for multiple testing (hundreds of thousands of markers are considered) or subtle variations thereof seem to be the only game in town. And it's not just an ordering of markers by p-value that matters but the magnitude of those values since it may well be that no markers are associated with the disease.

I give more details about the survey problem here, but I should clarify: the expected number of pulsars per set of Fourier frequencies is less than about 10^-6; even if you lump all ten thousand per beam together the expected number is still probably less than one in a hundred. So if the data were ideal and we set our thresholds right (using what is effectively a Bonferroni correction), we'd see nothing at all from most beams.

Maybe I should divide the question into Problem A (assuming perfectly random noise, distributed just how we expect) and Problem B (the real, interference-laden survey).

For Problem A I'm a little puzzled about how a multilevel model can be applied – the Fourier frequencies are genuinely independent (to first order, anyway): the signal we're looking for is a single power measurement that's unusually large. And assuming we know the data statistics, it's pretty straightforward to set a threshold that will allow an expected number of false positives per beam of (say) 10^-6.

In practice, of course, we are dealing with Problem B, and even though we set such a threshold, we almost always see at least thirty candidates per beam meeting it. The precise details and origin of these bogus candidates depend on the radio-frequency environment of the telescope on the day the observation was taken, which is so hopelessly complicated we would really rather not model it if at all possible.

We do indeed have a multistage process for sifting the candidates; the initial search provides crude amplitudes and frequencies, which we "tune up" using interpolation; we then "sift" out the best few candidates per beam (from all ten thousand roughly independent DM trials) and "fold" the raw data at their best-fit parameters, producing a detailed output plot; we then get a database of some hundred thousand output plots, which we use various heuristics to select and look at by hand. The looking-at-by-hand part is not without its own problems, but it does provide additional information beyond what is available in the initial selection. The issue, in Problem B, is selecting the most plausible N candidates to fold, since folding is a somewhat slow process. For the "sifting" I can imagine all sorts of clever statistical tests, but for the initial pass over the raw FFT, I can't see anything better than a threshold (or a top-N list, though if N is substantially greater than one we're in trouble).

This sounds a lot like the "gene target" problem. You have some information about the sequence environment in the region of some gene, you know something about binding sites, you'd like to pick a list of promising candidate genes, ones likely to be regulated by a protein you're interested in.

I wrote a paper on using a simple regional averaging procedure for solving this problem, and it worked beautifully:

http://bioinformatics.oxfordjournals.org/cgi/cont…

The idea was to combine some biological knowledge about the process with some simple mathematical manipulations, to get a score that could be considered as a random noise plus a signal, and then by setting various cutoffs you can filter out the pure random noise regions

I would guess that for your pulsar detection that is a good way to think about what you're doing, and perhaps by being a little more adventurous with your modeling (which I think is what Andrew was suggesting) you can determine some way to generate a score that is related more to your subject knowledge of what is real and what is not. I doubt Andrew meant to model the electrical noise environment (ie. some massive solution of PDEs with noise), but more to model your domain specific knowledge of what is real and what is not.

@Daniel: that's very interesting – I hadn't imagined it would be useful to apply digital signal processing techniques to genomes!

We do in fact do some convolutional trickery; when searching for pulsars whose period changes over the course of a two-minute observation (because they're in a tight binary) we use a matched filtering technique to remove the effect of acceleration. But leaving that aside for the moment, for unaccelerated pulsars, we know exactly what our "signal" looks like: a single Fourier frequency with high power, or two adjacent Fourier frequencies with less-high power, and we detect both. The problem is in describing our noise. We do some, for example rejecting anything too close to 60 Hz (any 60 Hz pulsars will have to be found at European observatories!), but short of developing a detailed model of the noise I don't know that there's much we can do to take it into account.

One thing we do do, which bears some similarity to your paper, is use a local background estimate; in particular we use a "double top hat" filter to estimate the background power in a bin as the average power from all bins within 10 but at least 5 away from the central frequency. This helps a bit with interference that takes the form of broad "humps" in the power spectrum, but there are still plenty of narrow spikes that are hard to distinguish from pulsars.

One approach we are looking into now is coherent harmonic summing. The idea is that most pulsars have sharply-peaked profiles, so we see power not just at the spin frequency of the pulsar but at many harmonics. Our current system just adds up the power incoherently, getting a root-mean-squared amplitude for the pulsar profile. But that ignores the phase of the harmonics; taking this into account should improve our sensitivity to pulsars as opposed to RFI (see also here for more details). So from that point of view, yes, we're trying to pick a measure of signal that provides the best available discrimination from noise. But as far as I can tell, the best way to set the first-pass detection threshold is still using a classical p-value.

In the second and later passes, we can afford to take advantage of the fact that pulsars are broad-band signals covering our whole range of radio frequencies, and that they are delayed in a frequency-dependent manner by the interstellar medium; this is a key tool in distinguishing pulsars from interference. I have no idea how to include this information in the first-pass detection without a massive increase in computational cost, though.

Serendipitously, I just came from a paper by Susie Bayarri at Jim Berger's birthday party, who reminded me of the work of James Scott and Jim Berger (2008, 2009, two papers) that have a Bayesian decision-theoretic approach to problems similar to the astronomy one discussed above. An application that they applied it to is in DNA microarray testing where you have a huge number of data points and want to find those few that are "hits" within the large amount of noise. The advantage of these methods is that there's no need for things like Bonferonni etc, which is what the astronomers above are in effect using. The problem of avoiding detecting "noise" comes out automatically. Jim's site doesn't have the papers above, but it does have a citation to an earlier paper that James and Jim wrote, "Scott, J. and Berger, J. (2005). An exploration of aspects of Bayesian multiple testing. Journal of Statistical Planning and Inference, 136, 2144–2162."

I'll ask Jim or James or Susie for the recent citations since we are all together for this 4-day party. Report back later.

Bill: I wasn't able to come to the Berger party, but I did write an article for it, which I think struck some Berger-like themes. Please tell Jim and others that I say hi.

It seems like you're already doing what I was suggesting, or at least starting to. By looking at the coherent harmonic summing, and other characteristics of your signal, you are effectively incorporating your knowledge of the problem into the technique for analyzing the data. It's an intermediate sort of modeling.

The next thing I might try would be to look at a collection of your fft windows in which you found no pulsars, and try to build a model of your noise as some mixture of simple processes. For example perhaps the general trends in spectral content are related to time of day, weather, or temperature. Perhaps the number of "noise spikes" and the distribution of frequency intervals between them has some consistent behavior, (perhaps they can be modeled as poisson processes with varying rates for varying frequency bands).

With a general background noise model you could then take a given FFT window and ask a more sophisticated version of "is this frequency band significantly higher power than background". For example you could ask whether the whole window minus the general noise trend has a significantly higher activity of spikes than the background frequency-varying-poisson model would predict?

Once you've fit a general model for your background noise, comparing a new FFT result with the background noise model could be relatively fast.

Bill – looking forward to your report on Jim's birthday party.

Given he has or is handing things off at SAMSI, he might focus even more on his statistical research.

K

p.s. beware of those ping-pong games with strange rules about attire

I've gotten the URLs for the two papers by Scott, and Scott and Berger:

http://ftp.stat.duke.edu/WorkingPapers/08-10.html

http://ftp.stat.duke.edu/WorkingPapers/08-09.html

Enjoy!

I have a question related to this discussion: when adjusting p-values for a family of hypothesis tests as Anne is doing, how does one define "family"? There seems to be no formal definition. Is a family defined as all tests conducted on the same dataset? Why not extend the definition of "family" to also include *all* independent hypothesis tests conducted on different datasets? For that matter, I figure during my lifetime I'll conduct several thousand hypothesis tests, while it sounds like Anne may run several billion hypothesis tests on her pulsar data during her lifetime. Shouldn't we adjust the p-value for all of these tests? At this point the whole idea of adjusting p-values for multiple comparisons becomes nonsensical. What is the solution?

Robert:

The solution, I believe, is multilevel modeling, to replace ideas of exchangeability and grouping of tests with models of underlying continuous parameters. For more discussion and examples, see the links in the antepenultimate paragraph of my above blog entry.

Thanks Andrew!