Skip to content
 

GPU Supercomputers

An example of a computer cluster

Image via Wikipedia

Computer games have been the driver behind large advances in 3D graphics hardware over the past decade. It turned out that the hardware developed for rotating, projecting and rendering many triangles can also be used for other purposes, and this is the notion of “general-purpose computing on graphics processing units” or just GPGPU.

The research and development community’s center is gpgpu.org. There are several environments for development of software on such hardware, CUDA, brook, OpenCL, libsh, CTM.

I have looked through their list of CUDA example applications, but couldn’t find any statistical applications. Some related ones in machine learning and Markov chains claim 50-fold speedups over conventional PC architectures, without the complexity of running a whole cluster of computers. Now, computation of likelihood and MCMC are inherently extremely paralelizable, and such hardware could make it easier to fit sophisticated models. This would be a good topic for a computationally minded PhD thesis.

13 Comments

  1. Denzel Li says:

    GPU is a SIMD(single instruction multiple data) model computing unit. The paradigm is good at jobs where computations can be separated into independent parts, such as generating random numbers, matrix multiplication. I can see algebra based task fit to this category. But Markov Chain has specific computation dependency through the chain, which is hard to parallel. Parallel multiple chain has never been difficult, but parallel one chain is hard.
    I would rather see particle filtering (or Sequential Monte Carlo) more suitable to parallel paradigm cause it is relied on large chains by nature. But for SMC to work fine, there may require resampling which require communication between chains.
    I agree with you that parallel is challenging but has great future.

  2. dk says:

    it turns out that the big constraint on performing simulations or otherwise crunching through lots of data is no longer processing speed but simply the time it takes to move large bodies of data out of memory into the CPU: whereas processors are lightening-quick because they are encased in silicon, disks are clunky because they remain electro-mechanical in design! Not a big deal for social scientists, who don't generally have massive datasets, but a big constraint in natural sciences, where multiterabyte datasets are now not uncommon. Bell et al. (2009), Science, 323, 1297-98.

  3. Kristian says:

    I have profitably run parallel bootstraps on multiple (ordinary) processor cores using the snow package in R.

    I guess that in the future, the department statistics machines will have a GPGPU cards on which I could do the same.

  4. Ben Lauderdale says:

    Single chain MCMC algorithms that employ a rejection mechanism might be greatly sped up using parallel processing. Each core works out a proposed new draw. Some of them will decide to reject the new move, but among the ones that accept, you then randomly choose one (uniform probability over all accepted moves, i think).

    In addition to being trivial to program for arbitrary numbers of cores, I see two benefits to doing this sort of "multiple proposal" rejection sampling. First, conditional on tuning, you end up accepting more often because you are computing many possible moves at once. Second, you end can tune algorithms like Metropolis to propose larger moves while maintaining a high acceptance rate.

    I am sure that there are more efficient ways to use multiple cores, but this one is at least very easy.

  5. Ben Bolker says:

    There's a presentation on 10-20 fold acceleration of computational linear algebra with GPUs at: http://www.nabble.com/file/p22114095/R3.ppt . Not parallel, but a non-trivial speed-up …

  6. Bob Carpenter says:

    Search for <<a href="http://www.google.com/search?q=machine+learning+gpu">machine learning GPU> on Google, and the first page of hits has an IEEE paper on two-layer neural networks, a paper on SVMs, on MCA (never heard of it), and this blog post. I know Bruce Carneal of UCSD has implemented k-means clustering this way, too.

    The biologists are all over this. There's even a <a href="http://folding.stanford.edu/English/Download">folding-at-home client that runs on GPUs.

    Ph.D. thesis? How about <a href="http://www.stanford.edu/class/cs229/proj2008/Madhavan-UsingGPUsToSpeedupSparseCodingAlgorithmsAppliedToSelfTaughtLearningProblems.pdf">Anand Madhavan's term paper for Andrew Ng's <a href="http://www.stanford.edu/class/cs229/">undergraduate machine learning class project at Stanford?

  7. Corey says:

    Would that scheme preserve the target distribution as the invariant distribution of the Markov Chain? My Magic 8-ball says, "Signs point to no," but I haven't done the math…

  8. Ben Lauderdale says:

    I am pretty sure that there are going to be multiple proposal, single accept rejection schemes (taking advantage of multiple cores to evaluate each proposal) that get you the right target distribution, but I am not sure exactly how they would work because I have not done the math either. It may be that choosing between the proposals is more complicated than I initially anticipated.

    Fortunately, while we may have different magic eight balls, we definitely have the same math…

  9. Iain Murray says:

    Ben, Corey: see Liu et al.'s Multiple-try Metropolis (MTM).

    I review MTM, give further references and introduce related schemes in sections 3.2, 3.4, 3.5 and 3.7 of my thesis.

  10. Austin Carpenter says:

    I released a CUDA Support Vector Machine implementation that can be called from Matlab:

    http://patternsonascreen.net/cuSVM.html

    (Self-Call)

  11. Marc Suchard says:

    Here's a paper describing GPU-enabled likelihood calculations within an MCMC sampler. We obtain nearly 100-fold speed-up in overall run-time through efficient computation caching and light-weight thread management.

    http://bioinformatics.oxfordjournals.org/cgi/cont…

  12. Anthony Lee says:

    We have done a case study on the speedups one can attain for a variety of population-based Monte Carlo methods (population-based MCMC, SMC samplers for static inference and particle filters for state-space models). The paper is available on arXiv at

    http://arxiv.org/abs/0905.2441

    and the code is available at our site:

    http://www.oxford-man.ox.ac.uk/gpuss/

    While we did not try it out, it is definitely the case that multiple-try Metropolis is amenable to computation in a many-core architecture. One would need only ensure that the number of tries is large enough so the time spent selecting a proposal does not cut into the speedup too much.

  13. kuan says:

    Hi Austin,
    I got a quick question regarding cuSVM. The choice of parameters seems a bit different with libsvm, of which I have a lot of experience. when you choose the penalty parameter C as a very large number, such as 5000, cusvm solves very quickly, under 1 second. but if C is chosen smaller, such as 50, cuSVM will hang up. libsvm does not have this problem.
    It also would be a great help, if you list your email contact on your webpage. I hope I'm not being a haggler…

    Kuan