There is some exciting development happening in data mining. In particular, several tools are offering data-flow interfaces for data analysis. The core idea of a data-flow interface is to graphically depict the operations on data: for example, we load data from a data source (specifying the file name, types of variables, and so on). We can examine the data by passing it into a scatter-plot module. Alternatively, we can passes the data into an imputation module. From the imputation module we pass the data without missing values into a model testing module, along with the specifications of several statistical models (logistic regression, naive Bayes, SVM, and so on). Note that what flows in this case is the specification of the model, and not data or imputed data as before. The results of testing then flow into the ROC (receiver operating characteristic) module to be analyzed. I have done this in Orange, a freely available Python-based system for data analysis, and the whole program can be shown graphically:

The resulting diagram corresponds to a Python script, but it is easier to work with diagrams than with scripts, especially for beginning analysts. It is possible to connect Orange with R quite easily using the RPy module. I’ve been using it over the past few weeks with quite good results. It would be possible to create “widgets” for different R-based models.
Otherwise, one of the first tools with a data-flow interface was IBM’s Visualization Data Explorer, but it was discontinued by IBM and transferred into open source OpenDX. The data mining kit Weka has a data-flow interface, but I prefer to use their Explorer. Finally, SPSS Clementine is a commercial data mining system with a data-flow interface.
There is also the Statistical Lab http://www.statistiklabor.de/en/, which uses R as an underlying computational engine.
Of course, visual programming languages have been around forever, Matlab's SimuLink is probably the most common (and Turing Complete!) one actually used by people. VisSim is another and, of course, Quartz Composer on Apple's newest operating system.
I've often thought (and occasionally unlimber Squeak to think about), for simple things, having something like Automator, or (for those who've used Adobe's products) the Action Stack would be a handy visual paradigm to have for analysis. This way you can easily see (and possible reorganize/modify) the transformations used to get to your regression, scatterplot and so on without getting stuck in some interminable wiring phase. The ability to round trip for those us who prefer to type would also be a nice touch
Does ORANGE use algorithms that can handle
large datasets?
Orange does use algorithms that are scalable, but it loads all the data into memory up front. For big datasets, you'd need a lot of memory.
Thanks to Byron for all the pointers! Visual programming has been around for a while, as have been data-flow processing architectures, but they are not used widely. Orange Canvas is just an easy user interface to lower-level object oriented APIs: there's nothing that can be done with the Canvas that couldn't be done with the scripts, and Canvas doesn't even require Orange. Action stack is a very good idea.
No problem Aleks.
Unfortunately, I think the problem with data flow and other visual programming paradigms is that there really isn't a "natural" common notation. Written programming languages, the founding ones anyway, were generally constructed using common mathematical notation contingent on the ease of compiler/interpreter implementation. Even then, devising a programming language from scratch is still really hard (e.g. PHP). A decent formalism for visual languages appears to be harder still. In some cases a formalism exists, in electrical engineering for example, but generally speaking it seems to be quite difficult to get people to agree on what a picture should mean.
Take your example, for instance. At first glance I found it very confusing because the learning algorithms are just hanging out in space with no input data connect to their left hand connection points. Reading your explanation helps, they're somehow being controlled by Test Learners, but it was very confusing. Especially since they appear to require input (having the same shape as Imputer which takes the file input) so I would have expected them to take an edge from Imputer and for ROC Analysis to take a number of inputs. This is just a simple example, imagine if we were looking at a complex piece of software.
Now, this isn't to say that there is no utility in visual interfaces, obviously statisticians tend to be very visual during exploratory analysis. A point Andrew tries to make on a regular basis.
Maybe, rather than abstract data flow, we should be looking at direct manipulation interfaces (something GGobi does to some extent) where analyses are built directly on the visualization of the analysis. I'm not sure what the workflow would be like, I seem to recall there was something done by I think it was UMass Amherst, on plots you could manipulate in real time in ways more complex than brushing, etc.
Thinking about it, a formalism for visual data analysis languages would make an excellent PhD dissertation topic! There have been things done in the past, by Jock Mackinlay and Leland Wilkinson, but an update to this branch would be good to have.
Yes! Too bad I didn't think of it 6 years ago.
Clearly we need to find a first year to do the work for us…