Skip to content
 

Equation search, part 2

Some further thoughts on the Eureqa program which implements the curve-fitting method of Michael Schmidt and Hod Lipson:

The program kept running indefinitely, so I stopped it in the morning, at which point I noticed that the output didn’t quite make sense, and I went back and realized that I’d messed up when trying to delete some extra data in the file. So I re-ran with the actual data. The program functioned as before but moved much quicker to a set of nearly-perfect fits (R-squared = 99.9997%, and no, that’s not a typo). Here’s what the program came up with:

models.png

The model at the very bottom of the list is pretty pointless, but in general I like the idea of including “scaffolding” (those simple models that we construct on the way toward building something that fits better) so I can’t really complain.

It’s hard to fault the program for not finding y^2 = x1^2 + x2^2, given that it already had such a success with the models that it did find.

As I type, the program is continuing to run and coming up with variants such as y = 0.998904*x2 + (x1*x1 + x1*x1*x1*x1*x2)/(1.9284 + 1.9284*x1*x1*x2*x2 + (11.3707 + x1*x1 + x1*x1*x1*x1*x2)/(x1 + x2)). The latter is assigned a “size” (degrees of freedom?) of 49, which seems a bit of a stretch given that the model is being fit to 40 data points. In the output, the sample size is characterized as 40 in the training data and 24 in the validation data. I can’t figure out where those 24 validation data points come from!

Let’s see how the model fits. Luckily, I still have the R code that I used to simulate the data:

x1 <- runif (n=40, min=0, max=10) x2 <- runif (n=40, min=0, max=20) y <- sqrt (x1^2 + x2^2) So let's try simulating a new dataset and seeing how the models fit: models <- list (x2, .97+x2, x2+.28*x1, .43*x1+.91*x2, x2+x1*(x1/(9.75+x2)), x2+x1/(.86+(x2/(.60*x1))), 1.00*x2+.64*(x1/(.58+x2/x1)), x2+x1/(.89+x2/(.73*x1))-.04*x1, x2+x1*(x1/(1.94*x2+x1*(x1/(x1+x2)))), x1*(x1/(1.94*x2+x1*(x1/(x1+x2))))+x2-.00057, .01 + 1.00*x2 + 1.00*x1*(x1/(1.93*x2+x1*(x1/(x1+x2))))) model.size <- c (1,3,5,7,9,11,13,15,17,19,23) n.models <- length (models) mae <- rep (NA, n.models) for (i in 1:n.models){ mae[i] <- mean (abs (y - models[[i]])) } png ("eureqaplot.png", height=400, width=500) par (mar=c(3,3,2,0), mgp=c(1.7,.5,0), tck=-.01) plot (model.size, mae, ylim=c(0,1.05*max(mae)), yaxs="i", bty="l", xlab="Model size", ylab="Mean absolute error on new dataset", main="Cross-validation performance of models obtained from Eureqa") dev.off() eureqaplot.png

Let me blow up the right end of that scale:

png (“eureqaplot2.png”, height=400, width=500)
par (mar=c(3,3,2,0), mgp=c(1.7,.5,0), tck=-.01)
plot (model.size, mae, ylim=c(0,.1), xlim=c(10, max(model.size)), yaxs=”i”, bty=”l”,
xlab=”Model size”, ylab=”Mean absolute error on new dataset”,
main=”Cross-validation performance of models obtained from Eureqan(complicated models only)”)
dev.off()

eureqaplot2.png

(Yeah, yeah, I know, I should use ggplot2, or at least write my own plotting function so that I don’t have to keep fixing these defaults every time I make a plot. Also, I have difficulty getting the indenting to show up in html; my actual code looks prettier.)

The above is a best-case scenario: the data actually come from a mathematical formula, and we’re using the identical formula for cross-validation as for the original data. That said, the program fit the data well. In fact, I kept running longer, and even the super-complicated models did not overfit. For example, the model with size=51 had a reported error of .001 on the fitted data. When applied to the simulated cross-validation data, the mean absolute error was still only 0.003–outperforming the simpler models in the graph above. So the program is doing a good job of getting close to the curve.

Having succeeded in doing some fitting, I decided to look more carefully at the modeling menu. The default building blogs included in the fitting are “add,” “subtract”, “multiply”, “divide”, “sine,” and “cosine.” I’m surprised it didn’t find the true model (y^2 = x1^2 + x2^2), which can be formed by addition and squaring (which is a form of multiplication). I guess they allow these manipulations of the input variables but not of the outcome. If I re-run adding the “power” option, I assume it will find the fit (y = (x1^2 + x2^2)^(1/2)) right away. This is cheating, of course, but still worth a try. I just started it now, will let you know what happens…

Yep, after about 25 minutes it found it: y = (x1^2 + x2^2)^0.5. Then it kept going and came up with y=(0.0575728 + x1*x1 + x2*x2)^0.500012 – 0.00507274, but that’s ok, there’s unavoidable round-off error.

A bunch of (mostly) good little things about the program

1. It runs right out of the box on my computer.

2. It has a real-time display.

3. The display has an x-axis in seconds (rather than iteration #).

4. The x-axis is on the log scale, so you can see the fast change at the beginning and the slow change later on.

5. It figured out automatically to take my variables x1 and x2 and change them into x-subscript-1 and x-subscript-2. Those little things help!

6. It rounds off the numbers in the display.

7. It uses sound as well as visual feedback, emiting a little beep every time it . . . well, I’m not actually sure what happens every time the program beeps, but it must be telling me something!

8. On the downside, it had some problems when I tried to re-run it. It changed my dataset and my model without warning.

9. Also a minor criticism: when plotting the mean absolute error, the y-axis goes below zero, which is a mistake, I think.

10. For some mysterious reason, the program flaked out at the end and crashed while I was trying to copy over some formulas. No huge deal, I’d already saved the image with the formulas.

What I can learn from this method

Several commenters have informed me that these equation search methods are not new. But they’re new to me! I guess it would make sense to compare with BART and other nonparametric methods. (Eureqa isn’t literally nonparametric, but once you start bringing in functions such as y = 0.998904*x2 + (x1*x1 + x1*x1*x1*x1*x2)/(1.9284 + 1.9284*x1*x1*x2*x2 + (11.3707 + x1*x1 + x1*x1*x1*x1*x2)/(x1 + x2))., this is as nonparametric as I’m ever gonna get.) This stuff looks more sophisticated than much of the statistical literature on nonparametrics. I really don’t have much of a sense of what can be fit, but as noted above, I was impressed at how complicated a model could be constructed from 40 data points. Next, maybe I can try an example with actual noise (that is, variation in the y’s that cannot be predicted by the x’s).

What this method can learn from me

I’m guessing that some of my hierarchical/multilevel modeling ideas would be useful when considering grouped data or categorical predictors (for example, “state” as an input variable that has 50 levels). Also, the idea of running their algorithm from multiple starting points. Beyond this, I’m not sure.

What are the limitations of this method (for me)

I work on problems where there are no “natural laws” to be discovered, in the sense discussed in this article. In my own areas of application in social and environmental sciences, I’m never going to come across any conservation laws or invariances. The closest I’ll ever come is to find some “stylized facts.” Such stylized facts are important, and I can give you several examples of them, but they’re not going to look like the models that come out of this sort of program, I think. (Just as, to choose another example, I’m not particularly interested in Tetrad-type methods that attempt to discover conditional independence structures in data: on the problems I work on, there’s never any conditional independence except when it’s imposed in the research design.)

I can’t see how, for example, I could use this Eureqa-type methods to make this sort of map of public opinion given age, income, and state (and, ultimately, I’d like to include lots more factors) but maybe similar ideas could be used to develop discrete models that make sense and predict well. Perhaps what I need to do is finish my “Benchmarks” paper where I describe a few of my applied data analysis problems and package the data and models in a way that’s convenient enough that computer scientists and others could go and try out their own methods on my sorts of problems.

P.S. Igor Carron posts some comments on Eureqa and links to a skeptical article by Christopher Hillar and Friedrich Sommer. This latter paper is only 4 pages long and is written clearly enough, but I can’t really understand what they’re saying. Something about fitness functions and . . . whatever. It’s the usual language barrier. One good thing about having the software implementation is that I can skip over the theory and see how it works in examples.

20 Comments

  1. Aniko says:

    I think that the complex models work only because of the lack of random errors in the model: there is nothing to overfit. Any continuous function can be approximated as a linear combination of well chosen basis functions, so of course this worked. However adding random errors, or extending the range should show overfitting issues with the complex models.

  2. http://models.street says:

    To get proper indentation in code, you should use a "<pre>" tag which is the HTML way of saying "this is literal code, don't monkey with the formatting". If you have a "visual" editor in your blog software often there's a button that tells it to format the code as preformatted.

    As for the article by Hiller and Sommer, it basically is pointing out that the method used in the article about searching for natural laws is not a mathematically valid way to find Lagrangians and Hamiltonians reliably. Although it does seem to work, it should also work with a wide variety of other functions, whereas on the other hand the fitness function they give is specialized to finding Lagrangians and Hamiltonians.

    For those not up on the physics background, a Lagrangian is basically the difference between the total kinetic energy (energy of motion) of the system and the total potential energy (energy of position). A Hamiltonian is the total energy of the system. A Lagrangian is a more general concept, applying in a wide variety of coordinate systems, whereas a Hamiltonian is a concept that applies only in a special coordinate system that can be derived from a Lagrangian.

    Their article is specific to this physical problem, but the general idea, that you need to choose your fitness function properly, is valid in all sorts of optimizations.

    </pre>

  3. http://models.street says:

    arg, of course my <pre> tag got interpreted as html and goofed up my previous post because I didn't use the html entities for less than and greater than.

  4. Ian Fellows says:

    I think the real question if the creators are interested in uncovering natural laws is how well it generalizes past the training data range.

    using:

    x1

  5. Bill Tozier says:

    Oh dear. Did they do that bad a job explaining what is actually happening here?

    The system is evolving equations as binary trees of functions, literals and variables. The 49-point equation you are confused by is (approximately, since we don't have the literal unsimplified tree representation which might have a slightly different ordering of branches)

    <pre>
    add
    ..times
    ….0.998904
    ….x2
    ..divide
    ….add
    ……times
    ……..x1
    ……..x1
    ……times
    ……..x1
    ……..times
    ……….x1
    ……….times
    …………x1
    …………times
    …………..x1
    …………..x2
    ..add
    ….1.9284
    ….add
    ……times
    ……..1.9284
    ……..times
    ……….x1
    ……….times
    …………x1
    …………times
    …………..x2
    …………..x2
    ….divide
    ……add
    ……..11.3707
    ……..add
    ……….times
    …………x1
    …………x1
    ……….times
    …………x1
    …………times
    …………..x1
    …………..times
    …………….x1
    …………….times
    ………………x1
    ………………x2
    ……add
    ……..x1
    ……..x2
    </pre>

    as far as I can tell. That comes to 49 program points (leaves and branches together), so I'll assume I've made only a few transcription mistakes. I've used the indents to try to capture the terms in the order you've transcribed them, and broken apart the higher-order terms into binary functions.

    So
    <pre>
    times..0.998904..x2
    </pre>

    for example is the term (0.998904 * x2), and so on.

    Does that help explain the confusion about the "size" column? It's the number of program points in the underlying binary tree representation of the equation. So y=x1 has one point, y=x1+x2 would have three, and so forth.

  6. Alex says:

    Did you consider that the program seems to be looking for a function, and that the Pythagorean equation isn't a function? It was only when you told it that you wanted a power of 1/2 that it found it, and of course, with a power of a half, the program could then interpret that as a function. Perhaps?

  7. Bill Tozier says:

    Andrew, I would like to try to translate or provide some sort of better context for these techniques, but I confess I'm having trouble parsing your questions about the results. Forgive me if I misunderstand some of the things you've said above, but as somebody who has does this stuff for more than a decade I'm hopeful that I can clarify a bit.

    I've looked over the "User Guide" they provided with the program, and… well, they dumbed it down a lot, I have to admit. I suspect some poor choices of wording might be part of the problem. You might look over Poli, Langdon and McPhee's (free) Field Guide to Genetic Programming for a better introduction.

    The model at the very bottom of the list is pretty pointless, but in general I like the idea of including "scaffolding" (those simple models that we construct on the way toward building something that fits better)

    While I agree with what you're approving—the idea of including increments when presenting a complex model—that's not what's happening in this ParetoGP system. The random sample of models that form the initial population are probably mid-sized models with (I'll guess) 10 or 20 program points each. The little ones are derived from them, not the other way around.

    As these trees of middle size are chopped up randomly and recombined with one another to form subsequent generations, some offspring are inevitably recombinations of little chunks of earlier model trees.

    Because recombination doesn't pay attention to the meaning of the pieces it's chopping up, these smaller models are practically unrelated. Except for the effects of genetic drift, which will tend to propagate sub-trees that appeared in the initial population.

    The models you're seeing in the report are the best blobs (of that size) the search has found by randomly grinding up and gluing together bits of the initial random sample.

    The latter is assigned a "size" (degrees of freedom?) of 49, which seems a bit of a stretch given that the model is being fit to 40 data points.

    "Size" of an S-expression like Hod and Michael are using to represent these formulae is defined as the number of branches and leaves in the tree representation. In this language {+, ×, −, ÷} are binary branches (they take two inputs only), and all the leaves will be your variables and constants. So when you add up all the binary functions, constants and variables in your 49-point function, you'll see that's what they're counting. Same with the other models: count the number of (binary) functions, variables and constants.

    It doesn't have anything at all to do with your data. They're using a heuristic measure of the structural complexity of the function, is all.

    In fact, I kept running longer, and even the super-complicated models did not overfit. For example, the model with size=51 had a reported error of .001 on the fitted data.

    OK, I think I know what the problem is here. Let me see if I can explain it without drawing any charts :)

    Genetic programming is a population-based metaheuristic. In this implementation there are, I dunno, maybe a couple of hundred different equations ("trees") in the population at a given moment. Those models are not changed while they're being evaluated: your data are plugged into each of those couple hundred trees in turn, and each tree is associated with two scores: its size, and its error over the training dataset.

    Then, based on those scores, parents are selected at random, chopped up, and recombined into offspring. I don't believe there is any parameter fitting in this simple program, except possibly by mutation. The fact that you see constants repeating in your more complex trees makes me think they've arisen from recombination or duplication. Certainly not fitting.

    The models you're seeing reported over time are the nondominated set of models. That is, the set of trees such that no other tree has been discovered which has both lower error and smaller size.

    I'd love to understand why you don't consider the most complex models overfit. In my crowd, they'd be horrifying. We tend as a rule of thumb to seek models near the "knee" of the Pareto front, close to the origin, since these give a decent tradeoff between overfitted and "pretty pointless".

    In any case, there is no way you will ever be shown a model of the same size, whose error is higher, in this report. I'm sure thousands of models have arisen which are "overfit" in the sense used with parametric modeling. But they're culled and/or hidden here, just because you're shown only the nondominated set.

    A lot of the difficulties I've experienced through the years when statisticians (and Operations Research people) encounter GP for the first time come from the very different meanings of "fitting" in the two realms. Often it's not clear to outsiders that GP is simultaneously searching through the set of all model structures from a given language, and (rather inefficiently in most cases) also trying to find constants to reduce error of those structures.

    Most of the action in GP, and especially in this little program intended for popular consumption, involves structural changes of models, not parameters. Most of the experience of folks outside the field (and particularly in statistics) depends on a model or small family of models having been selected a priori, and parameters fit to that.

    I hope this helps rather than sounding annoyingly didactic. These are a very powerful suite of techniques which have been crippled for more than a decade by a quirky isolationism. I do think there are places you could use GP (not just symbolic regression) in your work; the technical challenges are far less than the cultural ones.

  8. Andrew Gelman says:

    Bill: Thanks much for the explanation. I'm pretty sure you're counting the operations in the right order, given that I cut-and-pasted the expression directly from the program. (That's another nice feature, that it allows the user to see the equations visually and also to copy them to other software such as R.)

    Regarding the overfitting question, my point is that these complex models also performed well on the second dataset that I constructed from the model. That impressed me. But maybe it's as Aniko said in the comment above, that it all only worked so well because my underlying model was smooth and deterministic. I'll have to try again with a noisier example.

    Alex: I don't appreciate your sarcasm. I was trying out the program as a user, seeing how it works. I don't mind that lots of people use the methods of Bayesian Data Analysis without fully understanding what's going on. Understanding is often gained by examples and does not occur all at once. To you it was obvious that y^2=x1^2+x2^2 is not a function; to me this was not obvious: I saw a list of operations and did not know if they could be applied to the left-hand-side variable as well as to the right-hand-side variable. What better way to learn than to try it out? And what better way to communicate my learning than to report what I found in the blog? Also, I wanted to see how the program performed using its default settings.

  9. Bill Tozier says:

    Andrew,

    Sorry about that; it did finally sink in after I had posted that you were testing on out-of-sample data in your graphs, and that statistically those are still quite good though complex solutions. Interestingly, within the GP community (where statistics are pretty scant) I suspect 90% of practitioners would consider most of the bigger functions "overfit" in their folk sense, which probably has something to do with Occam and Kolmogorov complexity or something.

    I wonder whether that fact that the program itself breaks your dataset into training and testing subsets affects what we should expect as the outcome. I don't know how (or why) they're doing it, but the interface does report two subsets.

    It may well be there is a selection against overfitting already in place, if they are (implicitly or explicitly) penalizing functions whose OOS result is worse. I'm reminded of some evolutionary algorithm variants in which parents are selected on the basis of their in-sample scores, but culling is based on OOS scores. Again, documentation would probably be more useful than me riffing more :)

    One last comment on your results from the posting above, hopefully helpful since I've actually thought about both the "overfitting" question and what hadn't sunk in for me about the defaults you used. If I'd been reading more carefully, I would have been less surprised. Most of the GP work we do involves a much more expressive function set.

    It's always helpful to explain the dynamics of GP in terms of not just selection but its other constraints, like the ranges of constants allowed or the function set. In this case, you've asked it to fit data with a function of x1 and x2, but as Alex suggests you haven't given it the tools you used originally to generate the data.

    That's what I like to call a "one hand tied behind your back" search. Flag but set aside for a moment the fact that with non-generated data you don't know what the underlying generating process is, or what tools are "really" used. It's important, but hold on a second.

    I often have students try a few one-hand-tied problems when teaching symbolic regression (and I'll probably add this one in future, actually!): fitting y=abs(x) with {+, ×, −, ÷}, is a good simple one. Students will often look at what happens and hypothesize that the GP search explicitly breaks the problem down into a stage where it "learns the algebra" and then "fits the data", but in fact there are no such explicit stages "in" the algorithm. Nonetheless, the GP system does in fact learn the algebra first: you'll see the arithmetic versions of the absolute value function cropping up very quickly in such searches, because those are the simplest way to model the data.

    The classic one-hand-tied project involves fitting data from a simple linear model, but without allowing the system to have any explicit constants but 0 and 1. Again: it works in time… but it does so by "first discovering number theory."

    I hope you can see where I'm going: Your one-hand-tied search was learning algebraic solutions to the square root. I'll go out on a limb and guess that given sufficient time your more complex results would look a lot like this. Indeed, it may be that some smart algebraist would be able to munge them around to look quite similar.

    The point being: including primitives like fractional powers (or transcendentals, which I bet would also have simplified your search via Euler's formula) are not "cheating" at all. As I mentioned above, this may feel like "cheating" because of the shadow of Occam hanging over one's shoulder.

    But as we all know, and as your homework assignment for your students points out: modeling real data is not an automated process in which one trusts the answer one gets. GP with its various powers may be the worst place to carry that assumption. Modeling is a dialog, in which we pose challenges to the dataset itself, examine the results statistically, and revise our hypotheses. The fundamental difference between symbolic regression and traditional modeling methods, as I've mentioned, is that in symbolic regression you aren't merely seeing how your specified model "does" with the data, you're testing a wide-ranging family.

    The "answers" it gives are thus qualitatively different from those from parametric modeling. One can learn a good deal more from them than just whether a particular model is a "good fit" or not. Just as in your case I suspect you were asking the system to first discover the arithmetic square root function, by under-powering a symbolic regression system one is "asking" for a very limited set of insights.

    Flor Castillo and colleagues from Dow realized this some years back, and used symbolic regression with a rich function set to discover variable transforms from real data that could be used in subsequent "classical" statistical models. Occam himself wouldn't have his feelings hurt by this process, I bet.

    And that gets me back to the cultural differences in modeling I mentioned. I appreciate your writing about this. There's a lot more to GP than symbolic regression, and I suspect one could find some useful application to help with your maps, and the field would certainly benefit from more benchmarks.

    Thanks much for having a look at it.

  10. zbicyclist says:

    Because Andrew's mucked around with the program [note: that's a compliment], and because of the high quality of the comments, I've learned a lot vicariously here. I look forward to the weekend when I can play around with the program myself. THANKS TO ALL!

    Bill Tozier: what would be a couple of your favorite places to learn a BIT more? (a couple of glasses of water for now, not a firehose of information)

  11. Alex says:

    "I don't appreciate your sarcasm. I was trying out the program as a user, seeing how it works. I don't mind that lots of people use the methods of Bayesian Data Analysis without fully understanding what's going on. Understanding is often gained by examples and does not occur all at once. To you it was obvious that y^2=x1^2+x2^2 is not a function; to me this was not obvious: I saw a list of operations and did not know if they could be applied to the left-hand-side variable as well as to the right-hand-side variable. What better way to learn than to try it out? And what better way to communicate my learning than to report what I found in the blog? Also, I wanted to see how the program performed using its default settings."

    I wasn't being sarcastic. I was suggesting a solution as to why it was finding the Pythagorean equation unless you explicitly told it to look for powers of 1/2. I'm just a layman, I've never used the program you've described, nor do I have much statistical knowledge. I was just making a suggestion as to what the problem was, but for all I knew, you had already considered this, or it did not get to the heart of the matter. I was also asking if my suggestion seemed reasonable, since at the end of my comment, I asked "Perhaps?"

    I honestly don't see what in my comment justified your response. Nowhere did I say that you shouldn't try the program out, and nowhere did I say that you shouldn't report your findings on your blog.

  12. Andrew Gelman says:

    Bill: I only said that including the power function was cheating because it was not in the default settings of the function. Here's what happened:

    1. I heard about this curve-fitting program.

    2. I took one of my favorite examples, an example where I thought the program would have a pretty good chance of performing well.

    3. I ran the program on its default settings.

    From this perspective, inserting anything between steps 2 and 3 is cheating. If the default settings had included the power function, I would've run it that way. What I wanted to do was run the program out of the box using its recommended settings.

    Alex: I apologize. I guess I've just rediscovered once again how difficult it is to read intonation from typed speech!

  13. Bill Tozier says:

    Favorites? Well, to be frank I tend to be mean and critical of most computer scientists' and engineers' ability to explain their work usefully. The working title of the workshop presentation I'm prepping for GPTP next year is "What Makes GP Difficult? (Hint: You)"

    I think Hod and Michael have linked to some very good general introductions to symbolic regression at the Eureqa site. The Wikipedia article has some nice explanatory stuff in it.

    Much of the Genetic Programming literature is flawed by its focus on symbolic regression, which has been the nominal "killer app" for a long time. As you should be able to tell from my tone, I tend to find it rather trivial and sometimes misleadingly so. GP proper is a way of searching through sets of structures, not merely arithmetic or mathematical models, but algorithms, engineering designs of physical objects, recipes: all sorts of abstract objects. Personally I'm hoping (with a big dose of work from me and my colleagues) to get us past the "GP is for fitting curves" stage and onto something better.

    Personally I'd suggest a look at any of the Humie Award winning papers, which I think can all be downloaded from the Humie Awards website. The point I was trying to make about the triviality of symbolic regression results comes across best, I think, in Lee Spector's gold-medal paper from last year, or Stephanie Forrest's from this year.

    With those caveats, for more info on symbolic regression as such I'd still unreservedly suggest Riccardo Poli, Bill Langdon and Nic McPhee's book (free download at Lulu) A Field Guide to Genetic Programming. And if you're serious about seeing a 12-year-old view of the breadth of techniques (most of which are still under-explored) look for a cheap copy of Banzhaf t al's Genetic Programming: An Introduction.

  14. Bill Tozier says:

    Andrew: Sorry—intonation problem again :). It wasn't a criticism, though your reaction does underline a very difficult design decision with these systems: the Goldilocks point.

    If for example a huge language of 120+ functions—including recursion and iteration—is "turned on" by default, the risk is that a new user will try it out on a well-known simple/toy (or homework) problem, it will work as designed, and they'll have perfectly well-formed models of their linear data in terms of arctangents and who-knows-what. It may not be overfit, but the expectation of the user will have been undermined because unwittingly they will have under-specified the problem.

    On the other hand, if the barest minimum is "turned on" as a default, and a new user tries it out on something a bit more complex (as you did), it will work as designed, and they'll have perfectly well-formed models of their nonlinear data in terms of the bare minimum linear functions. Again, it may not be overfit, but the expectations of the user may have been undermined.

    Interestingly in this case, though, is that contrary to my initial reactions it looks as if the system is doing a fine job working out of the box on the data you supplied. It's not yet overfit because it's still working out the appropriate continued fraction representation for the problem you posed to it. Me, I'm delighted at this result you shared; the more I look it over the more I think what it's doing in the few hours you've run it is reinventing some fancy algebra from the 18th century.

    I don't believe there is a single sweet spot in that range of too much power vs. too little power. I expect any general-purpose modeling methodology will suffer from the same problem.

    It's an interesting challenge faced by any designer (or popularizer) of tools, and why we so often get consumer-grade, hobbyist grade, prosumer and professional versions of essentially the same stuff.

    I'm still wondering (not having a Windows machine) whether this package you're playing with is in the first or second rank on that list. I'm more pleased all the time, though, that it seems to be robust and simple enough to point out some aspects of the approach that even advanced users (well, let's just say "people who submit papers to conferences") often fail to appreciate.

  15. Alex says:

    "I apologize. I guess I've just rediscovered once again how difficult it is to read intonation from typed speech!"

    No worries. We've all done it.

  16. Bill Tozier says:

    Just want to go on the record to say that I'm quite impressed with the modernity of the algorithms Michael and Hod have implemented "under the hood" in the Eureqa package. I get a sense there's a lot more adaptive power in the methods being used than one gets from vanilla, off-the-shelf symbolic regression.

  17. Bill Tozier says:

    Just a quick note in support of what I was only guessing earlier, based on running Eureqa here on my MacBook Pro using a similar synthetic dataset with 500 points. It is indeed "deriving" the continued fraction representation of length of the hypotenuse of a right triangle::

    <pre>
    f(x1,x2) =
    x1 + x2*x2 /
    (2 x1 + x2*x2 /
    (2 x1 + x2*x2 /
    (2 x1 + x2*x2 /
    (2 x1 + x2*x2 /
    … /
    (x1 + x2)
    )))…
    </pre>

    (or the equivalent where x1 and x2 are swapped)

    I've honestly never seen this happen so cleanly before in symbolic regression. I hope you don't mind if I cadge it as a great example of a one-hand-tied problem. It was beyond my long-lost freshman math ability to derive the continued fraction myself. But a spreadsheet seems to have confirmed the relation, which converges quickly for most values of x1 and x2 within your sample.

    Here in a couple of hours I ended up with five nested layers of this equation, with some miniscule rounding error on a couple of constants near the ends.

  18. Igor Carron says:

    Andrew,

    As you rightly point out, the data need to come from a smooth manifold. There are common occurrences when this is not the case such as image data taken from your average cameras as occlusion. Whenthe authors initially mentioned their original paper underlying eureqa, they made a specific reference to using data from movies.

    Igor.

  19. Bill Tozier says:

    Igor,

    Actually, tat's not true. As I mentioned in passing, Eureqa includes a quite limited set of primitives. Most symbolic regression folks will tend to include some form of IF:THEN conditionals as well; full-featured (Turing complete) GP languages typically include either recursion or iteration.

    With the addition of conditional expressions (or those others), smoothness is not a limitation… except when it comes to interpretation, of course. I suspect the limitation was a good one, since interpreting what makes a "good model" of data when there are conditionals present is a matter of some controversy. Clearly :)

  20. Sherman Dorn says:

    The "Pick Modeling Task" panel inside Eureqa allows one to pick operations, and "square root" is one of 'em. The current "best fit" when I've run the same data:

    y= sqrt(0.0522076 + x2*x2 + x1*x1 – 0.00793067*x2), which is pretty darned close,
    and y = sqrt(x1*x1 + x2*x2) isn't far behind. I'll assume the difference could easily be a result of rounding error.

    I've written more about this tonight, and the changing train/validation data set split that's annoying to you, Andrew, seems like a potential value to me. Cut to the chase: if someone can use Eureqa to "rediscover" the Brass logit relational model, I think it'll have proved its value in social science. Any enterprising demography grad students out there?