I ran into this nice genetic programming demo applet:
I took down the number of generations to solution 10 times:
As you can see, the distribution is fairly wild. Outside of this experiment, I saw one or two runs that took up to 1,000 runs, as well. (1,000 is the cut-off for this particular simulation. If the program gets to 500, it seems like it generally goes all the way to 1,000.)
The special thing about this kind of distribution of numbers, as Linas Vepstas discovered while tuning MOSES for Opencog, is that it means there is a failure to maintain diversity properly; in fact, such an extreme failure that we would be better off throwing away our work and starting over every few generations (because the expected run-time is dominated by overly long runs). In this case, we could re-start the algorithm every 2 generations, with a 1/10 chance of success... giving us an expected run-time of 20!
Of course, we don't know ahead of time what cut-off will be optimal, but that can be addressed to some extent with clever tricks such as starting with a small cutoff and increasing it over time. The point, as Linas discusses in his article, is that we should never run into a situation like this. The bias created by the search algorithm is doing more bad then good. What a good restart strategy does for us is effectively shift the algorithm towards a brute-force search which simply looks at possibilities randomly until it finds a solution. (If we used a cut-off of 1, we'd be doing exactly that.) The algorithm isn't learning about the search space over time; allowing it to "remember" very much is only causing it to get stuck in local maxima within the landscape.
Intuitively, it makes sense to suppose that the hill-climbing ability of the system is only good at exploring very "local" terrain in the fitness landscape. Using rapid re-starts means we are trusting the genetic algorithm to climb relatively small hills, but prefer a more brute-force strategy for the global search. A brute-force approach (trying random possibilities until we get a solution) would have a 1/n chance at each attempt, for a search space of size n; this gives an expected run-time of n.* If genetic-algorithm hill-climbing can efficiently search a small area of size a in the amount of time it would take to brute-check an area of size b, then we can improve our search time by a factor of a/b using a genetic algorithm that gets restarted at intervals of time a.
Because the chance of success, be it 1/n or b/an, is a constant for every time-step, the time-to-solution for this kind of algorithm follows an exponential distribution. (The chances that we're still searching at a particular time t decay exponentially.) What Linas is saying in his post is that if our waiting-time distribution is any worse than exponential, then something is wrong with our algorithm, because we can improve it (to an exponential) via re-starts.
Rather than resorting to re-starts, Linas added extra parameters to the algorithm which encouraged more exploration, tuning these parameters for the best apparent results. He got better than a/b improvements this way; he reduced the a/b ratio itself (the efficiency with which MOSES searched small areas) by a factor of about 3/10. However, the resulting algorithm could still benefit from restarts; although the improved waiting-time fits an exponential for most of its duration, it gets worse toward the end (meaning the longest runs still have some problem).**
How can we do better than that? How can we avoid the detrimental effects of a genetic algorithm with a longer memory? (How can we keep the bias from gathered knowledge from making us behave worse than a system with no knowledge at all?)
The problem here reminds me of the exploitation/exploration trade-off studied in reinforcement learning. (Sorry I didn't find a better reference... I should improve the wikipedia article on this...) Genetic algorithms (and other, similar algorithms) search in the areas where they expect the highest reward, but their knowledge is inherently incomplete (because they are still searching!). As a result, they get stuck looking in areas near the best solutions found so far. The only way to improve the search is to make them forget (as in restart-based solutions) or make them ignore their current knowledge (injecting randomness into the decisions, as Linus did). Reinforcement learning gives us more effective strategies for exploration, in which we use our current knowledge to explore. Rather than only looking in areas which have demonstrated high scores so far, we can look in areas which might demonstrate higher scores; in particular, areas which have demonstrated high variance or areas which we haven't searched significantly. Perhaps the problem with genetic algorithms is that they do not remember how thoroughly they have searched specific areas, which means they continue to be very interested in areas in which they've already done a lot of searching.
What this leads me to consider is the idea of comparing Monte Carlo Tree Search (which has a built-in exploration/exploitation model) to genetic algorithms. (Again, I see wikipedia needs to be improved! ...) MCTS has revolutionized board-game AI, especially Go.
I first started considering MCTS for automatic programming in the fall, as a (possibly) pragmatic way of implementing incremental Levin search. The idea in that case would be for the system to retain probabilities from all problems it has ever solved, so that it would eventually build up good programming skills.
In this case, the idea is a bit different. The sampling of the search space would initially be quite random, gradually settling on better areas (but continuing to bias toward under-explored areas more than well-explored areas). This might get past the problems which genetic programming seems to have. It might also give up the advantages of local search, since it would more gradually settle on local areas rather than choosing random areas to search in a hill-climbing fashion. I'm not sure about that, though; the nature of the tree-search might include that behavior.
In any case, I intend to modify the code of that applet to try some ideas... but we'll see. (Followers of this blog will know that I intend to do a lot of things, but don't usually finish!)
Do most genetic programming systems have worse-than-exponential behavior, or is it simply a sign of bad tuning? The applet which I used for my experiment was certainly not intended for industrial use.
Has anyone applied MCTS to general optimization problems, and to automatic programming in particular? Has there been a direct comparison between genetic algorithms (or genetic programming) and MCTS? (MCTS was originally a planning algorithm for markov decision problems.) A little searching suggests the answer is "no".
* Bonus points if you know how to prove that. It's also interesting to point out that a more systematic brute-force approach, where we try all the possibilities in order rather than choosing randomly, gives us an expected run-time of about n/2, since we have a max time of n and an equal chance of finding the solution at each point. However, this is usually considered less important than other aspects of the search (since it only gains us a factor of 2).
** In the appendix of his article, Linus suggests that an exponential distribution is the best we can hope for; because of the highly parallelizable nature of this kind of search, we simply cannot expect anything better. This is not strictly true; see previous comment. However, interpreted more broadly, we can understand it as a point about the shape of hard search spaces. Suppose that we could count on some regularity at each level of detail, from global to mid-range to local. Then the search would take something like log(n) time, since we could prioritize half the search space, then half of that, then half of that, and so on. This is what we expect if there are several sub-problems which can be optimized independently. In a highly coupled problem, we can't rely on this.