Q&A about the GEP Book
These are Q&A from my personal
correspondence with readers of my
book, both of the first and second editions. It’s been great getting
your feedback and your emails! Thank you all so much! I know the second
edition of my book is a little bit pricey (but if it’s any consolation,
it’s a huge, not to mention substantial :), book with 478 pages), but
for those of you who cannot afford it, I made the first edition freely
available
online in html. So please take a look and join in on the
discussions! As usual, for privacy sake, I removed any references that
could identify the people involved.
I am impressed how much comes out of the treatment of
constants. It seems you have taken them much farther than in GP. I can't
remember, but I don't think GP has any analogue of the neat optimization
techniques. After reading that and the decision tree chapter, I can
really see how important the "Dc" part of the chromosome can be!
I’m really glad you understood the importance of the Dc in GEP (most
people choose to very conveniently ignore it). To my knowledge GP has no
analogue of the optimization techniques I propose but there are people
that nevertheless claim that GP can be used to do parameter optimization
by running a typical symbolic regression run – a perfect example of a
Rube Goldberg machine, in my opinion.
In the chapter on numerical constants, I wondered why you
have both the R and C vectors. In other words, why not just put the C
values (or a subset of them) in the Dc part? Why the extra step of
referencing C using R? I can see that some extra variability may result,
but since you mutate C as well as R, the search space is larger. I'm
sure there is something I missed here.
I don’t remember having weighted the pros and cons of this
representation. I first tried to implement a system similar to the one
used in GP but I wasn’t satisfied with the results and had a feeling
that something better could be created. I then came up with this system
and now that you ask, I think the use of both R and C has indeed some
advantages. First, it gives more control over constant variability (the
Dc length depends both on the head size and max arity and therefore in a
direct representation the variability will be dictated by the Dc size).
Second, it allows the appearance of duplicated constants (it’s much
harder to evolve exactly the same floatingpoint constants with the
direct representation) which seems to be very important as it is common
to see the same constant being used in different places. And finally, I
think it is more elegant.
I wasn't sure to what degree your criticism of
homogenizing was directed properly at GP, though. GP does not use
mutation usually, but Koza points out that the exchange of subtrees
results in constant innovation in the population. For instance, even if
all individuals are identical, a GP cross can still produce an
individual that is different from the rest. What do you think?
That’s true, but there is however some homogenization going on there
otherwise they wouldn’t say that in GP it is useless to continue
evolving beyond 51 generations. I’ve never implemented a Kozastyle GP
so I cannot tell you exactly what goes on within all those huge numbers
of individuals that he usually uses, but one can have an idea of what
takes place there by running a simple experiment in GEP: by using just
gene recombination and gene transposition both at 100% and a chromosome
with say 10 genes, one is just recombining mathematically concise blocks
as is done in GP and since genes are moving around in the chromosome
you’ll also avoid chromosomes becoming equal. And indeed populations
also stop evolving after a few generations. So perhaps what happens in
GP is that innovation stops because it is rediscovering again and again
old combinations of building blocks.
I keep coming back to how in GEP the average fitness is
way below the best individual fitness, and your use of relatively high
mutation rates. In GA, one says that you can't mutate too fast or as you
approach the optimum, you'll "mutate back" (disadvantageously) more
often than forward, and won't reach the optimum. Is it because you use
elitism that you can mutate so fast, and also so disruptively?
In GEP the behavior of average fitness will depend, among other things,
on the genetic operators you are using and, yes, if you use all the
genetic operators at the rates I usually use, you will get populations
with best fitness considerably higher than average. But if you use just
recombination (and even add a little bit of mutation), you’ll get what
one usually does for a GA. Obviously elitism plays a small part in all
this, and of course the larger the percentage of individuals that are
preserved from generation to generation the closer the gap between
average and best fitness. And since in GEP I only clone one individual
(the very best), this might also contribute to accentuate the gap even
more. (I know Koza usually keeps a fair amount of individuals unchanged
from generation to generation as I guess he cannot afford to lose
genetic diversity, but I remember doing some early experiments and
concluding that in GEP it is a waste of resources to keep so many
unchanged individuals as the system works equally well by keeping just
one of the best.)
About “mutating back” in GAs I have my doubts as people seem to be so
fixated on crossover. From my experience with GAs, I don’t see that they
behave so much differently than GEP. In all systems, there is a point
where you start applying too much mutation and you are just delaying
things or making things almost random, which obviously is not the goal.
And the GA is no exception. One can call it “mutate back” but I guess
this is just common sense.
I guess what surprises me is that the average fitness,
after an initial rise, stays pretty much constant, whereas the best
fitness keeps rising. Why doesn't average fitness follow best fitness
up? I guess the answer is that there is so much mutation (of various
kinds). But then, why does the best keep rising? I guess that's because
it can't fall! (due to elitism) and is occasionally bettered. Well, I'm
not very expert on this.
You got that right! These systems are usually optimized to solve
problems quickly and efficiently and, thanks to elitism, one can use
mutation rates considerably higher than the mutation rates used in
nature as what counts in nature (and in some artificial evolutionary
systems) is the performance of the overall population. Here, one is just
interested in the performance of the best and the rest of the population
is always being sacrificed for the benefit of the best. Obviously,
without elitism you would have to be more careful and if you were to
design good solutions you no doubt would have to be more conservative
and use much smaller mutation rates (as is done in nature). In this
case, the best and average fitnesses would be very close and both would
slowly and painstakingly increase with time (as happens in nature).
I found the last chapter “Evolutionary Studies”
stimulating, and I'm not sure I've drawn all the implications yet. I
guess the analysis in Goldberg's recent book (The Design of Innovation) is based mainly on what he rather pompously calls
"selectorecombinative" GAs – i.e., ones with crossover only. And he
derives a lot of stuff having to do with the sizing of populations in
order to assure a solution is found. From your analysis, this would seem
to be simply that he wants the populations big enough so that there will
be a significant founder effect, in order that the solution is found
before the homogenizing process of crossoveronly takes over. Is this
how you see that?
I think you are correct. If he only uses recombination he will need big
populations both to find good solutions and avoid premature convergence.
The combinatorial optimization results are startling
indeed. Your point about the abandonment of inversion in GAs is
interesting.
Why on earth do they continue fixated on recombination? Is it the allure
of the schema theorem and the building block hypothesis? Or is this a
case of trying to make the facts fit the theory?
I found the section on neutrality very nice, as it
counters any possible objection to "wasted" elements in the chromosome.
I hope you are right because I’ve received the most astounding comments
from supposed specialists in the EC field who think they know everything
there is to know about noncoding sequences in the genome. This jewel is
from a reviewer to a paper I submitted to a conference:

“Also, it's difficult to consider the sequences at the end of
the GEP genes to be ‘junk’. In the biological sense (unless I'm
mistaken) the large amount of junk in gene sequences is really
that – either they have shown it doesn't code for any proteins,
or they have not managed to assign function to it yet. Whereas,
you say yourself that your junk sequences in GEP genes are
‘extremely important’." 

The beautiful thing about the noncoding regions of
GEP genes is that we know exactly why they exist, that is, they exist to
ensure syntactic closure. That they may provide other benefits – such as
provide places for safely exchanging genes (more precisely, ORFs or
Kexpressions) or places for the accumulation of neutral mutations – is
just an extra bonus. Indeed, like so many times in evolutionary systems,
natural or otherwise, a structure first appears to solve a particular
problem but because it is also handy at solving others it brings
unexpected benefits, making the system much more robust as a whole.
The only chapter I found a bit confusing was Chapter 7. I
didn't know anything about the KolmogorovGabor polynomials or
STROGANOFF. I couldn't understand where you were going – I felt left
out. Then I saw at the end how you solved those problems much more
simply with GEP, and I realized that was the point of the chapter.
This is also the chapter I like less and I intend to rewrite it
completely the next time around because this algorithm is very
interesting especially if one uses simpler polynomials requiring just
23 different coefficients (the complete KG series requires 6, making
things excessively complex). And since there is no limit to the variety
of functions one can implement, it is possible to create a varied enough
set and just play with them as one does with any other mathematical
function (you can even mix them all together and it works even better).
I don’t think I stressed this enough in my book but anyone could reach
this conclusion just by analyzing the kinds of functions that were being
selected to build the best models (I actually found this out while
rewriting this chapter for the 2nd edition but at the time I was
unfortunately short on time to once again rewrite it).
Why do you weight all the
functions in the function set twice in almost all classification
problems? And what is the criterion for choosing the function set?
Well, what I found to work better for a
classification problem such as this (diagnosis
of breast cancer with 9 variables) is to have a function set with at
least twice as many functions as there are variables. So, in this case,
weighting the 10 functions two times does more or less this. Usually if
overall we have less functions than terminals, we will be making the
evolutionary process less efficient as the evolved models will have a
tendency to become very simplistic since the terminals will have a
greater probability of being integrated not only during the creation of
the initial population but also during genetic modification. So, it is
important to find a balance between the number of functions and
variables. As for the functions chosen, my personal experience is that
most of the time a simple function set with only the basic arithmetic
operators (weighted a certain number of times) is a good starting point
for classification problems. But you know how people are (myself
included): we like to use more fancy functions and since
GeneXproTools deals with more complex functions equally
well, I just like to spice the evolutionary process a bit and see what
happens.
Which values should be
used for the 0/1 rounding threshold? Will the current model not adjust
itself to any value you have chosen?
You are right: the plasticity of GEP is
extraordinary and it can, indeed, find a way around this (you see the
same plasticity operating with different function sets, linking
functions, random numerical constants, program structures and so on).
But for most problems there is a certain 0/1 rounding threshold that
works best. It will depend mainly on three factors: (1) on the kind of
data you have (only positive values, only negative values, a mixture of
the two, normalized or nonnormalized data, etc.); (2) on the kind of
functions you are using (for instance, if you are using functions that
by themselves already round the values they return to 0 or 1, in this
case a rounding threshold between 0 and 1 will do); and (3) on the
presence of random numerical constants (when a diverse range of random
numerical constants is available it is always possible for the algorithm
to work its way around even the most adverse situations, such as a
negative rounding threshold combined with only positive inputs and a
function set with only addition and multiplication).
Concerning the
diagnosis of breast cancer example, you write that you got a 340
fitness value (number of hits) against the training set of 350 records
and a 173 fitness value against a 174 records testing set. Did you wait
until the model has provided the best training fitness to apply it
against the test set and get this optimum result, or did you evolve many
times, through a "stop and go again" process until the test set fitness
has reached its best?
As you know,
GeneXproTools allows you to stop the evolutionary process after
which you can take a look at the performance in the test set. And if you
are not satisfied with the generalization results you might choose to
continue by using the evolved model as seed. Well, I do this a lot
whenever I’m trying to find the best possible model to solve a
realworld problem but, in the
breast cancer example you mentioned, I did several runs from scratch
using different settings (different functions, different rounding
thresholds, different architectures and so on) as I went along and then
I chose the best model.
Questions &
Answers
***
