Dr. Ferreira's interview from March 26, 2001 for
Technology Research News.
Interview with Cândida Ferreira
By Kim Patch
What gave you the
idea to create Gene Expression Programming?
I developed the basic ideas of Gene Expression Programming (GEP)
in September and October of 1999 almost unaware of their uniqueness. I
was reading Melanie Mitchell’s book
An Introduction to Genetic Algorithms
and meticulously solving all the computer exercises provided at the end
of each chapter. When the Genetic Programming (GP) technique presented
to discover Kepler’s third law was introduced, I implemented an
algorithm in C++ to solve it. And, surprisingly, this new algorithm
found the solution in the first run I made using an initial population
of only 10 individuals and iterated for 10 generations. I was so
impressed and so happy with these results that I proceeded immediately
to apply it to more challenging problems. I bought John Koza’s first
book then (Genetic Programming: On the Programming of Computers by Means of Natural Selection ) and applied my algorithm to several
problems and the comparisons consistently showed that it tremendously
surpassed GP. By then the details of the GP technique were unknown to me
and I couldn’t understand the reasons for this difference in
performance. Only later did I realize that, as a biologist, I had
created a new, more powerful algorithm borrowing several new ideas from
biology, especially the creation of a perfect genotype-phenotype mapping.
How exactly does the
algorithm work? Can you describe it so I can picture it?
Gene Expression Programming starts by randomly creating
the chromosomes of the individuals of the initial population. Something
like this:
| |
Generation 0 or Initial
Population:
--+/*aaaaaa//+/*aaaaaa+a+--aaaaaa-[0] = 0.0305998
*+-aaaaaaaa-*-//aaaaaa/*-*-aaaaaa-[1] = 0
/+/--aaaaaa**-a-aaaaaa*a-/aaaaaaa-[2] = 0.0306187
--aa*aaaaaa/*-a-aaaaaa-*-aaaaaaaa-[3] = 0
-*+/-aaaaaa+/*-+aaaaaa*/a--aaaaaa-[4] = 0
/-*/*aaaaaa/a+-*aaaaaa+aaa/aaaaaa-[5] = 0.0307552
-/*+-aaaaaa*aa++aaaaaa//-*aaaaaaa-[6] = 0
+**/aaaaaaa+a***aaaaaa*/aa*aaaaaa-[7] = 0.5177571
/*a-aaaaaaa*+a/*aaaaaa+*+a+aaaaaa-[8] = 0.0330132
/*-++aaaaaa-//a-aaaaaa**a+aaaaaaa-[9] = 0 |
These strings code for nonlinear entities or expression trees (ETs). The
ETs are simple diagram representations and their structure is easily
inferred from the respective chromosome. So, after the creation of the
initial population, each chromosome is expressed as an ET (the phenotype
or the ‘organism’). The expression trees are in fact simple computer
programs and it is possible to test them against a selection environment
(the set of fitness cases). Using an appreciable number of individuals
and an appreciable number of fitness cases, the probability of finding
one or a few programs that do something useful is relatively high (in
the population above, the values after each chromosome are a measure of
their fitness). These are selected to reproduce, and the higher the
fitness the higher the probability of leaving more offspring. Therefore
the genetic information of the selected individuals is passed on to the
next generation. But here, as in nature, the genetic information suffers
a little mutation and recombination before it is transmitted and
therefore the descendants of these individuals, most of the time, are
not identical copies of their parents. For instance, all the individuals
of the population below are descendants of chromosomes 0, 2, 5, 7, and 8
above (note that chromosomes 1 and 4 are better than their parents):
| |
Generation 1:
+**/aaaaaaa+a***aaaaaa*/aa*aaaaaa-[0] = 0.5177571
***/aaaaaaa+a***aaaaaa*/aa*aaaaaa-[1] = 7.3964260
+**/aaaaaaa+a***aaaaaa*/aa*aaaaaa-[2] = 0.5177571
+/*/aaaaaaa*+a/*aaaaaa*/aa*aaaaaa-[3] = 0.0328448
+**/*aaaaaa+**/aaaaaaa+a***aaaaaa-[4] = 0.6016823
/*a-aaaaaaa**+a*aaaaaa+*+a+aaaaaa-[5] = 0.0349026
+**/aaaaaaa+a***aaaaaa*/aa*aaaaaa-[6] = 0.5177571
*/-a+aaaaaa+a***aaaaaa*/aa*aaaaaa-[7] = 0.4814101
+**/aaaaaaa+a***aaaaaa*/aa*aaaaaa-[8] = 0.5177571
+**/aaaaaaa+a***aaaaaa*/aa*aaaaaa-[9] = 0.5177571 |
So, these new chromosomes are subjected to the same developmental
process: expression of the genetic information, confrontation of the
selection environment, followed by selection and reproduction with
modification. This process is repeated for a certain number of
generations or until a solution is found. For this simple example, after
10 generations a perfect solution with maximum fitness (1000 in this
case) was found
(chromosome 4):
| |
Generation 10:
*/a/*aaaaaa*+*/aaaaaaa+a***aaaaaa-[0] = 888.98006
*/a/*aaaaaa*+*/aaaaaaa+a***aaaaaa-[1] = 888.98006
*/aa*aaaaaa*+*-aaaaaaa+a**+aaaaaa-[2] = 0.0374567
*/a/*aaaaaa+*/a-aaaaaa+a***aaaaaa-[3] = 0.4814390
*/-/*aaaaaa*+*/aaaaaaa+a***aaaaaa-[4] = 1000
*/a/*aaaaaa*+*/aaaaaaa+a***aaaaaa-[5] = 888.98006
*/a/*aaaaaa*aaa+aaaaaa+a***aaaaaa-[6] = 0.5146930
*/a/*aaaaaa*+*/aaaaaaa+a***aaaaaa-[7] = 888.98006
*/a/+aaaaaa*+*/aaaaaaa+a***aaaaaa-[8] = 666.66575
*/a/*aaaaaa*+-/aaaaaaa+a***aaaaaa-[9] = 0.4812073 |
You describe the algorithm as having
linear chromosomes that are afterwards expressed as nonlinear entities.
What is the advantage to this, and how does it make the algorithm
different from Genetic Algorithms and Genetic Programming?
The advantage is that individuals are encoded in simple structures
(chromosomes) and all the genetic modifications (mutation,
recombination, transposition and so on) that occur during reproduction
are made on these structures and, therefore, one only needs to pass on
these structures to the next generation. On the other hand, the
expression trees that develop from the chromosomes have an independent
existence and may assume various forms, and therefore assume various
functions. In nature, organisms are also encoded in linear chromosomes,
and when they reproduce only the genetic material is transmitted. Then
this material is expressed as a particular body that assumes various
forms and functions. The body is what selection sees, and according to
its fitness the individual might be chosen to reproduce or not. And
during reproduction, only the genetic material is transmitted to the
next generation. If not for the blueprint of the organism, the
reproduction of complex beings would involve the copy of all the body
constituents. Well, it’s not hard to see that this is unfeasible and, in
fact, it is known that for life to move beyond a very rudimentary stage
the phenotype threshold should be crossed. Imagine our DNA going around
naked: it could do a couple of things, but not certainly walk, fly or
think. The interplay of the genotype and the phenotype allows the
evolution of different functionalities and greater complexity.
This is the fundamental difference between GEP and GAs or GP, for they
use only one kind of entity (linear chromosomes in GAs and ramified
structures or parse trees in GP) that works both as genotype and
phenotype, that is, there is no distinction between genotype and
phenotype. The entity is the body that goes around doing things and when
it reproduces everything must be copied. In addition, and especially for
GP, the modifications that occur during reproduction are greatly
constrained because there are lots of impossibilities at this level (it
is impossible to make an orange tree produce mangos only by grafting and
pruning) and therefore evolution is greatly limited. This kind of
limitation is common in GP, for it exhibits already a certain structural
complexity. This is one of the reasons GP uses almost exclusively a
special kind of recombination that permits the exchange of structurally
concise branches or building blocks. And evolution is, in this case, the
result of moving these blocks around, never really creating new
material. In GAs there are no such problems as their structure is very
simple; but then they are also less versatile in terms of
functionalities.
So, in GEP, every conceivable modification is allowed in the genome (I
use currently eight different genetic operators, but others may be
easily implemented) always producing syntactically correct expression
trees (in nature there is no such thing as a syntactically incorrect
protein). And therefore no time or resources are lost in editing or
ensuring the validity of the programs as happens in all GP
implementations. In GEP the chromosomes are so simple that they are
modified and reproduced as easily as the linear chromosomes of GAs. And
the expression trees they code for are more complex than the parse trees
of GP because GEP chromosomes code for multiple genes, each gene encoding a
sub-expression tree. And besides, no forbidden paths exist in GEP when
it comes to exploring the solution space.
How exactly does the
algorithm create computer programs? What's the input, what's the output
and what happens in between? Can you take me step-by-step through a
simple example?
The input is a set of fitness cases (the selection environment), for
instance, the production data of several firms (each row is a firm):
| |
Labor |
Material |
Capital |
Production |
| |
0.542228376455 |
0.337519500366 |
0.405527109223 |
0.400102626024 |
| |
0.691221863614 |
0.488785181029 |
0.513851630110 |
0.548400414897 |
| |
0.716405370429 |
0.537658971190 |
0.600028325136 |
0.587799795213 |
| |
..... |
|
|
|
And you believe that the Production can be explained in
terms of Labor, Material, and Capital. So, you randomly create an
initial population of GEP chromosomes, express them and see which ones
return a value for Production within the limits you previously
established (say, a 10% error) for one or more fitness cases. These
individuals will have a fitness greater than zero and therefore they
might be selected to reproduce with modification. And this is all that
is necessary for getting the evolutionary process started. If you are
lucky, after a certain number of generations, you will end up with a
chromosome (an encoded computer program) that is a function of Labor,
Material, and Capital that satisfactorily explains Production. So, the
best chromosome (the chromosome with higher fitness) evolved in such an
experiment is the output, or in other words, the output is a computer
program written in Karva notation (the language of GEP genes) that can
be automatically converted into a more conventional language, such as
C++ or Visual Basic. Indeed, the latest releases of Gepsoft
(Automatic Problem Solver Advanced and Standard) already generate
automatically Visual Basic and C++ code from Karva code.
Can you give me quick
definitions for the ways the chromosomes are modified and how this works
in algorithm vs. biology – mutation, inversion, transposition, root
transposition, gene transposition, gene recombination and one-point and
two-point recombination?
Mutation – as in biology, mutations are point mutations (substitutions
of a symbol by another at a particular position) and occur throughout
the chromosomes. With the difference that GEP uses different alphabets
which are used in the different regions of genes (the heads, the tails,
and special domains for handling random numerical constants).
Inversion – as the name suggests, the inversion operator inverts
sequences in the genes. In nature organisms also explore inversion and
it is known that during evolution some sequences or even entire genes
were inverted.
Transposition (the three kinds) – the mechanism of GEP transposition is
considerably different from biological transposition. Nonetheless, they
have a couple of things in common: some sequences in the chromosomes
are activated and jump to another place in the chromosome and, as a
result, they might get copied in the process (except for gene
transposition where the transposon – an entire gene – is deleted at the
place of origin). This kind of operators is very interesting because
they modify the chromosomes using elements that were already present in
the chromosome.
Recombination (the three kinds) – the mechanism is very similar to
biological recombination, for in both cases parent chromosomes are
paired, cut at one or two points (exactly the same position in both
parents), and exchange some genetic material between them. But in GEP
the paired chromosomes are not homologous and therefore the daughter
chromosomes are most of the time completely different from the parents.
Of the three types of GEP recombination, the more conservative is gene
recombination, as in this case entire genes are exchanged between the
parent chromosomes. In one-point and two-point recombination though, the
parent chromosomes are split at random positions and the daughter
chromosomes are usually very different from their parents.
Your
paper seems to imply that the algorithm has more in common with biological
selection than Genetic Algorithms and Genetic Programming. Can you
explain this in detail?
Do you really mean biological selection here or do you mean biology?
Well, I use the same old roulette-wheel selection method used earlier
both in GAs and GP. So, nothing new or different here, and it mimics
nature quite accurately. But if you meant biology, I think that I have
already explained that (a complex genome composed of multiple genes and
the two separate entities – genotype and phenotype – that allow an
efficient adaptation and evolution of complex bodies encoded as simple
strings).
Can you give me quick
definitions for the computer problems you mentioned in your
paper – symbolic regression, sequence induction, block stacking and
density classification?
Symbolic regression is also called function finding and consists of
finding a mathematical expression that describes a certain property
(like the Production above). Sequence
induction is a special case of symbolic regression where the domain of
the independent variable are non-negative integers. Block stacking is a
toy planning problem much used in computer science. Cellular automata
rules for the density classification task is also a toy problem, but one
very complex and difficult for which only a few very powerful algorithms
could discover good rules. And there were also the 11-multiplexer and
the GP-rule problems which are complex problems of logic synthesis or
Boolean concept learning.
How does your work
fit into the body of work on evolutionary algorithms, and what is
different about it?
Like a GA or GP system, Gene Expression Programming is also a genetic
algorithm because it uses populations of individuals, selects them
according to fitness, and introduces genetic variation using one or more
genetic operators. However, it is closer to GP because the phenotype of
GEP individuals is structurally similar to the nonlinear entities of GP
(called parse trees in GP and expression trees in GEP). However, due to
the genotype/phenotype interplay, GEP is much more efficient: problems
that took weeks or months to solve using other learning algorithms can
now be solved in seconds or minutes by GEP. With GEP, thanks to the
multigenic genotype, it is also possible to evolve more complex
structures, for instance, multi-subunit expression trees composed of
several sub-expression trees. In their turn, these multigenic systems
are at the core of other, higher levels of organization such as the
unicellular and multicellular GEP systems for evolving automatically
defined functions. But there are many more GEP systems: systems for
parameter optimization that explore both the multigenic organization and
special gene domains for handling random numerical constants; systems
for inducing polynomials, neural networks, and decision trees, all of
which also explore the concerted action of different gene domains for
fine-tuning all kinds of numerical constants: polynomials' coefficients,
the weights and thresholds of neural networks, and the numerical
constants of decision trees with mixed/numeric attributes.
What are the next
steps in your research? What are you ultimately aiming for?
We are at Gepsoft developing
commercial software for symbolic regression and classification (several packages were
already released) and also for logic synthesis. Scientifically, I want to
publish other GEP related materials that I already have but I find it
very difficult to get published before the first GEP paper is
conventionally out. So I am currently gathering everything in a book
format to be published as soon as I find a publisher. Ultimately, I want
to help make GEP very well established in the field of evolutionary
computation.
How could this
research eventually be applied practically? What types of uses might it
have or lead to?
This algorithm is being already applied practically. The standard and
advanced editions of the Automatic
Problem Solver have been already released (the standard edition was
released a week ago, and the advanced just today) and several packages
were already sold. It is a very powerful tool for engineers, economists,
mathematicians, physicists, chemists, biochemists... for it generates
automatically Visual Basic and C++ code that can be immediately
incorporated in other applications. Another tool Gepsoft
is currently developing for logic synthesis will generate automatically
SystemC code and could be used by the thousands of engineers that use
SystemC for electronic circuit design.
And I believe that several GEP algorithms have been already implemented
at different universities and research centers worldwide because I made
my
GEP paper available on line last November and since then thousands
of people have downloaded the paper.
Can you give me a
ballpark estimate of when the research could be technically ready to be
applied practically? I'm looking for a rough estimate on the order of
less than two years or more than 10 years, etc.
See above. I hope to see GEP very well
established in a year because this algorithm is extremely simple and
very efficient and several tools are already available.
Who funded the
research?
This was a domestic (or garage if you prefer) project financed by
myself.
What is your title?
I have a PhD in Biology and am assistant professor at
The Azores University.
Is there anything else
you'd like to say?
My
GEP paper is being currently revised by the journal Complex
Systems.
When was the paper
accepted by Complex Systems and is there a publishing date yet?
My paper has not yet been accepted by Complex Systems. See
above.
After I published the paper online, Complex Systems contacted me
and offered to
revise it. A month ago I received the reviewer’s comments and a week
later I sent back the revised manuscript. But I am still waiting and
there is no guarantee that they will publish it. However, the IEEE
Transactions on Evolutionary Computation also contacted me and
promised a fair and fast review. Unfortunately this publishing process
is terribly slow and I am afraid GEP will be old news the day the first
paper is out.
This email interview was for a story on GEP for
Technology Research News, but I’m
sorry to say the story was never published. Kim Patch asked the
questions by email in March 2001.
***
|