Questions & Answers from a 2001 Interview

Q&A from a 2001 Interview

This email interview was done in March 26, 2001, but since it didn't see the light of day I took the liberty of updating some facts, you know, things like what-are-the-next-steps or what-are-you-hoping-to-achieve, now that they are done deals. And now is exactly March 8, 2007, six years after I was asked to answer these questions, which, by the way, were not modified:

What gave you the idea to create Gene Expression Programming?

I developed the basic ideas of Gene Expression Programming (GEP) in September and October of 1999 almost unaware of their uniqueness. I was reading Melanie Mitchell’s book An Introduction to Genetic Algorithms and meticulously solving all the computer exercises provided at the end of each chapter. When the Genetic Programming (GP) technique presented to discover Kepler’s third law was introduced, I implemented an algorithm in C++ to solve it. And, surprisingly, this new algorithm found the solution in the first run I made using an initial population of only 10 individuals and iterated for 10 generations. I was so impressed and so happy with these results that I proceeded immediately to apply it to more challenging problems. I bought John Koza’s first book then (Genetic Programming: On the Programming of Computers by Means of Natural Selection) and applied my algorithm to several problems and the comparisons consistently showed that it tremendously surpassed GP. By then the details of the GP technique were unknown to me and I couldn’t understand the reasons for this difference in performance. Only later did I realize that, as a biologist, I had created a new, more powerful algorithm borrowing several new ideas from biology, especially the creation of a perfect genotype-phenotype mapping.

How exactly does the algorithm work? Can you describe it so I can picture it?

Gene Expression Programming starts by randomly creating the chromosomes of the individuals of the initial population. Something like this:

 Generation 0 or Initial Population: --+/*aaaaaa//+/*aaaaaa+a+--aaaaaa-[0] = 0.0305998 *+-aaaaaaaa-*-//aaaaaa/*-*-aaaaaa-[1] = 0 /+/--aaaaaa**-a-aaaaaa*a-/aaaaaaa-[2] = 0.0306187 --aa*aaaaaa/*-a-aaaaaa-*-aaaaaaaa-[3] = 0 -*+/-aaaaaa+/*-+aaaaaa*/a--aaaaaa-[4] = 0 /-*/*aaaaaa/a+-*aaaaaa+aaa/aaaaaa-[5] = 0.0307552 -/*+-aaaaaa*aa++aaaaaa//-*aaaaaaa-[6] = 0 +**/aaaaaaa+a***aaaaaa*/aa*aaaaaa-[7] = 0.5177571 /*a-aaaaaaa*+a/*aaaaaa+*+a+aaaaaa-[8] = 0.0330132 /*-++aaaaaa-//a-aaaaaa**a+aaaaaaa-[9] = 0

These strings code for nonlinear entities or expression trees (ETs). The ETs are simple diagram representations and their structure is easily inferred from the respective chromosome. So, after the creation of the initial population, each chromosome is expressed as an ET (the phenotype or the ‘organism’). The expression trees are in fact simple computer programs and it is possible to test them against a selection environment (the set of fitness cases). Using an appreciable number of individuals and an appreciable number of fitness cases, the probability of finding one or a few programs that do something useful is relatively high (in the population above, the values after each chromosome are a measure of their fitness). These are selected to reproduce, and the higher the fitness the higher the probability of leaving more offspring. Therefore the genetic information of the selected individuals is passed on to the next generation. But here, as in nature, the genetic information suffers a little mutation and recombination before it is transmitted and therefore the descendants of these individuals, most of the time, are not identical copies of their parents. For instance, all the individuals of the population below are descendants of chromosomes 0, 2, 5, 7, and 8 above (note that chromosomes 1 and 4 are better than their parents):

 Generation 1: +**/aaaaaaa+a***aaaaaa*/aa*aaaaaa-[0] = 0.5177571 ***/aaaaaaa+a***aaaaaa*/aa*aaaaaa-[1] = 7.3964260 +**/aaaaaaa+a***aaaaaa*/aa*aaaaaa-[2] = 0.5177571 +/*/aaaaaaa*+a/*aaaaaa*/aa*aaaaaa-[3] = 0.0328448 +**/*aaaaaa+**/aaaaaaa+a***aaaaaa-[4] = 0.6016823 /*a-aaaaaaa**+a*aaaaaa+*+a+aaaaaa-[5] = 0.0349026 +**/aaaaaaa+a***aaaaaa*/aa*aaaaaa-[6] = 0.5177571 */-a+aaaaaa+a***aaaaaa*/aa*aaaaaa-[7] = 0.4814101 +**/aaaaaaa+a***aaaaaa*/aa*aaaaaa-[8] = 0.5177571 +**/aaaaaaa+a***aaaaaa*/aa*aaaaaa-[9] = 0.5177571

So, these new chromosomes are subjected to the same developmental process: expression of the genetic information, confrontation of the selection environment, followed by selection and reproduction with modification. This process is repeated for a certain number of generations or until a solution is found. For this simple example, after 10 generations a perfect solution with maximum fitness (1000 in this case) was found (chromosome 4):

 Generation 10: */a/*aaaaaa*+*/aaaaaaa+a***aaaaaa-[0] = 888.98006 */a/*aaaaaa*+*/aaaaaaa+a***aaaaaa-[1] = 888.98006 */aa*aaaaaa*+*-aaaaaaa+a**+aaaaaa-[2] = 0.0374567 */a/*aaaaaa+*/a-aaaaaa+a***aaaaaa-[3] = 0.4814390 */-/*aaaaaa*+*/aaaaaaa+a***aaaaaa-[4] = 1000 */a/*aaaaaa*+*/aaaaaaa+a***aaaaaa-[5] = 888.98006 */a/*aaaaaa*aaa+aaaaaa+a***aaaaaa-[6] = 0.5146930 */a/*aaaaaa*+*/aaaaaaa+a***aaaaaa-[7] = 888.98006 */a/+aaaaaa*+*/aaaaaaa+a***aaaaaa-[8] = 666.66575 */a/*aaaaaa*+-/aaaaaaa+a***aaaaaa-[9] = 0.4812073

You describe the algorithm as having linear chromosomes that are afterwards expressed as nonlinear entities. What is the advantage to this, and how does it make the algorithm different from Genetic Algorithms and Genetic Programming?

The advantage is that individuals are encoded in simple structures (chromosomes) and all the genetic modifications (mutation, recombination, transposition and so on) that occur during reproduction are made on these structures and, therefore, one only needs to pass on these structures to the next generation. On the other hand, the expression trees that develop from the chromosomes have an independent existence and may assume various forms, and therefore assume various functions. In nature, organisms are also encoded in linear chromosomes, and when they reproduce only the genetic material is transmitted. Then this material is expressed as a particular body that assumes various forms and functions. The body is what selection sees, and according to its fitness the individual might be chosen to reproduce or not. And during reproduction, only the genetic material is transmitted to the next generation. If not for the blueprint of the organism, the reproduction of complex beings would involve the copy of all the body constituents. Well, it’s not hard to see that this is unfeasible and, in fact, it is known that for life to move beyond a very rudimentary stage the phenotype threshold should be crossed. Imagine our DNA going around naked: it could do a couple of things, but not certainly walk, fly or think. The interplay of the genotype and the phenotype allows the evolution of different functionalities and greater complexity.

This is the fundamental difference between GEP and GAs or GP, for they use only one kind of entity (linear chromosomes in GAs and ramified structures or parse trees in GP) that works both as genotype and phenotype, that is, there is no distinction between genotype and phenotype. The entity is the body that goes around doing things and when it reproduces everything must be copied. In addition, and especially for GP, the modifications that occur during reproduction are greatly constrained because there are lots of impossibilities at this level (it is impossible to make an orange tree produce mangos only by grafting and pruning) and therefore evolution is greatly limited. This kind of limitation is common in GP, for it exhibits already a certain structural complexity. This is one of the reasons GP uses almost exclusively a special kind of recombination that permits the exchange of structurally concise branches or building blocks. And evolution is, in this case, the result of moving these blocks around, never really creating new material. In GAs there are no such problems as their structure is very simple; but then they are also less versatile in terms of functionalities.

So, in GEP, every conceivable modification is allowed in the genome (I use currently eight different genetic operators, but others may be easily implemented) always producing syntactically correct expression trees (in nature there is no such thing as a syntactically incorrect protein). And therefore no time or resources are lost in editing or ensuring the validity of the programs as happens in all GP implementations. In GEP the chromosomes are so simple that they are modified and reproduced as easily as the linear chromosomes of GAs. And the expression trees they code for are more complex than the parse trees of GP because GEP chromosomes code for multiple genes, each gene encoding a sub-expression tree. And besides, no forbidden paths exist in GEP when it comes to exploring the solution space.

How exactly does the algorithm create computer programs? What's the input, what's the output and what happens in between? Can you take me step-by-step through a simple example?

The input is a set of fitness cases (the selection environment), for instance, the production data of several firms (each row is a firm):

 Labor Material Capital Production 0.542228376455 0.337519500366 0.405527109223 0.400102626024 0.691221863614 0.488785181029 0.513851630110 0.548400414897 0.716405370429 0.537658971190 0.600028325136 0.587799795213 .....

And you believe that the Production can be explained in terms of Labor, Material, and Capital. So, you randomly create an initial population of GEP chromosomes, express them and see which ones return a value for Production within the limits you previously established (say, a 10% error) for one or more fitness cases. These individuals will have a fitness greater than zero and therefore they might be selected to reproduce with modification. And this is all that is necessary for getting the evolutionary process started. If you are lucky, after a certain number of generations, you will end up with a chromosome (an encoded computer program) that is a function of Labor, Material, and Capital that satisfactorily explains Production. So, the best chromosome (the chromosome with higher fitness) evolved in such an experiment is the output, or in other words, the output is a computer program written in Karva notation (the language of GEP genes) that can be automatically converted into a more conventional language, such as C++ or Visual Basic. Indeed, the latest release of GeneXproTools (a modeling tool from Gepsoft that uses Gene Expression Programming) creates computer programs in a total of 16 different programming languages, which, at the end of the day, are what you call the output.

Can you give me quick definitions for the ways the chromosomes are modified and how this works in algorithm vs. biology – mutation, inversion, transposition, root transposition, gene transposition, gene recombination and one-point and two-point recombination?

Mutation – as in biology, mutations are point mutations (substitutions of a symbol by another at a particular position) and occur throughout the chromosomes. With the difference that GEP uses different alphabets which are used in the different regions of genes (the heads, the tails, and special domains for handling random numerical constants).

Inversion – as the name suggests, the inversion operator inverts sequences in the genes. In nature organisms also explore inversion and it is known that during evolution some sequences or even entire genes were inverted.

Transposition (the three kinds) – the mechanism of GEP transposition is considerably different from biological transposition. Nonetheless, they have a couple of things in common: some sequences in the chromosomes are activated and jump to another place in the chromosome and, as a result, they might get copied in the process (except for gene transposition where the transposon – an entire gene – is deleted at the place of origin). This kind of operators is very interesting because they modify the chromosomes using elements that were already present in the chromosome.

Recombination (the three kinds) – the mechanism is very similar to biological recombination, for in both cases parent chromosomes are paired, cut at one or two points (exactly the same position in both parents), and exchange some genetic material between them. But in GEP the paired chromosomes are not homologous and therefore the daughter chromosomes are most of the time completely different from the parents. Of the three types of GEP recombination, the more conservative is gene recombination, as in this case entire genes are exchanged between the parent chromosomes. In one-point and two-point recombination though, the parent chromosomes are split at random positions and the daughter chromosomes are usually very different from their parents.

Your paper (Ferreira, C., 2001. Gene Expression Programming: A New Adaptive Algorithm for Solving Problems. Complex Systems, Vol. 13, issue 2: 87-129) seems to imply that the algorithm has more in common with biological selection than Genetic Algorithms and Genetic Programming. Can you explain this in detail?

Do you really mean biological selection here or do you mean biology? Well, I use the same old roulette-wheel selection method used earlier both in GAs and GP. So, nothing new or different here, and it mimics nature quite accurately. But if you meant biology, I think that I have already explained that (a complex genome composed of multiple genes and the two separate entities – genotype and phenotype – that allow an efficient adaptation and evolution of complex bodies encoded as simple strings).

Can you give me quick definitions for the computer problems you mentioned in your paper – symbolic regression, sequence induction, block stacking and density classification?

Symbolic regression is also called function finding and consists of finding a mathematical expression that describes a certain property (like the Production above). Sequence induction is a special case of symbolic regression where the domain of the independent variable are non-negative integers. Block stacking is a toy planning problem much used in computer science. Cellular automata rules for the density classification task is also a toy problem, but one very complex and difficult for which only a few very powerful algorithms could discover good rules. And there were also the 11-multiplexer and the GP-rule problems which are complex problems of logic synthesis or Boolean concept learning.

How does your work fit into the body of work on evolutionary algorithms, and what is different about it?

Like a GA or GP system, Gene Expression Programming is also a genetic algorithm because it uses populations of individuals, selects them according to fitness, and introduces genetic variation using one or more genetic operators. However, it is closer to GP because the phenotype of GEP individuals is structurally similar to the nonlinear entities of GP (called parse trees in GP and expression trees in GEP). However, due to the genotype/phenotype interplay, GEP is much more efficient: problems that took weeks or months to solve using other learning algorithms can now be solved in seconds or minutes by GEP. With GEP, thanks to the multigenic genotype, it is also possible to evolve more complex structures, for instance, multi-subunit expression trees composed of several sub-expression trees. In their turn, these multigenic systems are at the core of other, higher levels of organization such as the unicellular and multicellular GEP systems for evolving automatically defined functions. But there are many more GEP systems: systems for parameter optimization that explore both the multigenic organization and special gene domains for handling random numerical constants; systems for inducing polynomials, neural networks, and decision trees, all of which also explore the concerted action of different gene domains for fine-tuning all kinds of numerical constants: polynomials' coefficients, the weights and thresholds of neural networks, and the numerical constants of decision trees with mixed/numeric attributes.

What are the next steps in your research? What are you ultimately aiming for?

Continue to create as many new learning algorithms as I can and then make them known both through publications (in many forms – papers, books, and web publications) and software applications. Ultimately, I hope this will contribute to make Gene Expression Programming very well established in the field of Evolutionary Computation and Machine Learning.

How could this research eventually be applied practically? What types of uses might it have or lead to?

We are at Gepsoft developing commercial software for predictive modeling (software for Classification/Logistic Regression, Function Finding, Time Series Prediction, Logic Synthesis, and Parameter Optimization). GeneXproTools is our flagship application and is being used for mining huge databases with thousands of variables, creating almost instantaneously computer models in a total of 16 programming languages (Ada, C, C++, C#, Fortran, Java, Java Script, Matlab, Pascal, Perl, PHP, Python, Visual Basic, VB.Net, Verilog, and VHDL). Our mission is to continue to make this package a very sophisticated and efficient AI lab through the integration of a wide set of learning algorithms (all the algorithms of the GEP family, including evolvable neural networks and decision trees, polynomials and automatically defined functions, parameter optimization algorithms and many others). Version 4.0 of GeneXproTools was released in July of 2006 and implements already two different learning algorithms (the GEP and GEP-RNC algorithms) and supports Function Finding, Classification/Logistic Regression, Time Series Prediction, and Logic Synthesis.