GEP: A Quick Introduction to Gene Expression Programming


	A Quick Introduction to Gene Expression Programming
Home News Author Q&A Tutorials Downloads GEP Biblio Contacts Visit Gepsoft

A Quick Introduction to Gene Expression Programming

Introduction

Gene Expression Programming (GEP) is a learning algorithm and what it learns specifically is about relationships between variables in sets of data and then builds models to explain these relationships.

How learning algorithms build models or discover solutions to problems varies, with some simulating networks of neurons and others simulating evolution through natural selection. Gene Expression Programming belongs to the latter group, the so called evolutionary algorithms. And like all evolutionary algorithms, natural or otherwise, GEP uses populations of individuals (in this case, populations of candidate solutions to the problem at hand), selects and reproduces them according to fitness, and introduces genetic variation using one or more genetic operators such as mutation or recombination. And these are obviously the prerequisites for evolution to occur.

Gene Expression Programming can be used to evolve very different things: from conventional mathematical models [1] to sophisticated decision trees [2] and neural networks [3]. However, the strategies it uses to create all of them are similar and very easy to understand. For example, if we want to create mathematical models, then we start by creating an initial population of models and then allow it to evolve. This means that each generation, we measure the fitness of all the models of the population and then reproduce them accordingly. This reproduction, however, is not perfect and therefore slightly different models will start to appear and, given enough time, some of them will be better than the ones we started with.

So these are the general principles of all evolutionary algorithms and, in fact, there are several artificial evolutionary algorithms that explore them. Of special interest to the GEP technique are the Genetic Algorithms (GAs) and Genetic Programming (GP), for they serve to illustrate some of the fundamental characteristics of the GEP technique and why GEP surpasses the old GP technique by a factor of 100-60,000.

First of all, both GAs and GP are simple replicator systems, with the latter considerably more complex than the former. This means that the things or entities these systems evolve go about their business doing what has to be done and when their time comes, they must reproduce their bodies (hopefully with some variation) to perpetuate their line. But reproducing bodies with modification might not be an easy task, especially if the bodies are complex structures like the parse trees that GP evolves. Indeed, the canonical GP technique is limited to the use of very conservative operators (almost exclusively a tree-based crossover) that change the trees in a way reminiscent of the grafting and pruning familiar to gardeners and farmers around the world [4]. And when we compare the achievements of grafting and pruning (and I'm not saying they are not considerable) with the achievements of the DNA/protein system of life on Earth (the most sophisticated genotype/phenotype system known to science), it becomes clear why genotype/phenotype systems are much more powerful than simple replicator systems: firstly, in genotype/phenotype systems there are no restrictions concerning the type and number of modifications in the genome and therefore all the paths and crannies of the fitness landscape are accessible to meander through, and secondly, these systems are free to explore different levels of organization (for instance, genes are expressed into proteins; proteins form ribosomes, microtubules, membranes, etc.; these macromolecules and structures then form cells; cells get organized into tissues; tissues into organs and so forth, culminating with the organism itself). Obviously these higher levels of complexity are completely inaccessible to simple replicator systems, for no matter how complex, they continue to be a single structure, forever incapable of becoming a part of a much more complex being.

The GEP system is a full-fledged genotype/phenotype system with expression trees of different sizes and shapes encoded in linear chromosomes of fixed length. Also important is that GEP chromosomes are multigenic, encoding multiple expression trees or sub-programs that can be organized into a much more complex program. So, like the DNA/protein system of life on Earth, the gene/tree system of GEP can not only explore all the crannies and paths of the solution space but it's also free to explore higher levels of organization. But how is this achieved? In order to answer this question we need to take a closer look at the structures of the main players of GEP – the chromosomes and the expression trees – and see how they work together.

The Architecture of GEP Programs

So, there are two main players in GEP: the chromosomes and the expression trees (ETs), being the latter the expression of the genetic information encoded in the former. As in nature, the process of information decoding is called translation. And this translation implies obviously a kind of code and a set of rules. The genetic code of GEP is very simple: a one-to-one relationship between the symbols of the genes and the nodes they represent in the trees. The rules are also very simple: they determine the spatial organization of nodes in the expression trees and the type of interaction between sub-ETs. Therefore, there are two languages in GEP: the language of genes and the language of expression trees and, thanks to the simple rules that determine the structure of ETs and their interactions, it is possible to infer immediately the expression tree given the sequence of a gene, and vice versa. This means that we can choose to have a very complex program represented by its compact genome without losing in meaning. This unequivocal bilingual notation is called Karva language. Its details are explained in the remainder of this tutorial.

The structural organization of GEP genes is better understood in terms of open reading frames (ORFs). In biology, an ORF or coding sequence of a gene begins with the start codon, continues with the amino acid codons, and ends at a termination codon. However, a gene is more than the respective ORF, with sequences upstream of the start codon and sequences downstream of the stop codon. And although in GEP the start site is always the first position of a gene, the termination point does not always coincide with the last position of a gene. Consequently, it is common for GEP genes to have noncoding regions downstream of the termination point. These noncoding regions obviously do not interfere with expression but, nonetheless, they play a crucial role in evolution, for they alone allow the creation of valid programs no matter how profoundly their chromosomes are modified (we’ll get back to this later).

Consider, for example, the algebraic expression:

(1)

It can also be represented as a diagram or an expression tree:

where “Q” represents the square root function.

This kind of diagram representation is what is called the phenotype in Gene Expression Programming. And the genotype can be easily inferred from the phenotype as follows:

01234567
Q*-+abcd	(2)

which is the straightforward reading of the expression tree from left to right and from top to bottom (exactly as we read a page of text). The expression (2) is an ORF, starting at “Q” (position 0) and terminating at “d” (position 7). These ORFs were named K-expressions from Karva notation.

Consider another ORF, the following K-expression:

01234567890
Qb*+baQba	(3)

Its expression as an ET is also very simple and straightforward. In order to express the ORF correctly, we must follow the rules governing the spatial distribution of functions and terminals. First, the start of a gene corresponds to the root of the ET which is placed in the topmost line. Second, in the next line, below each function, are placed as many branch nodes as there are arguments to that function. Third, from left to right, the nodes are filled consecutively with the next elements of the K-expression. Fourth, the process is repeated until a line containing only terminals is formed. In this case, the following expression tree is obtained:

Just by looking at the structure of GEP K-expressions, it is difficult or even impossible to see the advantages of such a representation, except perhaps for its simplicity and elegance. However, when K-expressions are analyzed in the context of a gene, the advantages of this representation become obvious. As previously stated, GEP chromosomes have fixed length, and they are composed of one or more genes of equal length. Consequently, the length of a gene is also fixed. Thus, in GEP, what varies is not the length of genes which is constant, but the length of the K-expressions. Indeed, the length of a K-expression may be equal to or less than the length of the gene. In the former case, the termination point coincides with the end of the gene, and in the latter, the termination point is somewhere upstream of the end of the gene.

What is the function of these noncoding regions of GEP genes? We will see that they are the essence of Gene Expression Programming and evolvability, for they allow the modification of the genome using several genetic operators without restrictions, always producing syntactically correct programs. Thus, in GEP, the fundamental property of genotype/phenotype systems – syntactic closure – is intrinsic, allowing the totally unconstrained restructuring of the genotype and, consequently, an efficient evolution.

Let’s analyze then the structural organization of GEP genes in order to understand how they invariably code for syntactically correct programs and why they allow an unconstrained application of virtually any genetic operator.

The genes of Gene Expression Programming are composed of a head and a tail. The head contains symbols that represent both functions and terminals, whereas the tail contains only terminals. For each problem, the length of the head h is chosen, whereas the length of the tail t is a function of h and the number of arguments n of the function with more arguments (also called maximum arity) and is evaluated by the equation:

t = h (n-1) + 1

(4)

Consider a gene for which the set of functions F = {Q, *, /, -, +} and the set of terminals T = {a, b}. In this case n = 2; if we chose an h = 15, then t = 15 (2 - 1) + 1 = 16; thus, the length of the gene g is 15 + 16 = 31. One such gene is shown below (the head is shown in blue):

0123456789012345678901234567890
*b+a-aQab+//+b+babbabbbababbaaa	(5)

It codes for the following ET:

In this case, the K-expression ends at position 7, whereas the gene ends at position 30.

Suppose now a mutation occurred at position 6, changing the “Q” into “*”. Then the following gene is obtained:

0123456789012345678901234567890
b+a-aab+//+b+babbabbbababbaaa	(6)

And its expression gives:

In this case, the termination point shifts one position to the right (position 8), changing slightly the daughter tree.

Consider another mutation in chromosome (5) above, the substitution of “a” at position 5 by “+”, resulting in the following chromosome:

0123456789012345678901234567890
*b+a-+Qab+//+b+babbabbbababbaaa	(7)

And its expression gives:

In this case, the termination point shifts 12 positions to the right (position 19), enlarging and changing significantly the daughter tree.

Obviously the opposite might also happen, and the daughter tree might shrink. For example, consider again gene (5) above, and suppose a mutation occurred at position 2, changing the “+” into “Q”, giving:

0123456789012345678901234567890
*bQa-aQab+//+b+babbabbbababbaaa	(8)

And now its expression results in the following ET:

In this case, the ORF ends at position 3, shortening the original ET in four nodes.

So, despite their fixed length, each gene has the potential to code for expression trees of different sizes and shapes, where the simplest is composed of only one node (when the first element of a gene is a terminal) and the largest is composed of as many nodes as the length of the gene (when all the elements of the head are functions with maximum arity).

It is evident from the examples above, that any modification made in the genome, no matter how profound, always results in a structurally correct program. Now this is unique to GEP and is obviously at the heart of its superior performance: for instance, in the simple replicator system of GP, most modifications result in syntactically invalid programs (imagine, for instance, what would happen if a terminal in a GP tree is replaced by a function), which is why most GP implementations rely exclusively on inefficient tree-specific crossover to create genetic diversity [4].

Multigenic Chromosomes and Linking Functions

Genotype/phenotype systems are such well-oiled machines that their expansion into more complex systems is rather easy, and the introduction of multiple genes in Gene Expression Programming clearly illustrates this.

So, the chromosomes of Gene Expression Programming are usually composed of more than one gene of equal length. For each problem or run, the number of genes, as well as the size of the head, are a priori chosen. Each gene codes for a sub-ET and the sub-ETs interact with one another forming a more complex multi-subunit expression tree.

Consider, for example, the following chromosome with length 39, composed of three genes, each with length 13 (the heads are shown in blue):

012345678901201234567890120123456789012

QaQ+-Qbbaaaba+Q+ab+abababa*-**b+aabbaba

(9)

It has obviously three K-expressions and each one of them is expressed independently, resulting therefore in three different sub-ETs:

For the sake of simplicity, in the linear representation above, the start of each K-expression is always given by position 0; the end of each K-expression, though, is only evident upon construction of the corresponding sub-ET. As shown in the Figure above, the first K-expression ends at position 1; the second at position 7; and the last at position 10. Thus, the multigenic chromosomes of GEP contain multiple K-expressions of different sizes, each one of them coding for a structurally and functionally unique sub-ET.

Obviously, these sub-ETs or sub-programs must interact with one another and, in the canonical GEP system, they interact through special functions, the so called linking functions. These functions link the sub-ETs together one after the other in an orderly fashion. For instance, the linking by addition of all the three sub-ETs shown in the Figure above is illustrated below (the linking functions are shown in gray):

Note that the final program represented in the Figure above could be linearly encoded as the following K-expression:

01234567890123456789012
++Q+-aQ+*b+aab+abbaab	(10)

However, the use of multigenic chromosomes is more appropriate to evolve solutions to complex problems, for they permit the modular construction of complex, hierarchical structures, where each gene codes for a smaller and simpler building block. These smaller building blocks are separated from one another and are therefore free to evolve independently, allowing for the creation of concrete new units that might prove handy in a new situation.

GEP with Random Numerical Constants

The implementation of a facility for handling random numerical constants (RNCs) in Gene Expression Programming is another example of how easy it is to create and explore higher levels of complexity in genotype/phenotype systems and, indeed, RNCs are elegantly and efficiently implemented in GEP.

This elegant construct involves an extra terminal “?” that is used for representing the RNCs and an additional domain Dc at the end of GEP genes. Structurally, the Dc comes after the tail, has a length t equal to the length of the tail, and is composed of the symbols used to represent the random constants. Therefore, another region (besides the head and the tail) with defined boundaries and its own alphabet is created in the gene.

Consider the single-gene chromosome with a head size h = 7 (the Dc is shown in blue):

01234567890123456789012
+?+?*aaa??aaa68083295	(11)

where the terminal “?” represents the random constants. The expression of this kind of gene is exactly done as explained above for genes with just the head/tail construct, giving:

Then the ?’s in the expression tree are replaced from left to right and from top to bottom by the symbols (numerals, for simplicity) in Dc, obtaining:

The values corresponding to these symbols are kept in an array. For simplicity, the number represented by the numeral indicates the order in the array. For instance, for the 10 elements zero-based array:

A = {0.611, 1.184, 2.449, 2.98, 0.496, 2.286, 0.93, 2.305, 2.737, 0.755}

chromosome (11) above gives:

Obviously, the GEP-RNC algorithm can also be used to evolve highly sophisticated programs composed of multiple sub-programs through the use of multigenic chromosomes. Obviously in this case the genes of the multigenic GEP-RNC system carry the extra Dc domain for encoding the random numerical constants.

Higher Levels of Complexity

When the mapping is perfect, genotype/phenotype systems are extremely efficient, and Gene Expression Programming is the only GP-style algorithm with such a mapping. And this means obviously that these full-fledged systems can grow: not à la GP which can only bloat, but rather grow in order to explore higher levels of complexity. And we've already seen here different levels of organization: unigenic systems, both with and without a Dc domain for handling numerical constants, and multigenic systems also with and without Dc domains. But there are many more GEP systems: unicellular and multicellular systems for evolving automatically defined functions; systems for parameter optimization that explore both the multigenic organization and the Dc domain; systems for inducing polynomials, neural networks, and decision trees, which also explore the concerted action of different gene domains for fine-tuning all kinds of numerical constants: polynomials' coefficients, the weights and thresholds of neural networks, and the numerical constants of decision trees with mixed/numeric attributes. You can learn all about these systems in Ferreira's book Gene Expression Programming: Mathematical Modeling by an Artificial Intelligence [2].

References

Ferreira, C., 2001. Gene Expression Programming: A New Adaptive Algorithm for Solving Problems. Complex Systems, Vol. 13, issue 2: 87-129.
Ferreira, C., 2006. Gene Expression Programming: Mathematical Modeling by an Artificial Intelligence. 2nd Edition, Springer-Verlag, Germany.
Ferreira, C., 2006. Designing Neural Networks Using Gene Expression Programming. In A. Abraham, B. de Baets, M. Köppen, and B. Nickolay, eds., Applied Soft Computing Technologies: The Challenge of Complexity, pages 517-536, Springer-Verlag.
Koza, J. R., 1992. Genetic Programming: On the Programming of Computers by Means of Natural Selection. Cambridge, MA: MIT Press.

Last modified: August 27, 2007

Cite this as:

Ferreira, C. "A Quick Introduction to Gene Expression Programming." From GEP Tutorials: A Gepsoft Web Resource.
https://www.gene-expression-programming.com/tutorial001.htm