Home
News
Author
Q&A
Tutorials
GEP Biblio
Contacts

Visit Gepsoft

Gene Expression Programming: Mathematical Modeling by an Artificial Intelligence

Fisher’s irises

In this classification problem the goal is to classify irises based on four measurements: sepal length, sepal width, petal length, and petal width. The iris dataset contains fifty examples each of three types of iris: Iris setosa, Iris versicolor, and Iris virginica.

Classification problems with more than two classes can also be solved by GEP but the data must be rearranged. The classification of data into n distinct classes C requires the processing of the data into n separate 0/1 classification problems as follows:

 1. C1 versus NOT C1 2. C2 versus NOT C2 ... n. Cn versus NOT Cn

Then n different models are evolved separately and afterwards combined in order to make the final decision.

For the iris data we are going to decompose our problem into three separate 0/1 classification problems. The first one is the Iris setosa versus NOT Iris setosa; the second is Iris versicolor versus NOT Iris versicolor; and the last is the Iris virginica versus NOT Iris virginica.

For this problem F = {+, -, *, /} and the set of terminals included all the four attributes which were represented by d0 - d3, corresponding, respectively, to sepal length, sepal width, petal length, and petal width. The 0/1 rounding threshold was set to 0.5 and the fitness was evaluated by equation (4.28).

For all the sub-problems, I started with three-genic chromosomes with an h = 8 and sub-ETs linked by addition. The first dataset (setosa versus NOT setosa) was almost instantaneously classified without errors and I soon found out that a very simple structure is required to classify correctly this dataset. The model below perfectly classifies all the irises into setosa and NOT setosa:

 double APSCfunction(double d[ ])      {           double dblTemp = 0;           dblTemp += (d-d);           return (dblTemp >= 0.5 ? 1 : 0); } (4.31)

As you can see, only the difference between the sepal width and the petal length is relevant to distinguish Iris setosa from the other two irises.

The classification of the remaining datasets was also extremely accurate, but on both cases only 149 out of 150 samples were correctly classified. The model below distinguishes Iris versicolor from the other two irises:

 double APSCfunction(double d[ ])      {           double dblTemp = 0;           dblTemp += (d*(((d*d)-d)+((d*d)-d)));           dblTemp += (((d-(d+d))-(d/d))/d);           dblTemp += (((d-(d*d))*(d-d))*d);           return (dblTemp >= 0.5 ? 1 : 0); } (4.32)

And the next model distinguishes Iris virginica from setosa and versicolor:

 double APSCfunction(double d[ ])      {           double dblTemp = 0;           dblTemp += (d/(d*(d/d)));           dblTemp += (d-d);           dblTemp += ((d-(((d+d)/d)/d))-d);           return (dblTemp >= 0.5 ? 1 : 0); } (4.33)

So, by combining the three models above and representing them by y1, y2, and y3, the following classification rules are obtained:

 IF (y1 = 1 AND y2 = 0 AND y3 = 0) THEN setosa;     IF (y1 = 0 AND y2 = 1 AND y3 = 0) THEN versicolor; IF (y1 = 0 AND y2 = 0 AND y3 = 1) THEN virginica; (4.34)

which classify correctly 149 out of 150 irises and, therefore, this model, with a classification accuracy of 99.33% and a classification error of 0.667%, is one of the best models ever obtained by machine learning algorithms.

Home | Contents | Previous | Next