Fisher’s irises

  Buy the Book

  Home
  News
  Author
  Q&A
  Tutorials
  Downloads
  GEP Biblio
  Contacts

  Visit Gepsoft

ISBN: 9729589054

Gene Expression Programming: Mathematical Modeling by an Artificial Intelligence

Fisher’s irises

In this classification problem the goal is to classify irises based on four measurements: sepal length, sepal width, petal length, and petal width. The iris dataset contains fifty examples each of three types of iris: Iris setosa, Iris versicolor, and Iris virginica.

Classification problems with more than two classes can also be solved by GEP but the data must be rearranged. The classification of data into n distinct classes C requires the processing of the data into n separate 0/1 classification problems as follows:

1. C₁ versus NOT C₁
2. C₂ versus NOT C₂
...
n. C_n versus NOT C_n

Then n different models are evolved separately and afterwards combined in order to make the final decision.

For the iris data we are going to decompose our problem into three separate 0/1 classification problems. The first one is the Iris setosa versus NOT Iris setosa; the second is Iris versicolor versus NOT Iris versicolor; and the last is the Iris virginica versus NOT Iris virginica.

For this problem F = {+, -, *, /} and the set of terminals included all the four attributes which were represented by d₀ - d₃, corresponding, respectively, to sepal length, sepal width, petal length, and petal width. The 0/1 rounding threshold was set to 0.5 and the fitness was evaluated by equation (4.28).

For all the sub-problems, I started with three-genic chromosomes with an h = 8 and sub-ETs linked by addition. The first dataset (setosa versus NOT setosa) was almost instantaneously classified without errors and I soon found out that a very simple structure is required to classify correctly this dataset. The model below perfectly classifies all the irises into setosa and NOT setosa:

double APSCfunction(double d[ ]) { double dblTemp = 0; dblTemp += (d[1]-d[2]); return (dblTemp >= 0.5 ? 1 : 0);
}	(4.31)

As you can see, only the difference between the sepal width and the petal length is relevant to distinguish Iris setosa from the other two irises.

The classification of the remaining datasets was also extremely accurate, but on both cases only 149 out of 150 samples were correctly classified. The model below distinguishes Iris versicolor from the other two irises:

double APSCfunction(double d[ ]) { double dblTemp = 0; dblTemp += (d[3](((d[0]d[3])-d[1])+((d[1]d[2])-d[2]))); dblTemp += (((d[2]-(d[2]+d[2]))-(d[0]/d[0]))/d[3]); dblTemp += (((d[0]-(d[2]d[3]))(d[2]-d[1]))d[0]); return (dblTemp >= 0.5 ? 1 : 0);
}	(4.32)

And the next model distinguishes Iris virginica from setosa and versicolor:

double APSCfunction(double d[ ]) { double dblTemp = 0; dblTemp += (d[1]/(d[0]*(d[0]/d[3]))); dblTemp += (d[2]-d[1]); dblTemp += ((d[2]-(((d[0]+d[3])/d[1])/d[3]))-d[2]); return (dblTemp >= 0.5 ? 1 : 0);
}	(4.33)

So, by combining the three models above and representing them by y₁, y₂, and y₃, the following classification rules are obtained:

IF (y₁ = 1 AND y₂ = 0 AND y₃ = 0) THEN setosa; IF (y₁ = 0 AND y₂ = 1 AND y₃ = 0) THEN versicolor;
IF (y₁ = 0 AND y₂ = 0 AND y₃ = 1) THEN virginica;	(4.34)

which classify correctly 149 out of 150 irises and, therefore, this model, with a classification accuracy of 99.33% and a classification error of 0.667%, is one of the best models ever obtained by machine learning algorithms.

Home | Contents | Previous | Next