|
|
Evocosm 3.3.1
Related Links
Documents |
Evocosm
A C++ Framework for Evolutionary Computing
I'm fascinated by machine learning through the evolutionary discovery of algorithms; I've also discovered some interesting ways to use genetic algorithms for testing software. As such, I write a number of different kinds of evolutionary algorithms, all of which follow certain patterns. And when I see patterns, I see the need for a framework -- thus was born libevocosm, a collection of tools for creating a wide variety of evolutionary algorithms.
This library is the foundation of Acovea, a tool for the analysis of compiler optimization.
A language-neutral, pseudo-code approach is valuable for describing algorithms in a generic fashion -- but it also limits my ability to show how evolutionary algorithms work. I need real code that people can download, execute, muck with, criticize, and feel. So these articles will include real code.
The multi-language approach presents a set of challenges, not the least of which is the time it takes for me to write the same program several times. Some principles and concepts that can't be easily expressed in certain programming languages. For example, Java lacks generics (templates), and neither it nor Python is designed for high-performance code. A Fortran 95 compiler can produce fast numerical code, but the language currently lacks object-oriented facilities for organizing code. Add the fact that much of my existing code base is written in C++, and the choice becomes clear.
My goal is to provide portable code for experimentation and demonstration. I've no interest in language wars. Trying to be everything to everyone will keep me from every finishing even a single article; based on time constraints and reader requests, I'll provide examples in other programming languages. But for the moment, libevocosm is a C++ project.
Evocosm is a set of classes that abstract the fundamental components of an evolutionary algorithm. I'll list the components here with a bit of introduction; you can review the details of the classes by downloading the code archives or by reviewing the online documentation (see the menu at the article's beginning for code and documentation links.) All class documentation was generated from source code comments using doxygen. These docs have not been thoroughly proofread, so they may contain a few typos and minor errors. Self-publishing has taught me the value of a good proofreader... ;}
Evolutionary algorithms come in a variety of shapes and flavors, but at their core, they all share certain characteristics: populations that reproduce and mutate through a series of generations, producing future generations based on some measure of fitness. An amazing variety of algorithms can be built on that general framework, which leads me to construct a set of core classes as the basis for future applications.
The classes include:
- Random Numbers
Evocosm relies on psuedorandom number generators implemented in the Brahe library. I generally use the Mersenne Twister or Marsaglia's CMWC4096. - OpenMP
As of version 3.3.0, Evocosm can be built with OpenMP, allowing it to automatically take advantage of multiprocessor and multicore computers. The inner loop of Evocosm performs fitness testing; the more time-consuming your fitness function, the more effective OpenMP support will be. - Floating- Point Chromosomes
Evcosom supports the crossover and mutation of IEEE-754 floating-point numbers, using an algorithm I invented in the mid-1990s. This topic is covered in detail here. - Roulette Wheels
The roulette_wheel class implements the concept of a "software roulette wheel" for Evocosm. This is a tool for natural selection, wherein the fitness of an organism determines the width of its "slot" on an imaginary roulette wheel. - Organisms
Think of an "organism" as an answer to a problem posed by a fitness landscape; "genes" define its behavior and an associated fitness value is assigned by an evocosm during testing. Evocosm provides the freedom to define organisms as almost anything: bit strings, floating-point numbers, finite state machines, LISP programs, or external robots controlled via radio waves. In A Complexity of Options, I used an Evocosm-derived GA to determine the gcc options that produce the faster code. - Fitness Landscapes
A "fitness landscape" defines the environment where organisms "live" or a problem that they are tested against. The landscape is intimately tied to the nature of the organism; think of an organism as a potential solution to a problem implemented by the landscape. A floating-point organism, for example, could be tested by a fitness landscape that represents a function to be maximized. Or, an organism describing the shape of wing could be tested by a landscape that simulates a wind tunnel. - Evocosms
The evocosm class binds a population of organisms to a set of objects that define the rules of survival and reproduction. An evocosm will have one or more populations, which will evolve against population-unique and shared (common) fitness landscapes; breeding is controlled by a set of class objects from the following classes. - Fitness Scaling
As a population converges on an "answer", the difference between fitness values often becomes very small; this prevents the best solutions from having a significant advantage in reproduction. Fitness scaling solves this problem by adjusting the fitness values to the advantage of the most-fit chromosomes. Evocosm includes a variety of fitness scaling algorithms. - Migration
A migrator removes individuals (via "emigration") from a population of organisms, transferring them to another population (via "immigration"). The only concrete implementation of this interface is random_pool_migrator, which defines a specific number of organisms that may migrate from each population to another. When creating a random_pool_migrator, specify the number of organisms that can migrate from each population. Migration is, of course, meaningless in any application that has only one population. - Selecting Survivors
A selector decides which organisms survive from one generation to the next. Some evolutionary algorithms will not use a selector; other will. In general, it is effective to keep the "best" organisms from one generation to the next, so that good genes do not become lost at random. This is, of course, an improvement on nature, where being "the best" doesn't guarantee survival. - Reproduction
In most cases, a reproducer generates new organisms using parents selected (by fitness) from an existing population. In some singular (and probably rare) cases, a reproducer might generate new, random organisms in order to keep diversity high. Reproduction techniques can include crossover and asexual, sexual and (my favorite) try-sexual models. - Mutation Operators
A mutator applies mutations (random, usually small changes) to a set of organisms. Mutation is highly dependent on the type of organism. In traditional genetic algorithms, a mutation flips one or more bits in an integer (i.e., chromosome). Evolving a path for the Traveling Salesman Problem involves complex mutations that maintain valid permutations of destination points; in the case of floating-point numbers, I've provided utilities for mutating and crossing IEC-60559 (IEEE- 754) float and double types.
My original designs defined many of these classes internal to evocosm. While that approach resulted in strong encapsulation, it also caused indigestion for several compilers and one debugger -- and it made the source code more difficult to read. In a fit of refactoring, I defined all the classes inside the LibEvocosm namespace, using separate header files for each conceptual group. I am not a fan of huge, monolithic header files, preferring instead to have small, focused headers that ease compilation and simplify understanding of the code.
After describing my algorithm for floating-point chromosomes, I'll present a complete implementation of a genetic algorithm based on the Evocosm classes.
The majority of genetic algorithms work on pure bit strings, converting those strings to the desired types for fitness testing. In Lawrence Davis' book Handbook of Genetic Algorithms, he transforms a 44-bit string into two floating point values via a series of operations. I've seen similar techniques elsewhere, and I find them a bit cumbersome.
In theory, a GA should have no knowledge of the format of the data it is modifying; however, natural chromosomes do encode some structure in their sequence; for example, DNA encodes "stop" sequences between definitions of amino acids. Crossover appears to take place in specific positions along natural chromosomes -- and while mutation doesn't care about the chromosome's structure, but its does change that structure.
In context of a computer program, the structure of a chromosome isn't so important as the ability to logically modify its bits through crossover and mutation. I decided to build tools for the mutation and crossover of encoded floating-point values of types float and double. Floating-point numbers contain scaled values that may have a fractional part. On most modern computer systems -- including the popular Intel and AMD processors -- the C/C++/Java float and double types implement the single- precision and double-precision floating-point formats defined by the Institute of Electrical and Electronic Engineers (IEEE) standard 754-1985. A float is a 32-bit value, and a double is a 64-bit. These bits in a floating-point value are divided into three components: A sign bit, an exponent, and a mantissa. The following table shows the internal format of the float and double types.
|
|
31 |
30
23 |
22
0 |
|
|
float |
|
|
|
|
|
|
s |
exp |
mantissa |
|
|
|
|
|
|
|
|
|
63 |
62
52 |
51
|
0 |
|
double |
|
|
|
|
|
|
sign |
exponent |
mantissa |
|
The highest-order bit in a floating-point value is the sign bit. If the sign bit is one, the value is negative; if the sign bit is zero, the value is positive. The exponent of a float occupies 8 bits and the mantissa uses the remaining 23 bits; a double has a 52-bit mantissa and an 11-bit exponent. In addition, the mantissa has an implicit high order bit of 1.
The mantissa holds a binary fraction greater than or equal to 1 (because of the implied high bit being one) and less than 2. The number of bits in the mantissa affects the accuracy of the floating-point value. A float has 6 decimal digits of accuracy, and a double (with its longer mantissa) is accurate to 15 decimal digits. Since the mantissa is a binary fraction, and it can't always exactly reflect a decimal value you've tried to store in it. For example, there is no binary fraction that can exactly represent the values 0.6 or 1/3. Floating-point numbers represent an approximation of a decimal value; this is where rounding errors come from.
The exponent is a binary number representing the number of binary digits the mantissa is shifted left (for a positive actual exponent) or right (for a negative actual exponent). The exponent is a biased value; you calculate the actual exponent value by subtracting a bias value from the exponent stored in the value. The bias for a float is 127; the bias for a double is 1023. Thus, a float value with an exponent of 150 would represent a number with an exponent of 23. The constants FLT_MIN, FLT_MAX, DBL_MIN, and DBL_MAX define the minimum and maximum values for floating point numbers, in the Standard C header file float.h. Two other relevant float.h constants are FLT_EPSILON and DBL_EPSILON, which represent the smallest possible difference between two float and double values.
In Standard C++, the numeric_limits template (defined in the limits header) is specialized to describe the characteristics of each numeric data type.
Floating-point numbers can take on some unusual values. It's possible for a value number to represent positive and negative infinity, for example. Or, a floating-point value may be in a special format that doesn't represent a valid number. Any routines that generate or change floating-point numbers must avoid producing these unusual values.
A floating-point value represents infinity when the bits in the exponent are all one and the bits in the mantissa are all zero. When both the mantissa and exponent are zero, the floating-point number is zero. Infinity, as well as zero, can have a sign. Positive and negative zero operate identically in calculations and comparisons.
When is a number not a number? When its exponent is all ones and its mantissa contains any set of bits that is not all zeros (which would indicate an infinity). A value in this format is known as a NaN (Not a Number). The sign bit for a NaN is irrelevant.
Unusual floating-point values often have undesirable consequences in calculations, so I wanted to avoid the creation of unusual numbers through floating-point reproduction. Looking at the above descriptions, we can see an obvious commonality between the troublesome NaNs and infinities: both types have exponents filled with ones. This characteristic provides a way of identifying "weird" floating-point number when performing mutations and crossover.
Mutation and crossover randomly change bits in the three components of a floating-point number: the sign bit, exponent, and mantissa. Changing the exponent and sign have the most dramatic affect on a floating-point value, since the change of one bit can dramatically alter the magnitude of a number. Assuming that all bits have an equal chance of mutation, we get the following probabilities that a random bit change will affect a specific component:
| sign bit | exponent | mantissa | |
| float | 3.1% | 25.0% | 71.9% |
| double | 1.6% | 17.1% | 81.3% |
Depending on the application, I've found that those intrinsic percentages don't
always allow effective mutations. The exponent, in
particular, is so likely to be changed that numbers fluctuate wildly under mutation.
To solve that problem, I created a system for the
roulette-wheel selection of the component to be mutated, allowing me to weight
mutation in favor of changing the mantissa. The relative chance of mutating each
component is configurable. My experiments suggest limiting the mutability of the
exponent to under 15%, keeping the sign bit mutation rate at about two or three
percent.
Note that my floating-point mutation code assumes 32-bit long values; on a 64-bit system, you'll need to adjust the definitions of variables used to map floating-point values to integers for bit manipulation. In the next version of this library, I'm going to define integer types via typedef for portability between 32- and 64-bit platforms.
The .tar.gz archives contain source code, makefiles, and doxygen configurations. The makefile defines several targets:
- make clean should be rather obvious
- make WHAT=option will build a static library, libevocosm.a, according to rules based on option:
- release builds an optimized library using gcc
- debug builds a debug library using gcc
- profile builds a library designed for profiling with gprof
- intel builds an optimized library using Intel C++
- make docs will build the documentation if you have doxygen installed
- make install copies the library to /usr/local/lib and the header files to /usr/local/include/libevocosm; you'll want to be running as root when you do this, and the base directory (/bin/local) can be changed by setting the PREFIX variable in the makefile.
If this all sounds very "Unix" -- you're correct. This code was developed under Linux using GCC and a variety of free software tools. I have tested this code under Windows 2000 using the initial release Microsoft Visual Studio.Net (MS C++ 13.0). However, I no longer include Visual Studio projects, as various versions of Microsoft's development environment use different project file formats.
Computer scientists often encounter problems that are not amenable to numerical methods of solution. A common problem is that of finding input value that produces a minimum or maximum output from a function. It's quite easy to optimize a function that maps a single high or low value; things become quite a bit more difficult when a function generates several such values. Consider this two-dimensional function:
![]()
Mapped onto an x-y plane from (1,-1)-(1,-1), the function in question produces this image:

This fitness landscape is somewhat challenging, with several peaks, including two very close in height. Yes, you might be able to generate a much more detailed picture and see where the highest peak is -- but not all equations can be visualized so easily. Rather than rely on the human eye or a "good" guess, we can use a computer program to find the actual answer.
How should a program go about finding the maximum for f(x,y)? The graph shows several local maxima in the upper quadrant, including two tall spikes in close proximity. Traditional approaches to finding function maxima or minima -- a process known as optimization -- use a variety of techniques that rely on the ability to climb "upward" to a solution. In optimizing f(x,y), most optimization techniques (hill climbing, for example) become "trapped" on the smaller "hills"-unless they begin looking for maxima in just the right place.
A genetic algorithm begins with a set of randomly-selected points from which it selects the best performers through fitness testing. Crossover combines the best attributes from the most successful members of the population, and random mutation introduces new characteristics that may produce better solutions. Subsequent generations build upon the success of their "ancestiors", evolving toward a solution that best "fits" the given niche (in this case, x and y values that produce the highest peak).
The funcopt classes implement a generic function optimizer. Optimizing functions is arguably the most ubiquitous application of genetic algorithms; it seems an appropriate example for demonstrating how to use libevocosm.
To build an evolutionary algorithm from the Evocosm classes, you need to define classes for the following components:
- An organism class, derived from the libevocosm::organism
template.
In the case of funcopt, this class is named function_solution, which derives from libevocosm::organism< vector<double> >. In other words, the organism uses a vector<double> as its "genes". I use a vector<double> to support functions containing more than one variable; in terms of f(x,y), the vector contains two doubles corresponding to the values of x and y for a given organism. - A mutator class, derived from the libevocosm::mutator
template.
The mutator handles random changes in organisms; it is tied by the libevocosm::mutator template argument to the type of organism undergoing evolution. For funcopt, I use an instance of an evoreal object to mutate double values in organisms. - A reproducer class, derived from the libevocosm::reproducer
template.
This class defines how reproduction takes place between two organisms of the same type. The function_reproducer uses an evoreal object to implement crossover between an function_solution's floating-point values. - A fitness landscape class, derived
from the libevocosm::landscape template.
A landscape defines the test applied to organisms. The function_landscape class implements a constructor that requires a function pointer argument. That function, of type t_function, must have a single vector<double> argument and must return a double value. The supplied function defines the formula being optimized; the example driver program, ecopt, implements f(x,y) as given in the equation above. - A reporter class, derived from the libevocosm::reporter
template.
This class is tasked with reporting information about each generation. - A binder class, derived from organism_factory
and landscape_factory.
I encapsulate instances of the above objects, plus instances of objects for handling fitness scaling, migration, selection modes (for which Evocosm provides concrete implementations of common algorithms). Include any other objects and values associated with the application. In most cases (including funcopt), my binder classes implements a configuration constructor and a run method. The configuration constructor's arguments define various parameters for a given run of the evolution algorithm -- mutation rate, for example. The run method creates an libevocosm::evocosm object using the encapsulated Evocosm components, and then iterates for a specific number of times, or it loops (calling libevocosm::evocosm::run_generation) until a specific result is attained. - A driver application -- in this case, ecopt.
The ecopt application defines a function to be optimized, creates a function_optimizer object, and runs a hundred generations to find the maximum value for f(x,y).
If you're looking for six to eight digits of precision, the peak is usually found in a few hundred generations. Being a stochastic process, the genetic algorithm doesn't always produce identical performance, even when applied to the same problem. Play a bit with the configuration, selecting a variety of parameters to gain a feel for how they affect performance. The best way to learn about genetic algorithms is to write them.
The function optimizer is a basic genetic algorithm, implemented in a few simple classes; it demonstrates how Evocosm works on a fundamental level. Other more advanced features of Evocosm -- particularly multiple intermigrational populations and state machines -- fall into the realm of complex applications that will be my focus in other articles.
As always, I'm interested in your comments.
- Scott

