grams.
  • Support for other programming languages, including Fortran 95.
  • Compiler configurations for other versions of GCC and other processors.
  • Even more documentation (but then again, what program doesn't need better docs?)
  • As always, I look forward to considered comments.

    -- Scott

    



    E-mail
    Twitter

    Software Products

    About Scott's Work
    Curriculum Vitae

    Computer Books
    Fiction
    Articles
    Reviews

    FAQ
    Bibliography



    Link to Scott Ladd's Syraqua site

    © 2009
    Scott Robert Ladd
    All rights reserved.
    Established 1996


    The grey-and-purple dragon logo, the blue coyote logo, Coyote Gulch Productions, Itzam, Evocosm, and Acovea are all Trademarks of Scott Robert Ladd.

    Privacy Policy
    Legal Stuff

     

    .0.0

    Results for Opteron and Pentium 4

    30 May 2004
     

    

     

    Acovea Logo

    Downloads
    acovea-gtk-1.0.1.tar.gz
    libacovea-5.1.1.tar.gz
    libevocosm-3.1.0.tar.gz
    libcoyotl-3.1.0.tar.gz

    About Acovea
    Acovea Overview
    A GUI for Acovea: Acovea-GTK
    Configuration Files
    FAQ
    About the Genetic Algorithm

    Analyses
    Acovea 5.0, GCC 4.0, Opteron
    Acovea 5.0, GCC 4.0, Pentium 4
    Acovea 5.0, GCC 3.4, Opteron
    Acovea 4.0, GCC 3.4, P4, AMD64 (May-04)
    Acovea 3.3, GCC 3.x, P4 (Dec-03)

    Licensing
    GNU General Public License (GPL)
    Commercial License

    If you find this article useful, please consider supporting the author's free software efforts with a donation, no matter how small.

    This article has been superceded by results obtained using Acovea 5.0.0. I'm leaving the original article online for historical reasons.

    Abstract

    ACOVEA (Analysis of Compiler Options via Evolutionary Algorithm) implements a genetic algorithm to find the "best" options for compiling programs with the GNU Compiler Collection (GCC) C and C++ compilers. "Best", in this context, is defined as those options that produce the fastest executable program from a given source code. Acovea is a C++ framework that can be extended to test other programming languages and non-GCC compilers.

    This article describes results obtained by running Acovea on Pentium 4 and Opteron workstations. When applied to several example benchmark programs, Acovea identified optimization sets that reduced individual benchmark run times by as much as 40%, when compared against code generated using the compiler's predefined -On optimization sets. The 40% figure is a bit atypical, however; in most instances, Acovea found "improved" optimizations sets that reduced run times by 20 percent or less.

    In two test cases, Acovea could not find any improvement over the standard -O options — although Acovea did produce useful information by proving, in one of those cases, that the best optimization was an unadorned -O1 (and not -O2 or -O3, as might be expected.)

    Table 1: Benchmark Systems
    characteristic Corwin Tycho
    Motherboard Tyan K8W 2885 Intel D850EMV
    Processors dual AMD Opteron
    model 240, 1.4GHz
    single Pentium 4
    2.8GHz Northwood
    HyperThread-enabled
    RAM 2GB PC3200
    (1GB/cpu, NUMA)
    512MB PC800 RDRAM
    Hard Drive 120GB ATA/133 80GB ATA/100
    Linux distro Gentoo-amd64 Debian "sid"
    Linux Kernel  2.6.3 NUMA/SMP 2.6.3 SMP/HT aware
    glibc version 2.3.2 2.3.2
    binutils 2.14.90.0.8 2.14.90.0.7

    Acovea Changes and Test Systems

    The original article appeared in late 2003, and covered my experiments with Acovea on a Pentium 4 workstation. At that time, Acovea defined compiler options and commands with C++ classes. That design grew both unwieldy and difficult to extend — people often asked if I could change Acovea to use XML configuration files. And that is exactly what I've done in version 4.0.0, upon which I've based this article.

    In mid-March 2004, I purchased the dual Opteron system (Corwin) described in Table 1, which complements my year-old Pentium 4 (Tycho) machine. I chose the Opteron for many reasons, none of them particularly religious. If you're interested, check out the how's and why's of my foray into AMD processors..

    This article is not a comparison of the Opteron and Pentium processors. Herein, I am writing about Acovea as it performs on the computers I own. While I have definite opinions about the Opteron and Pentium processors, those opinions belong in elsewhere. Suffice it to say that both Corwin and Tycho have proven to be excellent workstations for my day-to-day work.

    GCC versions

    All tests were performed on a recent snapshot of GCC 3.4. GCC 3.3 is the current and stable release; any discoveries I might make will not result in any improvement to a stable compiler, and one of my goals was to inspire improvements in the next versions of GCC. My previous Acovea tests resulted in some minor improvements to GCC 3.4, and it is there that I focus my efforts for this article.

    However, in recognition of reader interest, Acovea 4.0.0 provides predefined compiler defintions for both GCC 3.4 and 3.3. If time permits, I'll run 3.3 tests myself. Once tree-ssa merges into mainline GCC development, I'll be testing it (and its new Fortran 95 compiler) as well.

    Benchmark tests

    The current benchmark suite consists of seven algorithm-specific programs, all written to the 1999 ISO C Standard. I plan to add a couple new benchmarks to that set, and create additional implementations in Fortran 95, C++, and other languages. Such grand plans are, of course, predicated on my finding some of that mythical "spare time" I keep hearing about...

    The individual tests are:

    It might seem that these tests have many things in common -- but, as I'll describe below, Acovea found these benchmarks to have very different optimization characteristics.

    Compiler Options

    For the purpose of this article, I evolved the best set of options for the seven benchmark programs listed above; I also created a composite program that invokes all seven benchmarks to produce a single, overall result. At left is a chart showing the relative performance of each benchmark as compiled with different optimization settings.

    Acovea evolves optimization sets from the -O1 level. -O1 must be included if any optimization is to occur; I have included switches in the GCC configuration files to turn off various optimizations (e.g., -fno-merge-constants) implied by -O1, thus allowing evolution to remove options implied at the base level. For GCC 3.4 on the Pentium 4, Acovea evolves optimization sets selected from 64 options, some of which include multiple possibilities (e.g., -mfpath).

    Recent discussions on the GCC mailing list (thread 1, thread 2, thread 3) revealed different perceptions about the implications of the -ffast-math option. For the tests shown here, I included -ffast-math in the evolutionary mix. I've performed experiments that show how -ffast-math does not reduce (and in some cases, improves) accuracy when used with industry-standard benchmarks like Paranoia. The "floating-point accuracy" story is very complicated, but it deserves its own, to-be-written-when-I-have-time article.

    In the following discussion, I refer to "optimisms" and "pessimisms", which are short-hand terms for compiler options that improve or degrade the speed of executable code output from the compiler.

    Intel Pentium 4 (Northwood core)

    Astute readers will note that Tycho is running the "unstable" release of Debian GNU/Linux. I ran all tests from the bash prompt, in text mode, no X server, with a minimum of daemons present.

    For every test compile, Acovea uses gcc -lrt -lm -std=gnu99 -O1 -march=pentium4; the -march=pentium4 option implies -mcpu=pentium4, -msse, -msse2, and -mmmx, so I don't explicitly specify those options.

    Back to the chart! For each benchmark, five bars depict relative performace as compared to the fastest run-time. For example, on huff, Acovea identified a set of options that produced the fastest code; the shorter lines show how much slower huff ran with the default optimization levels. From fastest to slowest, the best optimizations for huff were (from fastest to slowest) acovea, -O1, -O3, -O2, and no optimization at all.

    Such a result might seem odd at first; shouldn't -O2 (and -O3, for that matter) produce a faster program than does -O1? Yes, by intent — but in reality, more optimization does not always equal faster code. Sometimes, higher optimization levels produce code that is too large for execution in the processor's on-board cache; in other cases, "optimizations" can be pessimistic, especially in interaction with other options. Acovea identifies and discards pessimisms by weeding them out through natural selection. See the Acovea overview for more details on Acovea's statistical reporting.

    For the seven benchmarks on the Pentium 4, Acovea produced the following results:

    Using the corresponding Acovea-generated option sets for each benchmark reduced overall compile time by about 10%, as compared to a compile using -O3 -ffast-math; the resulting composite Acovea program ran about 7% faster that did one compiled with -O3 -ffast-math.

    AMD Opteron

    After reviewing Acovea's results for the Pentium 4, you might be less than impressed; I was somewhat disappointed, even though Acovea produced faster option sets for 3 benchmarks while also indentifying a nasty pessimisim for huff.

    My evolutionary algorithm was much more successful, however, when applied to the Opteron. On five of seven benchmarks, Acovea identified optimization sets that produced faster code than did the -On options.

    Corwin's Opterons run the first release of Gentoo's AMD64 port of Linux; the system is operating entirely on the 64-bit level. That means 64-bit libraries and versions of GCC that generate 64-bit x86_64 (or AMD64, if you like) object code. While Corwin does have dual processors, the current benchmarks are all serial. I intend to produce some parallel benchmarks when (yes, that phrase is coming) I have the time. :)

    For every test compile on Corwin, Acovea used gcc -lrt -lm -std=gnu99 -O1 -march=opteron; the -march=opteron (which I'm told may be redundant for the AMD64 compiler) implies -msse, -msse2, -m3dnow, and -mmmx.

    For the seven benchmarks on the Opteron, Acovea produced the following results:

    Using the corresponding Acovea-generated option sets for each benchmark reduced overall compile time by about 10%, as compared to a compile using -O3 -ffast-math; the resulting composite Acovea program ran about 10% faster that did one compiled with -O3 -ffast-math.

    Conclusions

    The results I've presented describe the effects of different optimizations on a set of relatively straight-forward programs that perform well-defined tasks. My conclusions so far have been these:

    Acovea identifies weaknesses in GCC.
    In the case of huff on the Pentium 4, Acovea shows that something is clearly wrong with the predefined optimization sets. During my other tests, natural selection uncovered a half-dozen "internal compiler errors" bugs, many related to the -fnew-ra option. As a QA tool for testing GCC, Acovea has proven useful.

    Some processor-specific options still do not appear to be a major factor in producing fast code.
    Much to my surprise, I have yet to find any consistent evidence that options like -mfpmath=sse improve program performance. Thus Acovea bears out my personal experience, though it does not explain why so many people continue to suggest that I should use -mfpmath=sse to generate floating-point code. If someone could suggest a good "-mfpmath=sse", I'd appreciate seeing it.

    More optimizations do not guarantee faster code.
    This should be obvious, but it isn't. The complex interactions of many different optimization techniques seem very likely to conflict; the complexity is simply too much to understand. Acovea can test these interactions, and produce better optimization sets that avoid conflicts and mutual pessimisms.

    Different algorithms are most effective with different optimizations.
    Again, an obvious pearl of wisdom that is sometimes forgotten. Real (i.e., non-benchmark) programs do many things; using a profiler, the critical loops can be identified, and their algorithms refined as much as possible by hand. Once the algorithms are tuned, a tool like Acovea will identify the best possible optimization settings. Applying -O3 to an entire program is not likely to produce the fastest program; using algorithm-specific optimizations to compile specific, critical routines is likely to produce faster code.

    Future Directions

    I've considered a number of extensions to the current Acovea system, including:

    As always, I look forward to considered comments.

    -- Scott

    



    E-mail
    Twitter

    Software Products

    About Scott's Work
    Curriculum Vitae

    Computer Books
    Fiction
    Articles
    Reviews

    FAQ
    Bibliography



    Link to Scott Ladd's Syraqua site

    © 2009
    Scott Robert Ladd
    All rights reserved.
    Established 1996


    The grey-and-purple dragon logo, the blue coyote logo, Coyote Gulch Productions, Itzam, Evocosm, and Acovea are all Trademarks of Scott Robert Ladd.

    Privacy Policy
    Legal Stuff