Acovea 5.0

Analysis of GCC 3.4.3 for Opteron

27 April 2005
 

 

Acovea Logo

Downloads
acovea-gtk-1.0.1.tar.gz
libacovea-5.1.1.tar.gz
libevocosm-3.1.0.tar.gz
libcoyotl-3.1.0.tar.gz

About Acovea
Acovea Overview
A GUI for Acovea: Acovea-GTK
Configuration Files
FAQ
About the Genetic Algorithm

Analyses
Acovea 5.0, GCC 4.0, Opteron
Acovea 5.0, GCC 4.0, Pentium 4
Acovea 5.0, GCC 3.4, Opteron
Acovea 4.0, GCC 3.4, P4, AMD64 (May-04)
Acovea 3.3, GCC 3.x, P4 (Dec-03)

Licensing
GNU General Public License (GPL)
Commercial License

If you find this article useful, please consider supporting the author's free software efforts with a donation, no matter how small.

Abstract

This article describes results obtained by running Acovea 5.0.0 to test GCC 3.4.3 on an AMD Opteron workstation. When applied to several example benchmark programs, Acovea identified optimization sets that reduced all individual benchmark run times by between 8 and 17%, when compared against code generated using the compiler's predefined -On optimization sets.

Past articles have combined results from testing on both Opteron and Pentium 4 processors. This article focuses solely on the Opteron, to avoid the non sequitur and perhaps misleading comparison of processors that has accompanied past articles. There will be a Pentium 4 companion article available soon after the release of Acovea 5.0.

Herein, I am writing about Acovea as it performs on the dual Opteron workstation I own, Corwin. This system is still running the original Opteron 240 processors; it will be upgraded as soon as I obtain a pair of AMD's new dual-core processors.

All tests were performed using GCC 3.4.3, the version of GCC in common use at the time of this writing.

Results obtained with Acovea 5.0 were much superior to those from older versions. This is due to better statistical analysis; the underlying algorithm is the same. In every case, Acovea suggested a set of options that produced faster code than any of the preset -On options. Also, some code generation bugs in older versions of GCC were fixed, eliminating a couple of cases where Acovea produced exceptional results by avoiding broken optimizations. I did not find any egregious bugs in GCC 3.4.3, at least in terms of Acovea testing.

Benchmark tests

The current benchmark suite consists of six algorithm-specific programs, all written to the 1999 ISO C Standard. I intend to update the test suite following the release of Acovea 5.0. Such grand plans are, of course, predicated on my finding some of that mythical "spare time" I keep hearing about...

The individual tests are:

  • alma
    Calculates the daily ephemeris (at noon) for the years 2000-2099; tests array handling, floating-point math, and mathematical functions such as sin() and cos().
  • evo
    A simple genetic algorithm that maximizes a two-dimensional function; tests 64-bit math, loop generation, and floating-point math.
  • fft
    Uses a Fast Fourier Transform to multiply two very (very) large polynomials; tests the C99 _Complex type and basic floating-point math.
  • huff
    Compresses a large block of data using the Huffman algorithm; tests string manipulation, bit twiddling, and the use of large memory blocks.
  • lin
    Solves a large linear equation via LUP decomposition; tests basic floating-point math, two-dimensional array performance, and loop optimization.
  • mat1
    Multiplies two very large matrices using the brute-force algorithm; tests loop optimization.

Past articles included a test called "tree" which is not present here. The reason: It broke, and I didn't want to spend time hunting bugs in a benchmark I'll be replacing soon. The "tree" test was a quick hack that I'll replace with a new benchmark based on the production-quality code in Itzam.

Compiler Options

For the purpose of this article, I evolved the best set of options for the seven benchmark programs listed above; I also created a composite program that invokes all seven benchmarks to produce a single, overall result. At left is a chart showing the relative performance of each benchmark as compiled with different optimization settings.

  • O1, O2, O3 - baseline compiles with those collective options defined by GCC.
  • O3 fast math - baseline compiles with -O3 -ffast-math
  • acovea - compiled with options evolved by Acovea, with -O1 as the base level of optimization (though it implements switches to turn off options implied by -O1)

Acovea evolves optimization sets from the -O1 level. -O1 must be included if any optimization is to occur; I have included switches in the GCC configuration files to turn off various optimizations (e.g., -fno-merge-constants) implied by -O1, thus allowing evolution to remove options implied at the base level. For GCC 3.4 on the Opteron, Acovea evolves optimization sets selected from 60-some options, some of which include multiple possibilities (e.g., -mfpath). The -fnew-ra option causes compiler crashes and is disabled for these tests.

Discussions on the GCC mailing list (thread 1, thread 2, thread 3) revealed different perceptions about the implications of the -ffast-math option. For the tests shown here, I included -ffast-math in the evolutionary mix. I've performed experiments that show how -ffast-math does not reduce (and in some cases, improves) accuracy when used with industry-standard benchmarks like Paranoia. The "floating-point accuracy" story is very complicated, but it deserves its own, to-be-written-when-I-have-time article. And no, I haven't had time in the year since I last wrote that sentence!

Results

Acovea 5.0 was very successful when applied to the Opteron. For every single benchmark, Acovea produced a set of options that generated faster code than any of the built-in -On sets. You can download the complete set of test runs here. These are test files generate by the command-line and GUI drivers, and include the full sets of optimizations evolved by Acovea.

Corwin's Opterons run Gentoo's AMD64 Linux; the system is operating entirely on the 64-bit level. That means 64-bit libraries and versions of GCC that generate 64-bit x86_64 (or AMD64, if you like) object code. While Corwin does have dual processors, the current benchmarks are all serial. I intend to produce some parallel benchmarks when (yes, that phrase is coming) I have the time. :) Note that a huge infusion of cash would increase my free time to work on these projects. :)

For every test compile on Corwin, Acovea used gcc -lrt -lm -std=gnu99 -O1 -march=opteron; the -march=opteron (which I'm told is redundant for the AMD64 compiler) implies -msse, -msse2, -m3dnow, and -mmmx.

For each test, I provide Acovea's top options set along with lists of strongly pessimistic and optimistic flags. An "optimistic" option was common in "good" solutions during the entire evolutionary process. A "pessimistic" option, on the other hand, was absent from the "good" solutions, either because it hurt run times or it simply made no difference at all and was ignored as a "junk gene". It's important to realize that Acovea's best solution may contain options that had no practical effect on code generation; they didn't hurt, didn't help, and were just "there." Such it is human genes for producing tails; most of us have the genes (option), but it doesn't express itself in any way.

For the seven benchmarks on the Opteron, Acovea produced the following results:

  • alma
    Acovea's solution:
    -fno-cprop-registers -fno-delayed-branch -fno-crossjumping -fstrength-reduce -fcaller-saves -fforce-mem -fschedule-insns -fstrict-aliasing -fdelete-null-pointer-checks -freorder-functions -falign-labels -funit-at-a-time -falign-functions -finline-functions -frename-registers -fweb -fmove-all-movables -freduce-all-givs -ftracer -funswitch-loops -funroll-loops -mieee-fp -mno-push-args -maccumulate-outgoing-args -mno-align-stringops -minline-all-stringops -mfpmath=sse,387 -fno-math-errno -funsafe-math-optimizations -fno-trapping-math -finline-limit=600


    Strongly Optimistic Options:
    -fschedule-insns -fweb -mieee-fp -funsafe-math-optimizations


    Strongly Pessimistic Options:
    -fno-guess-branch-probability -fschedule-insns2 -fregmove -ffloat-store -mfpmath=387 -mfpmath=sse


    Acovea found an option set that was 10% faster than the default options.
     
  • evo
    Acovea's solution:
    -fno-merge-constants -fno-defer-pop -fno-omit-frame-pointer -fno-cprop-registers -fno-if-conversion -fno-if-conversion2 -fno-delayed-branch -fno-crossjumping -fexpensive-optimizations -fforce-mem -fschedule-insns2 -fdelete-null-pointer-checks -fsched-interblock -freorder-functions -finline-functions -fprefetch-loop-arrays -fmove-all-movables -fno-inline -ftracer -mfpmath=sse,387 -funsafe-math-optimizations -fno-trapping-math -fno-signaling-nans -finline-limit=600


    Strongly Optimistic Options:
    -fno-defer-pop -fno-cprop-registers -fno-if-conversion2 -frerun-loop-opt -falign-labels


    Strongly Pessimistic Options:
    -fno-thread-jumps -fpeephole2 -fstrict-aliasing -funroll-loops -funroll-all-loops -mfpmath=387


    On this test, the -On options all turned in similar performance; Acovea's suggestion beat them by 10%. Contrary to implications, -O1 performed better than -O2 and -O3 on evo.
     
  • fft
    Acovea's solution:
    -fno-omit-frame-pointer -fno-if-conversion -fno-delayed-branch -foptimize-sibling-calls -fcse-follow-jumps -fcse-skip-blocks -fexpensive-optimizations -fstrength-reduce -fcaller-saves -fpeephole2 -fschedule-insns -fschedule-insns2 -fregmove -fstrict-aliasing -fdelete-null-pointer-checks -freorder-blocks -funit-at-a-time -finline-functions -frename-registers -fweb -fprefetch-loop-arrays -fmove-all-movables -fno-inline -fpeel-loops -fbranch-target-load-optimize2 -mno-push-args -minline-all-stringops -ffinite-math-only -fno-signaling-nans


    Strongly Optimistic Options:
    -fstrength-reduce -fstrict-aliasing -fprefetch-loop-arrays


    Strongly Pessimistic Options:
    -fno-loop-optimize -funroll-all-loops -fbranch-target-load-optimize -mfpmath=sse -mfpmath=sse,387


    The solution above generates code that is 9% faster than that generated by GCC's compsite options.
     
  • huff
    Acovea's solution:
    -fno-defer-pop -fno-cprop-registers -fno-if-conversion2 -fno-delayed-branch -fno-crossjumping -foptimize-sibling-calls -fcse-skip-blocks -fgcse -fexpensive-optimizations -frerun-cse-after-loop -fcaller-saves -fforce-mem -fpeephole2 -fschedule-insns2 -fregmove -fdelete-null-pointer-checks -freorder-blocks -fsched-interblock -fsched-spec -freorder-functions -falign-loops -falign-labels -funit-at-a-time -falign-functions -fmove-all-movables -fpeel-loops -funroll-all-loops -mieee-fp -mno-push-args -maccumulate-outgoing-args -mno-align-stringops -mfpmath=387 -fno-math-errno -ffinite-math-only -fno-signaling-nans


    Strongly Optimistic Options:
    -fno-crossjumping -funroll-all-loops


    Strongly Pessimistic Options:
    -fno-guess-branch-probability -fno-if-conversion -fstrength-reduce -funroll-loops -fbranch-target-load-optimize -mfpmath=sse,387


    Acovea shines on this benchmark, improving performance by 17% as compared to the next-best solution, -O3. As with evo, -O1 performed better than -O2 and -O3 on huff.
     
  • lin
    Acovea's solution:
    -fno-merge-constants -fno-guess-branch-probability -fno-if-conversion2 -fno-delayed-branch -fgcse -fstrength-reduce -frerun-cse-after-loop -frerun-loop-opt -fforce-mem -fschedule-insns -fschedule-insns2 -fregmove -fstrict-aliasing -fdelete-null-pointer-checks -freorder-blocks -fsched-interblock -fsched-spec -falign-labels -funit-at-a-time -fweb -fprefetch-loop-arrays -fmove-all-movables -fno-inline -mno-align-stringops -minline-all-stringops -mfpmath=sse -fno-trapping-math -fno-signaling-nans -finline-limit=700


    Strongly Optimistic Options:
    -fstrength-reduce -fstrict-aliasing -fprefetch-loop-arrays


    Strongly Pessimistic Options:
    -fno-loop-optimize -funroll-all-loops -fbranch-target-load-optimize2 -mfpmath=sse -mfpmath=sse,387 -ffinite-math-only


    Acovea found an option set for producing code that is produced code more than 8% faster than any -On option.
     
  • mat1
    Acovea's solution:
    -fno-thread-jumps -fno-omit-frame-pointer -fno-if-conversion2 -fno-delayed-branch -fno-crossjumping -fcse-follow-jumps -fexpensive-optimizations -fstrength-reduce -frerun-cse-after-loop -fforce-mem -fschedule-insns -fschedule-insns2 -fregmove -fdelete-null-pointer-checks -freorder-blocks -fsched-interblock -fsched-spec -freorder-functions -falign-loops -falign-functions -finline-functions -frename-registers -fweb -fpeel-loops -ftracer -funroll-all-loops -fbranch-target-load-optimize2 -minline-all-stringops -ffinite-math-only


    Strongly Optimistic Options:
    -fstrength-reduce -fschedule-insns -frename-registers -fweb


    Strongly Pessimistic Options:
    -fno-guess-branch-probability -fno-loop-optimize -ffloat-store -fbranch-target-load-optimize -mfpmath=sse,387


    Another good result from Acovea, generating 11% faster code.
     

Acovea found and overall 14% improvement in generated code speed, as compared to the -On options.

Conclusions

The results I've presented describe the effects of different optimizations on a set of relatively straight-forward programs that perform well-defined tasks. My conclusions so far have been these:

Acovea 5.0 is a substantial improvement over older versions.
With version 5, Acovea identified a "better" solution in every single case. And the improvements found are significant; for a compute-bound application, gaining even 8% is A Good Thing.

A higher "-O" level does not guarantee faster code.
By implication, -O3 should produce code that is faster than -O1, but it doesn't. The complex interactions of many different optimization techniques seem very likely to conflict; the complexity is simply too much to understand. Acovea can test these interactions, and produce better optimization sets that avoid conflicts and mutual pessimisms.

Different algorithms are most effective with different optimizations.
Again, an obvious pearl of wisdom that is sometimes forgotten. Real (i.e., non-benchmark) programs do many things; using a profiler, the critical loops can be identified, and their algorithms refined as much as possible by hand. Once the algorithms are tuned, a tool like Acovea will identify the best possible optimization settings. Applying -O3 to an entire program is not likely to produce the fastest program; using algorithm-specific optimizations to compile specific, critical routines is likely to produce faster code.

As always, I look forward to considered comments.

-- Scott