Acovea 5.0

Analysis of GCC 4.0.0 for Opteron

29 April 2005
 

 

Acovea Logo

Downloads
acovea-gtk-1.0.1.tar.gz
libacovea-5.1.1.tar.gz
libevocosm-3.1.0.tar.gz
libcoyotl-3.1.0.tar.gz

About Acovea
Acovea Overview
A GUI for Acovea: Acovea-GTK
Configuration Files
FAQ
About the Genetic Algorithm

Analyses
Acovea 5.0, GCC 4.0, Opteron
Acovea 5.0, GCC 4.0, Pentium 4
Acovea 5.0, GCC 3.4, Opteron
Acovea 4.0, GCC 3.4, P4, AMD64 (May-04)
Acovea 3.3, GCC 3.x, P4 (Dec-03)

Licensing
GNU General Public License (GPL)
Commercial License

If you find this article useful, please consider supporting the author's free software efforts with a donation, no matter how small.

Abstract

This article describes results obtained by running Acovea 5.0.0 to test GCC 4.0.0 on an AMD Opteron workstation. When applied to several example benchmark programs, Acovea identified optimization sets that reduced all individual benchmark run times by between 5 and 20%, when compared against code generated using the compiler's predefined -On optimization sets. In one mathematical benchmark, Acovea found a solution that improves the run time by 57%, by discovering an inconsistency in GCC's default floating-point options.

Past articles have combined results from testing on both Opteron and Pentium 4 processors. This article focuses solely on the Opteron, to avoid the non sequitur and perhaps misleading comparison of processors that has accompanied past articles.

Herein, I am writing about Acovea as it performs on the dual Opteron workstation I own, Corwin. This system is still running the original Opteron 240 processors; it will be upgraded as soon as I obtain a pair of AMD's new dual-core processors.

All tests were performed using GCC 4.0.0, a major new version released in mid-April 2005.

Results obtained with Acovea 5.0 were much superior to those from older versions. This is due to better statistical analysis; the underlying algorithm is the same. In every case, Acovea suggested a set of options that produced faster code than any of the preset -On options. Also, some code generation bugs in older versions of GCC were fixed, eliminating a couple of cases where Acovea produced exceptional results by avoiding broken optimizations. I did not find any egregious bugs in GCC 4.0.0, at least in terms of Acovea testing.

Benchmark tests

The current benchmark suite consists of six algorithm-specific programs, all written to the 1999 ISO C Standard. I intend to update the test suite following the release of Acovea 5.0. Such grand plans are, of course, predicated on my finding some of that mythical "spare time" I keep hearing about...

The individual tests are:

  • alma
    Calculates the daily ephemeris (at noon) for the years 2000-2099; tests array handling, floating-point math, and mathematical functions such as sin() and cos().
  • evo
    A simple genetic algorithm that maximizes a two-dimensional function; tests 64-bit math, loop generation, and floating-point math.
  • fft
    Uses a Fast Fourier Transform to multiply two very (very) large polynomials; tests the C99 _Complex type and basic floating-point math.
  • huff
    Compresses a large block of data using the Huffman algorithm; tests string manipulation, bit twiddling, and the use of large memory blocks.
  • lin
    Solves a large linear equation via LUP decomposition; tests basic floating-point math, two-dimensional array performance, and loop optimization.
  • mat1
    Multiplies two very large matrices using the brute-force algorithm; tests loop optimization.

Past articles included a test called "tree" which is not present here. The reason: It broke, and I didn't want to spend time hunting bugs in a benchmark I'll be replacing soon. The "tree" test was a quick hack that I'll replace with a new benchmark based on the production-quality code in Itzam.

Compiler Options

For the purpose of this article, I evolved the best set of options for the seven benchmark programs listed above; I also created a composite program that invokes all seven benchmarks to produce a single, overall result. At left is a chart showing the relative performance of each benchmark as compiled with different optimization settings.

  • O1, O2, O3 - baseline compiles with those collective options defined by GCC.
  • O3 fast math - baseline compiles with -O3 -ffast-math
  • acovea - compiled with options evolved by Acovea, with -O1 as the base level of optimization (though it implements switches to turn off options implied by -O1)

Acovea evolves optimization sets from the -O1 level. -O1 must be included if any optimization is to occur; I have included switches in the GCC configuration files to turn off various optimizations (e.g., -fno-merge-constants) implied by -O1, thus allowing evolution to remove options implied at the base level. For GCC 3.4 on the Opteron, Acovea evolves optimization sets selected from 60-some options, some of which include multiple possibilities (e.g., -mfpath). The -fnew-ra option causes compiler crashes and is disabled for these tests.

Discussions on the GCC mailing list (thread 1, thread 2, thread 3) revealed different perceptions about the implications of the -ffast-math option. For the tests shown here, I included -ffast-math in the evolutionary mix. I've performed experiments that show how -ffast-math does not reduce (and in some cases, improves) accuracy when used with industry-standard benchmarks like Paranoia. The "floating-point accuracy" story is very complicated, but it deserves its own, to-be-written-when-I-have-time article. And no, I haven't had time in the year since I last wrote that sentence!

Results

Acovea 5.0 was very successful when applied to the Opteron. For every single benchmark, Acovea produced a set of options that generated faster code than any of the built-in -On sets. You can download the complete set of test runs here. These are test files generated by the command-line and GUI drivers, and include the full sets of optimizations evolved by Acovea.

Corwin's Opterons run Gentoo's AMD64 Linux; the system is operating entirely on the 64-bit level. That means 64-bit libraries and versions of GCC that generate 64-bit x86_64 (or AMD64, if you like) object code. While Corwin does have dual processors, the current benchmarks are all serial. I intend to produce some parallel benchmarks when (yes, that phrase is coming) I have the time. :) Note that a huge infusion of cash would increase my free time to work on these projects. :)

For every test compile on Corwin, Acovea used gcc -lrt -lm -std=gnu99 -O1 -march=opteron; the -march=opteron (which I'm told is redundant for the AMD64 compiler) implies -msse, -msse2, -m3dnow, and -mmmx.

For each test, I provide Acovea's top options set along with lists of strongly pessimistic and optimistic flags. An "optimistic" option was common in "good" solutions during the entire evolutionary process. A "pessimistic" option, on the other hand, was absent from the "good" solutions, either because it hurt run times or it simply made no difference at all and was ignored as a "junk gene". It's important to realize that Acovea's best solution may contain options that had no practical effect on code generation; they didn't hurt, didn't help, and were just "there." Such it is human genes for producing tails; most of us have the genes (option), but it doesn't express itself in any way.

For the seven benchmarks on the Opteron, Acovea produced the following results:

  • alma
    Acovea's solution:
    fno-merge-constants -fno-defer-pop -fno-delayed-branch -floop-optimize2 -ftree-dce -ftree-dominator-opts -ftree-dse -ftree-sra -ftree-ch -fmerge-constants -fexpensive-optimizations -fstrength-reduce -frerun-loop-opt -fpeephole2 -fschedule-insns -fregmove -fdelete-null-pointer-checks -freorder-blocks -fgcse-lm -freorder-functions -funit-at-a-time -falign-functions -falign-labels -finline-functions -funswitch-loops -fgcse-after-reload -fprefetch-loop-arrays -ftracer -funroll-loops -fno-function-cse -fgcse-sm -fgcse-las -freschedule-modulo-scheduled-loops -maccumulate-outgoing-args -minline-all-stringops -mfpmath=387 -fno-math-errno -funsafe-math-optimizations -fno-trapping-math -fno-signaling-nans -finline-limit=700


    Strongly Optimistic Options:
    -fschedule-insns -finline-functions -mfpmath=387 -fno-math-errno -funsafe-math-optimizations -finline-limit=700


    Strongly Pessimistic Options:
    -ftree-ccp -fforce-mem -fschedule-insns2 -ffloat-store -fno-inline -mfpmath=sse -mfpmath=sse,387


    Acovea found an option set that was an amazing 57% faster than the default options. This is a remarkable achievement. Being the suspicious sort, I investigated, and found that -funsafe-math-optimizations (a component of -ffast-math) is only effective when it is coupled with -mfpmath=387. For Opteron processors, the default settings is -mfpmath=sse — and SSE mathematics do not include all of the intrinsic mathematical functions that can be inlined via -funsafe-math-optimizations. This is an example of Acovea can identify and help solve serious code generation problems.

    This discovery doubles the speed of a high-performance floating-point application on AMD64 hardware. GCC should strongly consider changing the default for AMD64 to -mfpmath=387 when -funsafe-math-optimizations is enabled.
     
  • evo
    Acovea's solution:
    -fno-omit-frame-pointer -fno-guess-branch-probability -fno-if-conversion -fno-if-conversion2 -fno-delayed-branch -ftree-dce -ftree-dominator-opts -ftree-ter -ftree-lrs -ftree-sra -fcrossjumping -foptimize-sibling-calls -fcse-follow-jumps -fstrength-reduce -fforce-addr -fpeephole2 -fregmove -fstrict-aliasing -fthread-jumps -fgcse-lm -fsched-spec -freorder-blocks -falign-functions -ftree-pre -finline-functions -ftracer -funswitch-loops -funroll-loops -fno-function-cse -fgcse-las -fvariable-expansion-in-unroller -mno-push-args -maccumulate-outgoing-args -mno-align-stringops -minline-all-stringops -mfpmath=sse -funsafe-math-optimizations -fno-signaling-nans


    Strongly Optimistic Options:
    -fno-guess-branch-probability -funit-at-a-time -falign-functions -ftracer


    Strongly Pessimistic Options:
    -fno-loop-optimize -floop-optimize2 -fstrength-reduce -fforce-mem -ffloat-store -fbranch-target-load-optimize -fbranch-target-load-optimize2 -mfpmath=387


    On this test, the -On options all turned in similar performance; Acovea's suggestion beat them by only 5%. Contrary to implications, -O1 performed better than -O2 and -O3 on evo.
     
  • fft
    Acovea's solution:
    -momit-leaf-frame-pointer -fno-cprop-registers -ftree-ccp -ftree-dce -ftree-dominator-opts -ftree-ter -ftree-lrs -ftree-sra -ftree-copyrename -fmerge-constants -fcrossjumping -fcse-follow-jumps -fcse-skip-blocks -fexpensive-optimizations -fstrength-reduce -fpeephole2 -fschedule-insns -fregmove -fstrict-aliasing -fdelete-null-pointer-checks -freorder-blocks -fthread-jumps -fsched-interblock -freorder-blocks -freorder-functions -falign-functions -falign-jumps -falign-labels -funswitch-loops -fgcse-after-reload -fprefetch-loop-arrays -funswitch-loops -funroll-loops -fmodulo-sched -fgcse-sm -freschedule-modulo-scheduled-loops -ftree-loop-im -fvariable-expansion-in-unroller -mieee-fp -mfpmath=sse -fno-trapping-math -ffinite-math-only -fno-signaling-nans -fcx-limited-range -finline-limit=700


    Strongly Optimistic Options:
    -fstrength-reduce -fstrict-aliasing -fprefetch-loop-arrays -fno-signaling-nans


    Strongly Pessimistic Options:
    -fno-loop-optimize -floop-optimize2 -fforce-mem -fforce-addr -ffloat-store -funroll-all-loops -mfpmath=387 -mfpmath=sse -fno-trapping-math


    The solution above generates code that is 20% faster than that generated by GCC's composite options.
     
  • huff
    Acovea's solution:
    -fno-merge-constants -fno-defer-pop -fno-if-conversion2 -fno-delayed-branch -ftree-ccp -ftree-dce -ftree-lrs -ftree-sra -ftree-copyrename -ftree-fre -fcse-follow-jumps -fcse-skip-blocks -fgcse -fexpensive-optimizations -frerun-cse-after-loop -fforce-mem -fpeephole2 -fschedule-insns -fstrict-aliasing -fdelete-null-pointer-checks -freorder-blocks -fgcse-lm -fsched-interblock -fsched-spec -freorder-blocks -funit-at-a-time -falign-functions -falign-jumps -funswitch-loops -fgcse-after-reload -ffloat-store -fno-inline -ftracer -funswitch-loops -fgcse-sm -freschedule-modulo-scheduled-loops -ftree-loop-ivcanon -fivopts -mieee-fp -mno-push-args -maccumulate-outgoing-args -minline-all-stringops -mfpmath=sse,387 -fno-math-errno


    Strongly Optimistic Options:
    -fcse-skip-blocks -freorder-blocks -ftree-pre -ftracer -freschedule-modulo-scheduled-loops -maccumulate-outgoing-args -minline-all-stringops


    Strongly Pessimistic Options:
    -momit-leaf-frame-pointer -fno-guess-branch-probability -fno-if-conversion -fno-loop-optimize -floop-optimize2 -fforce-addr -funroll-loops -mfpmath=387 -mfpmath=sse


    Acovea shines on this benchmark, improving performance by 13% as compared to the next-best solution, -O3. As with evo, -O1 performed better than -O2 and -O3 on huff.
     
  • lin
    Acovea's solution:
    -fno-defer-pop -fno-guess-branch-probability -fno-cprop-registers -fno-if-conversion2 -ftree-dce -ftree-ter -ftree-sra -ftree-ch -fcrossjumping -fcse-follow-jumps -fstrength-reduce -fcaller-saves -fpeephole2 -fschedule-insns2 -fregmove -fstrict-aliasing -fdelete-null-pointer-checks -fgcse-lm -fsched-interblock -fsched-spec -freorder-blocks -funit-at-a-time -falign-loops -falign-labels -ftree-pre -fprefetch-loop-arrays -ftree-loop-ivcanon -mieee-fp -mno-push-args -mno-align-stringops -mfpmath=sse -fno-math-errno -funsafe-math-optimizations -ffinite-math-only -fno-signaling-nans -fcx-limited-range


    Strongly Optimistic Options:
    -fno-guess-branch-probability -fstrength-reduce -fstrict-aliasing -fprefetch-loop-arrays -fpeel-loops


    Strongly Pessimistic Options:
    -momit-leaf-frame-pointer -fno-loop-optimize -floop-optimize2 -fforce-addr -ffloat-store -fbranch-target-load-optimize -fgcse-sm -mfpmath=387 -mfpmath=sse,387


    Acovea found an option set for producing code that is produced code that is 16% faster than any -On option.
     
  • mat1
    Acovea's solution:
    -fno-merge-constants -fno-thread-jumps -momit-leaf-frame-pointer -floop-optimize2 -ftree-dse -ftree-ter -ftree-lrs -ftree-copyrename -ftree-ch -foptimize-sibling-calls -fcse-follow-jumps -fexpensive-optimizations -frerun-loop-opt -fschedule-insns2 -fstrict-aliasing -fdelete-null-pointer-checks -fgcse-lm -falign-functions -fprefetch-loop-arrays -ftracer -funswitch-loops -funroll-loops -fbranch-target-load-optimize -fgcse-sm -ftree-loop-im -ftree-loop-ivcanon -fvariable-expansion-in-unroller -minline-all-stringops -mfpmath=387 -fno-math-errno -funsafe-math-optimizations -fno-trapping-math -fno-signaling-nans


    Strongly Optimistic Options:
    -foptimize-sibling-calls -fforce-mem -fvariable-expansion-in-unroller -funsafe-math-optimizations


    Strongly Pessimistic Options:
    -fno-omit-frame-pointer -fforce-addr -ffloat-store -fbranch-target-load-optimize -mfpmath=sse -mfpmath=sse,387


    Another good result from Acovea, generating 11% faster code.
     

Acovea found and overall 22% improvement in generated code speed, as compared to the -On options.

Conclusions

The results I've presented describe the effects of different optimizations on a set of relatively straight-forward programs that perform well-defined tasks. My conclusions so far have been these:

Acovea 5.0 is a substantial improvement over older versions.
With version 5, Acovea identified a "better" solution in every single case. And the improvements found are significant; for a compute-bound application, gaining even 8% is A Good Thing.

A higher "-O" level does not guarantee faster code.
By implication, -O3 should produce code that is faster than -O1, but it doesn't. The complex interactions of many different optimization techniques seem very likely to conflict; the complexity is simply too much to understand. Acovea can test these interactions, and produce better optimization sets that avoid conflicts and mutual pessimisms.

Different algorithms are most effective with different optimizations.
Again, an obvious pearl of wisdom that is sometimes forgotten. Real (i.e., non-benchmark) programs do many things; using a profiler, the critical loops can be identified, and their algorithms refined as much as possible by hand. Once the algorithms are tuned, a tool like Acovea will identify the best possible optimization settings. Applying -O3 to an entire program is not likely to produce the fastest program; using algorithm-specific optimizations to compile specific, critical routines is likely to produce faster code.

As always, I look forward to considered comments.

-- Scott

 
Send E-mail

Consulting Services
Scott's CV

FAQ
Scott's Books
Reviews
Bibliography

Privacy Policy
Legal Stuff



©  2008
Scott Robert Ladd
All rights reserved.
Established 1996