Acovea 5.0
Analysis of GCC 4.0.0 for Pentium 4
1 May 2005
If you find this article useful, please consider supporting the author's free software efforts with a donation, no matter how small.
Abstract
This article describes results obtained by running Acovea 5.0.0 to test GCC 4.0.0 on an Intel Pentium 4 workstation. When applied to several example benchmark programs, Acovea identified optimization sets that reduced three individual benchmark run times by between 2 and 8%, when compared against code generated using the compiler's predefined -On optimization sets. In no case did Acovea fail to find a solution that at least equalled the performance of the -On options
Past articles have combined results from testing on both Opteron and Pentium 4 processors. This article focuses solely on the Pentium 4, to avoid the non sequitur and perhaps misleading comparison of processors that has accompanied past articles.
Herein, I am writing about Acovea as it performs on the Pentium 4 workstation I own, Tycho. All tests were performed using GCC 4.0.0, a major new version released in mid-April 2005.
Results obtained with Acovea 5.0 were clearer than those from older versions. This is due to better statistical analysis; the underlying algorithm is the same. In every case, Acovea suggested a set of options that produced faster code than any of the preset -On options.
Benchmark tests
The current benchmark suite consists of six algorithm-specific programs, all written to the 1999 ISO C Standard. I intend to update the test suite following the release of Acovea 5.0. Such grand plans are, of course, predicated on my finding some of that mythical "spare time" I keep hearing about...
The individual tests are:
-
alma
Calculates the daily ephemeris (at noon) for the years 2000-2099; tests array handling, floating-point math, and mathematical functions such as sin() and cos(). -
evo
A simple genetic algorithm that maximizes a two-dimensional function; tests 64-bit math, loop generation, and floating-point math. -
fft
Uses a Fast Fourier Transform to multiply two very (very) large polynomials; tests the C99 _Complex type and basic floating-point math. -
huff
Compresses a large block of data using the Huffman algorithm; tests string manipulation, bit twiddling, and the use of large memory blocks. -
lin
Solves a large linear equation via LUP decomposition; tests basic floating-point math, two-dimensional array performance, and loop optimization. -
mat1
Multiplies two very large matrices using the brute-force algorithm; tests loop optimization.
Past articles included a test called "tree" which is not present here. The reason: It broke, and I didn't want to spend time hunting bugs in a benchmark I'll be replacing soon. The "tree" test was a quick hack that I'll replace with a new benchmark based on the production-quality code in Itzam.
Compiler Options
For the purpose of this article, I evolved the best set of options for the seven benchmark programs listed above; I also created a composite program that invokes all seven benchmarks to produce a single, overall result. At left is a chart showing the relative performance of each benchmark as compiled with different optimization settings.
- O1, O2, O3 - baseline compiles with those collective options defined by GCC.
- O3 fast math - baseline compiles with -O3 -ffast-math
- acovea - compiled with options evolved by Acovea, with -O1 as the base level of optimization (though it implements switches to turn off options implied by -O1)
Acovea evolves optimization sets from the -O1 level. -O1 must be included if any optimization is to occur; I have included switches in the GCC configuration files to turn off various optimizations (e.g., -fno-merge-constants) implied by -O1, thus allowing evolution to remove options implied at the base level. For GCC 3.4 on the Pentium 4, Acovea evolves optimization sets selected from 60-some options, some of which include multiple possibilities (e.g., -mfpath). The -fnew-ra option causes compiler crashes and is disabled for these tests.
Discussions on the GCC mailing list (thread 1, thread 2, thread 3) revealed different perceptions about the implications of the -ffast-math option. For the tests shown here, I included -ffast-math in the evolutionary mix. I've performed experiments that show how -ffast-math does not reduce (and in some cases, improves) accuracy when used with industry-standard benchmarks like Paranoia. The "floating-point accuracy" story is very complicated, but it deserves its own, to-be-written-when-I-have-time article. And no, I haven't had time in the year since I last wrote that sentence!
Results
Acovea 5.0 was somewhat successful when applied to the Pentium; the results were not as dramatic as they were when I applied Acovea to the Opteron. For every single benchmark, Acovea produced a set of options that generated faster code than any of the built-in -On sets, although in three cases the improvement was trivial. You can download the complete set of test runs here. These are test files generated by the command-line and GUI drivers, and include the full sets of optimizations evolved by Acovea.
For every test compile on Corwin, Acovea used gcc -lrt -lm -std=gnu99 -O1 -march=pentium4; the -march=pentium4 implies -msse, -msse2, -m3dnow, and -mmmx.
For each test, I provide Acovea's top options set along with lists of strongly pessimistic and optimistic flags. An "optimistic" option was common in "good" solutions during the entire evolutionary process. A "pessimistic" option, on the other hand, was absent from the "good" solutions, either because it hurt run times or it simply made no difference at all and was ignored as a "junk gene". It's important to realize that Acovea's best solution may contain options that had no practical effect on code generation; they didn't hurt, didn't help, and were just "there." Such it is human genes for producing tails; most of us have the genes (option), but it doesn't express itself in any way.
For the seven benchmarks on the Opteron, Acovea produced the following results:
-
alma
Acovea's solution:
-fno-defer-pop -fno-thread-jumps -ftree-ccp -ftree-dce -ftree-dse -ftree-ter -ftree-lrs -ftree-sra -ftree-fre -ftree-ch -fcrossjumping -foptimize-sibling-calls -fcse-follow-jumps -fexpensive-optimizations -fcaller-saves -fforce-addr -fschedule-insns -fschedule-insns2 -fregmove -fstrict-aliasing -fdelete-null-pointer-checks -fgcse-lm -freorder-functions -funit-at-a-time -falign-functions -falign-jumps -finline-functions -fomit-frame-pointer -fpeel-loops -funroll-all-loops -freschedule-modulo-scheduled-loops -ftree-loop-im -fivopts -fvariable-expansion-in-unroller -mieee-fp -mno-push-args -mno-align-stringops -minline-all-stringops -mfpmath=387 -funsafe-math-optimizations -fno-trapping-math -finline-limit=700
Strongly Optimistic Options:
ftree-fre -finline-functions -fpeel-loops -funsafe-math-optimizations -finline-limit
Strongly Pessimistic Options:
-floop-optimize2 -momit-leaf-frame-pointer -ffloat-store -fno-inline -mfpmath=sse -mfpmath=sse,387 -fno-math-errno
Acovea turned in its best performance on this benchmark, gaining 8% over the -On options.
-
evo
Acovea's solution:
-fno-thread-jumps -fno-cprop-registers -fno-if-conversion2 -floop-optimize2 -ftree-dominator-opts -ftree-lrs -ftree-fre -ftree-ch -fmerge-constants -fcrossjumping -fgcse -fcaller-saves -fforce-mem -fschedule-insns -fschedule-insns2 -fregmove -fstrict-aliasing -fthread-jumps -fsched-interblock -freorder-blocks -freorder-functions -funit-at-a-time -falign-functions -falign-loops -falign-labels -finline-functions -funswitch-loops -fgcse-after-reload -fpeel-loops -funswitch-loops -fbranch-target-load-optimize2 -fmodulo-sched -freschedule-modulo-scheduled-loops -maccumulate-outgoing-args -minline-all-stringops -fno-math-errno -fno-trapping-math
Strongly Optimistic Options:
-funswitch-loops -fmodulo-sched
Strongly Pessimistic Options:
-fforce-addr -fpeephole2 -momit-leaf-frame-pointer -ffloat-store -funroll-all-loops -mfpmath=387 -mfpmath=sse -mfpmath=sse,387 -funsafe-math-optimizations
Results for this benchmark are just plain odd. The fastest code was turned in by -Os, although Acovea was able to match it.
-
fft
Acovea's solution:
-fno-merge-constants -fno-thread-jumps -fno-guess-branch-probability -fno-delayed-branch -ftree-dce -ftree-dominator-opts -ftree-ter -ftree-lrs -ftree-sra -ftree-copyrename -foptimize-sibling-calls -fcse-follow-jumps -fcse-skip-blocks -fgcse -frerun-cse-after-loop -frerun-loop-opt -fcaller-saves -fregmove -fstrict-aliasing -freorder-blocks -fthread-jumps -fsched-interblock -fsched-spec -fgcse-after-reload -fno-inline -funswitch-loops -funroll-all-loops -fmodulo-sched -fgcse-sm -freschedule-modulo-scheduled-loops -ftree-loop-im -fivopts -fvariable-expansion-in-unroller -mno-push-args -maccumulate-outgoing-args -minline-all-stringops -funsafe-math-optimizations -ffinite-math-only -fno-signaling-nans -fcx-limited-range -finline-limit=600
Strongly Optimistic Options:
-fstrict-aliasing (3.219) -maccumulate-outgoing-args (1.647) -funsafe-math-optimizations
Strongly Pessimistic Options:
-fno-guess-branch-probability -fno-loop-optimize -finline-functions -momit-leaf-frame-pointer -ffloat-store -fbranch-target-load-optimize -mfpmath=387 -mfpmath=sse,387
Acovea's solution was no better the -O2 and -O3. Overall, there seemed to be little room for optimizing this algorithm at all on the Pentium 4.
-
huff
Acovea's solution:
-fno-defer-pop -fno-guess-branch-probability -fno-if-conversion2 -fno-loop-optimize -ftree-ccp -ftree-ter -ftree-lrs -ftree-sra -ftree-copyrename -ftree-fre -ftree-ch -fmerge-constants -fcrossjumping -fcse-skip-blocks -fgcse -fexpensive-optimizations -fschedule-insns2 -fregmove -freorder-blocks -fthread-jumps -fgcse-lm -freorder-blocks -freorder-functions -funit-at-a-time -falign-functions -falign-jumps -ftree-pre -fgcse-after-reload -fomit-frame-pointer -ffloat-store -fprefetch-loop-arrays -fmodulo-sched -fno-function-cse -fgcse-sm -freschedule-modulo-scheduled-loops -ftree-loop-im -mieee-fp -minline-all-stringops -mfpmath=sse,387 -funsafe-math-optimizations -fno-trapping-math
Strongly Optimistic Options:
-ftree-sra -freorder-blocks -fsched-interblock -freorder-blocks -fomit-frame-pointer -fgcse-sm
Strongly Pessimistic Options:
-fno-if-conversion -fforce-addr -momit-leaf-frame-pointer -ftracer -funroll-all-loops -fbranch-target-load-optimize2 -mfpmath=387 -mfpmath=sse
Acovea improved generate code performance by 3% as compared to the next-best solution, -O3.
-
lin
Acovea's solution:
-fno-merge-constants -fno-cprop-registers -fno-if-conversion2 -ftree-ter -ftree-lrs -ftree-sra -ftree-ch -fexpensive-optimizations -fstrength-reduce -frerun-cse-after-loop -fcaller-saves -fpeephole2 -fschedule-insns -fstrict-aliasing -fdelete-null-pointer-checks -freorder-blocks -fsched-spec -freorder-blocks -falign-loops -falign-labels -ftree-pre -finline-functions -momit-leaf-frame-pointer -fno-inline -fpeel-loops -ftracer -funroll-all-loops -fmodulo-sched -freschedule-modulo-scheduled-loops -ftree-loop-im -fivopts -ftree-vectorize -mieee-fp -maccumulate-outgoing-args -mno-align-stringops -minline-all-stringops -funsafe-math-optimizations -fno-trapping-math -fno-signaling-nans
Strongly Optimistic Options:
-fstrict-aliasing -fno-inline -funroll-all-loops
Strongly Pessimistic Options:
-fno-loop-optimize -floop-optimize2 -falign-jumps -fomit-frame-pointer -ffloat-store -funroll-loops -mfpmath=387 -mfpmath=sse -mfpmath=sse,387
Acovea found an option set for producing code that is produced code that is 2% faster than any -On option.
-
mat1
Acovea's solution:
-fno-merge-constants -fno-defer-pop -fno-guess-branch-probability -fno-cprop-registers -fno-delayed-branch -floop-optimize2 -ftree-ccp -ftree-dominator-opts -ftree-ter -ftree-lrs -ftree-copyrename -fmerge-constants -fcrossjumping -foptimize-sibling-calls -fcse-follow-jumps -fgcse -fexpensive-optimizations -fcaller-saves -fforce-mem -fpeephole2 -fschedule-insns -fschedule-insns2 -fregmove -fdelete-null-pointer-checks -freorder-blocks -fsched-interblock -freorder-functions -falign-jumps -falign-loops -falign-labels -finline-functions -funswitch-loops -fgcse-after-reload -fprefetch-loop-arrays -fno-inline -fpeel-loops -funswitch-loops -fgcse-las -fivopts -maccumulate-outgoing-args -minline-all-stringops -fno-trapping-math -fno-signaling-nans -finline-limit=500
Strongly Optimistic Options:
-ftree-ccp -fgcse
Strongly Pessimistic Options:
-fforce-addr -momit-leaf-frame-pointer -ffloat-store -funroll-loops -funroll-all-loops -fbranch-target-load-optimize -mfpmath=sse -mfpmath=sse,387
This is the single test where Acovea did not beat the -On options. Looking at the numbers, I'm not surprised: The fastest code was generated by -O1 alone, beating both -O2 and -O3. In other words, adding more options to -O1 doesn;t improve performance, so Acovea had nothing to work with.
The results I've presented describe the effects of different optimizations on a set of relatively straight-forward programs that perform well-defined tasks. My conclusions so far have been these:
Acovea 5.0 is an improvement over older versions.
With version 5, Acovea identified a "better" solution in every single case, though in three cases the improvement was negligible, and in one case (mat1), -O1 came out ahead of Acovea, -O2 and -O3. However, the results for Pentium 4 were not as significant as those Acovea identified for the Opteron, a difference that can likely be attributed to the relative maturity of the two code generators. The Pentium 4 has been around longer, and I'm not surprised that it's code generator is more stable than is the Opteron back end.
A higher "-O" level does not guarantee faster code.
By implication, -O3 should produce code that is faster than -O1, but it doesn't always do so in practice.
The complex interactions of many different optimization techniques seem very likely to conflict;
the complexity is simply too much to understand. Acovea can test these interactions, and produce
better optimization sets that avoid conflicts and mutual pessimisms.
Different algorithms are most effective with different optimizations.
Again, an obvious pearl of wisdom that is sometimes forgotten. Real (i.e., non-benchmark)
programs do many things; using a profiler, the critical loops can be identified, and their
algorithms refined as much as possible by hand. Once the algorithms are tuned, a
tool like Acovea will identify the best possible optimization settings. Applying
-O3 to an entire program is not likely to produce the fastest program; using
algorithm-specific optimizations to compile specific, critical routines is likely to
produce faster code.
As always, I look forward to considered comments.
-- Scott

