Linux C and C++ Compilers
A comparison via benchmarks on Opteron and Pentium
by Scott Robert Ladd, 18 September 2004
Jump to Conclusions ;)
Once more, unto the breach, dear friends, once more...
Henry V (as voiced by Shakespeare) may have been talking about war, but the statement seems apropos when producing a set of benchmarks. Theoretically, benchmarks should provide clear, unequivocal information that guides people in making choices about software and hardware.
Reality is, alas, somewhat less than ideal; benchmarks are quite subjective, prone to interpretation, and rarely show a clear picture. Benchmarking is always a tricky business, especially when it comes to compilers: A reviewer selects a limited suite of benchmarks that demonstrate specific aspects of code generation, thus predicting general compiler performance from a limited data set. Not terribly scientific, to be sure.
So why do benchmarks at all? Because we can still learn something about the relative performance of different tools, by comparing results in a controlled environment. Benchmarks are guidelines, not absolute answers. And to be valid, benchmark source code must be available, and the testing conditions clearly stated.
This benchmarking article marks the beginning of an ongoing project to track the quality of programming tools for the Linux environment. As significant events occur, I will update this article, shifting past results to archive pages.
The Compilers
At this time, I'm testing two Linux-based C and C++ compilers: GNU's Compiler Collection (GCC), and Intel's C++ product. GCC is, of course, released under the GNU Public License, and I own a commercial license for the Intel compiler. I realize that other commercial compilers exist, but do not own them and, as such, can not test them. If other compiler vendors wish to be included in these ongoing comparisons, I'd be more than happy to talk to them.
GNU's Compiler Collection
GCC (which includes C, C++, Objective-C, Fortran, and Ada compilers) is arguably the most important tool for the creation of free software; without a free-as-in-speech and -as-in-beer compiler, it is unlikely that Linux (and perhaps BSD and OS X) would exist. I have an abiding interest in the quality of GCC. In all fairness, I do some peripheral work on GCC as time permits, so I am not an entirely unbiased observer.
The next major release of GCC will be version 4.0.0. The decision to make a major version change (the current release is 3.4.2) was handed down by the GCC Steering Committee after considerable and heated debate. Making a major version number change sends a signal to users that something "big" has happened. The last jump from 2.x to 3.x -- in 2001 -- signaled a major change in GCC's development process and maturity. Going to 4.x is a more subtle change, being more evolutionary and less revolutionary. Among the new features are a modern Fortran 95 compiler (replacing the g77) and a completely new optimization framework.
The internal changes may not be obvious to most GCC users, but they mark a significant architectural change with great potential. "Tree-ssa" is about the future: improved code generation and (eventually) simplified parallel programming (OpenMP). I've had solid success using GCC 4.0.0 (from CVS) for my daily in-house development. The addition of Fortran 95 will be a boon for scientific work.
GCC 4.0 is still being worked on, and the numbers here represent the current state of a program undergoing rapid development. I include GCC 4.0 to spur its improvement, and to dispell certain rumors and assumptions that have appeared in some forums.
With each new version of GCC, the GNU team adds many features, ranging from improved compliance with Standards to new code generation options. Many users have expressed concern about compile times and generated code quality; different versions of GCC behave in disparate ways. As such, I test several releases: 3.3, 3.4, and the upcoming-4.0, building the compilers from source code extracted from the GNU CVS repository.
Intel's C++ Compiler
Intel produces C, C++, and Fortran 95 compilers for Windows and Linux. Under Linux, the Intel compilers are available with a non-commercial license, meaning that anyone can download and use the full compiler for non-profit work. The Intel non-commercial license is not the same thing as the GNU General Public License (GPL); the Intel compilers are not "free-as-in-speech" software -- however, they are excellent tools that can be used for working on free software and non-commercial projects.
Intel's command-line options vary from GCC's, so it won't operate a a drop-in replacement for the GNU compiler. You can compile the Linux kernel with ICC if you so desire, but I'm more comfortable sticking with GCC for building Linux-based systems, since that is their "native" environment
By supporting GCC's extensions to C and C++, the Intel compiler is clearly trying to attract users to their development tools -- and thus to Intel processors. Intel also offers features that GCC does not, including OpenMP for simplified parallel programming. If you check forums dedicated to scientific work, video encoding, or source-based Linux distributions, you'll likely find people using and talking about Intel's compiler.
Why don't I use the Intel compiler on my Opteron system? The simple answer is that Intel's compiler does not produce code that takes advantage of the Opteron's features. Intel's compiler also produces executable code that disables some of its optimizations on non-Intel hardware. It's unrealistic to expect that Intel will create a compiler that works well on their competitor's hardware.
Intel's most recent compiler is version 8.1, which is quite different from 8.0 and earlier releases. Most significantly, ICC 8.1 detects your local GCC installation, using the associated libraries and include files. In the past, Intel defaulted to using the Dinkumware libraries it included with its compiler. This change caused quite a bit of annoyance when trying to run 8.1 on my Gentoo-based machines, because Gentoo's GCC is somehow incompatible with ICC's expectations. I'm investigating this further; for now, just be aware that Intel's 8.1 compiler requires a "standard" GCC install.
Intel C++ 8.1 has an additional surprise: It includes IBM's Eclipse, a Java-based development environment with C and C++ support. I find this a rather odd choice, when they could have provided integration with standard GUI IDEs like Anjuta and KDevelop. Even stranger, Intel delivers Eclipse 2.1.3, when the current version is 3.1; Intel also provides the JRockit java runtime. I did not install Intel's GUI development tools, and I can't see them being popular with Linux developers, who tend to disdain Java. I use Eclipse 3.1 for my Java development, and it's a fine (if slow) tool; I've never considered it seriously, though, for C and C++ development.
Test Methods
I performed testing on my two Linux systems, as described below. All benchmarks are command-line programs, run from bash, with a minimum of daemons in residence. I use the Gentoo GNU/Linux distribution. Your performance may (and likely will) vary in some details, based on your hardware and Linux installation.
Corwin (Homebrew)
Gentoo AMD64 GNU/Linux, kernel 2.6 SMP
Dual Opteron 240, Tyan K8W 2885
120GB Maxtor 7200 RPM ATA-133 HD
2GB PC2700 DRAM
Radeon 9200 Pro, 128MB, HP f1903 DFP
Tycho (Homebrew)
Dual Boot: Windows XP Professional
Gentoo x86 GNU/Linux, kernel 2.6 SMP
2.8GHz Pentium 4 w/HT, Intel D850EMV2, 533MHz FSB
2x80GB Maxtor D740X 7200 RPM ATA-100 HD
512MB PC800 RDRAM
Radeon 9200 Pro, 128MB, NEC FE990
No matter which compiler options I choose, someone is likely to send me e-mail telling me I got it all wrong. Ranging from the polite to the insistent to the rude, these e-mails contain contradictory suggestions for producing fast code. In the vast majority of cases, such anecdotal assertions lack any formal proof of their validity, and, more often than not, the suggested "improvement" is ineffective or detrimental. One example: Many people insist that I use -mmmx and -msse options when specifying -march=pentium4 for GCC -- when, in fact, the -march=pentium4 option implies those special instruction sets. For -march=opteron, -mfpmath=sse is the default. I appreciate help from the audience, but some people need to give their favored settings a reality check. The GCC documentation has recently improved, and now has a much more accurate list of implied options.
Some folk may object to my use of -ffast-math -- however, in numerous accuracy tests, -ffast-math produces code that is both faster and more accurate than code generated without it. Yes, -ffast-math has other aspects that make for interesting debate; however, such discussions belong in another article.
This article is not a comparison of the Pentium 4 and Opteron processors; my test systems are far too different for any such comparison to have meaning. And please do not ask me to test on systems I don't own, unless you're willing to send me hardware.
So, without further ado, here are the compiler options I selected:
- for all GCC versions on Pentium 4:
-march=pentium4 -mfpmath=sse -fomit-frame-pointer -ffast-math -O3 - for GCC 3.4 and 4.0 on Opteron:
-march=opteron -ffast-math -O3 - for GCC 3.3 on Opteron:
-march=athlon-xp -ffast-math -O3 - for Intel C++ on Pentium 4:
icc -O3 -xN -tpp7 -ipo (no -ipo on POV-ray or LAME due to linker errors)
The Benchmarks
In the case of a compiler, code generation benchmarks give us an empirical comparison of products that serves to guide our choices of tools. If I'm developing a number-crunching application, I appreciate knowing that compiler "A" produces faster code than compiler "B". In my experience, benchmarking serves as a guide, a filter that shows trends and identifies areas of concern.
Each benchmark is accompanied by a pair of tables containing performance data for Opteron and Pentium 4 tests. I've highlighted the "best" value in each column in green; a red value is the "worst" result in a column.
I've chosen the following set of benchmarks based on the types of work I do and my desire to analyze the computational "muscle" of code generated by different compilers. Based on experience and reader response, I may change the benchmark suite from time to time.
POV-Ray
Ray tracing
POV-Ray is a venerable tool for generating photorealistic images via ray tracing. This computationally-intensive application has a well-known benchmark test. Two of my daughters are nascent computer artists, and fast rendering is important for their work and enjoyment.
| Opteron (64-bit Linux) | ||||
| optimized | optimized | compile time | ||
| run time | size | optimized | -O0 | |
| gcc 4.0.0 20040915 | 35:28 | 1,458,654 | 2:35 | 1:09 |
| gcc 3.4.3 20040915 | 33:24 | 1,457,032 | 1:23 | 0:49 |
| gcc 3.3.5 20040915 | 38:12 | 1,369,130 | 1:12 | 0:43 |
| Pentium 4 (32-bit Linux) | ||||
| optimized | optimized | compile time | ||
| run time | size | optimized | -O0 | |
| gcc 4.0.0 20040915 | SEG FAULT | 1,398,843 | 1:37 | 0:50 |
| gcc 3.4.3 20040915 | 33:57 | 1,375,796 | 1:25 | 0:47 |
| gcc 3.3.5 20040915 | 36:01 | 1,304,528 | 1:12 | 0:47 |
| icc 8.1 20040803Z | 32:38 | 1,878,096 | 1:25 | 0:41 |
On the Opteron, GCC 4.0 loses some ground to version 3.4 in terms of code generation. A more serious problem is the segmentation fault that occurs on my Pentium 4 system for code compiled with 4.0. The SIGSEG occurs for any code compiled with -O1 (and -O2 and -O3) optimization; POV-Ray does compile and run correctly with all optimizations disabled. After a search of the GCC Bugzilla database and considerable experimentation, I have yet to identify the specific optimization that is giving the compiler indigestion. Time permitting, I'll dig for the answer to aid the GCC developers; should an answer present itself, I'll post the information.
In terms of production compilers, ICC is marginally better than any version of GCC on the Pentium 4.
LAME
MP3 Encoding
LAME is a popular tool for encoding digital music to MP3 format. For these benchmarks, I've used LAME to encode a 520Mb sound file from WAV to MP3, sending the output to /dev/null to reduce the effect of file I/O.
| Opteron (64-bit Linux) | ||||
| optimized | optimized | compile time | ||
| run time | size | optimized | -O0 | |
| gcc 4.0.0 20040915 | 5:52 | 494,134 | 1:08 | 1:08 |
| gcc 3.4.3 20040915 | 5:52 | 493,127 | 0:43 | 0:44 |
| gcc 3.3.5 20040915 | 6:09 | 411,454 | 0:36 | 0:36 |
| Pentium 4 (32-bit Linux) | ||||
| optimized | optimized | compile time | ||
| run time | size | optimized | -O0 | |
| gcc 4.0.0 20040915 | 4:32 | 472,364 | 0:42 | 0:41 |
| gcc 3.4.3 20040915 | 4:18 | 456,612 | 0:41 | 0:41 |
| gcc 3.3.5 20040915 | 4:34 | 384,142 | 0:37 | 0:37 |
| icc 8.1 20040803Z | 4:43 | 524,911 | 0:29 | 0:22 |
For the Opteron, GCC 4.0 compiles LAME slowly in comparison to 3.4, but both compilers produce code that is much faster than that generated by version 3.3. The Pentium 4 results show that GCC 4.0 has lost a bit of code generation quality in comparison to 3.4, but all versions of GCC out-perform ICC.
Coyote Benchmarks, 0.9.5
Number Crunching, Data Compression, and More
My benchmark suite is still in development, and isn't packaged as nicely as I'd like for general distribution. And I'm debating the final mix of algorithms; I may replace some well-known tests with esoteric challenges. If you want the benchmark source code, or have any questions about these tests, please e-mail me. The tests are:
-
alma
Calculates the daily planetary ephemeris (at noon) for the years 2000-2099; tests array handling, floating-point math, and mathematical functions such as sin() and cos(). -
evo
A simple genetic algorithm that maximizes a two-dimensional function; tests 64-bit math, loop generation, and floating-point math. -
fft
Uses a Fast Fourier Transform to multiply two very (very) large polynomials; tests the C99 _Complex type and basic floating-point math. -
huff
Compresses a large block of data using the Huffman algorithm; tests string manipulation, bit twiddling, and the use of large memory blocks. -
lin
Solves a large linear equation via LUP decomposition; tests basic floating-point math, two-dimensional array performance, and loop optimization. -
tree
Creates and modifies a large B-tree in memory; tests integer looping, and dynamic memory management.
| Opteron (64-bit Linux) | |||||||||
| alma | arco | evo | fft | huff | lin | tree | composite | size | |
| gcc 4.0.0 20040915 | 28 | 25 | 25 | 28 | 23 | 31 | 30 | 190 | 44,124 |
| gcc 3.4.3 20040915 | 44 | 25 | 25 | 28 | 24 | 30 | 38 | 213 | 44,181 |
| gcc 3.3.5 20040915 | 45 | 25 | 68 | 32 | 29 | 30 | 42 | 271 | 44,205 |
| Pentium 4 (32-bit Linux) | |||||||||
| alma | arco | evo | fft | huff | lin | tree | composite | size | |
| gcc 4.0.0 20040915 | 24 | 27 | 62 | 27 | 11 | 19 | 25 | 194 | 40,174 |
| gcc 3.4.3 20040915 | 41 | 34 | 63 | 28 | 18 | 19 | 28 | 229 | 37,115 |
| gcc 3.3.5 20040915 | 42 | 25 | 62 | 27 | 23 | 19 | 32 | 229 | 36,395 |
| icc 8.1 20040803Z | 13 | 20 | 27 | 31 | 17 | 19 | 29 | 156 | 114,360 |
On both test systems, GCC 4.0 is a distinct improvement over its predecessors. While GCC has not quite reached the performance of its commercial competitor, results on these benchmarks suggest that the new "tree-ssa" architecture holds great promise for improved code generation.
These "coyote" benchmarks provide an excellent example of the advantages of "open" software development. GCC 4.0 crashed with an internal compiler error, but the problem was easily solved -- I went to the bug list, found a pending patch, and applied it; no more bug! Of course, patching and recompiling source code is not something most casual users can do; my point is that the open process makes solving problems easier for those people who do possess the requisite skills.
SciMark 2.0
Scientific Number Crunching
SciMark 2.0 is a C benchmark invented by Roldan Pozo and Bruce Miller at the U.S. National Institute of Standards and Technology. Originally written in Java for the purpose of comparing virtual machine performance, the suite was translated into ANSI C. Bigger numbers result from faster code, as this benchmark reports results using MIPS (millions of instructions per second).
| Opteron (64-bit Linux) | |||||||
| FFT | SOR | MC | Sparse | LU | composite | optimized | |
| MIPS | MIPS | MIPS | MIPS | MIPS | MIPS | size | |
| gcc 4.0.0 20040915 | 371 | 319 | 153 | 423 | 435 | 340 | 22,614 |
| gcc 3.4.3 20040915 | 359 | 314 | 157 | 458 | 493 | 356 | 22,544 |
| gcc 3.3.5 20040915 | 345 | 321 | 148 | 460 | 481 | 351 | 22,422 |
| Pentium 4 (32-bit Linux) | |||||||
| FFT | SOR | MC | Sparse | LU | composite | optimized | |
| MIPS | MIPS | MIPS | MIPS | MIPS | MIPS | size | |
| gcc 4.0.0 20040915 | 280 | 465 | 89 | 424 | 431 | 338 | 18,042 |
| gcc 3.4.3 20040915 | 294 | 463 | 98 | 434 | 396 | 337 | 17,322 |
| gcc 3.3.5 20040915 | 337 | 412 | 152 | 874 | 801 | 515 | 16,848 |
| icc 8.1 20040803Z | 347 | 1009 | 521 | 572 | 1408 | 772 | 34,981 |
SciMark measures the performance of number-crunching code used in "typical" scientific and engineering applications. It consists of five computational kernels: a Fast Fourier Transform, a Gauss-Seidel relaxation, a sparse matrix-multiply, a Monte Carlo integration, and a dense LU factorization. The code is straight ANSI C, without any abstractions or the use of C++ features. I've found this benchmark reflects the performance I can expect in my own numerical applications.
This is a benchmark where Intel's compiler shines, producing code that runs much faster than what GCC emits. Some people suggest that Intel has "cooked" their compiler to specifically optimize well for a "known" benchmark program; I have no idea if that is true or not. I do know that Intel generates very fast programs for my non-benchmark numerical work.
Another area of concern is GCC's dimishing accomplishment for this benchmark; version 3.3 is clearly superior to it's descendents on the Pentium 4. Opteron results show a somewhat different story, with 3.4 winning over 3.3; GCC 4.0 exhibits noticeable code generation regressions as compared to 3.4. I'll look into this more as time permits.
Conclusions
So which compiler is better?
Like Einstein, I have to say the answer is relative. If you use systems based on the Pentium 4 architecture, Intel C++ is an excellent choice. If you need OpenMP, Intel is your only choice given GCC's lack of this feature. But, as I said before, Intel's compiler is not a drop-in replacement for GCC.
Compiling code is a complicated business, and it isn't humanly possible to write a perfect compiler that digests everything programmers throw at it. I do have some reports of code that compiles incorrectly with Intel C++; on the other hand, I have such reports and experiences with every compiler I've ever used, including GCC. While people complain that corporations are slow to fix bugs, the "pay up or do it yourself" attitude of many GCC developers can be equally frustrating. And not everyone is comfortable digging through bug databases looking for patches.
If anything, these tests show how free software rivals -- and sometimes exceeds -- the quality of its commercial counterparts. Perhaps GCC's greatest strength is its cross-platform portability; it is arguably the most ubiquitous piece of software in the world, running on everything from mainframes to embedded systems. For obvious reasons, Intel's compiler is specific and limited to their processors.
As for the religious war over free and proprietary software: I've written a few million lines of code over the decades, and only under GNU/Linux have I had source code for my compiler. Perusing the GCC source can be very educational -- but in the end, many developers don't have the time to do compiler hacking when they're trying to write code for customers. So long as GCC exists and is free, I don't see any problem with companies like Intel (and Borland) producing closed-source tools we can use to develop free projects.
Your mileage and religious fervor on this issue may, of course, vary. I'm just glad we have a choice when it comes to development tools.
"Choice" is the key word here -- choice is good, be it in democracy or software. Intel provides a useful alternative to GCC for development on ia32 systems. One compiler might have a great environment for developing GUI code; another compiler might generate fast code. GPL-like freedom may be important -- or not -- as individual circumstances dictate.
As always, I look forward to considered comments.
-- Scott

