GCC 4.0

A Review for AMD and Intel Processors


by Scott Robert Ladd, 2 May 2005

Jump to Conclusions ;)

 

Charles Darwin

If you find this article useful, please consider supporting the author's free software efforts with a donation, no matter how small.

If any one piece of software is the foundation of Free Software, it is the GNU Compiler Collection. The release of version 4.0 in mid-April brings many changes and new features. In this review, I compare the newly-released 4.0 with 3.4.3, using a few real world applications in C and C++.

I won't be discussing Objective-C, Java, or Ada, since I don't use those aspects of GCC. I will be talking a bit about th new Fortran 95 compiler, though.

Another item I won't be covering herein is Intel's (or anyone else's) commercial compiler. At this point in time, information from older articles holds true, in that Intel's compiler generally produces faster code than does any version of GCC. I've found that Intel's EM64T compiler works just fine on my Opteron systems, and its code is also excellent. But this article is about the evolution of GCC, leaving debates over commercial products for other times and places.

GCC Overview

GCC logo

GNU's Compiler Collection

GCC (which includes C, C++, Objective-C, Fortran, Java, and Ada compilers) is arguably the most important tool for the creation of free software; without a free-as-in-speech and -as-in-beer compiler, it is unlikely that Linux (and perhaps BSD and OS X) would exist. I have an abiding interest in the quality of GCC. In all fairness, I do some peripheral and probably insignificant work on GCC as time permits, so I am not an entirely unbiased observer.

On 20 April 2005, GNU released GCC 4.0.0. The decision to make a major version change was handed down by the GCC Steering Committee after considerable and heated debate. Making a major version number change sends a signal to users that something "big" has happened. The last jump from 2.x to 3.x — in 2001 — signaled a major change in GCC's development process and maturity. Going to 4.x is a more subtle change, being more evolutionary and less revolutionary.

Among the new features are a Fortran 95 compiler and the "Tree SSA" optimization framework. The C++ compiler is now faster, particularly on code heavy with templates; the Standard C++ library has improved stream and strings classes, and it includes preliminary new classes from the recent C++ Technical Report 1 (an update to Standard C++).

The internal changes may not be obvious to most GCC users, but they mark a significant architectural change with great potential. "Tree-ssa" is about the future, providing an infrastructure for improved code generation. Much of that potential lies in the future, as you'll see from the benchmarks below.

Fortran 95

Even GCC seems to treat the new Fortran 95 compiler, gfortran, as an afterthought. It gets only a single paragraph in GCC's NEWS file, for example. This isn't surprising, given that the GCC user community is, by vast majority, a collection of C and C++ programmers. But the introduction of a Fortran 95 compiler will have a tremendous impact on many users who've relied on (or been stuck with, as the case may be) the old g77 compiler.

Programmers still use an awful lot of Fortran in scientific and engineering applications. Vast libraries of time-tested code provide a reliable foundations for numerical computations. On "supercomputers", Fortran remains the language of choice for a substantial number of applications. Unfortunately, the only free-as-in-liberty Fortran compiler, GNU's g77, was mired in decades-old technology, missing many improvements found in later standards. If you doubt Fortran's importance, consider that there are more companies producing commercial Fortran than are selling C and C++ compilers.

Fortran 95 is a modern, modular programming language with elegant facilities for parallel processing and array manipulation. Now that GCC supports a recent standard (the is a Fortran 2005, with "object-oriented" features), this will help those institutions who've wanted to modernize their code, but have been locked into either g77 or a commercial compiler. While gfortran still lacks certain mainstream features, such as automatic parallelization, it does bring bring a good free implementation of Fortran 95 to those who need it.

Test Methods

I performed testing on my two Linux systems, as described below. I use the Gentoo GNU/Linux distribution. Your performance may (and likely will) vary in some small details, based on your hardware, installed libraries, and Linux installation.

Corwin (Homebrew)
Gentoo AMD64 GNU/Linux, kernel 2.6 SMP
Dual Opteron 240, Tyan K8W 2885
120GB Maxtor 7200 RPM ATA-133 HD
2GB PC2700 DRAM
Radeon 9200 Pro, 128MB, HP f1903 DFP

Tycho (Homebrew)
Dual Boot: Windows XP Professional
Gentoo x86 GNU/Linux, kernel 2.6 SMP
3.06GHz Pentium 4 w/HT, Intel D850EMV2, 533MHz FSB
2x80GB Maxtor D740X 7200 RPM ATA-100 HD
512MB PC800 RDRAM
Radeon 9200 Pro, 128MB, NEC FE990

No matter which compiler options I choose, someone is likely to send me e-mail telling me I got it all wrong. Ranging from the polite to the insistent to the rude, these e-mails contain contradictory suggestions for producing fast code. In the vast majority of cases, such anecdotal assertions lack any formal proof of their validity, and, more often than not, the suggested "improvement" is ineffective or detrimental. One example: Many people insist that I use -mmmx and -msse options when specifying -march=pentium4 for GCC — when, in fact, the -march=pentium4 option implies those special instruction sets. For -march=opteron, -mfpmath=sse is the default. I appreciate help from the audience, but some people need to give their favored settings a reality check. The GCC documentation has recently improved, and now has a much more accurate list of implied options.

Some folk may object to my use of -ffast-math — however, in numerous accuracy tests, -ffast-math produces code that is both faster and more accurate than code generated without it. Yes, -ffast-math has other aspects that make for interesting debate; however, such discussions belong in another article. If you don't use -ffast-math, you're ignoring many of your processor's most powerful features.

This article is not a comparison of the Pentium 4 and Opteron processors; my test systems are far too different for any such comparison to have meaning. And please do not ask me to test on systems I don't own, unless you're willing to send me hardware.

I built both GCC versions using the following options:

  • Opteron:
    --enable-shared --enable-threads=posix --enable-__cxa_atexit --disable-checking --enable-languages=c,c++,f95 --disable-multilib
  • Pentium 4:
    --enable-shared --enable-threads=posix --enable-__cxa_atexit --disable-checking --enable-languages=c,c++,f95

For the benchmark compiles, the pertinent options I selected were:

  • Opteron:
    -march=opteron -ffast-math -O3
  • Pentium 4:
    -march=pentium4 -mfpmath=sse -fomit-frame-pointer -ffast-math -O3

Some folk may quibble with the choices above, pointing to my own Acovea project as proof that -O3 is not necessarily he best choice in all cases. I discuss Acovea analysis elsewhere; for the purpose of the benchmarks, however, I wanted to use GCC as most users will, by selecting the "best" optimization, which is -O3.

The Benchmarks

Reality is, alas, somewhat less than ideal; benchmarks are quite subjective, prone to interpretation, and rarely show a clear picture. Benchmarking is always a tricky business, especially when it comes to compilers: A reviewer selects a limited suite of benchmarks that demonstrate specific aspects of code generation, thus predicting general compiler performance from a limited data set. Not terribly scientific, to be sure.

So why do benchmarks at all? Because we can still learn something about the relative performance of different tools, by comparing results in a controlled environment. Benchmarks are guidelines, not absolute answers. And to be valid, benchmark source code must be available, and the testing conditions clearly stated.

In the case of a compiler, code generation benchmarks give us an empirical comparison of products that serves to guide our choices of tools. If I'm developing a number-crunching application, I appreciate knowing that compiler "A" produces faster code than compiler "B". In my experience, benchmarking serves as a guide, a filter that shows trends and identifies areas of concern.

Each benchmark is accompanied by a pair of tables containing performance data for Opteron and Pentium 4 tests. I've highlighted the "best" value in each column in green; a red value is the "worst" result in a column.

I've chosen the following set of benchmarks based on the types of work I do and my desire to analyze the computational "muscle" of code generated by different compilers. Based on experience and reader response, I may change the benchmark suite from time to time.


POV-Ray 3.6.1
Ray tracing

POV-Ray is a venerable tool for generating photorealistic images via ray tracing. Written in C++, this computationally-intensive application has a well-known benchmark test. Two of my daughters are nascent computer artists, and fast rendering is important for their work and enjoyment.

 

Opteron (64-bit) optimized optimized compile time
run time size -O3... -O0
gcc 4.0.0 37:11 1,421,469 1:25 0:43
gcc 3.4.3 34:19 1,454,262 1:27 0:49
Pentium 4 (32-bit) optimized optimized compile time
run time size -O3... -O0
gcc 4.0.0 34:36 1,355,394 1:41 1:07
gcc 3.4.3 35:30 1,273,697 1:47 1:13

 

On the Opteron, GCC 3.4.3 is marginally superior to 4.0 in terms of both size and performance. I tried a number of combinations with GCC 4.0, looking for — and failing to find — a magic set of options to improve performance.

The Pentium 4 has the opposite story in terms of performance: The GCC 4.0-generated program is faster than the one created by GCC 3.4.3. And GCC 4.0 had marginally-faster compile times, too.


LAME 3.96.1
Music Encoding

LAME is a popular tool, written in C, for encoding digital music to MP3 format. For these benchmarks, I've used LAME to encode a 520Mb sound file from WAV to MP3, sending the output to /dev/null to reduce the effect of file I/O.

 

Opteron (64-bit) optimized optimized compile time
run time size optimized -O0
gcc 4.0.0 6:21 364,455 41.4 16.9
gcc 3.4.3 5:57 350,742 35.2 16.8
Pentium 4 (32-bit) optimized optimized compile time
run time size optimized -O0
gcc 4.0.0 4:01 333,610 49.0 40.9
gcc 3.4.3 3:48 317,123 36.0 21.8

 

When compared to GCC 3.4.3, GCC 4.0 takes longer to produce larger and slower programs from the LAME source. This is true on both test systems.


SciMark 2.0
Scientific Number Crunching

SciMark 2.0 is a C benchmark invented by Roldan Pozo and Bruce Miller at the U.S. National Institute of Standards and Technology. Originally written in Java for the purpose of comparing virtual machine performance, the suite was translated into ANSI C. Bigger numbers result from faster code, as this benchmark reports results using MIPS (millions of instructions per second).

 

Opteron (64-bit) FFT SOR MC Sparse LU composite optimized
MIPS MIPS MIPS MIPS MIPS MIPS size
gcc 4.0.0 365 321 176 450 471 357 23,737
gcc 3.4.3 365 320 159 457 500 360 23,071
Pentium 4 (32-bit) FFT SOR MC Sparse LU composite optimized
MIPS MIPS MIPS MIPS MIPS MIPS size
gcc 4.0.0 360 454 165 636 864 496 17,042
gcc 3.4.3 352 454 164 764 1157 578 17,007

 

SciMark measures the performance of number-crunching code used in "typical" scientific and engineering applications. It consists of five computational kernels: a Fast Fourier Transform, a Gauss-Seidel relaxation, a sparse matrix-multiply, a Monte Carlo integration, and a dense LU factorization. The code is straight ANSI C, without any abstractions or the use of C++ features. I've found this benchmark reflects the performance I can expect in my own numerical applications.

Again we see GCC 4.0 losing to the "older" compiler, on both Opteron and Pentium 4 systems.


Linux 4.6.11.8
The Kernel

Linux is a large piece of code, and Linux developers have been some of the most vocal critics of GCC's compile time and code size. Does GCC 4.0 address these concerns? While the kernel does not lend itself to timed performance tests, it's certainly possible to see how quickly different versions of GCC compile the code base. I downloaded the complete kernel tarball from kernel.org immediately prior to making these tests, and timed the duration of a make bzImage.

GCC 4.0 did compile a working kernel; I'm running it even as I type this sentence. The road to success was a bit bumpy; GCC 4.0 produces more warnings than did its predecessors, and I had to make trivial changes in two function prototypes from the file include/linux/i2c.h. GCC 4,0 objected to a declaration like this:

extern int i2c_transfer(struct i2c_adapter *adap, struct i2c_msg msg[],int num);

I replaced the above with the following:

extern int i2c_transfer(struct i2c_adapter *adap, struct i2c_msg * msg,int num);

GCC 4.0's warnings were mostly "pointer targets in assignment differ in signedness". As for the test compiles themselves, I used my personal the .config files for the target systems, so that the resulting kernel could be booted. I build a very system-specific kernel with few modules, as given by corwin.config and tycho.config.

 

Opteron (64-bit) Compile vmlinux.bin
time size
gcc 4.0.0 6:25 5,247,984
gcc 3.4.3 5:54 5,182,448
Pentium 4 (32-bit) Compile vmlinux.bin
time size
gcc 4.0.0 7:13 4,481,820
gcc 3.4.3 7:46 4,563,740

 

The results tell slightly different stories depending on the target system. For the Pentium 4, GCC 4.0 was quicker, producing a smaller kernel than did GCC 3.4. The Opteron system had the opposite result, with GCC 4.0 taking a bit more time and generating a larger kernel.


Conclusions

Is GCC 4.0 better than its predecessors?

In terms of raw numbers, the answer is a definite "no". I've tried GCC 4.0 on other programs, with similar results to the tests above, and I won't be recompiling my Gentoo systems with GCC 4.0 in the near future. The GCC 3.4 series still has life in it, and the GCC folk have committed to maintaining it. A 3.4.4 update is pending as I write this.

That said, no one should expect a "point-oh-point-oh" release to deliver the full potential of a product, particularly when it comes to a software system with the complexity of GCC. Version 4.0.0 is laying a foundation for the future, and should be seen as a technological step forward with new internal architectures and the addition of Fortran 95. If you compile a great deal of C++, you'll want to investigate GCC 4.0.

Keep an eye on 4.0. Like a baby, we won't really apperciate its value until it's matured a bit.

As always, I look forward to considered comments.

-- Scott





Software Products
Consulting Services
Curriculum Vitae

Computer Books
Fiction
Articles
Reviews

FAQ
Bibliography
Send E-mail



Link to Scott Ladd's Syraqua site

© 2008
Scott Robert Ladd
All rights reserved.
Established 1996


The grey-and-purple dragon logo, the blue coyote logo, Coyote Gulch Productions, Itzam, Evocosm, and Acovea are all Trademarks of Scott Robert Ladd.

Privacy Policy
Legal Stuff