Linux Number Crunching
Benchmarking Compilers and Languages for ia32
by Scott Robert Ladd
18 May 2003
22 August 2004: In the last year, this article has become somewhat obsolete, given new releases of compilers. I'm working on a more extensive article that uses a combination of tests to compare performance. This original article remains online for historical purposes.
If you're the impatient sort who needs to see the "conclusions" before reading the details, please feel free to link immediately to The Ultimate Truth (or not!)
In January 2003, I published a numerical benchmark for three programming languages -- C++, Fortran and Java -- using both free and commercial compiler for Linux. That article, based on a single benchmark, generated considerable interest; based on reader responses and my own requirements, I've expanded the benchmark suite and investigated new topics (such as accuracy). This article is the result; I will be updating the material here as new releases, tools, and platforms become available.
Standard Caveats
I do not own an AMD-based system, and thus can not test it. While I'm grateful for offers to run the tests on someone else's AMD system, I'm prefer to keep my testing in-house, so I can verify results. People are free, of course, to download the benchmark source code and see for themselves how the code performs on their systems.
If a compiler does not appear, I do not have it to test. I can't run code on a nonexistent AMD system, and I can't test compilers I don't own. I'm a small, independent developer and writer with a family to feed and a limited budget. I am willing to accept any and all freebies, however! ;)
I do not plan to run benchmark tests for Windows compilers. The topic simply doesn't interest me! I am not anti-Windows, and I'm not taking part in the Linux-versus-Windows war. I've done considerable development under Windows, and numeric performance has never been a critical issue in document management systems, library cataloging systems, or .Net programming. When I do hardcore number-crunching, it is on a Unix-style system.
My standard benchmark caveat applies: Theoretically, benchmarks should provide clear, unequivocal information that guides people in making choices about software and hardware. Reality is, alas, somewhat less than that ideal; benchmarks are easily skewed, prone to interpretation, and rarely show a clear picture. Benchmarking is always a tricky business, especially when it comes to compilers. The entire process is subjective: A reviewer selects a limited suite of benchmarks that demonstrate specific aspects of code generation, thus predicting general compiler performance from a limited data set. Not terribly scientific, to be sure, and prone to interpretation and second guessing by the author's audience.
So why do benchmarks at all? Because we can still learn something about the relative performance of different tools, by comparing their performance in a controlled environment. Benchmarks are guidelines, not absolute answers. And to be valid, benchmark source code must be available, and the testing conditions clearly stated.
| Table 1: Benchmark Systems | ||
| characteristic | Copernicus | Tycho |
| Motherboard | IBM 6889 | Intel D850EMV2 |
| Processors | dual Pentium III 600MHz |
single Pentium 4 2.8GHz, HT |
| RAM | 384MB (PC100) | 512MB (PC800) |
| FSB | 100MHz | 533MHz |
| Hard Drive | 9GB SCSI | 80GB ATA/100 |
| GNU/Linux distro | Debian "sid" | Debian "sid" |
| Linux Kernel | 2.4.20 SMP | 2.5.66 SMP |
| glibc version | 2.3.1 | 2.3.1 |
| binutils | 2.13.90 | 2.13.90 |
Test Systems
I used two Linux-based systems for these tests, as shown in Table 1. Just prior to writing this article, I reinstalled Linux on Copernicus using Debian 3.0r1 and the "unstable" "sid" archives. Astute readers will note that Tycho is running a "development" kernel; it's a long story involving bizarre circumstances that are beyond the scope of this review.
Speed vs. Accuracy
The conventional wisdom states that fast code is also inaccurate code -- in other words, trhe optimizations that increase speed also reduce accuracy.
Methods
I ran all tests from the bash prompt, in text mode, no X server, with a minimum of daemons present.
For C, C++, and Fortran 95, I provide two results: one for a "speed" optimized compile, and another using switches that ensure the greatest accuracy and speed. Each test was run ten times, and the fastest times are reported in the tables, along with the options used for that specific test
If you see any obvious errors or problems in my selections of options, please let me know. These compilers are terribly complex tools, with dozens of potentially-pertinent options that interact in myriad combinations. The GCC compilers are particularly "flexible", and I've not had much luck getting clear advice from the GCC development team about the most effective combination of options. I have been using a genetic algorithm to "evolve" an optimal set of optimizations for a given test; this work is still rather theoretical, however, and I want to be sure of my methodology before making any absolute pronouncements.
To obtain a test time, I run the applications 10 times, and report the average, mean, and standard deviation for that set of runs. Benchmarks use language-specific timing functions that enclose the inner loop; start-up and initialization time is ignored. I report times to the nearest second; even with minimal system overhead, the standard deviation was usually high enough to mask the value of reporting finer measurements.
How do I know that the programs are producing the correct output? I have unit-test versions of each benchmark that verify the results; to clarify the code, I remove the unit tests with a Python script before running final numbers. The benchmark source code published here is the "clean" version.
A few people have complained that I should report results using graphs, not tables of numbers. In my experience, graphic comparisons can be very misleading; I prefer to see raw numbers.
Parallelism and Hyper-Threading
Recent web articles (Tom's Hardware, Infoworld, Personal Computer World) talk about Intel's "Hyper-Thread-enabled compilers" without explaining exactly what that means. Intel's compilers support Hyper-Threading through automatic parallelization and OpenMP. The former involves analyzing code for loops that can be divided between available processors; OpenMP is a standard for explicitly defining the parallelization of code through the use of pragmas and directives. This isn't the best place for me to expound on the nature of SMP, Hyper-Threading, and related topics -- if you need a technical introduction to these technologies, you'll find excellent articles at Ars Technica and 2CPU.
Writing parallel code is an art form found in scientific computing centers and national laboratories -- but it is rather scarce in the business world. Parallel code means different things on different platforms; the code I write for a Beowulf cluster is distinctly different from that used in multithreading an office application for Windows. Current "threading" methodologies, such as those enshrined in Java, focus on allowing a program to smoothly perform background tasks without interfering with the performance of user-interface. True "parallel" code, on the other hand, is aimed at breaking complex computation tasks into separate execution units that produce a composite result.
For this article, parallel code enters the picture as a tool for making number crunching faster.
Benchmark Descriptions
This article, like most others on this site, reflects the answer I found when asking a question. In this case, the question is: "Which programming language provides the fastest tool for number crunching under Linux?" Given all the hype, arguing, and general misinformation that accompanies such questions in public forums, I decided to find the answers for myself by developing a simple benchmark that could easily be translated between platforms and programming languages.
The original number-crunching article used a single benchmark, alambench, based on limited time and my focus at that moment. For this review, I've added three more benchmarks: fftbench, huffbench, and
almabench
The project that started this whole process is my development of a high-performance tool for computational astronomy. At its core, the library needs to work well on both workstations and high-performance clusters; this led me to investigate several technologies.
The vast majority of extant astronomical code is in one of two programming languages: Fortran 77 or Basic. Fortran 77 is the backbone of scientific computing; it provides a very powerful and reliable tool backed by years of experience and time-tested libraries of powerful mathematical code. Object-oriented programming adds very little to the core functionality of a number-crunching application -- and performance is adversely affected by the overhead entailed in objects, exception handling, and other object-oriented facilities. And in many cases, function-oriented code exceeds the clarity of corresponding object-oriented code. There's not much point in using objects to define a trigonometric function (like sin or atan2). ;)
Basic, in many forms, is common in articles published in popular astronomy magazines and books aimed at amateurs, based on inaccurate and out-of-date ideas about the availability of Basic and its accessibility for amateur programmers. I've tried convincing magazines and publishers to consider Java as a demonstrative language, to no avail.
Thus was born almabench, a program that calculates ephemeris for eight planets (excluding Pluto) using a mathematical series. Essentially, the program computes the Right Ascension (RA) and Declination (Dec), producing a set of numbers similar to those found in publications like the Astronomical Almanac. For almabench, I compute these values using an algorithm invented by a group of French Astronomers; you can find a citation in the benchmark source code. I use a different algorithm in my production library, based on the Jet Propulsion Laboratory's highly-accurate ephemeris data. JPL's data was painstakingly developed from years of observation, and requires an external data file. For almabench, I wanted to eliminate file I/O entirely. Regardless of the programming language used, almabench imports only one thing from standard libraries: a few mathematical functions (sin, cos, fabs, atan2, etc.). Good compilers optimize these functions to native processor instructions.
(I did, of course, include console I/O functions while testing various implementations of almabench, so I could verify my code. All such I/O is commented out, however, for the purpose of testing.)
Again, the point of almabench is to determine floating-point performance; any other functionality obscures that goal. The benchmark uses double-precision math exclusively. While games may use a lot of single-precision math, almost all scientific applications live and die by the performance of double-precision calculations.
I found it remarkably easy to translate almabench into Fortran 95, C++, and Java. The C++ code does not use any object-oriented or exception-handling features of C++; essentially, this is a C program with minor C++ convenience features. I chose Fortran 95 over Fortran 77 because the former is more modern, cleaner, and better suited to certain applications. I can't think of any special comments for the Java version. If you have suggestions for improving or optimizing the source code, please let me know. The important matter is that the programs remain semantically identical.
fftbench
huffbench
lpbench
subtilis
This is my accuracy benchmark, based on ideas expressed by the William Kahan, a prominent expert...
At the moment, subtilis includes two tests: Kahan's quadratic test, and my own simple trigonometry test.
Future versions of subtilis will add more accuracy tests.
Future Benchmarks
I'm working on a few more benchmark programs, when time permits. One of the forthcoming tests will implement a problem in fluid dynamics; another will simulate the flight of an asteroid through the solar system using an N-body algorithm.
Results & Analyses
I've divide this review into language-oriented sections, with a final section collecting all the data into a single table for comparing all of the products reviewed
|
Table 2: C and C++ run time (in seconds) ( lower is better) |
|||||||||
| compiler | options | almabench | fftbench | huffbench | lpbench | ||||
| P4 | P3 | P4 | P3 | P4 | P3 | P4 | P3 | ||
| GNU gcc/g++ v3.3 |
speed | ||||||||
| accuracy | |||||||||
| Intel v7.1 () |
speed | ||||||||
| accuracy | |||||||||
C and C++
Finally, I'll point out that Intel's C++ compiler produces code that is more than twice as fast as that emitted by GNU's g++. The Intel compiler performs automatic vectorization of loops, and it appears to have superior intrinsic implementations of trigonometric functions. See my comparison of gcc and Intel C++ for more of my deliberations on this topic; in general, gcc is very competitive with Intel.
As you can see from Table 3, using OpenMP directives nearly doubles the speed of almabench on a dual-processor system. On the Hyper-Thread-enabled Pentium 4 computer, OpenMP code was almost 30% faster than its serial equivalent. Neither result surprised me; this is exactly what OpenMP should accomplish. Hyper-Threading shows itself to be a "Good Thing" -- but it is not the equivalent of real multiple processors.
It does not appear that Intel's automatic parallelization (the -parallel option) was able to divide almabench's code into separate tasks. This is not surprising; in general, the best parallel code is hand-tuned by a human who understands the algorithm and its dependencies.
|
Table 3: Java run time (in seconds) ( lower is better) |
|||
| compiler | options | almabench | |
| P4 | P3 | ||
| Sun v1.3.1_06 |
speed | ||
| accuracy | |||
| Sun v1.4.1_01 |
speed | ||
| accuracy | |||
| Sun v1.4.2 beta |
speed | ||
| accuracy | |||
| IBM v1.3.1 |
speed | ||
| accuracy | |||
| IBM v1.4 |
speed | ||
| accuracy | |||
| Blackdown v1.3.1_02 |
speed | ||
| accuracy | |||
| Blackdown v1.4.1_01 |
speed | ||
| accuracy | |||
| GNU gcj v3.3 |
speed | ||
| accuracy | |||
Java
I recompiled almabench with the corresponding javac for each JDK, so the JVMs were executing code generated by their corresponding compiler. The number of Java options is rather daunting, reminding me of the heady days when I could review a dozen C compilers for a programming magazine like Dr. Dobb's Journal. There's a lot of competition in the world of Java -- and that's a good thing.
Java's performance is the result of a change between versions 1.3 and 1.4. The latest specification allows the Math.* functions to use their StrictMath equivalents by default; StrictMath implements well-defined algorithms to produce identical results, independent of platform. Those algorithms, however, do not support processor-specific instructions that substantially improve performance. Given Java's focus on portability, the existence of StrictMath is important -- although it does result in a classic trade of performance for consistency.
The problem with Java's performance is not my code or my lack of Java skills -- the real problem is that Java 1.4 is slow. Both the Sun and IBM JVMs lost significant performance in the move from 1.3.1 to 1.4, whether due to a new language requirement or other factors. My faith in Java is severely shaken when applications lose significant performance by "upgrading" to the current release of Java. Portability involves more than operating systems and chips; the change to StrictMath changed the meaning of programs as well as their performance.
Given the nature of the problem, my conclusions about Java stand -- albeit softened due to the performance of IBM's solid product. By Sun's own definition, JDK 1.3.1 is obsolete; the fact that it performs better than the most current JDK is indicative of a serious problem with Sun's "improvements" to their language. Since Java 1.4.1 is what Sun is promoting, that is what I base my conclusion on. I can say that IBM's product is superior, and have already set it is my default JDK. It's no wonder Sun is upset about IBM "usurping" Java -- IBM is producing better tools.
Some people asked if my Java results were biased by the amount of time it takes to load and initialize the JVM. I've tested several empty and near-empty applications; a "Hello, world" program, for example, loads and runs in less than 2/10ths of a second -- hardly significant. The start-up time increases with the number of classes loaded during execution -- but almabench.java has no significant imports. At the behest of some readers, I timed Java execution using the System.currentTimeMillis() function -- and again, it showed that start-up time was less than a quarter of a second. I'll continue to use the external time for testing, given its consistency across languages and the trivial impact of JVM start-up time.
I did not include any commercial Java compilers. Most, like the oft-cited Excelsior JET, are Windows-only; this article is about Linux. I don't benchmark Visual C++ or C# for the same reason (although I will look at Mono and C# some time in the future). The free version of Borland C++ does not include a complete optimizer, so I don't think it "fits" in this review.
|
Table 4: Fortran 95 run time (in seconds) ( lower is better) |
|||
| compiler | options | almabench | |
| P4 | P3 | ||
| Intel v7.1 () |
speed | ||
| accuracy | |||
| Lahey
v6.5 |
speed | ||
| accuracy | |||
Fortran 95
I am looking at other Fortran compilers for future updates. As for GNU g77 -- I wrote the code in Fortran 95 because I find Fortran 77 to be annoying. I wrote piles of Fortran 77 back in my CDC and VAX days, but these days I'm writing for environments where Fortran 95 is more appropriate. Believe it or not, Fortran 95 is a very clean, orderly language that eliminates many Fortran 77 idiosyncrasies while adding features important for high-performance coding.
Combined Results
Intel's C++ produced code that was just as fast as the Fortran! One of my oldest assumptions about computing has fallen into darkness -- a hazard of actually verifying one's assumptions. I assume Intel is using the same code optimizer for both compilers, which accounts for the nearly-identical results.
In the original version of this review, gcj performed poorly. Several readers provided further insight that improved gcj's performance dramatically, to the point where it compares quite favorably with interpreted byte code. Not that Sun's Java is all that impressive, of course. I was not surprised to see that, compiled C++ and Fortran is at least twice as fast as Java byte code. Performance is very dependent on the chosen JVM; IBM's 1.3.1 implementation, for example, exceeds the performance of C++ code compiled with gcc! Intel-generated native code, however, rules the day over any of the Javas. Do not over-interpret my previous statement; you'll find more comments on this topic below.
Feedback
The original article generated an exceptional collection of interesting and helpful of responses. In this section, I'll run down the important points people made.
How am I timing the results? With the Linux time command. Table 3 reports the "real" value reported by time (the elapsed time of execution.) Embedding timers in the actual code is fraught with problems; for example, each language implements different time scales and abilities. I'm sure someone will tell me that time is full of problems too, but it works for me and is consistent across all programming languages.
Amid the barrage of Java-related comments, a few people actually noticed the Fortran code. Additional benchmarks are coming.
Conclusions
No benchmark tells the entire story of performance. I wrote this article to clarify specific issues surrounding numerical applications; it is a guide, not a declaration of truth. I will continue to use Fortran 95 for the astronomical project I'm developing; while C++ performed very well, Fortran offers intrinsic matrix operations, venerable libraries of tested numerical code, and powerful facilities for manipulating arrays. In the same vein, I continue to write Java for some of my clients, whose applications revolve around interfaces and databases, not numbers and matrices.
It is very dangerous to make assumptions about the affects of various compiler options; -O3 is not necessarily better than -O1, and "obvious" optimizations may be detrimental to code performance. documentation is often imprecise about how options interact and
I'm not impressed by Java's much-vaunted "run-time optimizations" -- at least when it comes to double-precision math. Timing individual loops shows that the JIT stops improving performance after the first couple of passes through the code. Even when the Intel compilers are forced to follow IEEE-754 specifications with the -mp option (approximating Java's partial IEEE restrictions), they still produce results that are much faster than Sun's latest Java machine. Older JVMs may be faster, but they aren't what Sun is selling, and they don't give you nice new features like java.nio. I've written many financial and engineering applications, and it has always been clear that Java is inferior to native code applications in terms of raw computational performance, when you have good native code compilers. I sincerely doubt Sun will pursue a highly-optimized numeric programming strategy for Java, given their focus is on business applications and communications.
On the bright side, IBM's implementation is superb. I have moved all of my in-house Java work and development to the IBM 1.4 JDK.
Beyond simple code performance, Java is generally ill-suited to the world of high-performance computing and Beowulf clusters. Java's threading mechanism is designed for improving the response of user interfaces; to perform well in a multiprocessor environment, something akin to OpenMP or MPI is required. Yes, I am aware of attempts at supporting OpenMP and MPI in Java -- but I see no point in building a supercomputer only to program it in an interpreted language. I've built interfaces and control programs in Java -- but the core number-crunching code remains in Fortran and C.
These benchmarks do not imply that Java is useless, or that Fortran is the world's best programming language, or that compiled languages are superior in all cases to byte-code interpreters. I would no more write a sophisticated graphical application in Fortran than I would analyze physics data in Java. Now, I would write a Fortran program to acquire and manipulate data, with a Java application for presenting the Fortran-generated results in graphical form. Java is a terrific tool for building user interfaces; I enjoy working with Swing, and am in the midst of developing a large vertical-market database application in Java. Be that as it may, I'm exploring number crunching in this article, not buttons and mouse movements.
For a long time, I strongly advocated changing Java to support numerical work. In recent months, I've decided that I was mistaken. We already have several good general-purpose programming languages; programmers would best be served by the refinement of existing tools for their featured environments. Rather than create a vast gray-area of similar languages, we need tools for creating powerful and focused solutions. Better interoperability between languages is critical in a world where no one language can handle all programming tasks. Rather than change fundamental aspects of Java, Sun should improve Java's ability to communicate with C and Fortran. Someone needs to develop an efficient object model; SOAP is a fine effort, but its XML foundation is not efficient.
Depending on reader interest (and my time), I may extend this article to include benchmarks of I/O and integer math performance. I may also include other languages (particularly Python) as time and inclination allow.
As always, I look forward to considered comments.
-- Scott

