Subsections


Performance measurements

In this chapter we present some benchmarks of our changes. It should be stressed that the optimization itself was not the major goal of our project that concentrated primarily on preparing GCC for easy integration of new optimizers, however of course the improvement of generated code was the main motivation and we implemented several new optimization passes, so we should test the benefits.

Methodology

Andreas Jaeger has kindly tested our code on industrial standard SPECint2000 benchmark suite. It contains several commonly used programs--for instance the perl interpreter, old version of GCC itself, chess playing program crafty, gzip and bzip compression programs and more. The detailed information about the product can be found at http://www.spec.org.

The SPECint2000 is highly memory bound as programs usually operate on data set of about 200MB in size and the code segments are also quite large making these results conservative estimation of our contribution.

Since we integrated our changes to mainline GCC tree containing changes from many other developers and also merged all changes happening in mainline to our development tree, there is no version of GCC without our changes comparable with our cfg-branch version available.

We decided to benchmark only the benefits of profile based optimizations, as these can be easily disabled. This is again just a conservative approximation of our work.

We also benchmarked GCC with two different sets of options. The -O2 -march=athlon commonly used by developers and with aggressive optimizations enabled -O3 -fomit-frame-pointer -march=athlon -funroll-all-loops
-fstrict-aliasing -malign-double -fprefetch-loop-arrays. These flags were chosen to test as many features of GCC as possible, not to produce best performing code, and they are used by Andreas' automated tester to monitor GCC performance for almost a year now.

Results

The relative results for each SPECint2000 benchmark are in figures [*] and [*]. Each benchmark contains 4 values all relative to performance of mainline GCC with profile estimation disabled. 2 are for mainline tree and 2 for cfg-branch11.1. Each tree is benchmarked first without profile feedback (ie. with profile guessed by our branch prediction methods) referred as ``static'' and later with profile feedback referred as ``profile'.11.2

Andreas used AMD Athlon 1.133GHz system with 496MB of memory and GNU/Linux operating system for benchmarking.

Figure: SPECint2000 comparison with standard flags.
Figure: SPECint2000 comparison with aggressive optimization.
\includegraphics[width=129mm]{results.ps}
\includegraphics[width=129mm]{results2.ps}

As you can be easily see, with profile feedback both mainline and CFG-branch perform consistently better than without estimation bringing up to $ 14\%$ speedups. In the geometric mean of all benchmarks, profile feedback improves performance by about $ 3\%$ in the mainline and by about $ 4\%$ in the cfg-branch.

The static profile estimators are successful enough to produce better code at average than with no predictions, but for crafty, parser and twolf we loose. By studying the analyze_branches outputs for these benchmarks we found that in the first two cases the hitrate is only about $ 63\%$. Crafty contains common loops that iterate exactly once, in the parser majority of pointers are NULL. Both these cases correspond to a single function in the source program. We may try to teach developers to use builtin_expect feature to help compiler in such case to get good performance.

Twolf is special. Hitrate of our branch predictor is very good--$ 78\%$ so we may want to investigate what really is going wrong. This may be just random effect of code layout changes, such as cache coloring conflict. On modern architectures, like Athlon is, it is very difficult to get consistent results in all cases because of number of factors affecting the final performance.

We shall also mention that cfg-branch aggressive optimization results already contain our new loop optimizer replacement. This makes the results less comparable to mainline. The feature set of our new implementation is much smaller than the old unroller, but as can be seen from results, it works better in majority of cases than the old code did. We started to work on the new unroller because we saw that the old code had to be replaced, but we did not even expected our unroller to outperform old code so soon.

Without estimation we measured $ 386.00$ SPECints11.3, with estimation $ 391.82$ SPECints ($ 1.3\%$ improvement), with profile feedback $ 402.08$ SPECints ($ 4.1\%$ improvement). Majority of benefits can be already measured in the mainline tree, as most of our new optimizations are not enabled.

In the aggressive optimization we measured $ 399.22$ SPECints without estimation, $ 411.75$ with estimation ($ 3.1\%$ improvement) and $ 415.92$ with profile feedback ($ 4.1\%$ improvement).

The $ 1.3\%$ improvement for static profile estimation with standard settings may look small, but we shall mention, that it already is more than benefits derived from instruction scheduling or loop optimization pass that have many times higher complexity and (in the first case) compilation time requirements.

Benefit of $ 4.1\%$ compares realistically to improvements for profile feedback reported by other compiler teams. DEC Alpha compiler team reported results of their longer and more involved approach bringing $ 17\%$ speedup on Spec95 testsuite. Majority of benefits ($ 10\%$) came from function inlining heuristics we could not implement yet, since profile can not be gathered at Abstract Syntax Tree level. Once work done on ast-branch is integrated we plan to focus on it, but we can not implement it at the moment without duplicating the work being done by other developers.

Major benefits reported by DEC team came from tracer $ 3\%$ and code reordering pass $ 4\%$, that we implemented too, but our measured benefits are lower ($ 0.3\%$ and $ 1.7\%$) even when we implemented more sophisticated algorithms. This can be explained by architectural differences between Alpha and Athlon chips. Athlon is designed to execute code optimized for different architecture of Intel's chips and has a lot of built in optimization logic making it much less responsible for (and dependent on) compiler optimizations. It has also on-chip scheduling which eliminates major benefits of tracer and the i386 architecture has much more compact code which partly eliminates the benefits of code layout algorithms. These purposes makes us believe that benefits will be much higher on other platforms. For instance profile estimation is a kind of requirement for proper compilation for modern EPIC architectures making our contribution more appealing in the near future. Unfortunately we can't run benchmarks on such machines at the moment.

To illustrate how nontrivial is to reach speedup in optimizing compiler we shall mention that according to results of Andreas tester, GCC 3.0.0 improved over GCC 2.95 by only $ 3\%$ after two years long intensive effort. GCC 3.1.x will be likely about $ 6\%$ faster and we believe important portion of that is due to our changes. The difference between -O1 and -O3 optimization level is only $ 6.4\%$11.4, as a collaborative result of 25 new optimizations enabled at that level.11.5

In this light we see our results as success.

Jan Hubicka 2003-05-04