Goals and Motivation

GNU Compiler Collection (GCC) is major free-software (or open-source) compiler. It is an aged and large project that gained lots of popularity because of simple retargetability and easy extensibility at front-end side. Since the early 80ties when the project officially started till today it has been targeted to over 90 CPU architectures^1.1 and many more operating systems, CPU flavors (the 32bit and 64bit versions of popular architectures are counted as a single target) and assembler/object file formats. Today GCC is commonly the compiler of choice for embedded targets and is used as major compiler in number of popular operating systems.

On the front end side, in addition to C currently mostly compliant with C99 standard, GCC has been extended to support C++ (GCC 3.0 is one of first compilers supporting ISO C++ standard and IA-64 exception handling specification), ADA, Fortran77 (Fortran9x front-end is being worked on), Chill and in separate tree Pascal (both ISO Pascal and Borland Pascal-like dialects).

We found GCC development a challenging task and since the sources are publicly available, we decided to participate on it. While choosing the goals of GCC related software project, we had to take into account a number of factors:

The major problem of GCC for years was little extensibility in terms of new optimization passes. The classical optimization passes -- jump optimization, common subexpression elimination, instruction combiner, loop optimizer and priority based register allocator have evolved over the years into very powerful optimizers. For instance the common subexpression elimination pass now does de-facto global value numbering, or instruction combiner utilizes powerful algebraic simplifier of expressions. Also the design has aged and further extensions have become difficult.

Adding new optimization passes was not very successful. In fact only a few new optimization passes were added in the last decade. One of the more successful was the addition of scheduler pass in 1992, but it took long time until it was adopted by machine descriptions.

In 1997 GCC project went through development model restructuralization and became much more open, having public CVS^1.2 tree, weekly snapshots and group of maintainers reviewing and applying patches. This new scheme allowed more rapid development and some new optimization passes were integrated. Considerable fraction of these may be considered unsuccessful.

A nice example of these unsuccessful integrated optimization passes is the global common subexpression elimination pass (GCSE) dated to 1997. It produced code that required more registers and ran slower than without the new pass, that was attributed to busy code motion algorithm used to implement partial redundancy elimination. Lazy code motion code came in 1998, but GCSE still was not powerful enough to change performance of common benchmarks in a positive way. The problem were limitations of register allocation and limited alias analysis support in compiler.

After rewrite of the i386 back-end used by all x86 platforms like i386, Pentiums or Athlons in 1999 the GCSE became mostly disabled for this platform as it did not understand more detailed intermediate representation of the back-end. In 2000 the work started on Single Static Assignment (SSA)^1.3 support in GCC that was expected to make GCSE (and other optimizations) implementation easier, but this effort was mostly dropped after failure to use SSA to implement a simple dead code removal pass. Intermediate language used by GCC at the present, Register Transfer Language (RTL), contains some constructs making SSA representation of program impossible^1.4.

In 2001 GCSE was partially modified to handle new representation of i386 instructions and a support for load/store motion was integrated. Even today in the official tree, GCSE is not able to do partial redundancy elimination on i386 instructions modifying flag registers (majority of them) and store motion implementation is disabled due to design problems, so the old pseudo-global CSE and GCSE are both run resulting in large compilation time overhead.

This can not be attributed to the wrong design of GCSE, that by itself is nice and clean piece of code, but to the fact that GCC contains almost no support for the optimization passes and the RTL language used is too low level which makes the implementation of optimizers unacceptably difficult. (Author believes that RTL is a good intermediate language that fits perfectly for the purposes of late passes such as instruction combining, register allocation or instruction scheduling it was designed for).

In our project we decided not to concentrate on adding new optimization features, as this looked like a lost battle, but on modernizing the compiler itself. We believe that if things are done properly, lots of effort on implementing new optimizers can be saved. We had to find our room in other outgoing projects of that time:

Loop optimizer rewrite, Michael Hayes

Michael was rewriting the current loop optimizer, using information provided by front-end about loop constructs and had problems because earlier passes mangled the information in a way that loop optimizer often had to skip the loop.

His new loop implementation uses natural loop discovery code on control flow graph to find the loop constructs. He has contributed the natural loop discovery code and new DU/UD analysis module, but has not finished the loop optimizer rewrite itself yet.

Mid-level RTL, Jeff Law

Register Transfer Language, the intermediate program representation used by GCC, is extremely low level. It allows to describe lots of details about target platforms and is the source of GCC's easy retargetability, however it is very complicated for optimizers to handle it and makes it impossible to convert program into SSA that is used by many modern algorithms. Jeff Law worked on higher level RTL, that looked syntactically identical to the low level RTL but hid all unnecessary target-specific features.

Abstract Syntax Tree optimizers, Diego Novillo

Recent rewrite of C++ front-end made it possible to keep representation of the whole function in abstract syntax tree format, instead of lowering the function statement by statement to RTL representation. This made it possible to write optimizations at tree level making implementation of high level optimizers much easier. Mark Mitchell has donated new implementation of function inlining code based at trees and Red Hat and CodeSourcery developers have been working on modifying C front-end to use the same representation.

Diego has started separate development branch targeting to implement an infrastructure for high level optimizers -- control flow graph over tree representation. He plans to implement single static assignment form, alias analysis, dependency analysis, high level loop optimizers and more.

DFA scheduler, Vladimir Marakov

Vladimir has implemented new automaton based hazard recognizer for scheduler allowing much closer description of target architecture. On the newly created branch some machine descriptions have been converted to use this new format. Now he plans to implement proper modern global scheduling pass.

New register allocator branch, Daniel Berlin and Michael Matz

Patents pending on the Chaitin's graph coloring idea and most of other modern register allocation algorithms forced GCC to stay with priority based register allocator from late 80ties. Daniel and Michael have implemented new, iterated register coalescing based algorithm. The patent part of this is currently sorted out.

Several front-end and retargeting projects

Just for sake of completeness we should mention that several projects were active in the front-end area and several ports were in the development. These areas basically did not influence our work, except for Fortran9x project that introduced notion of multiple entry points to single function.

Concerning the control flow graph (CFG), one of basic preexisting pieces of code needed for our project, each pass had its own implementation of CFG until 1998. Some passes, such as reg-stack^1.5even did its own liveness analysis and nontrivial CFG transformations, such as critical edge splitting.

In 1998 Richard Henderson provided a new, more flexible CFG implementation for data-flow module and revamped register allocation and related passes to it. Several other passes, scheduler and reg-stack were converted later, but still the majority of the passes did not use it and even worse, invalidated it by modifying the instruction stream without updating the CFG datastructure. Thus these structures had to be rebuild before every use.

We decided to concentrate on pushing the control flow graph (CFG) data structure further into GCC implementation and attempt to increase sharing of infrastructure over GCC passes. Just for illustration, before starting project, GCC contained 11 different implementations of function to remove instruction from the chain and many places in compiler did the work manually.

Since only cleaning up half a million lines of code is a dreadful task, we have decided give our work a limited, but important and hopefully fruitful goal -- pushing profile information (see chapter ) into GCC optimization passes. Impact, DEC a Intel compiler teams have reported that profile based optimizations bring substantial speedups (see for instance [2]) and profile information is best stored directly in the CFG itself, making the effort of converting existing passes to CFG more appealing.

Thus we decided to extend the lifetime of the control flow graph over the majority of optimization passes and update existing code to profile applications to allow reading data back into the compiler. Doing so required to reimplement some optimization passes, clean up others and most importantly develop an infrastructure for other GCC developers to easily manipulate with CFG representation and underlying RTL instruction chain.

To derive some more fruits from our work, we decided to develop some new optimization passes based on new CFG and profile code as proof of the concept. We also decided to redesign and improve current static profile estimation pass and the profiler itself as described later in this document.

Previously, a team of compiler developers at Intel extended GCC to better support their i960 Intel chip. Their extensions include some of profile based optimizations, trace scheduling and register allocation to name the most important. To minimize amount of changes their design decision was to store the profile information inside instruction stream itself instead in the flow graph structure that is in sharp contrast with our implementation. Intel's implementation was never integrated into official GCC tree.

Our effort was to implement similar optimizations (and more) but in the form acceptable for GCC maintainers. We created a branch called cfg-branch on the official CVS server holding GCC sources and all the patches installed had been sent to official mailing list, so the GCC maintainers had chance to review the code briefly and point out important problems, when present.

In our development protocol, all important patches were first reviewed by other developer on the internal mailing and once ready they were sent to the official mailing list and installed to the cfg-branch tree. The comments from other GCC developers and maintainers sometimes requested changes that were integrated and once code had proven to be useful and working, it was sent for integration to mainline tree. This scheme is similar to primary and secondary review model used by major GCC companies. Bugfixes were allowed to go into tree without approval.

Due to limited manpower of GCC maintainers, the infrastructural changes had priority over new optimization passes, to allow other developers to use our work. Additionally our tree was synchronized with mainline tree on weekly basis allowing us to more easily prepare and test patches for GCC mainline and avoiding risk of conflict with other development project integrated to the mainline.

We also tested our changes by bootstrapping the compiler^1.6, running the GCC testsuite and daily benchmarking using SPECint2000 with more comprehensive testing run weekly.

Jan Hubicka 2003-05-04