Future Plans

Summary

We believe that in our two semesters long effort we accomplished our original goal and exceeded it greatly, as our infrastructure improvements allowed us to implement several successful optimization passes (and few not so successful, see discussion in and ) in shorter time than before. Overall benefits we benchmarked are competitive with results reported by other compiler teams.

While the original reorganization of compiler to use CFG was mainly done by Jan Hubicka in the early stages of our projects, the other participants of the project (Zdenek Dvorák, Josef Zlomek and Pavel Nejedlý) were able to familiarize themselves with the necessary parts of the compiler in less than one month and use the CFG infrastructure to develop new optimization passes, improve importantly the GCC ability to predict the profile statically and bring framework to measure efficiency of individual branch prediction heuristics, and increase usability and robustness of basic block profiler implementation and when needed find and fix the latent GCC implementation problems. They also brought important feedback for design of CFG code, its documentation, and implemented several extensions (such as new natural loop tree representation, accurate debugging output support routines or robust profiler code).

A number of optimizations were implemented and we believe that mostly successful ones -- among the most important ones, the register allocation changes, code placement and new loop unrolling code. All the new passes are significantly shorter and easier to maintain than older GCC code.

For instance the CFG code is shared across all optimizations. Code layout and basic block duplication module is implemented on using it and is reused in several optimization passes (block reordering, loop optimizer, tracer). We implemented natural loop discovery code in 1100 lines and loop body duplication code in 900 lines used by loop peeling, unrolling and unswitching. Our loop unroller is just 700 lines long as opposed to 4000 lines of code of monolithic unrolling code present in old loop optimizer that magically used 3000 lines of code for instruction chain duplication from function inlining module. Our new loop optimizer, even when it is feature wise much poorer than the old code, already outperforms the old code in SPEC2000 benchmark at much lower code size expanses and we got loop peeling and unswitching de-facto for free.

We have found our implementations to be significantly shorter than the ones present in Open64 and competitive with Impact compiler^12.1, the other two large compiler projects with sources publically available. We also believe our implementations are easier to understand, but we probably need to keep the judgment on independent reader.

We have successfully merged major infrastructural changes to GCC mainline tree. Important amount of work is already present in the 3.1 branch to be released in 5 days (April 15th). 3.1 version will be the first official GCC version supporting profile feedback and doing limited amount of profile based optimizations -- register allocation, code alignment and simplistic basic block reordering.

Some more changes have been merged to the development tree, after it had been unfreezed last month, and only some of new optimizations are waiting to be integrated. Merging of the other changes is scheduled after the 3.2.0 release, because GCC maintainers are now focusing on pushing out the release.

Our special thanks come to Richard Henderson, one of the most active GCC maintainers, who kindly reviewed majority of our patches and approved them for inclusion into mainline and provided useful comments and ideas and even developed and improved our code in areas we were unsure about, and Andreas Jaeger, who tested and benchmarked our work using SPEC2000 benchmark suite.

Future Plans

While officially our project is about to be finished, we plan to continue working on the GCC. Of course we plan to maintain the changes we made, attempt to merge rest of them to the mainline GCC tree for next major release (3.2) and fix problems as they arise (we expect number of them on more exotic embedded platforms we can't test directly).

As shown earlier, our results can be seen as a satisfactory, but some optimizations does not perform as well as we hoped for. Most important is tracer, that brings much smaller speedups ( $0.3\%$ ) than reported by DEC Alpha compiler team ( $3\%$ ). Partly this can be attributed to differences between Athlon and Alpha architectures as discussed already, but partly it can be due to architecture of GCC itself. Many optimization passes are basic block based and thus do not benefit from extended superblock size. We want to update as many of optimizations as possible to be either superblock based or global.

We also plan to test the superblock scheduler, as superblock scheduler is one of the passes deriving most benefits from trace formation. We have not integrated a working prototype to the source tree yet and we want to see whether some speedups can be gained from it. Impact compiler is using exactly such scheme with great success on Itanium architecture. We will attempt to merge superblock scheduler to the mainline in case that the DFA-branch targeting more powerful global scheduler will not be able to deliver it for next release.

On AMD Athlon GCC current scheduler implementation brings about $0.9\%$ speedup. This is much lower compared to other chips, because Athlon has an on-chip scheduler that is mostly replacing the compiler pass. We expect that more global scheduling can bring more benefits, as the on-chip scheduler is not able to perform it.

Important extensions are also desirable to our loop optimizer pass. We should re-implement the strength reduction pass and add more sophisticated unrolling heuristics and induction variable discovery. We also plan to add dependency analysis and array prefetch code generation. This work needs to be closely synchronized with effort on AST-branch, as loop optimizing framework should be distributed between the high level optimizers that are available there, and the lowlevel RTL optimizers we are working on. For instance we do not plan to re-implement the induction variable discovery code in full strength as it is present in old loop optimizer, instead use highlevel optimizer to canonicalize loops to make our job easier. At high level some important information is present (such as whether the counter overflows) we can't derive from RTL representation, so low level optimizers cannot do as good job as high level ones in this case.

Important step will be extending our profile implementation to the tree representation, that is necessary to implement function inlining heuristics without requiring user to profile program twice. Profile based inlining has been reported as the most successful optimization among profile based changes made by the DEC Alpha compiler team. Again we need to wait for AST-branch to mature enough to make this task possible.

Similarly once inter-procedural framework is available, we plan to extend our profile estimation to the inter-procedural level allowing us to predict which functions belong to the hot spots of program and which do not. This has been reported as reliable in Wu and Larus paper[4].

Lots of work need to be done on midlevel RTL representation. While our prototype works reliably on i386 machine, other machine descriptions need to be updated to handle it. We also need to implement some side corners, such as string representation and function calls, we do not represent in midlevel RTL yet. We plan to continue working on making midlevel RTL more high to simplify the RTL generation pass and add multiple passes gradually lowering similarly to the SGI MipsPro compiler design. For instance we would like to have array subscripts represented directly in the midlevel RTL in early stages to simplify dependency analysis.

Interesting direction of future research can be trying to implement path profiling to the GCC. Path profiling is scheme able to measure whether outcome of given branch depends on the path it has been reached from. In case it does, the code can be duplicated (specialized) to assist hardware branch predictor and improve optimizations. Path schedules are also easier to update after other code specializing optimizations, such as loop unrolling and loop peeling. Similarly we plan to add code to measure histograms of number of iterations of each loop so we can better decide what loops to peel or unroll and how much.

We also plan to replace GCC register move pass by extending our register coalescing and implement more optimizers. Our existing optimizers will also definitely benefit from some extra tunning and benchmarking especially on non-i386 architectures.

Important cleanups to the control flow graph representation are possible as well. For instance once all passes are updated to use control flow graph information, we can unlink independent basic blocks in the instruction chain. We should also avoid the notes and jump tables to be present between the basic blocks. This requires high volume changes, so we plan to synchronize this with mainline development to avoid the need to maintain large patch for a too long time.

Jan Hubicka 2003-05-04