Subsections


Branch Predictions

Although the profile feedback brings very accurate information about program behavior, it has one major drawback -- an easiness to use. Majority of users are not willing to construct train runs for their programs and compile twice and in some cases (such as interactive programs, programs whose behavior depend on configuration or in embedded systems) this is even impossible to do.

An attractive alternative to feedback based profile is static branch estimation. Ball and Larus [3] describe a set of simple heuristics to predict conditional branch direction (taken or not taken) in about $ 75\%$ of executed branches, that is a remarkable result, even when upper bound (the perfect predictor magically knowing in advance the actual program behavior) is about $ 91\%$.

Wu and Larus [4] extended this method to estimate conditional branch outcomes (probability that branch is taken) and propagate this information over intra-procedural control flow graph to compute expected number of executions of each basic block and they got satisfactory results predicting well the majority of hot spots in the program.

The original branch prediction code based on Ball and Larus heuristics was integrated into GCC as part of IA-64 porting project by Stan Cox and Jason Eckhart, but later it showed to be seriously broken in our experiments predicting loop branches in wrong direction resulting in about $ 55\%$ successful predictions compared to $ 75\%$ and it implemented just subset of predictions described in the paper.

Infrastructure

Goal of our implementation was to allow the front-end and early optimization participate on branch predicting process. This differed from the Ball and Larus approach that used final assembly exclusively for the heuristics most probably because the sources of actual compiler were not available to the researches. Because more information is available to compiler than one can easily derive from resulting code, we can predict better. We can also re-use the analyzers implemented in existing optimizer passes.

Our approach has three major passes. In the first pass, number of heuristics are tried on each conditional jump and if some of them predicts the direction of jump (taken or not taken), the information is stored into conditional jump instruction itself.

In the second pass we combine all the information gathered into single value, the outcome -- estimated probability that given conditional jump will be taken.

In last pass we propagate the information into block coverage (see chapter [*]) constructing equivalent data to the ones measured by the profiler. Optimization passes then may handle both estimated and feedback based profiles equivalently.

Implementing and Verifying Heuristics

We added a possibility to attach REG_BR_PRED note to each conditional branch in the RTL instruction stream. The note contains two values -- identifier of the predictor algorithm for later analysis and the outcome the branch is supposed to have.

Each compilation pass run before profile estimation pass may freely predict any conditional jump in the code. For instance Ball and Larus suggest to predict a conditional jump around a loop as not taken. This heuristic comes from observation that most compilers copy test of while(cond)-type loops before the loop construct and loops tend to iterate. In our implementation the loop test duplication pass simply predicts each test copied resulting in better characteristics of this particular heuristics.

We found it difficult to predict whether given heuristic is good or not. To allow easy evaluation of new heuristics, we implemented tool analyze_branches to summarize information about behavior of all GCC heuristics based on real program behavior read from profile feedback. It can be used as follows:

  1. compile program with instrumentation
  2. execute program on train set
  3. recompile program with profile feedback and debugging output (using -db command line option)
  4. run analyze_branches *.bp on produced files to summarize information about heuristics

Following information is gathered about each of implemented prediction heuristics:

number of branches
Number of branches in the compiled programs given heuristics apply to
dynamic hitrate
Probability that direction of given branch in program is predicted correctly weighted by number of executions of each branch
predictability
The hitrate of perfect predictor on branches predicted by given heuristics. Some heuristics predicts well predictable branches, while other predicts data dependent branches. This value represents upper bound of dynamic hitrate given heuristics can reach.
coverage
Number of executions of branches predicted by given heuristics

It is obvious that heuristics in order to be successful should have dynamic hitrate and coverage as high as possible.

High Level Heuristics Support

Many run-time properties of program can be guessed from its source code. But some of properties are either completely lost or at least heavily mangled by translation into a low-level RTL representation. For example, usage of goto statement usually indicates some exceptional situation, like error handling; however, we cannot recognize jump created by goto statement from any other unconditional jump in the program. We can easily detect this statement during RTL generation, but using this information immediately would be hard and cumbersome.

To enable us to use these properties, we instead pass this information to later stages by emitting a special note (NOTE_INSN_PREDICTION) into insn stream. This note carries also additional information determining the kind of prediction and optionally some other attributes. In early stage (to avoid the information carried being influenced by optimizations) of the compilation we transform these notes into REG_PREDICTION notes placed directly on conditional jumps responsible for using the predicted branch (this property can be expressed as being at the end of the nearest basic block such that it dominates block containing the note, but is not postdominated by it).

Combining heuristics

Once all predictions are done, the prediction notes on each instruction must be combined into overall outcome of branch. When no heuristic applies to the branch, we predict it with outcome $ 50\%$ and for purposes of debug output we refer this as ``no prediction heuristic''.

When single prediction applies, the situation is simple as well. Complicated is situation where multiple heuristics apply to single instruction. The method used by Ball and Larus paper is to use first matching heuristic in fixed priority order. While this method has been proved to work pretty well, it is difficult to set up the order of heuristics making development somewhat difficult. Wu and Larus paper suggests the use of Dempster Shaffer-theory of evidence:

We may threat predictions as hypothesis on theorem saying that given branch is taken. When there are two predictions with two different hitrates, $ p_1$ and $ p_2$ the combined probability can be computed as follows:

$\displaystyle p={p_1 p_2 \over p_1 p_2+(1-p_1)(1-p_2)}$

As the operation is associative, multiple predictions may be combined in order to get final outcome. This method is elegant, but it has been mentioned as problematic in followup articles [10], as it expects both predictors to be independent on each other that is, of course, not the case.

Our method uses a combination of both methods. For strong heuristics, such as loop heuristics, we use first match method and for weak heuristics we combine the outcomes. When at least one of strong heuristics applies to the branch, we use the first one and when no applies, we use Dempster Shaffer theory based method to combine values. This appears to bring the better from both worlds.

The perfect solution is not possible, as there are too many heuristics to measure outcomes for all the possible combinations. [10] suggests to use neural network based method to combine values, but we decided to not use it due to large complexity of solution giving relatively little extra value. GCC needs to be understood by its developers and we guess most of them are not experts for artificial intelligence.

Andreas Jaeger has set up an tester (http://www.sue.de/~aj/SPEC) that runs weekly SPEC2000 benchmark suite on current GCC tree and stores results of analyze_branches to the log file. We use those results as hitrate values for the prediction combining algorithm. predict.def describes all heuristics, their names, hitrates and prediction combination methods.

The exact implementation is described in algorithm [*].


\begin{algorithm}
% latex2html id marker 739\caption{Combining of branch predi...
...bnormal edges by probability 0.}
\ENDIF
\ENDFOR
\end{algorithmic}\end{algorithm}

Table [*] contains table of heuristics we have implemented and parameters for the branch combining algorithm.


Table: Parameters of heuristics.
name method priority hitrate
Loop iterations F/M 10 computed
User supplied F/M 9 98%
Loop branch F/M 8 89%
Loop exit F/M 7 90%/num_exits
Continue D/S   56%
Noreturn D/S   99%
Loop header D/S   64%
Pointer D/S   81%
Opcode positive D/S   79%
Opcode nonequal D/S   71%
Floating point opcode D/S   90%
Call D/S   70%
Goto D/S   70%
Constant return D/S   95%
Negative return D/S   96%
NULL return D/S   90%
Early return D/S   67%


Heuristics implemented

High level heuristics

High level heuristics are the ones supplied by front-end and based on source language. Currently we have extended only the C and C++ front-ends to do so, but in future we plan to add heuristics to Java and Fortran front-end as well to handle for instance array bounds checking code or exception handling.

goto

Linux kernel commonly use goto statements to get into rarely executed paths of code. It can be argued that by good programmer, goto statement is usually only used to solve exceptional situations, for example error handling/recovery. Branches ending with goto statements are predicted as not taken.

Continue

Continue statements create inner loops; these loops generally have different behavior than the real loops - they tend to roll less. Our experiments show that loop constructed by continue statement rolls approximately $ 1.1$ times, while normal loops roll about $ 9$ times. Disqualifying the loops constructed by continue statements from the loop heuristics allows us to increase hitrate of them.

Negative value return

Negative values are used in C to represent error stages, so functions usually do not take path leading to return statement with negative constant.

NULL return

Similar role as negative constants serve to integers, NULL serve to pointers, so path leading to return statement of NULL constant is predicted as not taken.

Constant return

Rest of the constants are also returned less frequently than computed values, so path leading to return statement with constant not predicted by any of the heuristics above is predicted as not taken.

User supplied

GCC extension is builtin function, __builtin_expect allowing programmer to explicitly state the outcome of given conditional.

Loop Based Heuristics

The edges belonging to natural loops in code are one of the best predictable edges in the program.

Ball and Larus paper suggests to predict all edges leaving loop as not taken using single loop heuristic based on simple observation that program must loop in order to do useful work and consume CPU time.

We have decided to split this single heuristic into multiple special cases in order to increase hitrate of common loop branches that in turn increase predicted amount of iterations in each loop allowing us to optimize more aggressively.

Loop Branch

The loop branch heuristic predicts last branch in the natural loop as taken in order to make loop rolling. We have split it from the loop exit heuristic below as we measured different behavior of this very latest branch in the loop compared to earlier exits, as this particular branch usually belongs to the loop test statement written by programmer, while the others are common caused by break statements in the loop body.

We bypass this heuristic for edges marked by continue heuristic as discussed earlier.

Loop Exit

Any early exit from natural loop is predicted as not taken. We divide the probability of this prediction by number of exits of loop to avoid prediction of number of iterations of loop to be dependent on number of exits, as this resulted in predicting very few iterations on common loops and producing poor code.

Loop Iterations

GCC contains code to analyze loop and compute number of iterations of well behaving loops with iteration variable and constant bounds. In such cases we use loop iterations heuristic specifying exact probability of the exit test.

Nonloop heuristics

Loop header

In case of while loops, the test conditional is duplicated in the front of loop to eliminate jump into the body and to allow more loop optimizations. The resulting branch is usually not taken, as loops tend to iterate. We predict branches created during loop exit test duplication as not taken.

Unlike Ball and Larus heuristics, we do not mark any branch around the loop, as the explicit conditionals are often limiting checks that allows us to increase hitrate of the predictor, at the expense of reducing its coverage.

Pointer

For a and b pointers, it we predict a==b and a==NULL to be false. The negations of the conditions are predicted to be true.

Positive

Most of values in program are positive, so for a integer, it we predict a>0, a>=0, a>=1 and a>=-1 as true. The negations of condition are predicted to be false.

Opcode nonequal

Similarly as for pointers, even for integers a==b is usually false. We disqualify from this heuristics comparison against 0 and 1, as those constants are used for boolean expressions artificially lowering hitrate of the heuristics.

Floating point opcode

We predict floats to be non-equal and non-NaN.

Call

As suggested by Ball and Larus paper, we predict conditional jumping around call as taken, as calls are commonly used for error reporting.

Early Return

We predict branches causing function to return before reaching the last basic block not predicted by constant/negative/NULL return heuristics as taken. Ball and Larus suggest a reverse heuristic, claiming that early returns are used for error handling, but we take care of error handling by high level heuristic that allowed us to reverse a direction of the heuristic. Our heuristic can take care of fast paths through function, such as in hash table lookup, where single condition is usually guarding infrequently executed loop.

Estimating profile

Once probabilities of all edges in CFG are known, we produce estimated profile. In the acyclic tree, this is easily done by giving entry node frequency 1 and walking the tree down, computing frequencies of each node as sum of frequencies (once known) of its predecessors multiplied by the probability of given edge.

In the cyclic tree, the exact propagation is difficult and time consuming. We adopt simple algorithm described in [4] -- to walk the loop tree from innermost to outermost, and at each level compute estimated number of iterations by determining a probability the loop-back edge will be reached when header block is executed. This can be done by simple modification of the algorithm above -- when header of inner loop is reached, its estimated frequency is multiplied by already predicted number of iterations.

We found this algorithm fast and giving satisfactory results, as majority of functions in practice have reducible graph (ie. non-natural loops do not exist).

Jan Hubicka 2003-05-04