More on phase order: Loop Optimizations of the C2 compiler

In the C2 compiler, loops are optimized during the Machine Independent Optimizations super-phase . Loops is where most of the execution time is spent, so the compiler gives great deal of attention to them.

As mentioned elsewhere, Machine Independent Optimizations are performed modifying the Ideal Graph. Loop optimizations are implemented in the PhaseIdealLoop class and executed several times during the Machine Independent Optimization super-phase. This post describes the phase order of these optimizations.

I’m assuming you have some knowledge on the ideal graph. This post can help you otherwise.

Loop optimizations phase order

Loop optimizations are executed invoking the PhaseIdealLoop class constructor. The constructor’s body contains a single call to the build_and_optimize method. There, the IdealLoopTree data structure is built to represent all loops of the method. The IdealLoopTree structure contains metadata relative to loops  useful for the optimizations performed in this phase. The build_and_optimize method drives the order of the phase, as depicted in the following picture:

PhaseOrderLoopOpts2

The process starts by building the IdealLoopTree, which is the data structure used by loop optimizations to store metadata regarding all loops in a method. The IdealLoopTree leafs represent the innermost loops, while the root of the tree is matched with the Root node of the method. Each node of the IdealLoopTree contains information regarding one specific loop, such as whether it is a counted loop or not, the number of times the loop has been unrolled, if it is a counted loop, the increment node, etc.

Once the IdealLoopTree is built, the loops in the Ideal Graph are ‘beautified’. In this phase, the structure of the loops can changed to match canonical structures that can be optimized more efficiently. Also, all the Region nodes representing cycles are changed for Loop nodes. The Ideal Graph created by the parser contains only Region nodes. During the Beautify Loop phase, these Region nodes are converted into Loop nodes. Marking these Region nodes as Loop nodes allows the optimization algorithms to detect loops easily. If during the beautification the structure changes, then a rebuild of the IdealTreeLoop is needed.

Afterwards, the phase continues with the following steps:

1. Build dominators: The dominators of each node are found to remove redundant null checks. If an access A to a nullable location dominates an access B to the same location, then there is no need to null check B. Dominators are also used to find loops and nodes belonging to loops by determining control dependencies. Read More on Control Dependencies here (section 3.1). Dominators also play an important role in the scheduling of data nodes of the graph (Global Code Motion, Global Value Numbering)

2. Place data nodes into loops (build_loop_early): This steps determine which data nodes belongs to the body of the loop. Not all nodes in the Ideal Graph are directly connected to a Region (i.e. basic block). These nodes are called data nodes and its execution order is defined by their data dependencies.  This steps tries to find which data nodes belongs to the Region node defining each loop. This is better described in the paper by Cliff Global Code Motion, Global Value Numbering  section 2.2.

3. Try finding counted loops: A counted loop:

for ( int i = 0; i < N; i++) {}

has a very specific signature in the Ideal Graph, which is shown in the  following picture, (taken from Thomas Würthinger master these):

countedloop

This type of loops allows some powerful loop optimizations. The compiler search for this structures within the existing loops and when found, it marks the loop, substituting the Loop region node for a CountedLoop region nodes. The final If node of the loop is replaced by a CountedLoopEnd node.

4. Find ideal loop placement (build loop late): At this point, there is extra information to better determine which data nodes belong to the loop.  This step determines if there is a better place to assign a data node to the one assigned in the step 2 and changes the graph accordingly. This is better described in the paper by Cliff Global Code Motion, Global Value Numbering  section 2.3.

5. Eliminate useless predicates: To perform loop predication optimization the compilers needs that the parser include some predicates while constructing the ideal graph. Sometimes this predicates become useless after several transformations, therefore they are eliminated at this point.

6. Re-associate invariants: This step performs an arithmetic association on the mathematical operations inside the loop to further improve the loop invariant motion, for example: invariant1 + (x + invariant2)  becomes  ( invariant1 + invariant2) + x , allowing to remove the sum of the invariants out of the loop.

7. Split if: The graph is simplified to avoid unnecessary control structures, for example two consecutive ‘ifs’ checking the same thing.

8. Loop predication: Performs the loop predication optimization as described here.

9. Do Intensify fill: This step pattern match array initialization loops and replace them with an assembler stub.

10. Loop Unrolling / switching / peeling: Performs Loop unrolling, loop peeling, loop unswitching and range check elimination. As seen in the picture, whether these optimizations are performed or not will depend on the loop itself and the options passed to the virtual machine. In the image the annotated edges with a question mark (?) represents the execution of a policy. A policy executes to determine whether a loop can be optimized or not by a particular optimization.

11. Super scalar loop optimizations: Performs super-scalar loop optimizations.

 

Further reading:

The Hostpot Wiki

 

Leave a Reply

Your email address will not be published. Required fields are marked *