Page MenuHomePhabricator

Add framework for iterative compilation to llvm
Needs ReviewPublic

Authored by zoran.jovanovic on Aug 20 2015, 10:35 AM.

Details

Summary

This patch adds necessary parts to implement iterative compilation in
llvm: decision trees and three passes (ModuleDecisionTreeProxy,
FunctionFitness and ICSetCurrentFunc). Further, inliner, LICM and register allocator
are modified to support the iterative approach.

Diff Detail

Event Timeline

zoran.jovanovic retitled this revision from to Add framework for iterative compilation to llvm.
zoran.jovanovic updated this object.

This is a follow-up work on the iterative compilation framework for clang/llvm.
Initial discussion as well as the introduction to this approach has been presented at:

http://lists.cs.uiuc.edu/pipermail/llvmdev/2013-December/thread.html#68784

Previous patch revisions (same as suggestions to create new revision for this patch) can be found at:

http://reviews.llvm.org/D4723

Here is a short explanation of the forming of the decision tree.

First, some passes need to be modified to call getDecision function.
If this function returns false, a pass should do exactly as the code without modification.
If the return value is true, then the alternative path is taken.

Function getDecison always returns false in the first iteration.
And the compiler should work exactly as if there was no iterrative framework.

Let us assume that getDecision function is called four times in the first iteration of iterative compilation.
After the end of the first iteration, the decision tree will look like this:

o
\
  o
    \
      o
        \
          o
            \
              o

In the second iteration, exactly one decision is changed and the compiler works as in the first iteration until it reaches this particular decision. Then alternative path is taken in this point.
Further, getDecision function returns false until compilation iteration is finished. Let us assume that we took alternative decision at the third decision and that there are three more calls
of getDecision function. Note that different compilation paths may have different number of calls of getDecision function.

o
\
  o
    \
      o
    /   \
   o      o
     \      \
       o      o
         \
           o

Every new iteration adds one new branch to the decision tree.

At the end of each iteration fitness of the generated code is evaluated. In the last iteration, the compiler takes the path with the best fitness.

To augment the selection of a node where alternative decision would be taken, getDecision tree takes one parameter.
This parameter is interpreted as priority and the node with highest priory is selected for next alternative decision.

Machine learning approach

Formed decision tree can be used to train a binary classifier which can be used to facilitate existing heuristics. By collecting nodes where branching of decision tree occurred, we have training examples for the classifier. We could replace existing heuristics with this trained classifier and potentially get better code even without iterative approach.

This may be a good approach for jit compiler but it requires adding this machine learning approach which is planed for the future.

N-ary decisions are not planed for now because every n-ary decision can be replaced with n-1 binary decision. For example if we have to decide for a number from zero to three we can set three yes/no questions:

(number is zero or greater then zero?)
  /                            \
0             (number is one or greater then one?)
                /                      \
               1               (number is 2 or 3?)
                                  /          \
                                 2            3

If some decisions are made relevant by some future decisions this is only some inefficiency of the compilation process but they stay in the decision tree because nodes are never deleted. The nodes represent some decision points in the history, and path from the root of the tree to the leaf has enough information to exactly replay compilation iteration.

This patch includes:

Code refactoring.

Rebase
This version of patch is rebased to current trunk.

Decision points in pre-codegen and in codegen phase
This version introduces support for decision points in both
pre-codegen and codegen phase.

This feature led to a lot changes in code becuse implementation of the
ImmutablePass is such that an ImmutablePass is destroyed between
pre-codegen and codegen phase. The issue is resolved with temporary
JSON files which pass information between those two phases. Also, we
had to merge results from these phases.

We find that such behaviour of ImmutablePass and boundary between
pre-codegen and codegen phases is a bit strange and that there is
space for improvements which would simplify this patch and
implementation of similar features in llvm.

New optimization point - RegAllocGreedy
Decision points in codegen phase enabled us to implement a new
optimization point where iterative compilation can take an alternative
path. In this version of patch iterative compilation can take
alternative paths in register allocation, next to LICM and Inliner
which were present in the old version, too.

Added unit tests for DecissionTree, FileUtils and ModuleDecisionTreeProxies.

Bugfixes
During development we fixed some bugs. Some of them were introduced by
new changes and some were existing even in the old version.

Testing procedure and current status
We are targeting MIPS architecture and current testing procedure looks
like this:

create CSiBE test configuration
preprocess files with gcc
run all tests with -Oz -target mipsel-unknown-linux
-fiterative-comp=n -S

assemble output with MIPS gcc toolchain
measure results with size
Our results show that there are improvements of up to 8.78% in code
size which is promising and we are working on getting such results in
more cases and getting even better results.

Our results also show that making alternative decisions in compilation
path in LICM, Inliner or RegAlloc only (separated) can lead to
improvements in code size up to, respectively, 6%, 5% and 8%. This
results proved our assumption that these optimization points are good
for making decisions to take alternative compilation path is true.

Significant improvements are seen as early as with 10 iterations, but
100 iterations lead to even better results. As expected, improvements
and iteration count are not linearly dependant.

RFC
We would like to see this feature (iterative compilation) merged into
the LLVM/Clang core and we would like to hear what we should change
and improve so this will be possible in the future.

We would also like to hear from you if you find this approach
interesting and get any suggestion where to steer research and
experiments. Any suggestions which passes have potential for taking
alternative path?

Currently, there are few issues that we are working among which the
most interesting are:

Register allocator has higher priority than other optimization
points so we will change implementation in that way that all
optimization points can get a chance to take alternative direction

Our decision whether to take an alternative path or not is based on
a fitness function which can be improved. We will be working on
that, too.

Please, feel free to experiment with this patch and to contact us for
any help and information.

This patch needs to be applied along with the patch for clang.
See http://reviews.llvm.org/D12200 for more details.

vsk added a subscriber: vsk.Aug 22 2015, 9:01 PM

*Places for improvement*

Our research identified couple of places where improvements can be made:

  • DecisionTreeNodes priorities can be tuned better
  • Tree exploration can be made faster

*Implemented solutions*
This patch addresses the first issue.

Original idea is to find spots where heuristic result isn't
confident. Those values should be inspected carefully and normalized
which we can improve. Because of this, our implementation was biased
to explore alternative decisions of only one (RegAllocGreedy)
optimization point.

This problem is for now solved with introducing "round robin" tree
exploration which tries to mask out nodes of already used optimization
until all optimizations are exhausted or there are no available nodes
of unused optimizations. When all optimizations are used / there are
no available nodes to choose, we restart masking process. (An
optimization, or optimization point is considered as used if node of
that optimization was chosen for making alternative decision and
producing an alternative path.)

This solution enabled as to vary position in decision tree where we
make alternative decision which led to significant improvements.

Another improvement in this patch is randomization of node choice. Let
us define best node as a node with lowest function fitness (node is on
path which leads to smallest code size) and with highest priority. The
issue is that there are usually more than one such node in the
decision tree, for every optimization. In previous revisions, only
first best node was considered for making alternative decision. In
this revision, we collect all equivalent best nodes and randomly
choose one of them.

This randomization led to even faster tree exploration and greater
improvements.

*Current results*
We continued with benchmarking LLVM IC framework using CSiBE test. Our
results show that this patch can give results for 10 iterations as
good as old version for 100 iterations! As expected, this version
outperformed old version on 100 iterations.

*Further work*
To make improvements mentioned in the beginning, we will try to split
decision tree in such way that every function has it's own decision
tree. That way, every function would have smaller tree, and separate
functions could be optimized in parallel. With this implemented, we
expect to get even smaller code size but keeping number of iterations
reasonable small.

Priority tuning still stays open for suggestions and further work.

This version introduces new algorithm for exploring decision trees.
As stated in the last patch, splitting one big decision tree into multiple ones for every function in compilation unit can improve results.
Furthermore, separate decision trees are needed for each optimization phase (interprocedural, precodegen and codegen).

The algorithm spends 1/3 of iterations on searching for the best combination of alternative decisions for trees of each phase for every function in parallel. We got some improvement in code size for 5% more test cases which adds up to roughly 25% of test cases where our algorithm achieved smaller code size than non-iterative version.

Those results show that implementing parallel explore enables deeper exploration of the decision tree for the same number of iterations.

Comments are welcome.

Just to PING.
Any comments on this work?

zoran.jovanovic updated this revision to Diff 54639.EditedApr 22 2016, 6:27 AM

New patch version rebased to r256194.
Any comments on this work?

Hi Zoran,

me and one of my students started to play with this patch. However, unfortunately it does not seem to work. Using a minimal example:

$ cat /tmp/test.c
int main() {}

we see the following error:

bin/clang -S -fiterative-comp=2 /tmp/test.c -c
ITERATIVE COMPILATION ERROR:

The problem seems to be that only one file is written during module finalization:

if (ICPhase != FnNameAndPhase::CodeGenPhase) {                               
  resultsFile = new ICResultsFile(FicResultsFile);                           
}                                                                            
else {                                                                       
  resultsFile = new ICResultsFile(FicResultsFile2);                          
}

but both files are unconditionally read by clang:

+int loadResultFiles(llvm::ModuleDecisionTrees &MDT,
+                    std::vector<llvm::MdtResults> &merged) {
+  llvm::ICResultsFile results1 = MDT.getICResultsFile();
+  llvm::ICResultsFile results2 = MDT.getICResultsFile2();
+
+  int r1 = results1.read();
+  int r2 = results2.read();
+
+  if(r1 || r2)
+    return 1;
+
+  llvm::MdtResults::mergeMdtResults(merged,
+      results1.getResults(), results2.getResults());
+
+  return 0;
+}

It is currently unclear what the expected behavior is. Do you happen a minimal example that would allow us to use this IC patch with clang and see at least two different inlining choices to be explored?

In order to use iterative compilation algorithm for target different then MIPS it is necessary to implement appropriate function fitness pass that evaluates generated code. In our patch criteria used is code size and pass is implemented in lib/Target/Mips/MipsFunctionFitness.cpp file.
In case of X86 code size calculation is more complex because of variable-length machine instruction encoding.
We are looking for possibilities to issue more informative error message until architecture is supported.