Page MenuHomePhabricator

Add "RPRED" mechanism for relocation predicates. (5% speedup for `ld.lld -O0`)
AbandonedPublic

Authored by silvas on Nov 26 2016, 9:09 PM.

Details

Summary

In various places in LLD's hot loops, we have expressions of the form "E == R_FOO || E == R_BAR || ..." (E is a RelExpr).

Some of these expressions are quite long, and even though they usually go just a very small number of ways[1] and so should be well predicted, they can still occupy branch predictor resources harming other parts of the code, or they won't be predicted well if they overflow branch predictor resources or if the branches are too dense and the branch predictor can't track them all (the compiler can in theory avoid this, at a cost in text size). And some of these expressions are so large and executed so frequently that even when well-predicted they probably still have a nontrivial cost.

This patch aims to accelerate these operations by assigning the RelExpr enum values in such a way that these predicates can be written as a single expression E & RPRED_FOO. The patch also converts a couple use cases in the hot path as a proof of concept (it is easy to add or remove predicates, so we can decide later which predicates to accelerate, based on benchmarking).

The predicates are encoded in the high bits of the RelExpr value. There is one high bit of the RelExpr value per predicate. So each predicate can become a single bit test.

This speedup should be pretty portable. The cost of these simple bit tests is independent of:

  • the target we are linking for
  • the distribution of RelExpr's for a given link (which can depend on how the input files were compiled)
  • what compiler was used to compile LLD (it is just a simple bit test; hopefully the compiler gets it right!)
  • adding new target-dependent relocations (e.g. needsPlt doesn't pay any extra cost checking R_PPC_PLT_OPD on x86-64 builds)

Rafael, I tried running this through your benchmarks that I grabbed from your google drive link, but it uses a bunch of stuff that seem to be local to your machine. I was able to get clang-fsds working though. I did some rough measurements on clang-fsds and this patch gives over about 4% speedup for a regular -O1 link, about 2.5% for -O3 --gc-sections and over 5% for -O0. Sorry, I don't have my current machine set up for doing really accurate measurements right now; Rafael, if you could run it through your full benchmark suite that would be really nice.

[1] see e.g. the quick analysis in https://reviews.llvm.org/F2621745 (derived from the trace in https://reviews.llvm.org/F2621770)

Diff Detail

Event Timeline

silvas updated this revision to Diff 79344.Nov 26 2016, 9:09 PM
silvas retitled this revision from to Add "RPRED" mechanism for relocation predicates. (5% speedup for `ld.lld -O0`).
silvas updated this object.
silvas added reviewers: ruiu, rafael, davide.
silvas set the repository for this revision to rL LLVM.
silvas added a subscriber: llvm-commits.
silvas updated this revision to Diff 79346.Nov 26 2016, 9:17 PM
silvas removed rL LLVM as the repository for this revision.

Add a comment. More comments are probably necessary but this one was particularly bad to have missing.

silvas updated this revision to Diff 79347.Nov 26 2016, 9:19 PM

This time actually update the diff.

ruiu edited edge metadata.Nov 26 2016, 9:31 PM

4% speedup in -O1 sounds like a nice improvement.

ELF/Relocations.def
22–24

Are these numbers correct? It seems that

PRED_RELAX_MARKER == PRED_ALWAYS_CONSTANT | PRED_REFERS_TO_GOT

holds, which is odd. Don't you need power-of-twos?

silvas added inline comments.Nov 26 2016, 10:04 PM
ELF/Relocations.def
22–24

The shifting is done in construction of the enum. Sorry, I should have some comments in this file.

enum RelExprPredicate {
#define PREDICATE(Name, PredicateIndex)                                        \
  Name = (1 << (PredicateIndex + NumRelExprIndexBits)),
#include "Relocations.def"
};
silvas updated this revision to Diff 79349.Nov 26 2016, 10:30 PM
silvas edited edge metadata.

Simplify the definition of the predicates to be easier to follow. There is no longer any need to have the predicate definitions in Relocations.def

Rui, let me know if you like this approach better (or have suggestions for a simpler approach).

silvas updated this revision to Diff 79350.Nov 26 2016, 10:32 PM

This time with the right patch.

silvas marked 2 inline comments as done.Nov 26 2016, 10:35 PM

Mark some comments as done. They aren't relevant with the way that the predicates are currently defined.

Do the use-sites need to be updated? Won't the compiler already simplify such expressions (just concerned that the use-sites may be come less legible)

Also, for future source code readers, you may want a comment at the definition site of the enum that it is a regular (exclusive) enum and not some kind of flags/bit-field type thing.

Do the use-sites need to be updated? Won't the compiler already simplify such expressions (just concerned that the use-sites may be come less legible)

The use sites are updated by this patch (search for the expressions that look like & RPRED_FOO).

The expressions that are being replaced are of the form x == 7 || x == 9 || x == 35 || x == 12 || .... The best the compiler could do is infer range information (based on the C language rules for enums) and make a lookup table, but I'm not aware of any compiler that will currently do that for this kind of expression.

This patch is a type of data representation optimization that a compiler can't be expected to do, even with whole-program analysis.

The use sites might become less legible, but ideally we can hide each of these operations behind a helper with a useful name, such as refersToGotEntry, needsPlt, etc. so I don't expect it to matter much. I'm not as much of an expert on relocations to come up with good names for each of them, so some of them I have just left as a raw bit test. Hopefully Rafael and others can help identify good names for these (and for the RPRED_* enum values).

Also, for future source code readers, you may want a comment at the definition site of the enum that it is a regular (exclusive) enum and not some kind of flags/bit-field type thing.

Once we settle down on the design, I'll need to do another pass through the code and add some strategic comments.

joerg added a subscriber: joerg.Nov 27 2016, 3:17 AM

I'm not convinced that a better switch encoding logic by the compiler can't be done. In fact it seems to be quite reasonable to expect

if (x == C1 || x == C2 || ... || x == C_N)

to be turned into

if (x>=0 && x < 64 && (1ULL << (x&63)) & ((1ULL << C1) | (1ULL << C2) | ....| (1ULL << C_N))

as a general optimisation if all C_i<64.

Joerg, that transformation is interesting (and Hans' new switch lowering has something like that). However, I don't think we should assume that compilers implement that, nor that we can rely on it to generate code as good as the one with this RPRED mechanism.

One issue that I see with that transformation is that it is not a form of canonizalization and so won't kick in until later (or in the extreme case like Hans' switch lowering, in the backend).
So there is a phase ordering issue because, for example, JumpThreading would like to see the single combined expression in order for needsPlt or refersToGotEntry to get properly jump-threaded in the hot relocation loop.
Even worse, profile-guided branch reordering could easily mangle these expressions, causing them to not CSE appropriately (again, the hot relocation loop has precisely this situation; e.g. needsPlt and refersToGotEntry are duplicated).

With this RPRED mechanism we can get reliable results from any compiler.

joerg added a comment.Nov 27 2016, 6:20 AM

Your change is IMO a non-trivial uglification of the code and will only one shift at runtime. I'm not sure that justifies this complexity. I would expect it to be trivial to transform the suggested codegen change into a constexpr function with array argument to do the same (and include an assert about the value range of x where necessary.

ruiu added a comment.Nov 27 2016, 9:31 AM

It is a legitimate concern that the new code is harder to read, especially when compared to the original simple-minded one.

I wonder if a table lookup will work here. A RelExpr is a small integer, so we can do something like this

static bool refersToGotEntry(RelExpr Expr) {
  return RefersToGotEntry[static_cast<int>(Expr)];
}

where RefersToGotEntry is defined as

bool RefersToGotEntry[R_MAX] = {
  .R_GOT = true, .R_GOT_OFF = true, .R_MIPS_GOT_LOCAL_PAGE = true, ...
}
In D27145#606205, @ruiu wrote:

It is a legitimate concern that the new code is harder to read, especially when compared to the original simple-minded one.

I don't think it actually is so bad. These predicates are rarely modified and usually have a larger semantic meaning like needsPlt or refersToGotEntry. I don't think that most people reading this code actually look at the specific set of RelExpr's being checked very often.

I wonder if a table lookup will work here. A RelExpr is a small integer, so we can do something like this

static bool refersToGotEntry(RelExpr Expr) {
  return RefersToGotEntry[static_cast<int>(Expr)];
}

where RefersToGotEntry is defined as

bool RefersToGotEntry[R_MAX] = {
  .R_GOT = true, .R_GOT_OFF = true, .R_MIPS_GOT_LOCAL_PAGE = true, ...
}

If you want I can try an approach like that. However, the syntax you have there isn't possible in C++. It would probably end up looking something like:

RelExprPredicateTable RefersToGotEntry([](RelExpr Expr) {
  return Expr == R_GOT || Expr == R_GOT_OFF || Expr == R_MIPS_GOT_LOCAL_PAGE ||
         Expr == R_MIPS_GOT_OFF || Expr == R_MIPS_GOT_OFF32 ||
         Expr == R_MIPS_TLSGD || Expr == R_MIPS_TLSLD ||
         Expr == R_GOT_PAGE_PC || Expr == R_GOT_PC || Expr == R_GOT_FROM_END ||
         Expr == R_TLSGD || Expr == R_TLSGD_PC || Expr == R_TLSDESC ||
         Expr == R_TLSDESC_PAGE;
});

The code for RelExprPredicateTable would be some annoying constexpr metaprogramming that calls the lambda on each possible RelExpr value to build the table (and then we have to worry about which compilers support the necessary C++ features to do it; if not all of them do, then it may not be possible or it may have to be done in an ugly way). I'm not sure that would actually be that much more readable.

Do you want me to try it?

Also, it may not be as fast (hard to beat a single AND), but that will have to be measured.

joerg added a comment.Nov 27 2016, 3:47 PM

You can build the preciate functions with a constexpr helper? E.g.

static bool refersToGotEntry(RelExpr Expr) {
  if (uint64_t(Expr) > 63)
    return false;
  return (uint64_t(1) << Expr) & buildMask({R_GOT, R_GOT_OFF, ...});
}

where buildMask computes (uint64_t(1) << R_GOT) | (uint64_t(1) << R_GOT_OFF) | ... via constexpr.

I don't think precomputing the shift and passing that around is much help -- it's a single cheap assembler instruction without any memory access. It can also be trivially CSEed. The first check can likely be replaced by an assert, assuming input is sanitzed already here.

You can build the preciate functions with a constexpr helper? E.g.

static bool refersToGotEntry(RelExpr Expr) {
  if (uint64_t(Expr) > 63)
    return false;
  return (uint64_t(1) << Expr) & buildMask({R_GOT, R_GOT_OFF, ...});
}

where buildMask computes (uint64_t(1) << R_GOT) | (uint64_t(1) << R_GOT_OFF) | ... via constexpr.

I don't think precomputing the shift and passing that around is much help -- it's a single cheap assembler instruction without any memory access. It can also be trivially CSEed. The first check can likely be replaced by an assert, assuming input is sanitzed already here.

Oh, nice. Sorry, I misunderstood what you meant. That sounds really nice actually.

Okay, here's what I've come up with:
https://godbolt.org/g/NNSthI

If this looks good then I'll go back and just use this.

joerg added a comment.Nov 27 2016, 4:27 PM

That looks good.

I have posted D27156 as an alternative to this patch. The approach that Joerg suggested turned out really well actually. PTAL.

wow, that is neat.
I wonder if it would be legal to teach the compiler to do this if both sides of the expression were enum values of the same type (or possibly more legal if they were scoped enums) ?

grimar added a subscriber: grimar.Nov 27 2016, 11:41 PM
joerg added a comment.Nov 28 2016, 2:32 AM

The compiler would generally either have to know the overflow behavior of the left shift or add the one conditional I mentioned earlier. But otherwise I don't see why it can't create the transform.

ruiu added a comment.Nov 28 2016, 7:00 AM

It is unfortunate that we cannot write like that. I wish we had the C99 desginated initializer in C++.

ELF/Relocations.h
42–43

I think I prefer writing the definitions inline here in a macro-expanded form rather than in an included file. The file is not included more than once (except for static_assert), so the #include hack does not make much sense. I also want to make code less exciting by writing constants inline.

This code

enum RelExprPredicateBit {
  RPRED_NEEDS_PLT       = 1 << 31,
  RPRED_ALWAYS_CONSTANT = 1 << 30,
  RPRED_REFERS_TO_GOT   = 1 << 29,
  RPRED_RELAX_MARKER    = 1 << 28,
  RPRED_RELATIVE        = 1 << 27,
};

is boring but I believe it's easier to read. Likewise, I prefer a style like this because it is obvious what it is doing without looking at other places.

enum RelExpr {
  R_ABS = 1,
  R_GOT = 2 | RPRED_REFERS_TO_GOT,
  R_GOTONLY_PC = 3,
  R_GOTONLY_PC_FROM_END = 4,
  R_GOTREL = 5 | RPRED_RELATIVE,
  R_GOTREL_FROM_END = 6 | RPRED_RELATIVE,
  R_GOT_FROM_END = 7 | RPRED_ALWAYS_CONSTANT | RPRED_REFERS_TO_GOT,
  ...
};
silvas abandoned this revision.Dec 1 2016, 12:13 AM

Abandoning now that D27156 has landed.