In various places in LLD's hot loops, we have expressions of the form "E == R_FOO || E == R_BAR || ..." (E is a RelExpr).
Some of these expressions are quite long, and even though they usually go just a very small number of ways[1] and so should be well predicted, they can still occupy branch predictor resources harming other parts of the code, or they won't be predicted well if they overflow branch predictor resources or if the branches are too dense and the branch predictor can't track them all (the compiler can in theory avoid this, at a cost in text size). And some of these expressions are so large and executed so frequently that even when well-predicted they probably still have a nontrivial cost.
This patch aims to accelerate these operations by assigning the RelExpr enum values in such a way that these predicates can be written as a single expression E & RPRED_FOO. The patch also converts a couple use cases in the hot path as a proof of concept (it is easy to add or remove predicates, so we can decide later which predicates to accelerate, based on benchmarking).
The predicates are encoded in the high bits of the RelExpr value. There is one high bit of the RelExpr value per predicate. So each predicate can become a single bit test.
This speedup should be pretty portable. The cost of these simple bit tests is independent of:
- the target we are linking for
- the distribution of RelExpr's for a given link (which can depend on how the input files were compiled)
- what compiler was used to compile LLD (it is just a simple bit test; hopefully the compiler gets it right!)
- adding new target-dependent relocations (e.g. needsPlt doesn't pay any extra cost checking R_PPC_PLT_OPD on x86-64 builds)
Rafael, I tried running this through your benchmarks that I grabbed from your google drive link, but it uses a bunch of stuff that seem to be local to your machine. I was able to get clang-fsds working though. I did some rough measurements on clang-fsds and this patch gives over about 4% speedup for a regular -O1 link, about 2.5% for -O3 --gc-sections and over 5% for -O0. Sorry, I don't have my current machine set up for doing really accurate measurements right now; Rafael, if you could run it through your full benchmark suite that would be really nice.
[1] see e.g. the quick analysis in https://reviews.llvm.org/F2621745 (derived from the trace in https://reviews.llvm.org/F2621770)
I think I prefer writing the definitions inline here in a macro-expanded form rather than in an included file. The file is not included more than once (except for static_assert), so the #include hack does not make much sense. I also want to make code less exciting by writing constants inline.
This code
is boring but I believe it's easier to read. Likewise, I prefer a style like this because it is obvious what it is doing without looking at other places.