Investigation of PR49480 showed that D95121 caused about a 5.0% speed
regression when linking chromium_framework. That diff introduces a (very
useful) additional layer of abstraction over relocations, so the perf
overhead is not too surprising. The diff is pretty large, and perf
didn't give me any great hints, so I just went with optimizing the
likely candidate -- getRelocAttrs(). I managed to claw back about
1.4% of perf this way. Making the relocAttrsArray a global and
devirtualizing getRelocAttrs() gave most of the win; I also marked
the array range check with LLVM_UNLIKELY for good measure.
The numbers above are quoted for chromium_framework (from the tarball in
PR48657).
N Min Max Median Avg Stddev x 20 4.5 4.66 4.56 4.5715 0.044871161 + 20 4.42 4.61 4.5 4.5075 0.053001986 Difference at 95.0% confidence -0.064 +/- 0.0314295 -1.39998% +/- 0.68751% (Student's t, pooled s = 0.0491052)
I also measured v8_unittests:
N Min Max Median Avg Stddev x 20 0.62 0.65 0.64 0.6355 0.0082557795 + 20 0.61 0.63 0.62 0.616 0.0059824304 Difference at 95.0% confidence -0.0195 +/- 0.00461426 -3.06845% +/- 0.726084% (Student's t, pooled s = 0.00720928)
The v8 difference is likely larger because it doesn't use an order file.
Symbol ordering is actually one of the most expensive steps when linking
chromium_framework, and probably is a target for further optimization.
The other hotspot is the assignment of relocations to subsections. I'm
curious as to whether replacing the RB-tree in std::map with a radix
trie would be an improvement...
clang-tidy: warning: invalid case style for variable 'ARM64RelocAttrsArray' [readability-identifier-naming]
not useful