This is an archive of the discontinued LLVM Phabricator instance.

AArch64: compress jump tables to minimum size needed to reach destinations
ClosedPublic

Authored by t.p.northover on Apr 26 2017, 3:17 PM.

Details

Summary

This patch adds a pass after branch relaxation to check whether the span of jump tables is small enough that the offsets can be encoded in 8 or 16-bits and emit appropriate sequences if so.

This improves the binary size quite significantly for some tests (it turns out not a single table in the test-suite needs 32-bits of range, and two thirds only need 8-bits).

I'm a little divided on when to enable this and whether to continue to support 64-bit absolute entries (the code sequence is slightly simpler). In the end I decided to go for simplicity on the grounds that switches probably aren't performance critical and will probably be dominated by mispredicts anyway. I'd be open to gating it on MinSize or something though.

What do you think?

Diff Detail

Repository
rL LLVM

Event Timeline

t.p.northover created this revision.Apr 26 2017, 3:17 PM

If I'm following correctly, for PIC small-code-model code (which is what mostly matters these days), the comparison is this:

  1. On trunk we take 4 instructions to compute the destination for the jump: two to load the address of the jump table, one to load from the jump table, one to add the jump-table-relative offset offset to the address of the jump table.
  2. With this technique, we take 5 instructions: two to load the address of the jump table, one to load from the jump table, one to load the current PC, and one to add the PC-relative offset to the PC.

Is that right?

If you're looking to save size, we could put the jump table into the text segment just after the branch, and compute the jump destination in three instructions: one to load the address of the table, one to load the offset from the table, and one to add the offset to the address of the table. Is there some reason we shouldn't do that?

eastig added a subscriber: eastig.Apr 27 2017, 4:37 AM
joerg added a subscriber: joerg.Apr 27 2017, 5:51 AM

My primary concern here is that it can make it impossible to hand-patch the resulting assembler. I'm not sure if anyone is doing that, but if I have learned one lesson from pkgsrc development and the set of Linux support patches, it is to expect "yes, someone is really doing" as most likely answer. As such, I would hope for some global option to disable the pass.

About putting the jump table into a separate section: that's kind of what GCC is doing. I don't think a jump table has necessarily only one user, but placing it inside the function still looks helpful. That wouldn't conflict with -ffunction-sections or so either.

On 5/8/2017 9:36 AM, Tim Northover wrote:

On 26 April 2017 at 16:33, Eli Friedman via Phabricator
<reviews@reviews.llvm.org> wrote:

If you're looking to save size, we could put the jump table into the text segment just after the branch, and compute the jump destination in three instructions: one to load the address of the table, one to load the offset from the table, and one to add the offset to the address of the table. Is there some reason we shouldn't do that?

To get that first load down to 1 instruction we'd need to guarantee
the jump table was within 128MB of the code, and I think we
technically support larger functions. So we'd have to add something
like ARM's full Constant Island pass instead of the newly generic
branch relaxation we currently do. I don't think that is worth the
maintenance burden, it's a constant source of bugs on ARM.

For the second offset addition, again you're compromising distance.
The jump table itself then has to be within 256 or 65536 bytes of
every basic block used. I suspect that even *with* complex placement
that would be tough to arrange.

We manage for Thumb (see, for example, test/CodeGen/Thumb2/thumb2-tbb.ll). That said, I hadn't really considered the lack of existing infrastructure to implement this sort of thing on AArch64; you're right, it probably isn't worthwhile.

jfb added a subscriber: jfb.Sep 18 2018, 10:41 AM

Thread necromancy! I realised that I got distracted and never finished this discussion.

So I've ported the diff to current LLVM. Generally I still think it's the right approach.

jfb added inline comments.Sep 21 2018, 9:40 AM
llvm/lib/Target/AArch64/AArch64AsmPrinter.cpp
466

Range-based for loop here.

489

Make it a shr here, since that's what you'll want to emit anyways?

llvm/lib/Target/AArch64/AArch64CompressJumpTables.cpp
116

Are there ever cases where min is large, but the range [min, max) would fit within 8 or 16 bits? I'm wondering if you want another optimization where you pack the jump offsets, and always add min (here you're using a base of 0).

llvm/lib/Target/AArch64/AArch64InstrInfo.cpp
108

lol this comment

llvm/lib/Target/AArch64/AArch64TargetMachine.cpp
79

drop this

t.p.northover marked an inline comment as done.Sep 24 2018, 4:35 AM
t.p.northover added inline comments.
llvm/lib/Target/AArch64/AArch64CompressJumpTables.cpp
116

Probably. A slight wrinkle is that if the PC-label isn't local it has a limit of +/-1MB, but I suspect that's significantly less likely than the limit that would be lifted. I'll update the patch with that approach.

llvm/lib/Target/AArch64/AArch64InstrInfo.cpp
108

I have a horrible suspicion it'll have been one of mine.

Tidy-ups suggested by JF, and (also JF's idea) switch to an offset from the lowest-addressed basic-block instead of the actual branch to increase the number of candidates.

vsk added a subscriber: vsk.Oct 8 2018, 10:54 AM

Can you please provide some figures on the code size saved and the effect on the performance of this change?

Thank you.

As you'd expect, size is pretty good. Over the test-suite (including externals) nothing regresses. The total benefit over all the code is 0.5%, with notable highlights of 25% in 401.bzip2, 7% in 403.gcc, and 4% in 177.mesa.

Time is obviously more fuzzy, but I think it's in the noise. Overall there was a 0.5% regression by geomean, but when I limited it to tests that had actually changed, that reduced to 0.05%. Other slices I took to try and improve meaning actually turned it into an improvement (including the total runtime!).

As you'd expect, size is pretty good. Over the test-suite (including externals) nothing regresses. The total benefit over all the code is 0.5%, with notable highlights of 25% in 401.bzip2, 7% in 403.gcc, and 4% in 177.mesa.

Quite expected, but good to know the actual numbers. :)

Time is obviously more fuzzy, but I think it's in the noise. Overall there was a 0.5% regression by geomean, but when I limited it to tests that had actually changed, that reduced to 0.05%. Other slices I took to try and improve meaning actually turned it into an improvement (including the total runtime!).

Sounds like noise to me.

This is good data. However, I'd like to evaluate this patch a bit further on Exynos, if I may.

jfb added a comment.Oct 9 2018, 11:22 AM

This is good data. However, I'd like to evaluate this patch a bit further on Exynos, if I may.

Do you have specific worries? Or at least a timeline? Seems the patch can be reviewed without waiting for your exploration.

In D32564#1259155, @jfb wrote:

Do you have specific worries? Or at least a timeline? Seems the patch can be reviewed without waiting for your exploration.

Yes, I do. Exynos limits the size of jump tables, resulting in a daisy chain of smaller jump tables. This patch changes the jump table code, so I'd like to evaluate the performance impact, unless this pass is gated by -Os.

llvm/lib/Target/AArch64/AArch64AsmPrinter.cpp
519

It seems to me that this instruction could be moved after the load below to promote macro fusion.

Yes, I do. Exynos limits the size of jump tables, resulting in a daisy chain of smaller jump tables. This patch changes the jump table code, so I'd like to evaluate the performance impact, unless this pass is gated by -Os.

Is that in trunk? I can't see any obvious candidates with a git grep -i exynos. If not, it's probably not relevant to this review.

t.p.northover added inline comments.Oct 9 2018, 3:23 PM
llvm/lib/Target/AArch64/AArch64AsmPrinter.cpp
519

The comment above explains why this has to be first.

Is that in trunk? I can't see any obvious candidates with a git grep -i exynos. If not, it's probably not relevant to this review.

Yes, see ExynosM{1,3} under AArch64Subtarget::initializeProperties().

jfb added a comment.Oct 9 2018, 4:33 PM
In D32564#1259155, @jfb wrote:

Do you have specific worries? Or at least a timeline? Seems the patch can be reviewed without waiting for your exploration.

Yes, I do. Exynos limits the size of jump tables, resulting in a daisy chain of smaller jump tables. This patch changes the jump table code, so I'd like to evaluate the performance impact, unless this pass is gated by -Os.

Do you have an ETA?

If that ETA is too far, then Tim do you think doing -Os first (with a flag to force it, say -O2 -mcompress-jump-tables) is OK, with other optimizer settings later (when Exynos stuff comes back)?

Could also be an architectural flag, giving this is a particular behaviour from Exynos. Then a simple splitsJumpTables() or whatever check would be enough.

Rebase to master

In D32564#1259703, @jfb wrote:

Do you have an ETA?

Later today. Thank you.

I'm getting mixed results on Exynos, with significant improvements and regressions. I'd like to run more tests, but, at the moment, I'd rather that this feature would be gated, either as @jfb or @rengolin suggested.

Fair enough, I've disabled it for Exynos processors except under MinSize conditions. OK to commit?

rengolin added inline comments.Oct 12 2018, 1:34 AM
llvm/lib/Target/AArch64/AArch64CompressJumpTables.cpp
146

Can we not add more Cpu name comparisons? Shouldn't be too hard to create a feature and associate it to these two cores. Later on, Samsung or any other would be able to fine tune to other cores.

Rather than a feature in AArch.td, I'd prefer a line in AArch64Subtarget::initializeProperties().

Rather than a feature in AArch.td, I'd prefer a line in AArch64Subtarget::initializeProperties().

I think I disagree. Putting it in AArch64Subtarget.cpp means that someone has to recompile clang to test this out on their CPU. I'd reserve initializeProperties for unfortunate situations where you can't easily set something on the command-line or in AArch64.td (i.e. non-bool options so far).

Putting it in AArch64Subtarget.cpp also means you can't even in principle set it on a per-function basis without changing your CPU (which is icky).

Isn't this change always enabled for -Os? So it should be easy to test it or to enable down to a single function, shouldn't it?

Isn't this change always enabled for -Os? So it should be easy to test it or to enable down to a single function, shouldn't it?

-Oz, but even if it was -Os that perturbs things in other ways that are probably not desirable for anyone who cared enough to annotate a function for it. But that was mostly a throwaway comment, I think the attitude that .cpp initialization is an unfortunate necessity is more compelling.

evandro accepted this revision.Oct 24 2018, 11:59 AM
This revision is now accepted and ready to land.Oct 24 2018, 11:59 AM
t.p.northover closed this revision.Oct 24 2018, 1:41 PM

Thanks. Committed as r345188.