This is an archive of the discontinued LLVM Phabricator instance.

[DAGCombiner] [CodeGenPrepare] Split large offsets from base addresses
ClosedPublic

Authored by luismarques on Apr 4 2019, 4:42 PM.

Details

Summary

This patch addresses the following issue that came up in the context of RISC-V benchmarks, but which affects other targets. Suppose you have several loads/stores that access array elements or struct fields with large offsets:

void foo(int *x, int *y) {
    y[0] = x[0x10001];
    y[1] = x[0x10002];
    y[2] = x[0x10003];
    ...
}

In a target such as RISC-V you cannot add 0x10001 to the address of X in a single instruction (the constant doesn't fit the 12-bit signed immediate), so the generated code is more directly reflected by something like this:

void foo(int *x, int *y) {
    y[0] = *(x+0x10000+1);
    y[1] = *(x+0x10000+2);
    y[2] = *(x+0x10000+3);
    ...
}

But you can fold the +1, etc. into an immediate offset of the load/store instructions, so you are able to effectively have something like this:

void foo(int *x, int *y) {
    int *base = &x[0x10001];
    y[0] = base[0];
    y[1] = base[1];
    y[2] = base[2];
    ...
}

That optimization is only able to be performed, though, if the +1, +2, etc. are split from the 0x10000. Fortunately, there is already a target hook that indicates we want such an address split to occur: shouldConsiderGEPOffsetSplit. When that hook returns true, CodeGenPrepare.cpp adds the GEPs with large offsets to a list of GEPs to be split and ::splitLargeGEPOffsets splits them, in a process clearly illustrated in that method's comments. Unfortunately, the split currently only occurs when the base and the GEP are in different BBs, since the DAGCombiner would just recombine those in the same BB anyway.

This patch intends to:

  1. make the split also occur in the cases where the base and the GEP are in the same BB (that's often the case);
  2. ensure that the DAGCombiner doesn't reassociate them back again.

To achieve that second step the patch adds a check before the reassociation of add instructions to see if the sum is used by loads or stores and if reassociating could break a reg+imm addressing mode for those loads/stores. This strategy seems to work, as shown in the tests.

A possible alternative would be to add a RISC-V specific pass to split the addresses, but solving this problem in a more generic fashion is probably preferable, as it avoids duplication of functionality and can benefit other targets.

(It might be possible to address https://bugs.llvm.org/show_bug.cgi?id=24447 by making the address mode checks more stringent for X86, etc.)

Diff Detail

Repository
rL LLVM

Event Timeline

luismarques created this revision.Apr 4 2019, 4:42 PM
Herald added a project: Restricted Project. · View Herald TranscriptApr 4 2019, 4:42 PM
luismarques edited the summary of this revision. (Show Details)Apr 4 2019, 4:45 PM
luismarques edited the summary of this revision. (Show Details)Apr 4 2019, 4:48 PM

Should we care about the number of uses of N0 here? It should generally be "safe" to reassociate in cases where it only has one use.

Given that we're splitting the GEPs before SelectionDAG, the only way to preserve the optimization is to avoid re-merging them... and I can't think of any reasonable approach for that other than something along the lines of this patch. Well, I guess we could just completely disable folding without regard for whether the addressing mode is legal, but that probably interacts badly with type legalization.

The alternative, as you note, is to do something after isel. Arguably that would be more effective. At that point you have a better idea of what constants are actually necessary, and you could integrate it with other similar optimizations like optimizing related integer constants. This is sort of along the lines of RISCVMergeBaseOffset.

lib/CodeGen/SelectionDAG/DAGCombiner.cpp
980 ↗(On Diff #193789)

This computation of AccessTy is weird: it's supposed to be the type of the load, not the type of the pointer.

How you get the right access type is sort of awkward, of course... I guess you could traverse the use list.

asb added a comment.Apr 11 2019, 3:48 AM

Given that we're splitting the GEPs before SelectionDAG, the only way to preserve the optimization is to avoid re-merging them... and I can't think of any reasonable approach for that other than something along the lines of this patch. Well, I guess we could just completely disable folding without regard for whether the addressing mode is legal, but that probably interacts badly with type legalization.

The alternative, as you note, is to do something after isel. Arguably that would be more effective. At that point you have a better idea of what constants are actually necessary, and you could integrate it with other similar optimizations like optimizing related integer constants. This is sort of along the lines of RISCVMergeBaseOffset.

My feeling is that if there are no major objections to disabling the fold in this case (e.g. an undocumented requirement that add with constants are always canonicalised as they currently are), then it seems worth adding logic like this to disable the re-merging on the basis that 1) it's not a huge amount of code that serves to preserve work that's already done in CodeGenPrepare anyway, 2) it's easier for other backends to make use of it. I agree there may later be scope for pass that runs after ISel that might be wider in scope. But for this transformation, it seems a shame to have a situation where CodeGenPrepare does the necessary analysis+transformation, the DAGCombiner undoes that work, then a very similar analysis+transformation is done again on the SelectionDAG.

asb added inline comments.Apr 11 2019, 5:39 AM
lib/CodeGen/SelectionDAG/DAGCombiner.cpp
980 ↗(On Diff #193789)

Traversing the use list as callers to canFoldInAddressingMode do seems sensible. This function can identify the first load/store operation and use the type from that. Bail out if none of the uses are load/stores. By doing this we can also get the right address space. It might be worth exploring if we can just call canFoldInAddressingMode rather than replicating similar logic here.

lebedev.ri added a subscriber: lebedev.ri.

Passing-by comment: can/will this do anything to e.g. the testcase https://reviews.llvm.org/D59535#1458118 which resulted in revert if D59535 ?
The problem being solved here seems /related/.

Not sure if it's relevant to what this patch is trying to do, but x86 has a similar example described here:
https://bugs.llvm.org/show_bug.cgi?id=24447

asb added a comment.Apr 11 2019, 6:26 AM

Not sure if it's relevant to what this patch is trying to do, but x86 has a similar example described here:
https://bugs.llvm.org/show_bug.cgi?id=24447

Thanks for the link. This has the potential to help in the case that the large constants were produced by getelementptrs. Except the logic here and in CodeGenPrepare's splitgep routines checks isLegalAddressingMode, which may not be selective enough for X86 given its addressing modes are so unrestricted.

Hello. This looks like an interesting patch. Thanks for working on it. I ran some numbers and on Thumb1 targets (where resources are generally very constrained) this looks like a nice improvement.

Things didn't look as good on Thumb2 (and AArch64), but that might be that something isn't tuned correctly, or something that's just going wrong. I'll try and take a look (in the morning). Don't feel that that should block you here, I think something odd might be going on with floating point constants in soft-fp? I'm not sure yet.

Hello. This looks like an interesting patch. Thanks for working on it. I ran some numbers and on Thumb1 targets (where resources are generally very constrained) this looks like a nice improvement.
Things didn't look as good on Thumb2 (and AArch64), but that might be that something isn't tuned correctly, or something that's just going wrong. I'll try and take a look (in the morning). Don't feel that that should block you here, I think something odd might be going on with floating point constants in soft-fp? I'm not sure yet.

My reply through the email seems to have bounced, so I'm quoting it here:
Thanks for the feedback and encouragement. I'm finishing an improved version of the patch, it would be great if you could rerun those analyses on the updated version and share if it changes anything important. The new version properly checks if the transformation would impact the loads/stores that use the constants. I can see that check having an impact on some real-world scenarios not covered by the existing tests.

The patch now checks if doing the reassociation could break any of the load/stores that use the added constants. The value type and addressing space is now obtained from the load/stores. The failing tests have been slightly reduced by these changes:

Failing Tests (5):

LLVM :: CodeGen/AMDGPU/salu-to-valu.ll
LLVM :: CodeGen/SystemZ/int-add-08.ll
LLVM :: CodeGen/SystemZ/int-sub-05.ll
LLVM :: CodeGen/ARM/misched-fusion-aes.ll
LLVM :: CodeGen/ARM/vector-spilling.ll
luismarques added a comment.EditedApr 13 2019, 7:14 AM

Should we care about the number of uses of N0 here? It should generally be "safe" to reassociate in cases where it only has one use.

Let's say we are doing *(x + 0x10000 + 2) and we don't use x + 0x10000 anywhere else (that's the N0, right?). In an arch like RISC-V if we want to add 0x10002 to x that takes two instructions to materialize the constant, while 0x10000 only takes one. We can avoid that second instruction by folding the +2 into the load/store. Am I thinking about this wrong?

piotr added a subscriber: piotr.Apr 15 2019, 1:03 AM

I wasn't thinking in terms of keeping the addressing mode legal, just avoiding destroying the work of constant hoisting. Constant hoisting won't split a constant in the way you're suggesting. And it's relatively easy to write patterns to split "load x+c" in the most efficient way if "c" has a single use. I guess the more restrictive version would allow splitting single-use constants as a DAGCombine, or earlier? Not sure why you'd want to, though.

This updates the patch to:

  1. Add a check for N0.hasOneUse(), as suggested by Eli Friedman;
  2. Changes the shuffling indices of the ARM vector-spilling.ll test, to ensure the desired multi-register vector spills and restores are generated;
  3. Updates the ARM misched-fusion-aes.ll AES fusion test checks to account for the new instruction scheduling. We still seem to have the desired number of fusable instructions, just in a different order.

Those two tests would remain broken even if you always gave the OK to reassociate in reassociationCanBreakAddressingModePattern. With these changes the remaining failing tests should be:

Failing Tests (3):

LLVM :: CodeGen/AMDGPU/salu-to-valu.ll
LLVM :: CodeGen/SystemZ/int-sub-05.ll
LLVM :: CodeGen/SystemZ/int-add-08.ll

I'll look more closely into those tests. If necessary we can tweak the reassociation gating for those.

luismarques added subscribers: uweigand, tstellar.

Fix the remaining failing unit tests (for AMDGPU and SystemZ).

Those test case changes represent an actual improvement here, so this looks good. Thanks!

test/CodeGen/SystemZ/int-add-08.ll
53 ↗(On Diff #196024)

It would be preferable to keep verifying the base register here, i.e.

lay [[BASE::%r[1-5]]], 524280(%1)
alg {{%r[0-5]}}, 8([[BASE]])
test/CodeGen/SystemZ/int-sub-05.ll
58 ↗(On Diff #196024)

Same here.

With rL358845 in, the remaining results I ran look good. Consider us out of your way.

luismarques retitled this revision from [DAGCombiner] [CodeGenPrepare] WIP/RFC Splitting large offsets from base addresses to [DAGCombiner] [CodeGenPrepare] Split large offsets from base addresses.
luismarques edited the summary of this revision. (Show Details)
  • Fixes a minor whitespace issue;
  • Updates the SystemZ tests to preserve the base register check;
  • Updates the summary to reflect the current patch/review status.

Thanks for the review comments!

luismarques marked 4 inline comments as done.Apr 24 2019, 3:55 AM

With rL358845 in, the remaining results I ran look good. Consider us out of your way.

Great! Thanks for looking into this.

uweigand added inline comments.Apr 24 2019, 5:28 AM
test/CodeGen/SystemZ/int-sub-05.ll
58 ↗(On Diff #196413)

Should be %r[1-5] here as well, register 0 cannot be used for address generation. Otherwise, the SystemZ changes LGTM now. Thanks!

luismarques edited the summary of this revision. (Show Details)
  • Fix SystemZ's int-sub-05.ll test register matching.
luismarques marked 2 inline comments as done.Apr 24 2019, 5:40 AM
luismarques added inline comments.
test/CodeGen/SystemZ/int-sub-05.ll
58 ↗(On Diff #196413)

Oops. Thanks!

uweigand added inline comments.Apr 24 2019, 12:19 PM
test/CodeGen/SystemZ/int-sub-05.ll
58 ↗(On Diff #196413)

Perfect, thanks!

luismarques marked an inline comment as done.
luismarques added a subscriber: arsenm.

Rebased on master.
@arsenm: could you please review the AMDGPU changes? They were a bit fiddly, so you may prefer to tweak the prefixes in a different way, etc.

asb added inline comments.Jun 5 2019, 10:05 PM
lib/CodeGen/CodeGenPrepare.cpp
4203 ↗(On Diff #201194)

This comment is now out-of-date, and needs updating.

lib/CodeGen/SelectionDAG/DAGCombiner.cpp
1040 ↗(On Diff #201194)

Can you not check for a MemSDNode here, and avoid worrying about whether it's a LD or a ST?

test/CodeGen/RISCV/split-offsets-1.ll
3 ↗(On Diff #201194)

Should probably check RV64 too

4 ↗(On Diff #201194)

A comment should explain what this test is intending to check

test/CodeGen/RISCV/split-offsets-2.ll
1 ↗(On Diff #201194)

This doesn't need to be in a separate file to the other test

asb requested changes to this revision.Jun 5 2019, 10:05 PM
This revision now requires changes to proceed.Jun 5 2019, 10:05 PM

Addresses the remaining review concerns.

luismarques marked 5 inline comments as done.Jun 7 2019, 6:44 AM
asb accepted this revision.Jun 7 2019, 11:56 PM

Thanks, this looks good to me.

This revision is now accepted and ready to land.Jun 7 2019, 11:56 PM
This revision was automatically updated to reflect the committed changes.
craig.topper added inline comments.
llvm/trunk/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
1076

Doing some archaelogy.

Should this be checking the uses of the (add, (add, x, offset1), offset2)) expression? It seems to be checking the uses of (add x, offset1).

luismarques added inline comments.Apr 19 2022, 7:43 AM
llvm/trunk/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
1076

It seems to be checking the uses of (add x, offset1).

Yes, you're right.

Should this be checking the uses of the (add, (add, x, offset1), offset2)) expression?

Yes, I think that would be the actually correct thing to do. It seems that, because the (add x, offset1) was also being used in the loads/stores, in practice this was working well enough for the test coverage we had (not sure about the real world). In fact, I think if you change it to check the uses of the proper value you won't see any changes at all for all CodeGen tests, for all targets. If you're planning to fix this, can you see if you can add a test case that shows the difference in behavior?

Thanks for spotting this issue, Craig! Sorry about the bug.