This is an archive of the discontinued LLVM Phabricator instance.

[ARM] Improve codegen of volatile load/store of i64
ClosedPublic

Authored by vhscampos on Nov 11 2019, 3:53 AM.

Details

Summary

Instead of generating two i32 instructions for each load or store of a volatile
i64 value (two LDRs or STRs), now emit LDRD/STRD.

These improvements cover architectures implementing ARMv5TE or Thumb-2.

The code generation explicitly deviates from using the register-offset
variant of LDRD/STRD. In this variant, the register allocated to the
register-offset cannot be reused in any of the remaining operands. Such
restriction seems to be non-trivial to implement in LLVM, thus it is
left as a to-do.

Event Timeline

vhscampos created this revision.Nov 11 2019, 3:53 AM

The architecture specification provides limited guarantees here, but I guess this doesn't do any harm.

llvm/lib/Target/ARM/ARMISelLowering.cpp
9098

I'd prefer not to exclude extending loads here. Could lead to weird cases where we miss the transform.

9112

Please don't make MachineSDNodes this early; it might appear to mostly work, but other code is not expecting MachineSDNodes at this point. See ARMISelLowering.h for how to introduce a target-specific SDNode.

vhscampos updated this revision to Diff 229043.Nov 13 2019, 2:34 AM
  1. Not exclude extloads anymore.
  2. Create new ARMISD nodes specific to load/store of dual registers.
  3. Custom lower i64 volatile loads/stores to these new ARMISD nodes.
efriedma added inline comments.Nov 18 2019, 5:52 PM
llvm/lib/Target/ARM/ARMISelLowering.cpp
9156

Loads and stores should not have an "Offset" at this point; we don't form pre/post-indexed operations until after legalization. Maybe worth asserting "isUnindexed()". (Same for loads.)

efriedma added inline comments.Nov 18 2019, 5:52 PM
llvm/lib/Target/ARM/ARMISelDAGToDAG.cpp
3727

You should be able to use TableGen patterns for these, I think. AArch64Prefetch is an example of something similar to what you want.

llvm/lib/Target/ARM/ARMISelLowering.cpp
9098

You still need to *handle* extending loads somehow...

Actually, maybe we don't mess with volatile loads in DAGCombine, and you don't need to implement it. In that case, it would still be nice to have an assertion, in case someone changes it at some point.

9144

We also don't want to restrict truncating stores.

dmgreen added inline comments.Nov 19 2019, 4:16 AM
llvm/lib/Target/ARM/ARMISelLowering.cpp
9158

Likewise, this can just say if (Subtarget->hasMVEIntegerOps() && (VT == MVT::v4i1 || VT == MVT::v8i1 || VT == MVT::v16i1))

The other cases (isTruncatingStore and isUnindexed) should not come up. If they do we should noisily fail with an assert.

9353

How come this is altered, but not the LowerPredicateLoad?

vhscampos marked an inline comment as done.Nov 19 2019, 5:16 AM
vhscampos added inline comments.
llvm/lib/Target/ARM/ARMISelLowering.cpp
9353

The custom lowering of loads and stores here is triggered by the DAG Type Legalizer, since i64 is not supported.

In DAGTypeLegalizer::CustomLowerNode(), custom lowering of loads is directed to ARMTargetLowering::ReplaceNodeResults(), which then calls LowerLOAD(), created in the present patch. The lowering of stores is the one that is directed to ARMTargetLowering::LowerOperation().

In summary, the custom lowering of loads because of illegal result types does not go through here, so I believe there's no need to have it changed in this point.

vhscampos updated this revision to Diff 230276.Nov 20 2019, 8:46 AM
  1. Move the custom SD nodes to TableGen.
  2. Truncating stores not restricted anymore.
  3. Remove isUnindexed() calls since they should return true always at this point.
  4. Extend test to also check loads/stores that have an offset.
vhscampos marked an inline comment as done.Nov 20 2019, 8:49 AM
vhscampos added inline comments.
llvm/lib/Target/ARM/ARMISelLowering.cpp
9098

Your last comment on this wasn't clear to me. If after the latest change you still want me to add anything, please let me know.

efriedma added inline comments.Nov 21 2019, 4:34 PM
llvm/lib/Target/ARM/ARMISelDAGToDAG.cpp
1304

Are you sure this range computation is right? The range should be multiples of 4 from -1020 to 1020.

vhscampos updated this revision to Diff 230622.Nov 22 2019, 3:22 AM

Fixed the immediate range check to cover -1020 to 1020.

dmgreen added inline comments.Nov 27 2019, 7:59 AM
llvm/lib/Target/ARM/ARMISelDAGToDAG.cpp
1304

It may be worth adding tests for cases that are close to or just over the boundary.

vhscampos updated this revision to Diff 233337.Dec 11 2019, 5:47 AM
  1. Update summary to have a better explanation of this patch.
  2. Add a post-ISel hook to add register allocation hints to LDRD/STRD operands.
  3. In the AArch32 case, move ISel from TableGen back to the C++ side. This is needed because we must have a custom lowering whenever LDRD/STRD selection would normally yield a register offset. The ARM Load/Store Optimizer is not able to handle LDRD/STRD's register offsets in the cases where LDRD/STRD must be reverted to LDM/STM. As such, the C++ instruction selection opts for not generating instructions with a register offset.
  4. Improve test by testing several immediate boundary cases.
vhscampos edited the summary of this revision. (Show Details)Dec 11 2019, 5:48 AM
efriedma added inline comments.Dec 12 2019, 10:28 AM
llvm/lib/Target/ARM/ARMISelLowering.cpp
10859

Oh, I didn't notice this before.

There's a problem here: if we're trying to generate LDRD for the sake of the extra guarantees provided by the some targets, transforming that to LDM is wrong; it doesn't have the same guarantee, and therefore could cause unpredictable, subtle problems.

Probably we want to allocate a GPRPair, instead of allocating two separate registers and trying to tie them together with a hint. Maybe requires defining a new pseudo-instruction that takes a GPRPair instead of two GPRs.

Or I'd be okay with just restricting the optimization to Thumb2 for now, if you don't want to do the extra work right now.

vhscampos edited the summary of this revision. (Show Details)Dec 16 2019, 6:29 AM
vhscampos updated this revision to Diff 234045.Dec 16 2019, 6:30 AM

Create ARM PseudoInsts that take a register pair operand. This way we can enforce the allocation requirement.

efriedma added inline comments.Dec 17 2019, 4:33 PM
llvm/lib/Target/ARM/ARMISelDAGToDAG.cpp
3549

I think you could implement this pattern in TableGen (grep for REG_SEQUENCE in the ARM .td files). But probably not the corresponding load pattern, since the inverse opcode doesn't exist, so not a big deal either way.

llvm/lib/Target/ARM/ARMInstrInfo.td
2704

hasExtraDefRegAllocReq shouldn't be necessary; the regular ldrd/strd are weird because the register allocation constraint isn't expressed correctly by the operand types, but you don't have that problem here.

vhscampos updated this revision to Diff 234493.Dec 18 2019, 2:46 AM

Changed STOREDUAL instruction selection to use TableGen patterns.
Removed one constraint from LOADDUAL and STOREDUAL.

This revision is now accepted and ready to land.Dec 18 2019, 2:26 PM
This revision was automatically updated to reflect the committed changes.

Hi Victor,

this commit breaks "LLVM::2010-05-03-OriginDIE.ll" test on "llvm-clang-win-x-armv7l" builder.
Here is the failed build http://lab.llvm.org:8011/builders/llvm-clang-win-x-armv7l/builds/1851

Would you take a look?

Thanks,
Vlad.

I created a new patch fixing the issue: https://reviews.llvm.org/D71749

Hi Victor, we're observing a crash in Clang when compiling the Linux kernel bisected to this commit. Can you please revert?
See the trace in https://ci.linaro.org/job/tcwg_kernel-bisect-llvm-master-arm-next-allmodconfig/62/artifact/artifacts/build-bbcf1c3496ce2bd1ed87e8fb15ad896e279633ce/console.log grep for stack dump.

vhscampos reopened this revision.Dec 20 2019, 10:13 AM

I have reverted this commit.

Also, this revision was reopened so that I can append its fix here for reviewing.

This revision is now accepted and ready to land.Dec 20 2019, 10:13 AM
vhscampos updated this revision to Diff 235844.Jan 2 2020, 3:26 AM

This is the same patch as before but with fixes for the tests that regressed (as reported in the comments here).

The fix is to specify the address mode of the two pseudo instructions introduced.

nickdesaulniers accepted this revision.Jan 6 2020, 11:33 AM

Thanks @vhscampos , I tested the new patch and could no longer reproduce the observed crashes. I'm not sure if I'm looking at the interdiff correctly, but consider adding additional test cases that properly describe and cover the breakage we observed.

vhscampos updated this revision to Diff 236557.Jan 7 2020, 5:13 AM

Added one test to cover the issue related to loads/stores to a stack frame.

This revision was automatically updated to reflect the committed changes.

@vhscampos sorry, we're getting new/different warnings now seemingly with this patch: https://github.com/ClangBuiltLinux/linux/issues/838

Warning: index register overlaps transfer register

Apparently the ARM-mode LDRD is a bit more strange than I realized. From the ARM manual: if t2 == 15 || m == 15 || m == t || m == t2 then UNPREDICTABLE;. I guess we're managed to avoid running into this in the past by never generating the register form of ldrd.

It should be possible to express this constraint to the register allocator using @earlyclobber. (@earlyclobber is actually a little more conservative than we need, strictly speaking, but the difference probably doesn't matter too much.)

@nickdesaulniers @efriedma @nathanchance Apologies for missing the latest comments! Since this seems to be a blocker, I'd suggest that this change gets reverted until I am able to have a closer look at the issue of register overlapping. What do you think?

Please revert; I am happy to test a new revision to make sure there are no warnings but I don’t want this shipped in clang-10 and a revert is something that we can easily backport unless you can come up with a fix rather quickly.

Reverted in the master branch.

vhscampos reopened this revision.Feb 8 2020, 5:27 AM
This revision is now accepted and ready to land.Feb 8 2020, 5:27 AM
vhscampos planned changes to this revision.Feb 8 2020, 5:28 AM
hans added a subscriber: hans.Feb 10 2020, 2:17 AM

Reverted in the master branch.

I see it was also reverted on the release/10.x branch 7996b49053f0508717f4a081d197ddc3073f4b5f. Thanks for keeping the branch in mind, but as mentioned in http://lists.llvm.org/pipermail/llvm-dev/2020-January/138295.html please check with me before pushing directly to it.

Changes:

  1. Explictly avoid using the register-offset variant of LDRD/STRD. This variant has a limitation on register allocation: the register allocated to the register-offset cannot be reused in any of the remaining operands. I could not find an easy way to implement this in LLVM, so I left it as a to-do in the future.
  2. Instruction selection of STRD was moved from TableGen to C++ because of point (1).
  3. Updated tests to reflect these changes.
This revision is now accepted and ready to land.Mar 10 2020, 10:09 AM
vhscampos requested review of this revision.Mar 10 2020, 10:10 AM
vhscampos edited the summary of this revision. (Show Details)

Can I please have this reviewed again? I have addressed the issues reported.

nickdesaulniers accepted this revision.Mar 10 2020, 1:05 PM

Boot tested a clang built arm32 linux kernel in QEMU with this patch applied. Thanks for following up with fixes.

This revision is now accepted and ready to land.Mar 10 2020, 1:05 PM
efriedma accepted this revision.Mar 10 2020, 1:29 PM

LGTM with one minor comment.

llvm/lib/Target/ARM/ARMISelDAGToDAG.cpp
3525

The RegOffset check could use a comment explaining what it's doing, here and for STRD.

vhscampos updated this revision to Diff 249573.Mar 11 2020, 3:14 AM
vhscampos edited the summary of this revision. (Show Details)

Added comment to explain the fallback to non-register-offset variants.

This revision was automatically updated to reflect the committed changes.
vhscampos reopened this revision.May 27 2020, 7:56 AM
This revision is now accepted and ready to land.May 27 2020, 7:56 AM
vhscampos updated this revision to Diff 266545.May 27 2020, 7:56 AM

Improve the testcase which exercises loads and stores from stack. Now, wrong frame index replacements will be caught here.

vhscampos requested review of this revision.May 27 2020, 8:01 AM
This revision is now accepted and ready to land.May 27 2020, 4:10 PM
This revision was automatically updated to reflect the committed changes.