Page MenuHomePhabricator

[ARM] Improve codegen of volatile load/store of i64
Needs ReviewPublic

Authored by vhscampos on Nov 11 2019, 3:53 AM.



Instead of generating two i32 instructions for each load or store of a volatile
i64 value (two LDRs or STRs), now try to emit wide loads or stores.

The CodeGen in this patch attempts to produce LDRD/STRD instructions for
these wide memory accesses. However, this is not always possible because
these instructions require an even/odd, consecutive pair of register

After register allocation, if this requirement is not met, the ARM
Load/Store Optimizer transforms ill-formed LDRD/STRD into LDM/STM

Do note that the requirement above is only present in AArch32. In
Thumb-2, the two register operands of LDRD/STRD are not restrained to be a
consecutive pair of registers.

These improvements cover architectures implementing ARMv5TE or Thumb-2.

Event Timeline

vhscampos created this revision.Nov 11 2019, 3:53 AM

The architecture specification provides limited guarantees here, but I guess this doesn't do any harm.


I'd prefer not to exclude extending loads here. Could lead to weird cases where we miss the transform.


Please don't make MachineSDNodes this early; it might appear to mostly work, but other code is not expecting MachineSDNodes at this point. See ARMISelLowering.h for how to introduce a target-specific SDNode.

vhscampos updated this revision to Diff 229043.Nov 13 2019, 2:34 AM
  1. Not exclude extloads anymore.
  2. Create new ARMISD nodes specific to load/store of dual registers.
  3. Custom lower i64 volatile loads/stores to these new ARMISD nodes.
efriedma added inline comments.Mon, Nov 18, 5:52 PM

You should be able to use TableGen patterns for these, I think. AArch64Prefetch is an example of something similar to what you want.


You still need to *handle* extending loads somehow...

Actually, maybe we don't mess with volatile loads in DAGCombine, and you don't need to implement it. In that case, it would still be nice to have an assertion, in case someone changes it at some point.


We also don't want to restrict truncating stores.


Loads and stores should not have an "Offset" at this point; we don't form pre/post-indexed operations until after legalization. Maybe worth asserting "isUnindexed()". (Same for loads.)

dmgreen added inline comments.Tue, Nov 19, 4:16 AM

Likewise, this can just say if (Subtarget->hasMVEIntegerOps() && (VT == MVT::v4i1 || VT == MVT::v8i1 || VT == MVT::v16i1))

The other cases (isTruncatingStore and isUnindexed) should not come up. If they do we should noisily fail with an assert.


How come this is altered, but not the LowerPredicateLoad?

vhscampos marked an inline comment as done.Tue, Nov 19, 5:16 AM
vhscampos added inline comments.

The custom lowering of loads and stores here is triggered by the DAG Type Legalizer, since i64 is not supported.

In DAGTypeLegalizer::CustomLowerNode(), custom lowering of loads is directed to ARMTargetLowering::ReplaceNodeResults(), which then calls LowerLOAD(), created in the present patch. The lowering of stores is the one that is directed to ARMTargetLowering::LowerOperation().

In summary, the custom lowering of loads because of illegal result types does not go through here, so I believe there's no need to have it changed in this point.

vhscampos updated this revision to Diff 230276.Wed, Nov 20, 8:46 AM
  1. Move the custom SD nodes to TableGen.
  2. Truncating stores not restricted anymore.
  3. Remove isUnindexed() calls since they should return true always at this point.
  4. Extend test to also check loads/stores that have an offset.
vhscampos marked an inline comment as done.Wed, Nov 20, 8:49 AM
vhscampos added inline comments.

Your last comment on this wasn't clear to me. If after the latest change you still want me to add anything, please let me know.

efriedma added inline comments.Thu, Nov 21, 4:34 PM

Are you sure this range computation is right? The range should be multiples of 4 from -1020 to 1020.

vhscampos updated this revision to Diff 230622.Fri, Nov 22, 3:22 AM

Fixed the immediate range check to cover -1020 to 1020.

dmgreen added inline comments.Wed, Nov 27, 7:59 AM

It may be worth adding tests for cases that are close to or just over the boundary.

vhscampos updated this revision to Diff 233337.Wed, Dec 11, 5:47 AM
  1. Update summary to have a better explanation of this patch.
  2. Add a post-ISel hook to add register allocation hints to LDRD/STRD operands.
  3. In the AArch32 case, move ISel from TableGen back to the C++ side. This is needed because we must have a custom lowering whenever LDRD/STRD selection would normally yield a register offset. The ARM Load/Store Optimizer is not able to handle LDRD/STRD's register offsets in the cases where LDRD/STRD must be reverted to LDM/STM. As such, the C++ instruction selection opts for not generating instructions with a register offset.
  4. Improve test by testing several immediate boundary cases.
vhscampos edited the summary of this revision. (Show Details)Wed, Dec 11, 5:48 AM
efriedma added inline comments.Thu, Dec 12, 10:28 AM

Oh, I didn't notice this before.

There's a problem here: if we're trying to generate LDRD for the sake of the extra guarantees provided by the some targets, transforming that to LDM is wrong; it doesn't have the same guarantee, and therefore could cause unpredictable, subtle problems.

Probably we want to allocate a GPRPair, instead of allocating two separate registers and trying to tie them together with a hint. Maybe requires defining a new pseudo-instruction that takes a GPRPair instead of two GPRs.

Or I'd be okay with just restricting the optimization to Thumb2 for now, if you don't want to do the extra work right now.