Page MenuHomePhabricator

[ARM] Prevent spilling between ldrex/strex pairs

Authored by tmatheson on May 5 2021, 5:15 AM.



Based on the same for AArch64: 4751cadcca45984d7671e594ce95aed8fe030bf1

At -O0, the fast register allocator may insert spills between the ldrex and
strex instructions inserted by AtomicExpandPass when expanding atomicrmw
instructions in LL/SC loops. To avoid this, expand to cmpxchg loops and
therefore expand the cmpxchg pseudos after register allocation.

Required a tweak to ARMExpandPseudo::ExpandCMP_SWAP to use the 4-byte encoding
of UXT, since the pseudo instruction can be allocated a high register (R8-R15)
which the 2-byte encoding doesn't support.

The previously committed attempt in D101164 had to be reverted due to runtime
failures in the test suites. Rather than spending time fixing that
implementation (adding another implementation of atomic operations and more
divergence between backends) I have chosen to follow the approach taken in

Compared to D101164, this patch also fixes the problem for Thumb 2.

Depends on D101912

Diff Detail

Event Timeline

tmatheson created this revision.May 5 2021, 5:15 AM
tmatheson requested review of this revision.May 5 2021, 5:15 AM
Herald added a project: Restricted Project. · View Herald TranscriptMay 5 2021, 5:15 AM
lenary added a comment.May 5 2021, 7:05 AM

Please can you pre-commit atomicrmw_exclusive_monitor_all.ll, so we can see the changes this commit makes to those code sequences?

LemonBoy added inline comments.May 5 2021, 7:08 AM

Nit, guard this line with if (UxtOp == t2UXTB) as already done for the ldrex/strex ops below.

tmatheson updated this revision to Diff 343055.May 5 2021, 8:20 AM
  • At more triples to the test (Thumb1/Thumb2)
  • Remove tests for floating point atomics. They tend to be bitcast to ints by the frontend and the support for them is patchy across architecture versions.
  • Rebase on top of pre-commit test in D101912
tmatheson added inline comments.May 5 2021, 8:28 AM

All of the currently used variants (t2UXTB, UXTB, t2UXTH, UXTH) expect an immediate here, which is different from the strex/ldrex instructions of which only t2STREX and t2STREX accept an offset. So I don't think a guard is necessary.

tmatheson updated this revision to Diff 343065.May 5 2021, 8:57 AM

Minor change to trigger rebuild

tmatheson updated this revision to Diff 343710.May 7 2021, 10:05 AM

Minimize the CHECK lines in the test

Really, using AtomicExpansionKind::LLSC at all on ARM/AArch64 is fragile; long-term, I think it still makes sense to pursue using pseudo-instructions. But this is clearly an improvement.

I'm a little surprised that ARM::CMP_SWAP_8/ARM::CMP_SWAP_16 generating broken instructions hasn't caused any issues before.


I don't think we can use ARM::t2UXTB here unconditionally. Specifically, I'm concerned about the armv8m.base target, which has ARM::t2LDREXBm/ARM::t2STREXB, but not ARM::t2UXTB.

Instead, can we add a separate ARM::tCMP_SWAP_8 instruction, with an appropriate register allocation restriction?

  • Add tests for v8m.baseline
  • Add two new pseudo instructions tCMP_SWAP_8 and tCMP_SWAP_16 to limit register for UXT instruction
  • Add some asserts in ExpandCMP_SWAP
efriedma added inline comments.May 11 2021, 11:05 AM

The text of this assertion seems seems wrong; the point of the assertion is that ARMv8-M.baseline doesn't have t2UXTB/t2UXTH


isThumb() includes both Thumb1 (here, v8m.base) and Thumb2. That's fine, but if you're going to do that, please remove the dead Thumb2 codepaths from ARMExpandPseudoInsts.

Remove dead code paths and fix assert string

This revision is now accepted and ready to land.May 11 2021, 12:05 PM

Thanks for the feedback!

efriedma added inline comments.Thu, May 27, 2:26 PM

Just realized I wasn't looking carefully enough when I reviewed this...

In the case where we're generating a call to the runtime library, we want to call __sync_fetch_and_sub_8, not __sync_val_compare_and_swap_8. Same for other similar atomicrmw operations. I mean, it's not semantically wrong the way it is, but using __sync_val_compare_and_swap_8 is worse for codesize.

I think the way to handle this is to switch around the logic to return AtomicExpansionKind::None instead of AtomicExpansionKind::CmpXChg in this case.