Page MenuHomePhabricator

[RISCV] Codegen for i8, i16, and i32 atomicrmw with RV32A
ClosedPublic

Authored by asb on Jun 7 2018, 7:17 AM.

Details

Summary

Introduce a new RISCVExpandPseudoInsts pass to expand atomic pseudo-instructions after register allocation. This is necessary in order to ensure that register spills aren't introduced between LL and SC, thus breaking the forward progress guarantee for the operation. AArch64 does something similar for CmpXchg (though only at O0), and Mips is moving towards this approach (see D31287). See also this mailing list post from James Knight, which summarises the issues with lowering to ll/sc in IR or pre-RA.

See the accompanying RFC thread for an overview of the lowering strategy.

Diff Detail

Repository
rL LLVM

Event Timeline

asb created this revision.Jun 7 2018, 7:17 AM
asb removed a reviewer: javed.absar.Jun 7 2018, 7:18 AM

The problem with min and max is that you must not have a branch within the ll/sc loop. But, they can be done branchless with not too many instructions, so ISTM we should just implement them that way.

I think the following will work -- it computes b ^ ((a^b) & -(a < b)). The & -(a < b) either returns the LHS (if comparison is true) or returns 0 (if comparison is false), so you either get b^a^b, or b.

min(a0, a1) -> a0:

slt     a2, a0, a1
neg     a2, a2
xor     a0, a0, a1
and     a0, a0, a2
xor     a0, a1, a0

And I think just use sltu instead of slt for umin/umax, and reverse the args of the slt for max/umax.

lib/Target/RISCV/RISCVExpandPseudoInsts.cpp
179 ↗(On Diff #150330)

I'd note that all of the code which is OUTSIDE the ll/sc loop doesn't really need to be implemented within this pseudo. I'm not sure if it's worthwhile extracting that or not, but it could theoretically have benefit, due to being able to potentially reuse e.g. the address math.

Have you considered making such a split?

lib/Target/RISCV/RISCVTargetMachine.cpp
104 ↗(On Diff #150330)

Is it okay to do this pre-scheduling? Are we guaranteed that the scheduler won't do something invalid here? (I'm not sure the set of transforms it's capable of at the moment.)

asb added a comment.Jun 8 2018, 5:29 AM

The problem with min and max is that you must not have a branch within the ll/sc loop. But, they can be done branchless with not too many instructions, so ISTM we should just implement them that way.

I think the following will work -- it computes b ^ ((a^b) & -(a < b)). The & -(a < b) either returns the LHS (if comparison is true) or returns 0 (if comparison is false), so you either get b^a^b, or b.

I did consider the branchless approach. However forward branches _are_ part of the architectural guarantee of LR/SC forward progress:

In the standard A extension, certain constrained LR/SC sequences are guaranteed to succeed
eventually. The static code for the LR/SC sequence plus the code to retry the sequence in case
of failure must comprise at most 16 integer instructions placed sequentially in memory. For the
sequence to be guaranteed to eventually succeed, the dynamic code executed between the LR and
SC instructions can only contain other instructions from the base “I” subset, excluding loads, stores,
backward jumps or taken backward branches, FENCE, FENCE.I, and SYSTEM instructions. The
code to retry a failing LR/SC sequence can contain backward jumps and/or branches to repeat the
LR/SC sequence, but otherwise has the same constraints. The SC must be to the same address
and of the same data size as the latest LR executed. LR/SC sequences that do not meet these
constraints might complete on some attempts on some implementations, but there is no guarantee
of eventual success.

lib/Target/RISCV/RISCVExpandPseudoInsts.cpp
179 ↗(On Diff #150330)

Yes, I've thought about this quite a bit. The ideal is to expand everything other than the LL+SC+branch at the IR level. I had been thinking about a new inlineasm-like 'black box' construct that would be expanded by target-dependent code only at the very last moment, when lowering to MCInst. But when I started sketching it out in more detail with the aim of incorporating it as an option in an RFC on atomics lowering I started thinking that there's not much benefit in this versus expanding pseudo-instructions in preEmitPass2. Targets can guarantee that runs after everything else that might cause problems (such as the machine outliner). I'll play with this some more.

lib/Target/RISCV/RISCVTargetMachine.cpp
104 ↗(On Diff #150330)

preEmitPass2 seems the ideal point. Although the machine outliner could be taught to leave instructions between LLC/SC alone, it's probably better to ensure that this lowering happens at the latest possible point.

In D47882#1126302, @asb wrote:

can only contain other instructions from the base “I” subset, excluding loads, stores,
backward jumps or taken backward branches,

Oh, nice! That changed since the last time I read it in 2016; it used to prohibit -all- taken branches, not just backwards ones.

lib/Target/RISCV/RISCVExpandPseudoInsts.cpp
179 ↗(On Diff #150330)

I agree -- adding an intrinsic, say "llvm.riscv.atomicrmw.masked.llsc.loop", and then teaching AtomicExpandPass to emit that for the inner loop, seems like it should be fine without adding any new concepts to LLVM.

That's pretty easy for atomicrmw, but not so much for cmpxchg, if you want to avoid extra conditionals and jumps on the way out and also support the separate success/failure fences. But this review is for atomicrmw, so we can ignore that for now. (But: there's already work ongoing to allow for "calls" which have multiple successors, which seems like it'll help to fix that issue once it lands.)

I'd be fine with you landing this, and going back and rework it to expand the prefix/suffix in AtomicExpandPass in followup changes.

asb updated this revision to Diff 151167.Jun 13 2018, 8:05 AM
asb edited the summary of this revision. (Show Details)

Updates:

  • Expand masked AMOs to IR for address calculation and masking + an intrinsic for the core LL/SC loop. Introduce a new hook in TargetLowering to allow this
  • Support for part-word atomicrmw max/min/umax/umin
  • Use MaskedMerge rather than calculating an inverse mask
  • Expand atomicrmw sub i32 to (AMOADD_W GPR:$addr, (SUB X0, GPR:$incr)) with appropriate AQ/RL bits

If there is enthusiasm from other targets, it may be worth moving more of the logic in RISCVISelLowering::emitAtomicRMW to AtomicExpandPass. As evidenced in the follow-up patch which supports 8/16-bit bitwise AMOs with 32-bit AMOs, it is useful to retain the ability to provide custom expansion logic.

I've authored an RFC which details the strategy which I'll post as soon as I've completed uploading the patch-set.

asb edited the summary of this revision. (Show Details)Jun 13 2018, 8:46 AM
asb updated this revision to Diff 159859.Aug 8 2018, 11:46 PM

The main update in this patch revision is moving more of the IR generation for the masked atomic operation to AtomicExpandPass, reusing the createMaskInstrs helper used by expandPartWordAtomicRMW and expandPartWordCmpXchg. We now introduce AtomicExpansionKind::MaskedIntrinsic rather than AtomicExpansionKind::Custom.

No other backend is forced to move towards this lowering strategy, but it should now be possible to do so with minimal code duplication.

jyknight added inline comments.Aug 9 2018, 7:56 AM
include/llvm/CodeGen/TargetLowering.h
1561 ↗(On Diff #159859)

s/Custom/Masked/

lib/CodeGen/AtomicExpandPass.cpp
888 ↗(On Diff #159859)

For the OR, XOR, and AND operations, there's no need to implement or use a masked intrinsic at all -- the normal 32bit atomic will do the job fine given ValOperand_Shifted as the argument.

For the masked AND, you do need to pass "ValOperand_Shifted | PMV.Inv_Mask" instead, of course.

(I note that I neglected to implement that optimization for AND in the cmpxchg expansion in expandPartwordAtomicRMW.)

lib/Target/RISCV/RISCVExpandPseudoInsts.cpp
120 ↗(On Diff #159859)

It's not a big deal, but this function and others like it make me think it'd be better to just have one instruction for LR and one for SC, with an argument for the atomic ordering. I assume there's some reason that was not doable?

204 ↗(On Diff #159859)

Nice -- the sequence in AtomicExpandPass::performMaskedAtomicOp should be updated to do the same thing, I think!

379 ↗(On Diff #159859)

Doesn't this need to sign-extend the values prior to doing a signed comparison?

asb added inline comments.Aug 16 2018, 2:17 AM
lib/CodeGen/AtomicExpandPass.cpp
888 ↗(On Diff #159859)

Yes I recognise this, and D48129 implements the three optimisations you suggest for RISC-V. I've now reworked that patch so it is target-independent and the same i8/i16 -> i32 transformation logic is used regardless of the AtomicExpansionKind. Let me know what you think,

It probably makes sense to rebase this patch so it is dependent on D48129.

asb updated this revision to Diff 160994.Aug 16 2018, 3:43 AM
asb marked 5 inline comments as done.
asb edited the summary of this revision. (Show Details)

Rebase on top of D48129 and update to address most review comments.

TODO: Look again at signed atomicrmw min/max.

lib/Target/RISCV/RISCVExpandPseudoInsts.cpp
120 ↗(On Diff #159859)

Instruction definitions in the RISC-V backend follow the instruction encoding, meaning they are easy to inspect for correctness with respect to the ISA manual and the MC layer support is trivial. We could define a RISCV::PseudoLR that took AtomicOrdering and have logic that selected the appropriate AQ/RL bits, but we'd still be implementing something equivalent to this switch - just moving it elsewhere.

204 ↗(On Diff #159859)

I've added a TODO to performMaskedAtomicOp

One comment left unresolved, re sign-extend -- patch LG other than that!

lib/CodeGen/AtomicExpandPass.cpp
888 ↗(On Diff #159859)

Sorry, I totally forgot about that other patch! Yes, it makes more sense to structure this way around. :)

lib/Target/RISCV/RISCVExpandPseudoInsts.cpp
120 ↗(On Diff #159859)

Nah, it's fine, not a big deal.

asb updated this revision to Diff 162175.Aug 23 2018, 7:02 AM
asb marked an inline comment as done.

Update to sign-extends the values for signed min/max, which is the last known issue with this patch.

Happy to commit?

asb added a comment.Aug 23 2018, 9:06 AM

Actually, hold-off on reviewing. I don't think the logic for sign extension is correct when the target memory location isn't 32-bit aligned. Sorry for the noise. I'll add look again and add more tests to cover this.

It's a little tempting to just fail for [u]min/max of i8/i16 given Clang never generates them, but we've got this far now...

asb updated this revision to Diff 165968.Sep 18 2018, 7:12 AM

The patch has been updated so sign-extension is performed correct for part-word signed min/max operations.

All known issues are resolved. Final review would be very welcome.

jyknight accepted this revision.Sep 18 2018, 8:58 AM

Final update looks good as far as I can tell.

Ship it! :)

This revision is now accepted and ready to land.Sep 18 2018, 8:58 AM
This revision was automatically updated to reflect the committed changes.