This is an archive of the discontinued LLVM Phabricator instance.

[SystemZ] Reimplement the 1-byte compare-and-swap logic
ClosedPublic

Authored by jonpa on Feb 26 2021, 6:19 PM.

Details

Summary

Even though the implementation in emitAtomicCmpSwapW() was correct, it made Valgrind report an error.

Instead of using a RISBG on CmpVal, an LL[CH]R can be made on the OldVal, and the problem is avoided.

CmpVal: Should not need a LL[HC]R, as it should already be zero-extended also in the case of a non-constant, or?

Test updating only begun...

Diff Detail

Event Timeline

jonpa created this revision.Feb 26 2021, 6:19 PM
jonpa requested review of this revision.Feb 26 2021, 6:19 PM
Herald added a project: Restricted Project. · View Herald TranscriptFeb 26 2021, 6:19 PM

CmpVal: Should not need a LL[HC]R, as it should already be zero-extended also in the case of a non-constant, or?

Not necessarily. Our ABI does require that "char" and "short" parameters and return values are extended, but that can be either a zero- or a sign-extension depending on the type. Also, this is implemented via the zeroext/signext type attributes on the parameters in code generated by clang; with LLVM IR generated elsewhere (like in those test cases!), we may get a plain i8 or i16 that is not extended. And of course if the i8 or i16 in question is not a function parameter but the result of some intermediate computation, it is not guaranteed to be extended anyway.

So in short, yes, the CmpVal may have to be extended. However, it is probably worthwhile to detect those (common) cases where it already *is* extended to avoid redundant effort. This is hard(er) to do at the MI level, so I think the extension is best done in SystemZTargetLowering::lowerATOMIC_CMP_SWAP at the SelectionDAG level before emitting the ATOMIC_CMP_SWAPW MI instruction.

As an aside, it seems the code does now require one extra register. It might be worthwhile to avoid this by rearranging the statements a bit:

//  LoopMBB:
//   %OldVal        = phi [ %OrigOldVal, EntryBB ], [ %RetryOldVal, SetMBB ]
//   %SwapVal       = phi [ %OrigSwapVal, EntryBB ], [ %RetrySwapVal, SetMBB ]
//   %Dest          = RLL %OldVal, BitSize(%BitShift)
//                      ^^ The low BitSize bits contain the field
//                         of interest.
//   %RetrySwapVal = RISBG32 %SwapVal, %Dest, 32, 63-BitSize, 0
//                      ^^ Replace the upper 32-BitSize bits of the
//                         swap value with those that we loaded and rotated.
//   %Dest    = LL[CH] %Dest
//   CR %Dest, %CmpVal

//  SetMBB:
//   %StoreVal     = RLL %RetrySwapVal, -BitSize(%NegBitShift)
//                      ^^ Rotate the new field to its proper position.
//   %RetryOldVal  = CS %OldVal, %StoreVal, Disp(%Base)
//   JNE LoopMBB

As an added bonus, this would make Dest already zero-extended, so it might be possible to avoid any extra zero-extension on the result (by adding an AssertZExt ISD node after the ATOMIC_SWAP_CMPW).

jonpa added a comment.Mar 1 2021, 5:17 PM

Not necessarily. Our ABI does require that "char" and "short" parameters and return values are extended, but that can be either a zero- or a sign-extension depending on the type. Also, this is implemented via the zeroext/signext type attributes on the parameters in code generated by clang; with LLVM IR generated elsewhere (like in those test cases!), we may get a plain i8 or i16 that is not extended. And of course if the i8 or i16 in question is not a function parameter but the result of some intermediate computation, it is not guaranteed to be extended anyway.

So in short, yes, the CmpVal may have to be extended. However, it is probably worthwhile to detect those (common) cases where it already *is* extended to avoid redundant effort. This is hard(er) to do at the MI level, so I think the extension is best done in SystemZTargetLowering::lowerATOMIC_CMP_SWAP at the Select\ionDAG level before emitting the ATOMIC_CMP_SWAPW MI instruction.

I added an AND to zero-out the high bits to perform the zero-extension from the narrow type. It seemed that if the source was a constant (e.g. '1'), the DAG.getNode(ISD::AND, ...) call folded the AND on the fly. And if the source was a parameter with the 'zeroext' attribute (or rather any result with an AssertZext node) ,the AND goes away during DAGCombine2. So for what I could see, there is not much extra work to do here.

As an aside, it seems the code does now require one extra register. It might be worthwhile to avoid this by rearranging the statements a bit:

As an added bonus, this would make Dest already zero-extended, so it might be possible to avoid any extra zero-extension on the result (by adding an AssertZExt ISD node after the ATOMIC_SWAP_CMPW).

Ah, yes, that makes sense now.

At this point I am wondering about how to treat the signedness of the input/output: if the template type of std::atomic is signed, then the result of e.g. a signed char should be sign-extended to i32, right? So both CmpVal and Dest should either be sign- or zero-extended. The patch currently always zero-extends...

I see that with signed char:

Optimized lowered selection DAG: %bb.0 '_Z3funa:entry'
SelectionDAG has 15 nodes:
          t0: ch = EntryToken
        t8: ch = lifetime.start<0 to 1> t0, TargetFrameIndex:i64<0>
      t12: ch = store<(store 1 into %ir.0, align 2)> t8, Constant:i8<1>, FrameIndex:i64<0>, undef:i64
    t14: i8,i1,ch = AtomicCmpSwapWithSuccess<(volatile load store acquire monotonic 1 on %ir.0)> t12, FrameIndex:i64<0>, Constant:i8<1>, Constant:i8<0>
  t15: i8,ch = AtomicLoad<(volatile dereferenceable load seq_cst 1 from %ir.0, align 2)> t14:2, FrameIndex:i64<0>
    t16: ch = lifetime.end<0 to 1> t15:1, TargetFrameIndex:i64<0>
    t18: i64 = sign_extend t15
  t20: ch,glue = CopyToReg t16, Register:i64 $r2d, t18
  t21: ch = SystemZISD::RET_FLAG t20, Register:i64 $r2d, t20:1


Type-legalized selection DAG: %bb.0 '_Z3funa:entry'
SelectionDAG has 17 nodes:
    t16: ch = lifetime.end<0 to 1> t27:1, TargetFrameIndex:i64<0>
      t28: i64 = any_extend t27
    t30: i64 = sign_extend_inreg t28, ValueType:ch:i8
  t20: ch,glue = CopyToReg t16, Register:i64 $r2d, t30
          t0: ch = EntryToken
        t8: ch = lifetime.start<0 to 1> t0, TargetFrameIndex:i64<0>
      t24: ch = store<(store 1 into %ir.0, align 2), trunc to i8> t8, Constant:i32<1>, FrameIndex:i64<0>, undef:i64
    t26: i32,i32,ch = AtomicCmpSwapWithSuccess<(volatile load store acquire monotonic 1 on %ir.0)> t24, FrameIndex:i64<0>, Constant:i32<1>, Constant:i32<0>
  t27: i32,ch = AtomicLoad<(volatile dereferenceable load seq_cst 1 from %ir.0, align 2)> t26:2, FrameIndex:i64<0>
  t21: ch = SystemZISD::RET_FLAG t20, Register:i64 $r2d, t20:1

For a signed type (above), there is a sign_extend of the result (and for the unsigned case a zero_extend). But after the type-legalizer, the AtomicCmpSwapWithSuccess has been replaced with a new one of i32 type, but the extension node is gone. The memory operand has no information of what extension is supposed to happen, so I wonder how the result (original value) is supposed to be properly extended..?

This happens in DAGTypeLegalizer::PromoteIntRes_AtomicCmpSwap() which uses TLI.getExtendForAtomicCmpSwapArg(). This also seem to ignore the original type (takes no parameters), so I am not sure if I am missing something?

It seems nice if we could extend CmpVal and Dest with the same signedness and just change the opcodes for the extensions with this new approach...

I added an AND to zero-out the high bits to perform the zero-extension from the narrow type. It seemed that if the source was a constant (e.g. '1'), the DAG.getNode(ISD::AND, ...) call folded the AND on the fly. And if the source was a parameter with the 'zeroext' attribute (or rather any result with an AssertZext node) ,the AND goes away during DAGCombine2. So for what I could see, there is not much extra work to do here.

Ah, thanks for pointing out the TLI.getExtendForAtomicCmpSwapArg() and TLI.getExtendForAtomicOps() routines, I had overlooked those. This means you don't have to do any of the extends "by hand", you just use the proper opcode in those routines. The TLI.getExtendForAtomicCmpSwapArg() routine specifies how common code is supposed to extend the compare value before passing it to compare-and-swap, while the TLI.getExtendForAtomicOps() routine specifies what extension platform-specific code will have already performed on the result so common code may rely on it.

These are currently both set to ANY_EXTEND on SystemZ, I think for the new algorithm you can just set them both to ZERO_EXTEND and everything those just work.

At this point I am wondering about how to treat the signedness of the input/output: if the template type of std::atomic is signed, then the result of e.g. a signed char should be sign-extended to i32, right? So both CmpVal and Dest should either be sign- or zero-extended. The patch currently always zero-extends...

Well, I guess there could be multiple variants (sign/zero-extend to i32/i64) that we might support. This may need improvements to the common-code TIL.getExtendForAtomic... logic (or possibly be done via DAGCombine)? In any case, I think for now we should just do the ZERO_EXTEND, any improvement can be done as a follow-up.

jonpa updated this revision to Diff 327560.Mar 2 2021, 1:27 PM

Updated per review.

Tests updated and passing.

The change in lowerATOMIC_CMP_SWAP should be removed now. Otherwise this LGTM.

llvm/lib/Target/SystemZ/SystemZISelLowering.cpp
4060

This is not needed any more -- it is already done by common code now that you set getAtomicExtendOps to ZERO_EXTEND.

jonpa added a comment.Mar 3 2021, 10:36 AM

This is not needed any more -- it is already done by common code now that you set getAtomicExtendOps to ZERO_EXTEND.

I also thought so, but I found that it did make a difference on this test case:

Not sure exactly why, but thought we might as well have it there... Or is this a bug in common code we should fix?

Or is it perhaps even good without the AssertZext - the differnece in this case is an LLGFR instead of LLGCR. I thought maybe that could make a difference in other programs...

Before isel:

Optimized legalized selection DAG: %bb.0 '_Z3funh:entry'                        Optimized legalized selection DAG: %bb.0 '_Z3funh:entry'
SelectionDAG has 28 nodes:                                                |     SelectionDAG has 27 nodes:
  t0: ch = EntryToken                                                             t0: ch = EntryToken
    t19: ch = lifetime.end<0 to 1> t45:2, TargetFrameIndex:i64<0>                   t19: ch = lifetime.end<0 to 1> t45:2, TargetFrameIndex:i64<0>
      t34: i64 = any_extend t45                                           |           t49: i32 = AssertZext t45, ValueType:ch:i8
    t36: i64 = and t34, Constant:i64<255>                                 |         t59: i64 = zero_extend t49
  t22: ch,glue = CopyToReg t19, Register:i64 $r2d, t36                    |       t22: ch,glue = CopyToReg t19, Register:i64 $r2d, t59
      t11: ch = lifetime.start<0 to 1> t0, TargetFrameIndex:i64<0>                    t11: ch = lifetime.start<0 to 1> t0, TargetFrameIndex:i64<0>
    t30: ch = store<(store 1 into %ir.0, align 2), trunc to i8> t11, C              t30: ch = store<(store 1 into %ir.0, align 2), trunc to i8> t11, C
    t39: i64 = and FrameIndex:i64<0>, Constant:i64<-4>                              t39: i64 = and FrameIndex:i64<0>, Constant:i64<-4>
        t2: i64,ch = CopyFromReg t0, Register:i64 %0                                    t2: i64,ch = CopyFromReg t0, Register:i64 %0
      t24: i64 = AssertZext t2, ValueType:ch:i8                                       t24: i64 = AssertZext t2, ValueType:ch:i8
    t29: i32 = truncate t24                                                         t29: i32 = truncate t24
    t43: i32 = sub Constant:i32<0>, t57                                   |         t43: i32 = sub Constant:i32<0>, t58
  t45: i32,i32,ch = SystemZISD::ATOMIC_CMP_SWAPW<(volatile load store     |       t45: i32,i32,ch = SystemZISD::ATOMIC_CMP_SWAPW<(volatile load store 
    t56: i32 = truncate FrameIndex:i64<0>                                 |         t57: i32 = truncate FrameIndex:i64<0>
  t57: i32 = shl t56, Constant:i32<3>                                     |       t58: i32 = shl t57, Constant:i32<3>
  t23: ch = SystemZISD::RET_FLAG t22, Register:i64 $r2d, t22:1                    t23: ch = SystemZISD::RET_FLAG t22, Register:i64 $r2d, t22:1

output

        .text                                                                           .text
        .file   "boolean_cmpxchg.cpp"                                                   .file   "boolean_cmpxchg.cpp"
        .globl  _Z3funh                         # -- Begin function _Z                  .globl  _Z3funh                         # -- Begin function _Z
        .p2align        4                                                               .p2align        4
        .type   _Z3funh,@function                                                       .type   _Z3funh,@function
_Z3funh:                                # @_Z3funh                              _Z3funh:                                # @_Z3funh
        .cfi_startproc                                                                  .cfi_startproc
# %bb.0:                                # %entry                                # %bb.0:                                # %entry
        stmg    %r13, %r15, 104(%r15)                                                   stmg    %r13, %r15, 104(%r15)
        .cfi_offset %r13, -56                                                           .cfi_offset %r13, -56
        .cfi_offset %r14, -48                                                           .cfi_offset %r14, -48
        .cfi_offset %r15, -40                                                           .cfi_offset %r15, -40
        aghi    %r15, -168                                                              aghi    %r15, -168
        .cfi_def_cfa_offset 328                                                         .cfi_def_cfa_offset 328
        la      %r3, 166(%r15)                                                          la      %r3, 166(%r15)
        mvi     166(%r15), 1                                                            mvi     166(%r15), 1
        risbgn  %r1, %r3, 0, 189, 0                                                     risbgn  %r1, %r3, 0, 189, 0
        l       %r5, 0(%r1)                                                             l       %r5, 0(%r1)
        sll     %r3, 3                                                                  sll     %r3, 3
        lcr     %r4, %r3                                                                lcr     %r4, %r3
        lhi     %r0, 0                                                                  lhi     %r0, 0
.LBB0_1:                                # %entry                                .LBB0_1:                                # %entry
                                        # =>This Inner Loop Header: De                                                  # =>This Inner Loop Header: De
        rll     %r14, %r5, 8(%r3)                                                       rll     %r14, %r5, 8(%r3)
        risbg   %r0, %r14, 32, 55, 0                                                    risbg   %r0, %r14, 32, 55, 0
        llcr    %r14, %r14                                                              llcr    %r14, %r14
        crjlh   %r14, %r2, .LBB0_3                                                      crjlh   %r14, %r2, .LBB0_3
# %bb.2:                                # %entry                                # %bb.2:                                # %entry
                                        #   in Loop: Header=BB0_1 Dept                                                  #   in Loop: Header=BB0_1 Dept
        rll     %r13, %r0, -8(%r4)                                                      rll     %r13, %r0, -8(%r4)
        cs      %r5, %r13, 0(%r1)                                                       cs      %r5, %r13, 0(%r1)
        jl      .LBB0_1                                                                 jl      .LBB0_1
.LBB0_3:                                # %entry                                .LBB0_3:                                # %entry
        llgcr   %r2, %r14                                                 |             llgfr   %r2, %r14
        lmg     %r13, %r15, 272(%r15)                                                   lmg     %r13, %r15, 272(%r15)
        br      %r14                                                                    br      %r14
.Lfunc_end0:                                                                    .Lfunc_end0:
uweigand accepted this revision.Mar 3 2021, 11:05 AM

Ahh, you're right. It's done in common code by the default ATOMIC_CMP_SWAP_SUCCESS expander -- but we're not using that since we use our own custom expander! So it indeed has to be done there in the back end.

Patch LGTM. Thanks!

This revision is now accepted and ready to land.Mar 3 2021, 11:05 AM
This revision was landed with ongoing or failed builds.Mar 3 2021, 12:06 PM
This revision was automatically updated to reflect the committed changes.