This is an archive of the discontinued LLVM Phabricator instance.

Optionally simplify basic blocks introduced by AtomicExpandPass
AbandonedPublic

Authored by kparzysz on May 3 2018, 10:27 AM.

Details

Summary

The atomic expansion pass typically runs late, after most IR optimizations have already run. The test-and-branch code it generates can often be simplified, but no such simplification normally occurred. This patch adds an optional parameter to createAtomicExpandPass with CFG simplification options. The presence of this parameter will trigger running simplifyCFG
on basic blocks created in expandPartwordCmpXchg, insertRMWLLSCLoop, and expandAtomicCmpXchg.

Diff Detail

Repository
rL LLVM

Event Timeline

kparzysz created this revision.May 3 2018, 10:27 AM
jfb requested changes to this revision.May 4 2018, 10:09 AM

Is hexagon the only backend that triggers this? Seems like we want tests for at least ARM64, no?

This revision now requires changes to proceed.May 4 2018, 10:09 AM

I don't know what AArch64 needs. I can enable it there too.

kparzysz updated this revision to Diff 145234.May 4 2018, 11:24 AM
  1. Removed incorrect (leftover from initial incorrect diff) Hexagon tests.
  2. Enable simplification on AArch64.
  3. Add testcases for AArch64 and Hexagon.
ab added inline comments.
lib/Target/AArch64/AArch64TargetMachine.cpp
376

This already does the simplification, no? How about adding the pass for Hexagon as well? I suppose this patch limits it to a smaller set of blocks, but on the other hand it's nice not to have to deal with it in the pass.

kparzysz added inline comments.May 4 2018, 12:02 PM
lib/Target/AArch64/AArch64TargetMachine.cpp
376

Evidently it doesn't a good enough job---check the testcase. On Hexagon we don't want to run the full simplify-cfg. I've tried that and it broke over 100 lit tests and I don't even know what impact it would have on performance. This would be just a way too big of a hammer.

kparzysz added inline comments.May 4 2018, 12:50 PM
lib/Target/AArch64/AArch64TargetMachine.cpp
376

Let me attach the testcase outputs (I somehow thought it was more evident in the testcase itself).

Without this patch

f0:                                     // @f0
        .cfi_startproc
// %bb.0:                               // %b0
        ldr     w8, [x0]
        add     w9, w8, #1              // =1
        cmp     w9, #17                 // =17
        csinc   w9, wzr, w8, eq
.LBB0_1:                                // %cmpxchg.start
                                        // =>This Inner Loop Header: Depth=1
        ldaxr   w10, [x0]
        cmp     w10, w8
        b.ne    .LBB0_4
// %bb.2:                               // %cmpxchg.trystore
                                        //   in Loop: Header=BB0_1 Depth=1
        stlxr   w10, w9, [x0]
        cbnz    w10, .LBB0_1
// %bb.3:
        orr     w8, wzr, #0x1
        b       .LBB0_5
.LBB0_4:                                // %cmpxchg.nostore
        clrex
        mov     w8, wzr
.LBB0_5:                                // %cmpxchg.end
        cmp     w8, #0                  // =0
        mov     w8, #123
        mov     w9, #321
        csel    w0, w9, w8, ne          // *** The patch eliminates this
                                        // *** select (and the setup code).
        ret
.Lfunc_end0:
        .size   f0, .Lfunc_end0-f0
        .cfi_endproc
                                        // -- End function

With the patch:

f0:                                     // @f0
        .cfi_startproc
// %bb.0:                               // %b0
        ldr     w8, [x0]
        add     w9, w8, #1              // =1
        cmp     w9, #17                 // =17
        csinc   w9, wzr, w8, eq
.LBB0_1:                                // %cmpxchg.start
                                        // =>This Inner Loop Header: Depth=1
        ldaxr   w10, [x0]
        cmp     w10, w8
        b.ne    .LBB0_4
// %bb.2:                               // %cmpxchg.fencedstore
                                        //   in Loop: Header=BB0_1 Depth=1
        stlxr   w10, w9, [x0]
        cbnz    w10, .LBB0_1
// %bb.3:
        mov     w0, #321
        ret
.LBB0_4:                                // %cmpxchg.nostore
        clrex
        mov     w0, #123
        ret
.Lfunc_end0:
        .size   f0, .Lfunc_end0-f0
        .cfi_endproc
                                        // -- End function
jfb added a comment.May 14 2018, 11:36 AM

I'd like to hear from @t.p.northover on the ARM bits. He's been paying much closer attention that I have.

asb added a subscriber: asb.May 30 2018, 12:53 PM

While this sort of thing seems like it _should_ be okay (and possibly it even is, after all we're already running simplifycfg on ARM), the way LLVM expands operations into separate LL and SC calls in IR level is basically an invalid thing to do, and therefore fragile. And that makes changes like this which trigger more/different transformations seem somewhat scary.

I'd certainly be more comfortable making such a change once we stop emitting fragile LLSC loops at the IR level.

(The idea would be to have AtomicExpandPass emit an intrinsic for the inner ll/sc loop, which would get lowered very late in the MI pipeline, in order to guarantee that it's emitted without interference. The discussion of such has been restarted recently due to some of the RISCV atomic work, since RISCV is more precise in terms of its architectural requirements than most. I hope the result will be new functionality added to AtomicExpandPass which all targets can then be converted over to use.)

kparzysz abandoned this revision.Aug 2 2018, 3:23 PM

Replaced with running SimplifyCFG after expanding atomics into ll/sc.