The atomic expansion pass typically runs late, after most IR optimizations have already run. The test-and-branch code it generates can often be simplified, but no such simplification normally occurred. This patch adds an optional parameter to createAtomicExpandPass with CFG simplification options. The presence of this parameter will trigger running simplifyCFG
on basic blocks created in expandPartwordCmpXchg, insertRMWLLSCLoop, and expandAtomicCmpXchg.
Details
- Reviewers
jyknight ab jfb javed.absar t.p.northover
Diff Detail
- Repository
- rL LLVM
Event Timeline
Is hexagon the only backend that triggers this? Seems like we want tests for at least ARM64, no?
- Removed incorrect (leftover from initial incorrect diff) Hexagon tests.
- Enable simplification on AArch64.
- Add testcases for AArch64 and Hexagon.
lib/Target/AArch64/AArch64TargetMachine.cpp | ||
---|---|---|
376 | This already does the simplification, no? How about adding the pass for Hexagon as well? I suppose this patch limits it to a smaller set of blocks, but on the other hand it's nice not to have to deal with it in the pass. |
lib/Target/AArch64/AArch64TargetMachine.cpp | ||
---|---|---|
376 | Evidently it doesn't a good enough job---check the testcase. On Hexagon we don't want to run the full simplify-cfg. I've tried that and it broke over 100 lit tests and I don't even know what impact it would have on performance. This would be just a way too big of a hammer. |
lib/Target/AArch64/AArch64TargetMachine.cpp | ||
---|---|---|
376 | Let me attach the testcase outputs (I somehow thought it was more evident in the testcase itself). Without this patch f0: // @f0 .cfi_startproc // %bb.0: // %b0 ldr w8, [x0] add w9, w8, #1 // =1 cmp w9, #17 // =17 csinc w9, wzr, w8, eq .LBB0_1: // %cmpxchg.start // =>This Inner Loop Header: Depth=1 ldaxr w10, [x0] cmp w10, w8 b.ne .LBB0_4 // %bb.2: // %cmpxchg.trystore // in Loop: Header=BB0_1 Depth=1 stlxr w10, w9, [x0] cbnz w10, .LBB0_1 // %bb.3: orr w8, wzr, #0x1 b .LBB0_5 .LBB0_4: // %cmpxchg.nostore clrex mov w8, wzr .LBB0_5: // %cmpxchg.end cmp w8, #0 // =0 mov w8, #123 mov w9, #321 csel w0, w9, w8, ne // *** The patch eliminates this // *** select (and the setup code). ret .Lfunc_end0: .size f0, .Lfunc_end0-f0 .cfi_endproc // -- End function With the patch: f0: // @f0 .cfi_startproc // %bb.0: // %b0 ldr w8, [x0] add w9, w8, #1 // =1 cmp w9, #17 // =17 csinc w9, wzr, w8, eq .LBB0_1: // %cmpxchg.start // =>This Inner Loop Header: Depth=1 ldaxr w10, [x0] cmp w10, w8 b.ne .LBB0_4 // %bb.2: // %cmpxchg.fencedstore // in Loop: Header=BB0_1 Depth=1 stlxr w10, w9, [x0] cbnz w10, .LBB0_1 // %bb.3: mov w0, #321 ret .LBB0_4: // %cmpxchg.nostore clrex mov w0, #123 ret .Lfunc_end0: .size f0, .Lfunc_end0-f0 .cfi_endproc // -- End function |
I'd like to hear from @t.p.northover on the ARM bits. He's been paying much closer attention that I have.
While this sort of thing seems like it _should_ be okay (and possibly it even is, after all we're already running simplifycfg on ARM), the way LLVM expands operations into separate LL and SC calls in IR level is basically an invalid thing to do, and therefore fragile. And that makes changes like this which trigger more/different transformations seem somewhat scary.
I'd certainly be more comfortable making such a change once we stop emitting fragile LLSC loops at the IR level.
(The idea would be to have AtomicExpandPass emit an intrinsic for the inner ll/sc loop, which would get lowered very late in the MI pipeline, in order to guarantee that it's emitted without interference. The discussion of such has been restarted recently due to some of the RISCV atomic work, since RISCV is more precise in terms of its architectural requirements than most. I hope the result will be new functionality added to AtomicExpandPass which all targets can then be converted over to use.)
This already does the simplification, no? How about adding the pass for Hexagon as well? I suppose this patch limits it to a smaller set of blocks, but on the other hand it's nice not to have to deal with it in the pass.