This is an archive of the discontinued LLVM Phabricator instance.

ARMv8.1 support for LLVM AArch64
ClosedPublic

Authored by ajasty-cavium on May 25 2017, 11:19 PM.

Details

Summary

Proposed patch for ARMv8.1 Large System Extensions support.

Currently missing support for CASP, NAND (not supported by LSE instructions), and subword SUB/AND.

Requesting comments, looking into fixing subword SUB/AND, and writing tests (currently tested by atomic-ops.ll with -mcpu=thunderx2t99). Also, the ATOMIC_LOAD_CLR was added to generic IR, I suspect I should move it to AArch64ISD, feedback requested.

ATOMIC_LOAD_ADD with discarded return should probably be performed via a separate IR (ATOMIC_ADD, etc). This is straightforward to implement and should increase performance, RFC there too.

Finally, weaker memory ordering should be implemented, but even LDX/STX seem to use full acquire/relax even when __ATOMIC_RELAXED is specified. This will likely be looked into in a later patch.

Diff Detail

Repository
rL LLVM

Event Timeline

ajasty-cavium created this revision.May 25 2017, 11:19 PM
ajasty-cavium edited the summary of this revision. (Show Details)May 25 2017, 11:21 PM
ajasty-cavium edited the summary of this revision. (Show Details)May 25 2017, 11:29 PM

Hi, can you please re-share the patch with more context?

git diff -U9999

would do.

Thanks!

lib/Target/AArch64/AArch64ISelDAGToDAG.cpp
2664

Wouldn't this be better inside SelectCMP_SWAP?

t.p.northover edited edge metadata.May 26 2017, 6:57 AM

One other thing you have to be careful of: if an instruction loads into xzr or wzr then it's actually an "STwhatever" operation and defined not to have acquire semantics regardless of its mnemonic. So you probably have to modify AArch64DeadRegisterDefinitionsPass.cpp to ignore these instructions.

Also, you should add regression tests under tests/CodeGen/AArch64 to make sure the instructions you're expecting get emitted.

include/llvm/CodeGen/ISDOpcodes.h
763 ↗(On Diff #100368)

This doesn't need to be a global ISD opcode since only AArch64 ever uses it. It can be added to the list in AArch64ISelLowering.h instead.

Actually, you don't even need any special ISD handling for SUB or CLR. You can make an SDNodeXForm that inserts the appropriate - or ~ during selection. Something like

def i32not : Operand<i64>, SDNodeXForm<imm, [{
  return CurDAG->getTargetConstant(~N->getZExtValue(), SDLoc(N), MVT::i32);
]}>;

If you use that in the destination part of the Pat it'll all work automatically.

lib/Target/AArch64/AArch64ISelLowering.cpp
453

Indentation and commented code here.

10620–10621

How does this work with your special ISD handling above? It seems like it means it would never trigger for bytes and halfs, but there are valid instructions there too.

lib/Target/AArch64/AArch64InstrAtomics.td
410

It looks like this only ever emits the operations with the full barrier, it would be good to support the more relaxed access modes (acquire/release/relaxed) too.

multiclasses ought to be able to reduce a lot of the repetition, but what you'd fundamentally want to do is use PatFrags to cast the node to an AtomicSDNode and check its ordering. Maybe add an extra level in TargetSelectionDAG.td:

multiclass binary_atomic_op_ord {
  def #NAME#_monotonic : PatFrag<(ops node:$ptr, node:$val),
        (!cast<SDNode>(#NAME) node:$ptr, node:$val), [{
      return cast<AtomicSDNode>(N)->getOrdering() == AtomicOrdering::Monotonic;
    }]>;
   ...
}

then add a defm NAME#_8 : binary_atomic_op_ord; (and other sizes) to the binary_atomic_op multiclass.

After that you'd be able to check atomic_load_add_8_acquire and so on in these patterns.

Fixed to full diff.

One other thing you have to be careful of: if an instruction loads into xzr or wzr then it's actually an "STwhatever" operation and defined not to have acquire semantics regardless of its mnemonic. So you probably have to modify AArch64DeadRegisterDefinitionsPass.cpp to ignore these instructions.

Yeah, this was a major concern of mine, technically it's an invalid mnemonic with ladaddal x0, xz, p. One solution was to custom lower all the atomic_load_X where operand 0 wasn't used, and create new AArch64ISD's for them. I wasn't thrilled about this for the first pass (because it seems to work anyway), but for the final it's something to consider.

Also, you should add regression tests under tests/CodeGen/AArch64 to make sure the instructions you're expecting get emitted.

Adapting CodeGen/AArch64/atomic-ops.ll, it's partly in place, will add to updated patch. Any other tests to look at for atomics? Also, I'm using -mcpu=thunderx2t99 for lse as -mcpu=generic+lse doesn't work, is there another preferred mechanism?

Thank you for all the great feedback btw!

lib/Target/AArch64/AArch64ISelLowering.cpp
10620–10621

Currently bytes and halfs cause issues because I didn't get the bitcast right, for the moment everything that isn't cleanly covered falls back to llsc.

lib/Target/AArch64/AArch64InstrAtomics.td
410

Agree completely on relaxed accesses, let me experiment with your PatFrag.

christof edited edge metadata.May 31 2017, 8:35 AM

One other thing you have to be careful of: if an instruction loads into xzr or wzr then it's actually an "STwhatever" operation and defined not to have acquire semantics regardless of its mnemonic. So you probably have to modify AArch64DeadRegisterDefinitionsPass.cpp to ignore these instructions.

Yeah, this was a major concern of mine, technically it's an invalid mnemonic with ladaddal x0, xz, p. One solution was to custom lower all the atomic_load_X where operand 0 wasn't used, and create new AArch64ISD's for them. I wasn't thrilled about this for the first pass (because it seems to work anyway), but for the final it's something to consider.

I don't think you can say that it works that easily. The instruction will not implement the acquire semantics, so the memory model gets broken. Not an easy thing to track down once that happens. I think ignoring these instructions in AArch64DeadRegisterDefinitionsPass.cpp as suggested is a good approach.

? Also, I'm using -mcpu=thunderx2t99 for lse as -mcpu=generic+lse doesn't work, is there another preferred mechanism?

The target -march=8.1-a should be enough. I gave you patch a spin. The options -march=8.1-a -O1 I can see CASAL being used, but it is not used when passing -O0. I think this might be related to that comment in shouldExpandAtomicCmpXchgInIR().

bmakam added a subscriber: bmakam.May 31 2017, 12:32 PM

One other thing you have to be careful of: if an instruction loads into xzr or wzr then it's actually an "STwhatever" operation and defined not to have acquire semantics regardless of its mnemonic. So you probably have to modify AArch64DeadRegisterDefinitionsPass.cpp to ignore these instructions.

Yeah, this was a major concern of mine, technically it's an invalid mnemonic with ladaddal x0, xz, p. One solution was to custom lower all the atomic_load_X where operand 0 wasn't used, and create new AArch64ISD's for them. I wasn't thrilled about this for the first pass (because it seems to work anyway), but for the final it's something to consider.

I don't think you can say that it works that easily. The instruction will not implement the acquire semantics, so the memory model gets broken. Not an easy thing to track down once that happens. I think ignoring these instructions in AArch64DeadRegisterDefinitionsPass.cpp as suggested is a good approach.

Understood, this gives us correctness, will also have to ensure deadregister catches ordering and allows wzr for release or weaker.

? Also, I'm using -mcpu=thunderx2t99 for lse as -mcpu=generic+lse doesn't work, is there another preferred mechanism?

The target -march=8.1-a should be enough. I gave you patch a spin. The options -march=8.1-a -O1 I can see CASAL being used, but it is not used when passing -O0. I think this might be related to that comment in shouldExpandAtomicCmpXchgInIR().

I need to debug this, for -O0 there shouldn't be an issue with the split, the reason for not splitting in fastreg is to prevent a spill between the ldx/stx.

For my testing I see CASAL with '-O0 -march=v8.1-a', can you send me your -emit-llvm?

include/llvm/CodeGen/ISDOpcodes.h
763 ↗(On Diff #100368)

One thing about moving this to AArch64ISD namespace, I'm writing my own getNode support for ATOMIC_LOAD_CLR specifically, is there a less intrusive mechanism you can think of?

Understood, this gives us correctness, will also have to ensure deadregister catches ordering and allows wzr for release or weaker.

They're different instructions so that shouldn't be too difficult. FWIW I don't think any kind of AArch64ISD node is necessary or useful for this problem.

include/llvm/CodeGen/ISDOpcodes.h
763 ↗(On Diff #100368)

It doesn't seem like a terrible idea to relax the assert in SelectionDAG::getAtomic so that it allows target-specific nodes (by checking Opcode against FIRST_TARGET_MEMORY_OPCODE).

Understood, this gives us correctness, will also have to ensure deadregister catches ordering and allows wzr for release or weaker.

After further digging, I believe that WZR can't be used for any of the LD<OP> instructions. If you do, you are actually encoding ST<OP>. The issue is that a memory barrier that synchronises on loads will synchronise on LD<OP>, but not on ST<OP>. Hence, this transformation might break the memory ordering specified by the program.

After further digging, I believe that WZR can't be used for any of the LD<OP> instructions. If you do, you are actually encoding ST<OP>.

That's pretty much true. "ST<OP> ..." is an alias for "LD<OP> [wx]zr" so you're allowed to write both and they mean the same thing, with one caveat: if the instruction has acquire semantics there is no ST<OP> alias for it. You can only write "ldaddal wzr, w1, [x0]" that way, not as "staddal w1, [x1]" for example.

The issue is that a memory barrier that synchronises on loads will synchronise on LD<OP>, but not on ST<OP>. Hence, this transformation might break the memory ordering specified by the program.

Right. And the dead definitions pass is the only thing likely to insert an XZR behind your back since the register is reserved. (Obviously an explicit pattern could do it too, but you'd only write that where it's valid).

After further digging, I believe that WZR can't be used for any of the LD<OP> instructions. If you do, you are actually encoding ST<OP>.

That's pretty much true. "ST<OP> ..." is an alias for "LD<OP> [wx]zr" so you're allowed to write both and they mean the same thing, with one caveat: if the instruction has acquire semantics there is no ST<OP> alias for it. You can only write "ldaddal wzr, w1, [x0]" that way, not as "staddal w1, [x1]" for example.

I expect you mean ldaddal w1, wzr, [x0]. Those instructions have their target register as second operand. I dislike that instruction: There is no acquire semantics if the target register is wzr, even though the mnemonic makes it look like there is. It might be better if we diagnose such an instruction instead of using them. Also, I would not be surprised if that instruction behaves exactly the same as staddl w1 [x0].

The issue is that a memory barrier that synchronises on loads will synchronise on LD<OP>, but not on ST<OP>. Hence, this transformation might break the memory ordering specified by the program.

Right. And the dead definitions pass is the only thing likely to insert an XZR behind your back since the register is reserved. (Obviously an explicit pattern could do it too, but you'd only write that where it's valid).

That is certainly what it looks like to me. I suggest to err on the safe side and not use the WZR/XZR as target for any of these LD<OP> instructions.

That is certainly what it looks like to me. I suggest to err on the safe side and not use the WZR/XZR as target for any of these LD<OP> instructions.

I think that's a bit strong, we know exactly what circumstances you can use ZR safely, and about the only way we can ban it in DeadRegisters is with an explicit list of instructions anyway, so just leave the ones where it's valid out of that list.

Oh, and remember this same issue applies to SWP (but not CAS).

Updated with dead-register checking, added test.

Temporarily removed SUB/AND, will commit as separate patch.

Updated with full context.

Seems like a good start for 8.1-A atomics to me. There are some things that might be improved upon, like supporting weaker orderings. Do you plan on doing that work as well?
Just a few questions inline.

lib/Target/AArch64/AArch64DeadRegisterDefinitionsPass.cpp
90

I think this will also catch a load-exclusive. As far as I know those are ok with XZR/WZR as target register. The only problematic once are the atomicrmw operations. These are SWP and the LD<OP> where <OP> is ADD, CLR, EOR, SET, SMAX, SMIN, UMAX, and UMIN.

But then again, it at least is safe to overestimate.

lib/Target/AArch64/AArch64ISelLowering.cpp
10586

Any reason for expanding cmpxchg again? It was not mentioned in your comment and the test you added will probably fail on it.

test/CodeGen/AArch64/atomic-ops-lse.ll
2–3

Why not -mattr=+v8.1a instead of -mcpu=thunderx2t99 ?

ajasty-cavium marked an inline comment as done.Jun 7 2017, 11:53 AM

Seems like a good start for 8.1-A atomics to me. There are some things that might be improved upon, like supporting weaker orderings. Do you plan on doing that work as well?

Absolutely, and SUB/AND support. Also, for AtomicOrdering without acquire/seq-cst I'll re-enable DeadRegister-s and xform to ST(OP).

Would prefer to land this patch before adding the weaker ordering, but ST(OP) is dependent on weaker ordering. In any case this should perform equal to, or better than current LDX/STX on hardware that supports it.

In any case this should perform equal to, or better than current LDX/STX on hardware that supports it.

For seq_cst and acq_rel accesses, yes. Switching from ldxr/stxr to instructions with full barriers is rather more dubious though. As long as you're planning to support them soon though, I've got no real objections.

lib/Target/AArch64/AArch64DeadRegisterDefinitionsPass.cpp
90

It'll also catch plain "load atomic" operations. I think a blacklist of instructions is probably best.

Would prefer to land this patch before adding the weaker ordering, but ST(OP) is dependent on weaker ordering. In any case this should perform equal to, or better than current LDX/STX on hardware that supports it.

For that to happen, you need to get test/CodeGen/AArch64/atomic-ops-lse.ll and shouldExpandAtomicCmpXchgInIR() to agree with each other. As it stands, the test will fail because it expect CAS while the code is now expanding it. See my previous comment.

Also, I think it is best to follow Tim's suggestion and make AArch64DeadRegisterDefinitions.cpp match only the AArch::LD<OP>[A|L|AL] instructions instead of the blanket isAtomic(). The danger here is that somebody will see a regression when this is committed. Typically this means that the patch that causes it is reverted if not fixed within the day. Better not to introduce that regression.

ajasty-cavium marked an inline comment as done.

Fixed dead-register blacklist, and disabled expandInIR for cmp-swap. Passes make check and individual test. Changed test invocation to -mattr=+lse.

christof accepted this revision.Jun 15 2017, 2:37 AM

Looks good to me.
If you could still add a comment in the source on why you blacklist the instruction in the DeadRegisterDefinitionPass that would be appreciated.
Thanks for the nice patch. Looking forward to the patterns for the more relaxed memory models.

lib/Target/AArch64/AArch64DeadRegisterDefinitionsPass.cpp
89

If you can leave a comment here explaining the reason for this black listing, that would be great for future eyes. Something among the lines of:
// The atomic instructions part of LSE have different memory ordering semantics when targeting the zero register.

This revision is now accepted and ready to land.Jun 15 2017, 2:37 AM
christof requested changes to this revision.Jun 15 2017, 3:45 AM

Sorry, I was too quick. That blacklist is not working properly. Also, it would be good to add some test that show the blacklisting in operation. I just tried an example:

define void @test_atomic_load_add_i32(i32 %offset) nounwind {
     %old = atomicrmw add i32* @var32, i32 %offset seq_cst
   ret void
}

Which should not use WZR, but still uses it with the latest patch.

lib/Target/AArch64/AArch64DeadRegisterDefinitionsPass.cpp
99

Hold on. These are Selection DAG instructions. But this pass actually runs just before register allocation and after all DAG lowering is done, does it not?
So you'll need to use AArch64:: LDADDALs and such here, or you will never match these instructions.

This revision now requires changes to proceed.Jun 15 2017, 3:45 AM

Found something else.

lib/Target/AArch64/AArch64InstrAtomics.td
445

The CAS and CASP instructions have a different operand order. Correct is:
CASALb $Rs, $Rt, $Rn
This also applies for the patterns below.

test/CodeGen/AArch64/atomic-ops-lse.ll
472

I was wondering if you can check the operand order here. I think that since OLD and NEW are directly coming from the function parameters, you should be able to match on w0 and w1 directly. Meaning this could be:

casal w0, w1, [x[[ADDR]]]

That catches any unintended swapping of operands. This holds for all the cmpxchg tests.

ajasty-cavium edited edge metadata.
ajasty-cavium marked 2 inline comments as done.

Fixed CAS, renamed operands to "old/new" for clarity. Also updated test to use x0/x1 to confirm proper operand order.

Also reimplemented blacklist using all aarch64 opcodes.

Fixed CAS, renamed operands to "old/new" for clarity. Also updated test to use x0/x1 to confirm proper operand order.

Great!

Also reimplemented blacklist using all aarch64 opcodes.

Did you test that this blacklist now indeed works as expected? If you have such a test, can you add it. Thanks.

ajasty-cavium marked 4 inline comments as done.Jun 19 2017, 2:33 PM

Also reimplemented blacklist using all aarch64 opcodes.

Did you test that this blacklist now indeed works as expected? If you have such a test, can you add it. Thanks.

Tested using my development code scrap, will have to make many more test cases when ST(OP)L and barriers are enabled:

78:   f8e8012a        ldaddal x8, x10, [x9]
7c:   f8e82128        ldeoral x8, x8, [x9]

__atomic_fetch_add(&i, l, __ATOMIC_RELAXED);
      j = __sync_fetch_and_xor(&i, l);
lib/Target/AArch64/AArch64DeadRegisterDefinitionsPass.cpp
90

Did not think that LDX would be issued with a XZR, though now I can see it as a SEQ_CST atomic store. Good catch.

90

Point taken, will look at the blacklist.

lib/Target/AArch64/AArch64ISelLowering.cpp
10586

Reversed this.

lib/Target/AArch64/AArch64InstrAtomics.td
445

My first development test case used __atomic_compare_exchange_n without checking the behavior (it could see false positives from the builtin as llvm cmps expected and desired). Reworked my test to actually run the cas in 4 cases to confirm returned value (not just success bool) as appropriate, seems to handle cases properly.

Also renamed operands for clarity.

test/CodeGen/AArch64/atomic-ops-lse.ll
2–3

Actually +v8.1a does not give you LSE, but I found -mattr=+lse does. Will add to update.

ajasty-cavium marked 2 inline comments as done.

Added additional test cases for dead-register.

christof accepted this revision.Jun 21 2017, 1:43 AM

OK, it looks good to me now. Thanks.

This revision is now accepted and ready to land.Jun 21 2017, 1:43 AM
ajasty-cavium accepted this revision.Jun 21 2017, 2:33 AM

Thank you Chris, I do not have commit rights, should I forward to the mailing list?

christof added a comment.EditedJun 21 2017, 2:56 AM

Thank you Chris, I do not have commit rights, should I forward to the mailing list?

I can commit it on your behalf if you want me to.

Thank you Chris, I do not have commit rights, should I forward to the mailing list?

I can commit it on your behalf if you want me to.

Would greatly appreciate it!

I can commit it on your behalf if you want me to.

Would greatly appreciate it!

You want me to use the summary as commit message?

I can commit it on your behalf if you want me to.

Would greatly appreciate it!

You want me to use the summary as commit message?

commit fca5da5b3fa75a7aff1ac622d3a1af44deb0d369
Author: Ananth Jasty <ajasty@cavium.com>
Date: Wed Jun 21 01:55:06 2017 -0700

[AARCH64][LSE] Preliminary support for ARMv8.1 LSE Atomics.

Implemented support to AArch64 codegen for ARMv8.1 Large System Extensions atomic instructions. Where supported, these instructions can provide atomic operations with higher performance.

Currently supported operations include: fetch_add, fetch_or, fetch_xor, fetch_smin, fetch_min/max (signed and unsigned), swap, and compare_exchange.

This implementation implies sequential-consistency ordering, more relaxed ordering is under development.

Subtarget->hasLSE is currently supported for Cavium ThunderX2T99.

Differential Revision: https://reviews.llvm.org/D33586
This revision was automatically updated to reflect the committed changes.

Is it possible that the tests were excluded from the commit?

Is it possible that the tests were excluded from the commit?

Ugh, yes. It seems like I missed adding the new file after applying this patch. I'll commit it now. Thanks for flagging it.