Page MenuHomePhabricator

LSE Atomics reorg - Part I
ClosedPublic

Authored by steleman on Jul 12 2017, 12:16 PM.

Details

Summary

This is the first part of adding memory synchronization semantics to LSE Atomics.

This patch does not change the functionality of the existing LSE Atomics.
It is just the first step necessary for adding memory synchronization semantics.

The memory semantics feature will be added in a subsequent patch.

In this patch, several corrections were added to the existing LSE Atomics implementation, based on the ARM Errata D11904 from 05/12/2017.

Diff Detail

Repository
rL LLVM

Event Timeline

steleman created this revision.Jul 12 2017, 12:16 PM

If at all possible, I would like to see D35309 committed before/with this patch.

If at all possible, I would like to see D35309 committed before/with this patch.

One of the things I am doing in this - D35319 - changeset is correcting the naming convention of the LSE instructions mnemonics, which currently does not observe the
established AArch64 naming convention for instruction mnemonics.

Namely: 'B', 'H', 'W' and 'X' suffixes, and not 'b', 'h', 's' and 'd'.

t.p.northover edited edge metadata.Jul 13 2017, 7:17 AM

This diff covers lots of different areas:

  1. DeadRegisterDefinitions: seems like a reasonable improvement,
  2. The instruction definitions: horrible on the surface, but a massive bug might justify them. It's completely unclear why they're necessary (especially as this patch contains no tests).
  3. The scheduling: whatever, assuming you have inside knowledge of ThunderX.
lib/Target/AArch64/AArch64DeadRegisterDefinitionsPass.cpp
141–142 ↗(On Diff #106273)

This loop seems too generic: we know what instructions we're looking at, and we know that the entire purpose is to replace a register with XZR/WZR -- the surrounding code isn't going to do that if it's not in that class.

lib/Target/AArch64/AArch64InstrFormats.td
9501–9503 ↗(On Diff #106273)

Why are you replacing an InstAlias with a real instruction? That's architecturally incorrect and only really done in LLVM when there's no other choice (weird encoding issues and so on in ARM).

9581–9582 ↗(On Diff #106273)

What does this have to do with atomics?

steleman added inline comments.Jul 13 2017, 7:25 AM
lib/Target/AArch64/AArch64DeadRegisterDefinitionsPass.cpp
141–142 ↗(On Diff #106273)

I do not understand what you are saying here.

The whole purpose of this function is to prevent the use of WZR/XZR, and not replace some other register with XZR/WZR, which would be wrong, as per ARM Errata.

This diff covers lots of different areas:

  1. The scheduling: whatever, assuming you have inside knowledge of ThunderX.

What does whatever mean, in this context?

What does whatever mean, in this context?

The code looks vaguely plausible for defining how these are scheduled but I have no idea whether it's actually a performance improvement because I don't have access to any ThunderX internal documentation that describes its microarchitecture.

lib/Target/AArch64/AArch64DeadRegisterDefinitionsPass.cpp
141–142 ↗(On Diff #106273)

Yes, but all of these instructions do have some definition that would otherwise be eligible. And if they didn't then the function wouldn't be necessary in the first place because the main loop checks.

This diff covers lots of different areas:

  1. The scheduling: whatever, assuming you have inside knowledge of ThunderX.

What does whatever mean, in this context?

The scheduling changes for ThunderX can't really be reviewed by the community (as we don't have the domain specific knowledge), so Tim is basically saying do whatever you see fit.

christof edited edge metadata.Jul 13 2017, 7:49 AM

Few inline comments on the dead register pass. I've not looked in detail to the other changes done in this patch.

lib/Target/AArch64/AArch64DeadRegisterDefinitionsPass.cpp
124 ↗(On Diff #106273)

I don't think CAS and CASP lose acquire semantics when they target the zero register. Am I wrong?

141 ↗(On Diff #106273)

Any particular reason why not to look only at the target (Wt) operand? Only the zero register as target operand makes the instructions ignore the acquire behaviour.

steleman added inline comments.Jul 13 2017, 7:52 AM
lib/Target/AArch64/AArch64DeadRegisterDefinitionsPass.cpp
141–142 ↗(On Diff #106273)

The main loop checks after the ShouldSkip function is called, and not before it.

Therefore, the ineligible instructions have not already been excluded by the main loop, because the main loop hasn't executed yet.

t.p.northover added inline comments.Jul 13 2017, 7:53 AM
lib/Target/AArch64/AArch64DeadRegisterDefinitionsPass.cpp
124 ↗(On Diff #106273)

I think christof is right too. LD* and SWP are the ones affected.

t.p.northover added inline comments.Jul 13 2017, 8:00 AM
lib/Target/AArch64/AArch64DeadRegisterDefinitionsPass.cpp
141–142 ↗(On Diff #106273)

Why does when it's checked matter? Also, even without that the loop is still redundant on the grounds that all of these instructions *do* have a dodgy class.

178 ↗(On Diff #106273)

No need to mention that, especially not with a wrong-endian date. It's part of the ARMv8.1a spec.

Hi Stefan,

I went ahead and committed D35309, so you will likely need to rebase the changes to AArch64InstrAtomics.td and AArch64DeadRegisterDefinitionsPass.cpp. I would also recommend separating out the ThunderX scheduling changes into their own patch, at least when committing.

steleman added inline comments.Jul 13 2017, 8:29 AM
lib/Target/AArch64/AArch64DeadRegisterDefinitionsPass.cpp
141–142 ↗(On Diff #106273)

Also, even without that the loop is still redundant on the grounds that all of these instructions *do* have a dodgy class.

I don't understand what this means.

ShouldSkip needs to iterate over all the definitions of the MCInstrDesc, as to obtain the TargetRegisterClass:

const MCInstrDesc &Desc = MI.getDesc();
unsigned ND = Desc.getNumDefs();

// [ ... ]

const TargetRegisterClass *RC = TII->getRegClass(Desc, I, TRI, MF);
if (RC == nullptr)
  continue;

if (RC->contains(AArch64::WZR) || RC->contains(AArch64::XZR))
  return true;

// [ ... ]

Does it not?

steleman added inline comments.Jul 13 2017, 8:38 AM
lib/Target/AArch64/AArch64DeadRegisterDefinitionsPass.cpp
141 ↗(On Diff #106273)

No reason - just that it was easier to write this way.

But looking only at the Wt operand is a very good suggestion, so I will change it thusly.

Hi Stefan,

I went ahead and committed D35309, so you will likely need to rebase the changes to AArch64InstrAtomics.td and AArch64DeadRegisterDefinitionsPass.cpp. I would also recommend separating out the ThunderX scheduling changes into their own patch, at least when committing.

OK, that's fine. I'll rebase.

t.p.northover added inline comments.Jul 13 2017, 8:43 AM
lib/Target/AArch64/AArch64DeadRegisterDefinitionsPass.cpp
141–142 ↗(On Diff #106273)

I don't understand what this means.

There is no instance in which this loop (once reached) will not return true, and that is guaranteed not just by dynamic register classes but by the structure of the instructions being considered (i.e. the MCInstrDescs are generated at compile time).

And just to keep banging on about the other point so it's not forgotten: even if it did return false because no such operand was present, the later code wouldn't try to replace the register anyway because it also checks whether XZR/WZR is valid.

The loop does nothing useful, and if it did that should be tested (since it's an addition rather than pure refactoring).

steleman added inline comments.Jul 13 2017, 8:44 AM
lib/Target/AArch64/AArch64DeadRegisterDefinitionsPass.cpp
124 ↗(On Diff #106273)

Yes, after re-reading the Errata three times, CAS/CASP shouldn't be excluded.

steleman added inline comments.Jul 13 2017, 8:59 AM
lib/Target/AArch64/AArch64DeadRegisterDefinitionsPass.cpp
141–142 ↗(On Diff #106273)

There is no instance in which this loop (once reached) will not return true

OK, now I am really lost.

Are you saying that this (inner) loop:

for (unsigned I = 0; I < ND; ++I) {
// [ ... ]
}

will always return true?

christof added inline comments.Jul 13 2017, 9:29 AM
lib/Target/AArch64/AArch64DeadRegisterDefinitionsPass.cpp
141–142 ↗(On Diff #106273)

Yes. If you run through all the operands of the instructions listed earlier, at some point it will hit the operand for Wt. That operand has a known register class which includes the zero register.

steleman added inline comments.Jul 13 2017, 10:19 AM
lib/Target/AArch64/AArch64DeadRegisterDefinitionsPass.cpp
141–142 ↗(On Diff #106273)

Ahh, I just realized my mistake. So the TargetRegisterClass will always contain WZR/XZR because it contains all the registers used by the AArch64 Target, and not just the registers being used by this particular instruction.

What I am really looking for is the set of registers being used by this particular instruction.

christof added inline comments.Jul 13 2017, 10:44 AM
lib/Target/AArch64/AArch64DeadRegisterDefinitionsPass.cpp
141–142 ↗(On Diff #106273)

The TargetRegisterClass of an instruction operand lists all permissible registers for that operand.

Why do you want to know the register that is currently in use? This pass is trying to change the target operand register into WZR/XZR. The blacklist you create here should not care which registers are currently in use.

It might be worth to note that these 8.1 instructions will never have a target operand of WZR/XZR as the compiler will not generate such instruction. Maybe you planned to allow them in these pass? I think that is a bit premature and complicates this black-list.

steleman added inline comments.Jul 13 2017, 11:12 AM
lib/Target/AArch64/AArch64DeadRegisterDefinitionsPass.cpp
141–142 ↗(On Diff #106273)

Maybe you planned to allow them in these pass? I think that is a bit premature and complicates this black-list.

I was trying to expand the blacklist to include all the <I>A and <I>AL instructions, and ended up over-complicating it unnecessarily.

So, I will re-submit and it will be much simpler.

It might be worth to note that these 8.1 instructions will never have a target operand of WZR/XZR as the compiler will not generate such instruction.

So, then, is this blacklist even necessary?

t.p.northover added inline comments.Jul 13 2017, 11:27 AM
lib/Target/AArch64/AArch64DeadRegisterDefinitionsPass.cpp
141–142 ↗(On Diff #106273)

So, then, is this blacklist even necessary?

I think he meant intentionally from a TableGen pattern or something. This blacklist is what prevents the compiler from generating them unintentionally. Without it when the compiler sees something like

%vreg0<def, dead> = LDADDAL %vreg1, %vreg2

it'll notice that vreg0 is dead, and that LDADDAL can potentially use XZR there and make the switch.

steleman updated this revision to Diff 106484.Jul 13 2017, 12:01 PM

Updated and corrected AArch64DeadRegisterDefinitions::ShouldSkip.

OK, that pass looks fine to me now. And we're still going to trust you know what you're doing with the scheduling. Which leaves the churn in the TableGen files...

As far as I can see the only actual change is to the internal names of these instructions, which could be accomplished in a much less intrusive way. You're also creating real ST* instructions, but that seems unnecessary to me (and actively bad style).

steleman added a comment.EditedJul 13 2017, 12:30 PM

This diff covers lots of different areas:

  1. The instruction definitions: horrible on the surface, but a massive bug might justify them. It's completely unclear why they're necessary (especially as this patch contains no tests).

Now on to why the changes to the instruction definitions:

I implemented the memory ordering semantics for all the LSE Atomics with Intrinsics. As in:

{..}include/llvm/Intrinsics/IntrinsicsAArch64.td:

// Atomic LD<OP> Intrinsics.
def int_aarch64_ldadd_32 :
  Intrinsic<[llvm_i32_ty], [llvm_i32_ty, llvm_anyptr_ty]>;
def int_aarch64_ldadd_64 :
  Intrinsic<[llvm_i64_ty], [llvm_i64_ty, llvm_anyptr_ty]>;
def int_aarch64_ldadda_32 :
  Intrinsic<[llvm_i32_ty], [llvm_i32_ty, llvm_anyptr_ty]>;
def int_aarch64_ldadda_64 :
  Intrinsic<[llvm_i64_ty], [llvm_i64_ty, llvm_anyptr_ty]>;
def int_aarch64_ldaddl_32 :
  Intrinsic<[llvm_i32_ty], [llvm_i32_ty, llvm_anyptr_ty]>;
def int_aarch64_ldaddl_64 :
  Intrinsic<[llvm_i64_ty], [llvm_i64_ty, llvm_anyptr_ty]>;
def int_aarch64_ldaddal_32 :
  Intrinsic<[llvm_i32_ty], [llvm_i32_ty, llvm_anyptr_ty]>;
def int_aarch64_ldaddal_64 :
  Intrinsic<[llvm_i64_ty], [llvm_i64_ty, llvm_anyptr_ty]>;

[ etc etc etc ]

The instruction definition themselves (in AArch64InstrInfo.td) are changed to:

let AddedComplexity = 5, Predicates = [HasLSE] in {
 def LDADDB    : BaseLDOP<0b00, 0, 0, 0b000, "add", "", "b",
                          int_aarch64_ldadd_32, GPR32>;
 def LDADDH    : BaseLDOP<0b01, 0, 0, 0b000, "add", "", "h",
                          int_aarch64_ldadd_32, GPR32>;
 def LDADDW    : BaseLDOP<0b10, 0, 0, 0b000, "add", "", "",
                          int_aarch64_ldadd_32, GPR32>;
 def LDADDX    : BaseLDOP<0b11, 0, 0, 0b000, "add", "", "",
                          int_aarch64_ldadd_64, GPR64>;

[ etc etc etc ]

And in AArch64InstrFormats.td:

class BaseLDOP<bits<2> sz, bits<1> acq, bits<1> rel, bits<3> opc,
               string op, string order, string size,
               Intrinsic OpNode, RegisterClass RC>
  : BaseLDOPEncoding<(outs RC:$Rt),
                     (ins RC:$Rs, GPR64sp:$Rn),
                     "ld" # op # order # size,
                     "\t$Rs, $Rt, [$Rn]", "",
                     []>,
                     Sched<[WriteAtomic, WriteLD, WriteST]> {
  let Sz = sz;
  let Acq = acq;
  let Rel = rel;
  let Opc = opc;
}

In AArch64ISelLowering.cpp, each instruction lowering function will discover the correct
memory ordering model from the Intrinsic Opcode and the AtomicOrdering provided by the AtomicSDNode.

Some ISD NodeTypes can be lowered by the same function -- in this case the lowering function acts as a pure pass-through to a specific LSE Opcode. Some others need special treatment (for example ISD::ATOMIC_LOAD_SUB becomes ISD::ATOMIC_LOAD_ADD).

In AArch64ISelDAGToDAG.cpp, the correct instruction selection - with the correct memory ordering and register size - will be done by each instruction selection function, based on the corresponding Intrinsics Opcode. Just like in the instruction lowering case, several different instructions can be handled by the same instruction selection function.

This is the reason why the instruction definitions were expanded to explicitly describe each and every single instruction - as opposed to using the original multiclass design: a different Intrinsic need to be passed to the Instruction Definition depending on register size, and memory ordering. I do not think it is possible to accomplish this particular design with a multiclass.

I implemented the memory ordering semantics for all the LSE Atomics with Intrinsics.

It's unnecessary to add intrinsics to handle this, let alone get C++ code involved. We've got exactly the ISD::AtomicRMW node we want and it already has all the information we need attached.

If you add a fragment like this to include/llvm/Target/TargetSelectionDAG.td:

multiclass binary_atomic_op_ord {
  def #NAME#_monotonic : PatFrag<(ops node:$ptr, node:$val),
        (!cast<SDNode>(#NAME) node:$ptr, node:$val), [{
      return cast<AtomicSDNode>(N)->getOrdering() == AtomicOrdering::Monotonic;
    }]>;
   [...]
}

and adapt the definition of binary_atomic_op to use it you can refer directly to things like atomic_load_add_8_acquire from TableGen. A few more multiclasses to reduce copy/paste in AArch64InstrAtomics.td along the lines of:

multiclass LDOPregister_patterns_ord_dag<string inst, string suffix, string op,
                                     string size, dag SrcRHS, dag DstRHS> {
  def : Pat<(!cast<SDNode>(op#"_"#size#"_monotonic") GPR64sp:$Rn, SrcRHS),
            (!cast<Instruction>(inst#suffix) DstRHS, GPR64sp:$Rn)>;
  [...]
}

multiclass LDOPregister_patterns<string inst, string op> {
  defm : LDOPregister_patterns_ord<inst, "d", op, "64", (i64 GPR64:$Rm)>; // 64-bit
  [...]
}

defm : LDOPregister_patterns<"LDADD", "atomic_load_add">;
[...]

and the job's a goodun.

This is the reason why the instruction definitions were expanded to explicitly describe each and every single instruction - as opposed to using the original multiclass design: a different Intrinsic need to be passed to the Instruction Definition depending on register size, and memory ordering. I do not think it is possible to accomplish this particular design with a multiclass.

Even if we wanted the intrinsic-based design (I definitely don't) it would be better to put separate patterns elsewhere than to contort the existing instruction definitions to accommodate it. And splitting the ST* instructions is still a bad idea either way.

It's unnecessary to add intrinsics to handle this, let alone get C++ code involved. We've got exactly the ISD::AtomicRMW node we want and it already has all the information we need attached.

OK I will look at your suggestions and see what I come up with.

steleman updated this revision to Diff 108326.Jul 26 2017, 10:50 AM

Progress update: updated with the latest changes.

I still need to add tests for all the new atomic ops.

Thanks very much for updating the patch. It's starting to look a lot more like how I'd expect these to be implemented.

I think you've slightly missed the benefit I was suggesting multiclasses would give so there's still more duplication than is needed. Hopefully my explanations below make sense but I can try to clarify if not.

include/llvm/Target/TargetSelectionDAG.td
1118 ↗(On Diff #108326)

You can put these into the "multiclass binary_atomic_op"

multiclass binary_atomic_op<...> {
  defm NAME#_8 : binary_atomic_op_ord;
  [...]
}

That'll automatically instantiate the _8 ordered variants when the plain _8 is created. Obviously you need similar _16, _32, ...

1168–1171 ↗(On Diff #108326)

Here onwards is unnecessary I believe. You're actually creating duplicate (and unused) nodes like atomic_load_add_monotonic_seq_cst, ...

The whole point of the multiclass is that you instantiate multiple variants at the same time. The multiclass takes the base name (e.g.atomic_load_add_8) and tacks on its named suffix (e.g. _monotonic) and hopefully implements the correct checks to make sure it matches the right node.

lib/Target/AArch64/AArch64DeadRegisterDefinitionsPass.cpp
59 ↗(On Diff #108326)

Tiny nit: functions should start with a lower-case letter.

lib/Target/AArch64/AArch64InstrAtomics.td
410 ↗(On Diff #108326)

You can eliminate lots of copy/paste with multiclasses here too.

// Differing SrcRHS and DstRHS allow you to cover CLR & SUB by giving a more
// complex DAG for DstRHS.
let Predicates = [HasLSE] in
multiclass LDOPregister_patterns_ord_dag<string inst, string suffix, string op,
                                         string size, dag SrcRHS, dag DstRHS> {
  def : Pat<(!cast<SDNode>(op#"_"#size#"_monotonic") GPR64sp:$Rn, SrcRHS),
            (!cast<Instruction>(inst#suffix) DstRHS, GPR64sp:$Rn)>;
  [... variants for acquire, release, acq_rel and seq_cst ...]
}

// Simple case for non-CLR, non-SUB instructions where it's just one result inst.
multiclass LDOPregister_patterns_ord<string inst, string suffix, string op,
                                         string size, dag RHS> {
  defm : LDOPregister_patterns_ord_dag<inst, suffix, op, size, RHS, RHS>;
}

multiclass LDOPregister_patterns<string inst, string op> {
  defm : LDOPregister_patterns_ord<inst, "d", op, "64", (i64 GPR64:$Rm)>;
  defm : LDOPregister_patterns_ord<inst, "s", op, "32", (i32 GPR32:$Rm)>;
  defm : LDOPregister_patterns_ord<inst, "h", op, "16", (i32 GPR32:$Rm)>;
  defm : LDOPregister_patterns_ord<inst, "b", op, "8",  (i32 GPR32:$Rm)>;
}

defm : LDOPregister_patterns<"LDADD", "atomic_load_add">;
[... all other ops except SUB/CLR ...]

// Then slightly more complex version of LDOPregister_patterns to handle CLR/SUB and a pair
// of defms for them. About 8 lines of code.
lib/Target/AArch64/AArch64InstrFormats.td
9401 ↗(On Diff #108326)

I'm fine with this change, but it should probably be committed before as a separate "NFC" rename. In fact, feel free to do that any time you like.

9517 ↗(On Diff #108326)

These STOP patterns appear to be unused.

lib/Target/AArch64/AArch64SchedThunderX2T99.td
318–319 ↗(On Diff #108326)

Separate commit for the scheduling changes please. But as far as I'm concerned you can go ahead whenever you want, as with the renaming.

steleman added inline comments.Jul 26 2017, 11:22 AM
lib/Target/AArch64/AArch64InstrFormats.td
9401 ↗(On Diff #108326)

Won't this complicate things, though?

If I commit this change as a separate - and "before" change, I'll still have to change all the instruction mnemonics in the existing Pats, and in the AArch64DeadRegister blacklist.

It seems like a lot of code churn.

9517 ↗(On Diff #108326)

Yes, they are unused for now. But I plan on adding the ST<OP> instructions as well.

They need ISD NodeTypes, though, as ISD::ATOMIC_STORE_ADD, ISD::ATOMIC_STORE_SUB,
etc, aren't defined in include/llvm/CodeGen/ISDOpcodes.h. Which is why I'm saving
this change for a subsequent changeset.

t.p.northover added inline comments.Jul 26 2017, 11:26 AM
lib/Target/AArch64/AArch64InstrFormats.td
9401 ↗(On Diff #108326)

OK, I won't insist.

9517 ↗(On Diff #108326)

Then that would be a separate patch.

Fair warning: I strongly suspect it's unnecessary and should be automatically handled by the DeadDefinitions pass. ATOMIC_STORE_ADD is nothing but a normal ATOMIC_LOAD_ADD where the definition is dead, which is exactly what that pass is supposed to handle.

It does not appear that the multiclass design you are advocating here does what we'd expect it to do.

For one of the simple LD<OP> atomics - namely LDADD - I have placed a test program with corresponding output here:

  1. lseadd.c - https://drive.google.com/open?id=0B9ulXWgBcdxxcDNmNU1yVnQ5aXM
  2. lseadd.sh - https://drive.google.com/open?id=0B9ulXWgBcdxxTHI2MXd6dTZ0aVk
  3. lseadd-clang-O2.S - https://drive.google.com/open?id=0B9ulXWgBcdxxUW0yVEJCeE9mUTQ
  4. lseadd-clang-O2.ll - https://drive.google.com/open?id=0B9ulXWgBcdxxOTZRNjl6b1drVGM

The contents of these files is pretty self-explanatory.

Description of the problem: regardless of the atomic ordering model chosen in the test program -- in this case it's __ATOMIC_SEQ_CST -- TableGen will always match to ldadd, which is incorrect, instead of ldaddal, which is what would be expected in this case. This is shown in the lseadd-clang-O2.S assembler file indicated above.

However, the LLVM IR output matches the atomic ordering correctly.

The problem disappears and the matching is performed correctly if the Pats are explicitly written in the AArch64InstrAtomics.td file.

I have uploaded the LLVM patch that exhibits this problem here:

https://drive.google.com/open?id=0B9ulXWgBcdxxYnN3LXNHZndaWHM

steleman updated this revision to Diff 109592.Aug 3 2017, 10:40 AM

Updated changeset to use LDOPregister_patterns.
Implements all the LD<OP> LSE Atomics with the exception of NAND.
Updated test in atomic-ops.lse to cover all memory ordering models.

t.p.northover accepted this revision.Aug 4 2017, 7:40 AM

Thanks for going through so many revisions on this. I think it looks good now.

This revision is now accepted and ready to land.Aug 4 2017, 7:40 AM

Thanks for the hard work. It looks good. Quite an exhaustive list of tests as well.

Thanks for the hard work. It looks good. Quite an exhaustive list of tests as well.

Thank you.

I wanted to make certain all the memory models work correctly, so that's the reason
for the very large number of tests. :-)

steleman updated this revision to Diff 109833.Aug 4 2017, 4:05 PM

Restored definition for defm atomic_load_nand in TargetSelectionDAG.td,
which I had removed by mistake.

This revision was automatically updated to reflect the committed changes.