Page MenuHomePhabricator

x86 atomic: optimize a.store(reg op a.load(acquire), release)
ClosedPublic

Authored by jfb on Jul 20 2015, 10:46 PM.

Details

Summary

PR24191 finds that the expected memory-register operations aren't generated when relaxed { load ; modify ; store } is used. This is similar to PR17281 which was addressed in D4796, but only for memory-immediate operations (and for memory orderings up to acquire and release). This patch also handles some floating-point operations.

Diff Detail

Event Timeline

jfb updated this revision to Diff 30230.Jul 20 2015, 10:46 PM
jfb retitled this revision from to x86 atomic: optimize a.store(reg op a.load(acquire), release).
jfb updated this object.
jfb added reviewers: reames, kcc, dvyukov, nadav.
jfb added a subscriber: llvm-commits.
dvyukov edited edge metadata.Jul 21 2015, 12:00 AM

Will this optimization transform:

int foo() {
   int r = atomic_load_n(&x, __ATOMIC_RELAXED);
   atomic_store_n(&x, r+1, __ATOMIC_RELAXED);
   return r;
}

? If yes, how?

jfb updated this revision to Diff 30266.Jul 21 2015, 9:53 AM
jfb edited edge metadata.
  • Address dvyukov's comment: test that self-add doesn't fold.
jfb added a comment.Jul 21 2015, 9:53 AM

Will this optimization transform:

int foo() {
   int r = atomic_load_n(&x, __ATOMIC_RELAXED);
   atomic_store_n(&x, r+1, __ATOMIC_RELAXED);
   return r;
}

? If yes, how?

Good point, I added test add_32r_self to ensure that this doesn't happen, and that the pattern matching figures out dependencies properly.

In D11382#209066, @jfb wrote:

Will this optimization transform:

int foo() {
   int r = atomic_load_n(&x, __ATOMIC_RELAXED);
   atomic_store_n(&x, r+1, __ATOMIC_RELAXED);
   return r;
}

? If yes, how?

Good point, I added test add_32r_self to ensure that this doesn't happen, and that the pattern matching figures out dependencies properly.

I am glad that the comment was useful, but I actually asked a different thing :)
My example does not contain self-add. It contains two usages of a load result, and one of the usages can be potentially folded. My concern was that the code can be compiled as:

MOV [addr], r
ADD [addr], 1
MOV r, rax
RET

or to:

ADD [addr], 1
MOV [addr], rax
RET

Both of which would be incorrect transformations -- two loads instead of one.
I guess this transformation should require that the folded store is the only usage of the load result.

jfb updated this revision to Diff 30373.Jul 22 2015, 11:21 AM
  • Add test suggested by dvyukov making sure that the load isn't duplicated.
jfb added a comment.Jul 22 2015, 11:23 AM
In D11382#209066, @jfb wrote:

Will this optimization transform:

int foo() {
   int r = atomic_load_n(&x, __ATOMIC_RELAXED);
   atomic_store_n(&x, r+1, __ATOMIC_RELAXED);
   return r;
}

? If yes, how?

Good point, I added test add_32r_self to ensure that this doesn't happen, and that the pattern matching figures out dependencies properly.

I am glad that the comment was useful, but I actually asked a different thing :)
My example does not contain self-add. It contains two usages of a load result, and one of the usages can be potentially folded. My concern was that the code can be compiled as:

MOV [addr], r
ADD [addr], 1
MOV r, rax
RET

or to:

ADD [addr], 1
MOV [addr], rax
RET

Both of which would be incorrect transformations -- two loads instead of one.
I guess this transformation should require that the folded store is the only usage of the load result.

Oh sorry, I totally misunderstood you! I added a test for this. IIUC it can't happen because the entire pattern that's matched is replace with a pseudo instruction, so an escaping intermediate result wouldn't have a def anymore.

jfb updated this revision to Diff 30374.Jul 22 2015, 11:25 AM
  • Also handle floating-point memory-register addition.
jfb updated this revision to Diff 30375.Jul 22 2015, 11:30 AM
  • Add one more FIXME.

LGTM but wait for somebody else

jfb updated this revision to Diff 30386.Jul 22 2015, 12:45 PM
  • atomic_mi.ll add a colon after FileCheck LABEL checks to prevent partial string matches.
jfb updated this revision to Diff 30387.Jul 22 2015, 12:50 PM
  • [NFC] name pseudo-instructions more consistently.
jfb updated this object.Jul 22 2015, 12:51 PM
morisset edited edge metadata.Jul 23 2015, 6:46 AM

LGTM for the integer part.

I am not really convinced that the floating point change belongs in the same patch: it is conceptually different, and does not seem to share any code with the rest of the patch. It is also not obvious to me why it is only tested on X86_64, a comment would be appreciated.

test/CodeGen/X86/atomic_mi.ll
820

Why ?

837

Why ?

jfb updated this revision to Diff 30498.Jul 23 2015, 9:49 AM
jfb marked 2 inline comments as done.
jfb edited edge metadata.
  • Comment on x86-32 testing of atomics and SSE.
jfb added a comment.Jul 23 2015, 9:54 AM

LGTM for the integer part.

I am not really convinced that the floating point change belongs in the same patch: it is conceptually different, and does not seem to share any code with the rest of the patch. It is also not obvious to me why it is only tested on X86_64, a comment would be appreciated.

I improved the comment for x86-32. LLVM's code generation was silly when I was testing it out. The instructions I use are in SSE, and even with -mattr=+sse (and +sse2) it wasn't particularly good code, even without using any atomics. I figure x86-32 optimizations of floating-point atomics isn't particularly useful compared to x86-64 at this point in time.

jfb updated this revision to Diff 30499.Jul 23 2015, 10:06 AM
  • *sd instructions are in SSE2.
jfb added a comment.Jul 23 2015, 10:08 AM

Actually, *sd instructions are SSE2, so I fixed the pattern matcher :)

To be specific, the x86-32 code with -mattr=+sse generates the following:

fadd_32r:                               # @fadd_32r
	# ...
	movl	12(%esp), %eax
	movl	(%eax), %ecx
	movl	%ecx, (%esp)
	movss	(%esp), %xmm0           # xmm0 = mem[0],zero,zero,zero
	addss	16(%esp), %xmm0
	movss	%xmm0, 4(%esp)
	movl	4(%esp), %ecx
	movl	%ecx, (%eax)
	addl	$8, %esp
	retl
	# ...
fadd_64r:                               # @fadd_64r
	# ...
	movl	32(%esp), %esi
	xorl	%eax, %eax
	xorl	%edx, %edx
	xorl	%ebx, %ebx
	xorl	%ecx, %ecx
	lock		cmpxchg8b	(%esi)
	movl	%edx, 12(%esp)
	movl	%eax, 8(%esp)
	fldl	8(%esp)
	faddl	36(%esp)
	fstpl	(%esp)
	movl	(%esp), %ebx
	movl	4(%esp), %ecx
	movl	(%esi), %eax
	movl	4(%esi), %edx
	# ...

Part of the problem is calling convention on x86-32, and part of the problem is x87 for 64-bit.

With -mattr=+sse,+sse2 the generated code becomes:

fadd_32r:                               # @fadd_32r
	# ...
	movss	8(%esp), %xmm0          # xmm0 = mem[0],zero,zero,zero
	movl	4(%esp), %eax
	addss	(%eax), %xmm0
	movss	%xmm0, (%eax)
	retl
	# ...
fadd_64r:                               # @fadd_64r
	# ...
	movl	32(%esp), %esi
	xorl	%eax, %eax
	xorl	%edx, %edx
	xorl	%ebx, %ebx
	xorl	%ecx, %ecx
	lock		cmpxchg8b	(%esi)
	movl	%edx, 12(%esp)
	movl	%eax, 8(%esp)
	movsd	8(%esp), %xmm0          # xmm0 = mem[0],zero
	addsd	36(%esp), %xmm0
	movsd	%xmm0, (%esp)
	movl	(%esp), %ebx
	movl	4(%esp), %ecx
	movl	(%esi), %eax
	movl	4(%esi), %edx

I could do something with this, but I'm a bit wary of how the calling convention works out (or rather, that the code generation won't change slightly and throw things off).

jfb added a subscriber: echristo.

@echristo suggested I add @chandlerc because this patch says "atomic". Note @dvyukov's LGTM above.

@chandlerc suggested I add @pete and @t.p.northover as reviewers.

pete accepted this revision.Aug 5 2015, 10:17 AM
pete edited edge metadata.

I would have preferred to see the renaming in another patch, but at this point I don't think its worth the effort to split it out. Its a mechanical change, not a change in behavior, and tablegen would have made it clear if it didn't like the change.

So, I agree with @dvyukov, LGTM.

Cheers,
Pete

lib/Target/X86/X86InstrCompiler.td
765

This was there prior to your change, but i wonder if we should have a later patch (by you or me or anyone else) to consider removing this cast. We can do so by taking 'SDNode op' instead of a string. For example

diff --git a/lib/Target/X86/X86InstrCompiler.td b/lib/Target/X86/X86InstrCompiler.td
index 49dc318..9168713 100644

  • a/lib/Target/X86/X86InstrCompiler.td

+++ b/lib/Target/X86/X86InstrCompiler.td
@@ -757,26 +757,26 @@ defm LXADD : ATOMIC_LOAD_BINOP<0xc0, 0xc1, "xadd", "atomic_load_add",

  • extremely late to prevent them from being accidentally reordered in the backend
  • (see below the RELEASE_MOV* / ACQUIRE_MOV* pseudo-instructions) */

-multiclass RELEASE_BINOP_MI<string op> {
+multiclass RELEASE_BINOP_MI<SDNode op> {

def NAME#8mi : I<0, Pseudo, (outs), (ins i8mem:$dst, i8imm:$src),
    "#RELEASE_BINOP PSEUDO!",
  • [(atomic_store_8 addr:$dst, (!cast<PatFrag>(op)

+ [(atomic_store_8 addr:$dst, (op

        (atomic_load_8 addr:$dst), (i8 imm:$src)))]>;
// NAME#16 is not generated as 16-bit arithmetic instructions are considered
// costly and avoided as far as possible by this backend anyway
def NAME#32mi : I<0, Pseudo, (outs), (ins i32mem:$dst, i32imm:$src),
    "#RELEASE_BINOP PSEUDO!",
  • [(atomic_store_32 addr:$dst, (!cast<PatFrag>(op)

+ [(atomic_store_32 addr:$dst, (op

        (atomic_load_32 addr:$dst), (i32 imm:$src)))]>;
def NAME#64mi32 : I<0, Pseudo, (outs), (ins i64mem:$dst, i64i32imm:$src),
    "#RELEASE_BINOP PSEUDO!",
  • [(atomic_store_64 addr:$dst, (!cast<PatFrag>(op)

+ [(atomic_store_64 addr:$dst, (op

(atomic_load_64 addr:$dst), (i64immSExt32:$src)))]>;

}
-defm RELEASE_ADD : RELEASE_BINOP_MI<"add">;
-defm RELEASE_AND : RELEASE_BINOP_MI<"and">;
-defm RELEASE_OR : RELEASE_BINOP_MI<"or">;
-defm RELEASE_XOR : RELEASE_BINOP_MI<"xor">;
+defm RELEASE_ADD : RELEASE_BINOP_MI<add>;
+defm RELEASE_AND : RELEASE_BINOP_MI<and>;
+defm RELEASE_OR : RELEASE_BINOP_MI<or>;
+defm RELEASE_XOR : RELEASE_BINOP_MI<xor>;

This revision is now accepted and ready to land.Aug 5 2015, 10:17 AM
This revision was automatically updated to reflect the committed changes.
jfb added inline comments.Aug 5 2015, 2:06 PM
lib/Target/X86/X86InstrCompiler.td
765

I can send a follow-up.

jfb marked an inline comment as done.Aug 5 2015, 4:16 PM
jfb added inline comments.
lib/Target/X86/X86InstrCompiler.td
765

Done in D11788.