PR24191 finds that the expected memory-register operations aren't generated when relaxed { load ; modify ; store } is used. This is similar to PR17281 which was addressed in D4796, but only for memory-immediate operations (and for memory orderings up to acquire and release). This patch also handles some floating-point operations.
Details
Diff Detail
- Repository
- rL LLVM
Event Timeline
Will this optimization transform:
int foo() { int r = atomic_load_n(&x, __ATOMIC_RELAXED); atomic_store_n(&x, r+1, __ATOMIC_RELAXED); return r; }
? If yes, how?
Good point, I added test add_32r_self to ensure that this doesn't happen, and that the pattern matching figures out dependencies properly.
I am glad that the comment was useful, but I actually asked a different thing :)
My example does not contain self-add. It contains two usages of a load result, and one of the usages can be potentially folded. My concern was that the code can be compiled as:
MOV [addr], r ADD [addr], 1 MOV r, rax RET
or to:
ADD [addr], 1 MOV [addr], rax RET
Both of which would be incorrect transformations -- two loads instead of one.
I guess this transformation should require that the folded store is the only usage of the load result.
Oh sorry, I totally misunderstood you! I added a test for this. IIUC it can't happen because the entire pattern that's matched is replace with a pseudo instruction, so an escaping intermediate result wouldn't have a def anymore.
- atomic_mi.ll add a colon after FileCheck LABEL checks to prevent partial string matches.
LGTM for the integer part.
I am not really convinced that the floating point change belongs in the same patch: it is conceptually different, and does not seem to share any code with the rest of the patch. It is also not obvious to me why it is only tested on X86_64, a comment would be appreciated.
test/CodeGen/X86/atomic_mi.ll | ||
---|---|---|
847 ↗ | (On Diff #30387) | Why ? |
864 ↗ | (On Diff #30387) | Why ? |
I improved the comment for x86-32. LLVM's code generation was silly when I was testing it out. The instructions I use are in SSE, and even with -mattr=+sse (and +sse2) it wasn't particularly good code, even without using any atomics. I figure x86-32 optimizations of floating-point atomics isn't particularly useful compared to x86-64 at this point in time.
Actually, *sd instructions are SSE2, so I fixed the pattern matcher :)
To be specific, the x86-32 code with -mattr=+sse generates the following:
fadd_32r: # @fadd_32r # ... movl 12(%esp), %eax movl (%eax), %ecx movl %ecx, (%esp) movss (%esp), %xmm0 # xmm0 = mem[0],zero,zero,zero addss 16(%esp), %xmm0 movss %xmm0, 4(%esp) movl 4(%esp), %ecx movl %ecx, (%eax) addl $8, %esp retl # ... fadd_64r: # @fadd_64r # ... movl 32(%esp), %esi xorl %eax, %eax xorl %edx, %edx xorl %ebx, %ebx xorl %ecx, %ecx lock cmpxchg8b (%esi) movl %edx, 12(%esp) movl %eax, 8(%esp) fldl 8(%esp) faddl 36(%esp) fstpl (%esp) movl (%esp), %ebx movl 4(%esp), %ecx movl (%esi), %eax movl 4(%esi), %edx # ...
Part of the problem is calling convention on x86-32, and part of the problem is x87 for 64-bit.
With -mattr=+sse,+sse2 the generated code becomes:
fadd_32r: # @fadd_32r # ... movss 8(%esp), %xmm0 # xmm0 = mem[0],zero,zero,zero movl 4(%esp), %eax addss (%eax), %xmm0 movss %xmm0, (%eax) retl # ... fadd_64r: # @fadd_64r # ... movl 32(%esp), %esi xorl %eax, %eax xorl %edx, %edx xorl %ebx, %ebx xorl %ecx, %ecx lock cmpxchg8b (%esi) movl %edx, 12(%esp) movl %eax, 8(%esp) movsd 8(%esp), %xmm0 # xmm0 = mem[0],zero addsd 36(%esp), %xmm0 movsd %xmm0, (%esp) movl (%esp), %ebx movl 4(%esp), %ecx movl (%esi), %eax movl 4(%esi), %edx
I could do something with this, but I'm a bit wary of how the calling convention works out (or rather, that the code generation won't change slightly and throw things off).
@echristo suggested I add @chandlerc because this patch says "atomic". Note @dvyukov's LGTM above.
I would have preferred to see the renaming in another patch, but at this point I don't think its worth the effort to split it out. Its a mechanical change, not a change in behavior, and tablegen would have made it clear if it didn't like the change.
So, I agree with @dvyukov, LGTM.
Cheers,
Pete
lib/Target/X86/X86InstrCompiler.td | ||
---|---|---|
765 ↗ | (On Diff #30499) | This was there prior to your change, but i wonder if we should have a later patch (by you or me or anyone else) to consider removing this cast. We can do so by taking 'SDNode op' instead of a string. For example diff --git a/lib/Target/X86/X86InstrCompiler.td b/lib/Target/X86/X86InstrCompiler.td
+++ b/lib/Target/X86/X86InstrCompiler.td
-multiclass RELEASE_BINOP_MI<string op> { def NAME#8mi : I<0, Pseudo, (outs), (ins i8mem:$dst, i8imm:$src), "#RELEASE_BINOP PSEUDO!",
+ [(atomic_store_8 addr:$dst, (op (atomic_load_8 addr:$dst), (i8 imm:$src)))]>; // NAME#16 is not generated as 16-bit arithmetic instructions are considered // costly and avoided as far as possible by this backend anyway def NAME#32mi : I<0, Pseudo, (outs), (ins i32mem:$dst, i32imm:$src), "#RELEASE_BINOP PSEUDO!",
+ [(atomic_store_32 addr:$dst, (op (atomic_load_32 addr:$dst), (i32 imm:$src)))]>; def NAME#64mi32 : I<0, Pseudo, (outs), (ins i64mem:$dst, i64i32imm:$src), "#RELEASE_BINOP PSEUDO!",
+ [(atomic_store_64 addr:$dst, (op (atomic_load_64 addr:$dst), (i64immSExt32:$src)))]>; } |
lib/Target/X86/X86InstrCompiler.td | ||
---|---|---|
765 ↗ | (On Diff #30499) | I can send a follow-up. |