This is an archive of the discontinued LLVM Phabricator instance.

[TargetInfo] Adjust x86-32 atomic support to the CPU used
Needs ReviewPublic

Authored by mgorny on Feb 4 2017, 10:24 AM.

Details

Summary

Set the maximum width of atomic operations on x86-32 based on the target
CPU. The 64-bit inline atomics require cmpxchg8b which is an i586
instruction. Other inline atomics require cmpxchg which is an i486
instruction.

This fixes the incorrect value of GCC_ATOMIC_LLONG_LOCK_FREE
and
atomic_always_lock_free() on FreeBSD where clang defaults to i486
CPU (PR#31864).

For CUDA device builds, assume i586+. This matches the default CPUs for
all x86-32 targets on systems supporting CUDA.

Diff Detail

Event Timeline

mgorny created this revision.Feb 4 2017, 10:24 AM
jlebar edited edge metadata.Feb 4 2017, 10:49 AM

Could someone help me figure out what is the cause and correct solution to that failure? @jlebar?

The test is checking that the macros have the same value when compiling for CUDA host and device. That is, if we're compiling for an x86 CPU and an NVPTX GPU, we invoke cc1 twice, and the macros should have the same values both times. Which, I know, is a lie. But because when we're compiling for NVPTX we still parse all of the CPU code, macros generally need to have the same values otherwise we get into Big Trouble. NVPTX atomics are controlled separately.

You can see in NVPTXTargetInfo that we read properties from the host targetinfo so that we export the same macros. The problem here seems to be that we're mutating the x86 targetinfo after the nvptx targetinfo reads its properties.

Does that give you enough context to fix the problem?

dim added inline comments.Feb 4 2017, 11:32 AM
lib/Basic/Targets.cpp
4251

Are you purposefully not setting MaxAtomicInlineWidth here? (It seems from TargetInfo that the default value is zero.)

4273

As far as I can see, in the constructor this call is _always_ made with CPU set to CK_Generic, i.e. zero. Therefore, the "allow locked atomics up to 4 bytes" path in setAtomic is always chosen. Maybe it is clearer to just initialize MaxAtomicPromoteWidth to 32 directly here, instead?

Could someone help me figure out what is the cause and correct solution to that failure? @jlebar?

You can see in NVPTXTargetInfo that we read properties from the host targetinfo so that we export the same macros. The problem here seems to be that we're mutating the x86 targetinfo after the nvptx targetinfo reads its properties.

Does that give you enough context to fix the problem?

Thanks. I'll try to find a reasonably sane solution ;-).

lib/Basic/Targets.cpp
4251

Yes. I've based this on what's done for ARM. Unless I misunderstood something, this means that on 'plain' i386 there is no inline atomics support but we still want to do atomics-via-locking up to 32-bit types. I'm not sure about 32/64 here to match i486.

4273

Well, I just copied the idea from ARM. I thought of it more like 'make sure it is initialized to some value, possibly update it later when setting CPU'. I'm fine either way.

mgorny updated this revision to Diff 87134.Feb 5 2017, 1:42 AM
mgorny edited the summary of this revision. (Show Details)

Ok, this CUDA fix should be reasonable, i think. It simply assumes i586+ (i.e. all inline atomics enabled) for CUDA target builds. I seriously doubt it's technically possible that anyone will ever use CUDA on <i586 ;-).

jlebar added inline comments.Feb 5 2017, 2:07 PM
lib/Basic/Targets.cpp
1784

Someone else is calling setCPU on the HostTarget. If they're calling it after we call it here, this is obviously not going to work. Since it does work, I presume they are calling it before we get here. In which case, can we not just set MaxAtomicPromoteWidth=HostTarget->MaxAtomicPromoteWidth right after we set MaxAtomicInlineWidth = HostTarget->getMaxAtomicInlineWidth(); below?

(Note that as written definitely isn't right because it assumes that HostTarget is non-null.)

mgorny updated this revision to Diff 87170.Feb 5 2017, 3:12 PM

CUDA: added the MaxAtomicPromoteWidth setting, and moved the CPU setting a little lower to ensure that it doesn't get called with null HostTarget.

jlebar added inline comments.Feb 5 2017, 4:59 PM
lib/Basic/Targets.cpp
1808

Okay, is this still needed now?

mgorny added inline comments.Feb 6 2017, 12:19 AM
lib/Basic/Targets.cpp
1808

Yes. I've specifically tested with it commented out, and the CPU gets initiated to generic (=no inline atomics) then.

jlebar added inline comments.Feb 6 2017, 8:37 AM
lib/Basic/Targets.cpp
1808

Yes, but is that a bug? Does that break the test?

I thought the problem we were trying to solve here was that CUDA host and device builds did not define the same macros. And I thought that setCPU modified the values for MaxAtomicInlineWidth and MaxAtomicPromoteWidth. Moreover I thought that we called HostTarget->setCPU before calling this function.

If all of those things are true, I don't see what problem we're solving by calling HostTarget->setCPU("i586") here.

joerg added a subscriber: joerg.Feb 6 2017, 8:55 AM

At this point, I don't think there is any use on pretending that i386-as-default makes sense. So I would request that the i386 case should be made explicit or just dropped, with a preference for the latter.

mgorny added inline comments.Feb 6 2017, 9:32 AM
lib/Basic/Targets.cpp
1808

Well, the thing is, we don't call HostTarget->setCPU() before this function. We just call AllocateTarget(), and it does not set the CPU.

Normally the CPU is set in Driver, based on -march etc. if provided, with fallback to platform-specific defaults. In the case of host-side CUDA build, the Driver sets x86-specific CPU. While the defaults differ per platform, for all platforms supporting CUDA it's i586+.

Now, for the target-side, the Driver creates NVPTX target, and sets NVPTX-specific CPU. The HostTarget instance is only created within NVPTXTargetInfo, and so we need to setCPU() explicitly. Since we can reliably assume that the host-side will be i586+, we use i586 here.

So far this didn't matter since all atomic properties were defined statically. However, this patch changes them to adjust to the CPU used, and so if the X8632TargetInfo instance is allocated without an explicit setCPU() call, it defaults to generic x86 (= no inline atomics available) which is different from the host platform default. As a result, different macros are defined and the test fails.

mgorny added a comment.Feb 6 2017, 9:43 AM

At this point, I don't think there is any use on pretending that i386-as-default makes sense. So I would request that the i386 case should be made explicit or just dropped, with a preference for the latter.

By the former, do you mean making CK_Generic imply i486+ or i586+? What about line ~3947 where the same conditions are used to control other definitions? Should they be changed too?

joerg added a comment.Feb 6 2017, 9:55 AM

Generic should imply i486+. I don't think any general purpose system supports i386 at this point, simply because it has an annoying number of bugs in critical components. The i486 (esp. the non-crippled ones) are reasonable easy to support and there are still people around with hardware, esp. clones.

ahatanak added inline comments.
lib/Basic/Targets.cpp
4251

If there isn't a test case for plain i386, is it possible to add one (perhaps in test/Sema/atomic-ops.c)?

mgorny added inline comments.Feb 6 2017, 10:11 AM
lib/Basic/Targets.cpp
4251

I could do that. However, @joerg suggested dropping i386 branch entirely, and assuming i486+.

ahatanak added inline comments.Feb 6 2017, 10:16 AM
lib/Basic/Targets.cpp
4251

OK, thanks. In that case, we don't need the test.

Well, the thing is, we don't call HostTarget->setCPU() before this function. We just call AllocateTarget(), and it does not set the CPU.

Ah, got it. LGTM for the nvptx changes.

dim edited edge metadata.Feb 6 2017, 11:51 AM

There's still something strange here. If I compile the following on i386-freebsd12, with clang -march=i486 -O2 -S:

_Atomic(long long) ll;

void f(void)
{
  ++ll;
}

the result is:

	.globl	f
	.p2align	4, 0x90
	.type	f,@function
f:                                      # @f
# BB#0:                                 # %entry
	pushl	%ebp
	movl	%esp, %ebp
	pushl	%ebx
	movl	ll+4, %edx
	movl	ll, %eax
	.p2align	4, 0x90
.LBB0_1:                                # %atomicrmw.start
                                        # =>This Inner Loop Header: Depth=1
	movl	%eax, %ebx
	addl	$1, %ebx
	movl	%edx, %ecx
	adcl	$0, %ecx
	lock		cmpxchg8b	ll
	jne	.LBB0_1
# BB#2:                                 # %atomicrmw.end
	popl	%ebx
	popl	%ebp
	retl
.Lfunc_end0:
	.size	f, .Lfunc_end0-f

	.type	ll,@object              # @ll
	.comm	ll,8,4

So what gives? It's still inserting a cmpxchg8b! AFAIK it should now insert a call to __atomic_fetch_add_8 instead.

Note that this changes if you use C++ atomics, e.g.:

#include <atomic>

void f(std::atomic<long long>& x)
{
  ++x;
}

compiles to:

	.globl	_Z1fRNSt3__16atomicIxEE
	.p2align	4, 0x90
	.type	_Z1fRNSt3__16atomicIxEE,@function
_Z1fRNSt3__16atomicIxEE:                # @_Z1fRNSt3__16atomicIxEE
.Lfunc_begin0:
	.cfi_sections .debug_frame
	.cfi_startproc
	.cfi_personality 0, __gxx_personality_v0
	.cfi_lsda 0, .Lexception0
# BB#0:                                 # %entry
	pushl	%ebp
	movl	%esp, %ebp
	subl	$16, %esp
	movl	8(%ebp), %eax
.Ltmp0:
	movl	%eax, (%esp)
	movl	$5, 12(%esp)
	movl	$0, 8(%esp)
	movl	$1, 4(%esp)
	calll	__atomic_fetch_add_8
.Ltmp1:
# BB#1:                                 # %_ZNSt3__113__atomic_baseIxLb1EEppEv.exit
	addl	$16, %esp
	popl	%ebp
	retl
.LBB0_2:                                # %lpad.i.i
.Ltmp2:
	movl	%eax, (%esp)
	calll	__cxa_call_unexpected
	subl	$4, %esp
.Lfunc_end0:
	.size	_Z1fRNSt3__16atomicIxEE, .Lfunc_end0-_Z1fRNSt3__16atomicIxEE
	.cfi_endproc
mgorny updated this revision to Diff 87311.Feb 6 2017, 2:52 PM

Removed the i386 branch. Now the i486+ are used unconditionally.

Le gentle ping.

hans added a subscriber: hans.

+jyknight, who had a similar patch in http://reviews.llvm.org/D17933 (see also r291477 and PR31864)

test/CodeGen/atomic-ops.c
1

Naive question: why is the i686- part of the triple not sufficient here; why is -target-cpu needed?

mgorny added inline comments.Feb 23 2017, 9:32 PM
test/CodeGen/atomic-ops.c
1

It's because triple is not really meaningful on most of the systems (e.g. many Linux distros use i386, *BSD use i486), and the default CPU logic is applied in the Driver, while cc1 is called directly here.

mgorny marked 5 inline comments as done.Mar 10 2017, 11:40 AM

A gentle ping.

Maybe this change is obsolete now that D59566 is merged?

Herald added a project: Restricted Project. · View Herald TranscriptMay 23 2022, 6:51 PM