This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/NVPTX/
-
Target/
-
NVPTX/
1/1
NVPTXISelLowering.cpp
-
NVPTXTargetMachine.cpp
-
test/Transforms/AtomicExpand/NVPTX/
-
Transforms/
-
AtomicExpand/
-
NVPTX/
-
expand-atomic-i16.ll
-
expand-atomic-i8.ll
2/4
expand-atomic-rmw-fadd.ll
2/6
expand-atomic-rmw-fsub.ll
-
expand-atomic-rmw-nand.ll
-
lit.local.cfg
1/2
unaligned-atomic.ll

Differential D71128

[NVPTX][FIX] Expand atomics we cannot handle natively in the ISA
Needs ReviewPublic

Authored by jdoerfert on Dec 6 2019, 9:47 AM.

Download Raw Diff

Details

Reviewers

tra
__simt__
arsenm

Summary

NOTE: This is lacking a test and more of a request for feedback (I'm not an NVPTX person).

Diff Detail

Repository: rG LLVM Github Monorepo

Unit TestsFailed

	Time	Test
	30 ms	LLVM.CodeGen/NVPTX::Unknown Unit Message ("")
	50 ms	LLVM.CodeGen/NVPTX::Unknown Unit Message ("")
	30 ms	LLVM.CodeGen/NVPTX::Unknown Unit Message ("")

Event Timeline

jdoerfert created this revision.Dec 6 2019, 9:47 AM

Herald added a project: Restricted Project. · View Herald TranscriptDec 6 2019, 9:47 AM

Herald added subscribers: jfb, bollu, hiraditya and 2 others. · View Herald Transcript

jfb added a reviewer: __simt__.Dec 6 2019, 9:54 AM

Needs tests. The AMDGPU ones in test/Transforms/AtomicExpand can probably be copied as-is (plus another codegen one to make sure AtomicExpand is actually running)

Build result: pass - 60562 tests passed, 0 failed and 726 were skipped.

Log files: console-log.txt, CMakeCache.txt

Harbormaster completed remote builds in B42022: Diff 232596.Dec 6 2019, 10:05 AM

+1 for tests.

llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp
360	Typo: it's 44219: https://bugs.llvm.org/show_bug.cgi?id=44219

Add tests and run pass in the pipeline

Fixed the typo, copied test/Transforms/AtomicExpand/AMDGPU into NVPTX and changed the run lines accordingly. Then I run the update_test_checks. The result is different than before (some expansion happens), and close to the AMDGPU result, but I haven't verified everything.

I also added the pass explicitly to the NVPTX required passes. Are there existing tests to check the target specific pipeline?

Build result: fail - 60568 tests passed, 3 failed and 726 were skipped.

failed: LLVM.CodeGen/NVPTX/atomics-sm60.ll
failed: LLVM.CodeGen/NVPTX/atomics.ll
failed: LLVM.CodeGen/NVPTX/load-store.ll

Log files: console-log.txt, CMakeCache.txt

Harbormaster failed remote builds in B42034: Diff 232633!Dec 6 2019, 1:12 PM

> Command Output (stderr):
> --
> /mnt/disks/ssd0/agent/workspace/amd64_debian_testing_clang8/llvm/test/CodeGen/NVPTX/atomics-sm60.ll:6:10: error: CHECK: expected string not found in input
> ; CHECK: atom.add.f64
>          ^
> <stdin>:1:1: note: scanning from here
> //
> ^
> <stdin>:34:2: note: possible intended match here
>  atom.cas.b64 %rd3, [%r1], %rd2, %rd1;
>  ^

This appears to be a regression. We do have fp32/fp64 atomic adds in NVPTX. Replacing them with add+CAS is suboptimal.

In D71128#1773447, @tra wrote:

> Command Output (stderr):
> --
> /mnt/disks/ssd0/agent/workspace/amd64_debian_testing_clang8/llvm/test/CodeGen/NVPTX/atomics-sm60.ll:6:10: error: CHECK: expected string not found in input
> ; CHECK: atom.add.f64
>          ^
> <stdin>:1:1: note: scanning from here
> //
> ^
> <stdin>:34:2: note: possible intended match here
>  atom.cas.b64 %rd3, [%r1], %rd2, %rd1;
>  ^

This appears to be a regression. We do have fp32/fp64 atomic adds in NVPTX. Replacing them with add+CAS is suboptimal.

I'm already working on it.

Fix test cases by exposing more TLI hooks

Build result: FAILURE -
Log files: console-log.txt, CMakeCache.txt

Harbormaster failed remote builds in B42045: Diff 232662!Dec 6 2019, 3:39 PM

In D71128#1773635, @merge_guards_bot wrote:

Build result: FAILURE -
Log files: console-log.txt, CMakeCache.txt

I have the feeling that wasn't my fault.

tra added inline comments.Dec 9 2019, 11:57 AM

llvm/test/Transforms/AtomicExpand/NVPTX/expand-atomic-rmw-fadd.ll
17	Don't we want to preserve `atomicrmw fadd` in this case and lower it to `atom.add.f32` ? Why do we want to expand here?
132	Ditto here and below. We do have `atom.add.f64`
llvm/test/Transforms/AtomicExpand/NVPTX/expand-atomic-rmw-fsub.ll
6	Functilon name `fadd` does not seem to match the instruction `fsub`.
16	I must be missing something -- I would think that we do not want to expand atomicrmw variants which we can lower to an existing instruction, but a lot of the tests show the opposite and expand atomics that have direct support in hardware. The patch subject seems to agree with my assumptions, but the tests appear to contradict it. Is that intentional? If so, what is it that I'm missing?
llvm/test/Transforms/AtomicExpand/NVPTX/unaligned-atomic.ll
3	Nit: no need for `-check-prefix` as you only using `CHECK` in the test.

I will try to look into the problematic lowerings @tra pointed out (thanks btw!). Any hints to why they are expanded are appreciated :)

llvm/test/Transforms/AtomicExpand/NVPTX/expand-atomic-rmw-fadd.ll
17	Same as below.
132	To be honest, I don't even know why we do not match it. All I (tried) to do is add the limits wrt. size and alignment. Somehow that had more effect than I wanted. The new hooks already remove some of the weirdness we saw but it seems something is missing here (maybe during the instruction "registration").
llvm/test/Transforms/AtomicExpand/NVPTX/expand-atomic-rmw-fsub.ll
6	Good catch, copy & paste from the AMD tests ;) (@arsenm
16	It is not intentional to pesimise anything, as mentioned above. The problem is I am neither a backend nor NVPTX person and my changes do seem to have unwanted effects I cannot even categorize.
llvm/test/Transforms/AtomicExpand/NVPTX/unaligned-atomic.ll
3	Fair, I think I copied this ;)

arsenm added inline comments.Dec 10 2019, 8:20 AM

llvm/test/Transforms/AtomicExpand/NVPTX/expand-atomic-rmw-fsub.ll
16	For the purpose of this change, that this isn't optimal doesn't matter. These aren't implemented correct, but doing so is a separate change and those changes will show up in the same tests here

tra added inline comments.Dec 10 2019, 9:05 AM

llvm/test/Transforms/AtomicExpand/NVPTX/expand-atomic-rmw-fsub.ll
16	OK. Looks like `atomicrmw fsub` currently fails to lower on NVPTX, so expanding it is an improvement. However, expanding `atomicrmw fadd` is a substantial regression and is likely to be a showstopper. Atomic FP32 addition is a commonly used instruction in various reduction kernels so anything that prevents mapping it to `atom.add.f32` instruction will be very noticeable. I realize that there are many moving parts involved in getting this to work properly. If proper fix needs multiple patches, please try to commit them atomically to avoid the performance regression in between those changes. Also, if there are dependent patches, it would be great to arrange all of them as such in phabricator, so it's easier to see the big picture.

arsenm added inline comments.Jan 9 2020, 7:15 AM

llvm/include/llvm/CodeGen/TargetLowering.h
1856–1871 ↗	(On Diff #232662)	Do we really need 4 of these when just the one for Instruction will work

jdoerfert marked an inline comment as done.Jan 9 2020, 8:32 AM

jdoerfert added inline comments.

llvm/include/llvm/CodeGen/TargetLowering.h
1856–1871 ↗	(On Diff #232662)	Alternatively we can overload the instruction one and check for the kind to decide what to do. I don't remember how I ended up like this, I'll address this once I get around to this patch again...

arsenm resigned from this revision.Feb 13 2020, 4:44 PM

Revision Contents

Path

Size

llvm/

lib/

Target/

NVPTX/

NVPTXISelLowering.cpp

5 lines

NVPTXTargetMachine.cpp

1 line

test/

Transforms/

AtomicExpand/

NVPTX/

expand-atomic-i16.ll

183 lines

expand-atomic-i8.ll

183 lines

expand-atomic-rmw-fadd.ll

186 lines

expand-atomic-rmw-fsub.ll

162 lines

expand-atomic-rmw-nand.ll

30 lines

lit.local.cfg

2 lines

unaligned-atomic.ll

34 lines

Diff 232633

llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp

Show First 20 Lines • Show All 351 Lines • ▼ Show 20 Lines	NVPTXTargetLowering::NVPTXTargetLowering(const NVPTXTargetMachine &TM,

setBooleanContents(ZeroOrNegativeOneBooleanContent);		setBooleanContents(ZeroOrNegativeOneBooleanContent);
setBooleanVectorContents(ZeroOrNegativeOneBooleanContent);		setBooleanVectorContents(ZeroOrNegativeOneBooleanContent);

// Jump is Expensive. Don't create extra control flow for 'and', 'or'		// Jump is Expensive. Don't create extra control flow for 'and', 'or'
// condition branches.		// condition branches.
setJumpIsExpensive(true);		setJumpIsExpensive(true);

		// Force atomics to be expanded if the ISA doesn't support them: PR44219
		traUnsubmitted Done Reply Inline Actions Typo: it's 44219: https://bugs.llvm.org/show_bug.cgi?id=44219 tra: Typo: it's 44219: https://bugs.llvm.org/show_bug.cgi?id=44219
		setMinCmpXchgSizeInBits(32);
		setMaxAtomicSizeInBitsSupported(64);
		setSupportsUnalignedAtomics(false);

// Wide divides are _very_ slow. Try to reduce the width of the divide if		// Wide divides are _very_ slow. Try to reduce the width of the divide if
// possible.		// possible.
addBypassSlowDiv(64, 32);		addBypassSlowDiv(64, 32);

// By default, use the Source scheduling		// By default, use the Source scheduling
if (sched4reg)		if (sched4reg)
setSchedulingPreference(Sched::RegPressure);		setSchedulingPreference(Sched::RegPressure);
else		else
▲ Show 20 Lines • Show All 4,700 Lines • Show Last 20 Lines

llvm/lib/Target/NVPTX/NVPTXTargetMachine.cpp

Show First 20 Lines • Show All 264 Lines • ▼ Show 20 Lines	void NVPTXPassConfig::addIRPasses() {
// call addEarlyAsPossiblePasses.		// call addEarlyAsPossiblePasses.
const NVPTXSubtarget &ST = *getTM<NVPTXTargetMachine>().getSubtargetImpl();		const NVPTXSubtarget &ST = *getTM<NVPTXTargetMachine>().getSubtargetImpl();
addPass(createNVVMReflectPass(ST.getSmVersion()));		addPass(createNVVMReflectPass(ST.getSmVersion()));

if (getOptLevel() != CodeGenOpt::None)		if (getOptLevel() != CodeGenOpt::None)
addPass(createNVPTXImageOptimizerPass());		addPass(createNVPTXImageOptimizerPass());
addPass(createNVPTXAssignValidGlobalNamesPass());		addPass(createNVPTXAssignValidGlobalNamesPass());
addPass(createGenericToNVVMPass());		addPass(createGenericToNVVMPass());
		addPass(createAtomicExpandPass());

// NVPTXLowerArgs is required for correctness and should be run right		// NVPTXLowerArgs is required for correctness and should be run right
// before the address space inference passes.		// before the address space inference passes.
addPass(createNVPTXLowerArgsPass(&getNVPTXTargetMachine()));		addPass(createNVPTXLowerArgsPass(&getNVPTXTargetMachine()));
if (getOptLevel() != CodeGenOpt::None) {		if (getOptLevel() != CodeGenOpt::None) {
addAddressSpaceInferencePasses();		addAddressSpaceInferencePasses();
if (!DisableLoadStoreVectorizer)		if (!DisableLoadStoreVectorizer)
addPass(createLoadStoreVectorizerPass());		addPass(createLoadStoreVectorizerPass());
▲ Show 20 Lines • Show All 119 Lines • Show Last 20 Lines

llvm/test/Transforms/AtomicExpand/NVPTX/expand-atomic-i16.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
				; RUN: opt -mtriple=nvptx-unknown-unknown -S -atomic-expand %s \| FileCheck %s
				; RUN: opt -mtriple=nvptx64-unknown-unknown -S -atomic-expand %s \| FileCheck %s

				define i16 @test_atomicrmw_xchg_i16_global(i16 addrspace(1)* %ptr, i16 %value) {
				; CHECK-LABEL: @test_atomicrmw_xchg_i16_global(
				; CHECK-NEXT: [[RES:%.]] = atomicrmw xchg i16 addrspace(1) [[PTR:%.]], i16 [[VALUE:%.]] seq_cst
				; CHECK-NEXT: ret i16 [[RES]]
				;
				%res = atomicrmw xchg i16 addrspace(1)* %ptr, i16 %value seq_cst
				ret i16 %res
				}

				define i16 @test_atomicrmw_add_i16_global(i16 addrspace(1)* %ptr, i16 %value) {
				; CHECK-LABEL: @test_atomicrmw_add_i16_global(
				; CHECK-NEXT: [[RES:%.]] = atomicrmw add i16 addrspace(1) [[PTR:%.]], i16 [[VALUE:%.]] seq_cst
				; CHECK-NEXT: ret i16 [[RES]]
				;
				%res = atomicrmw add i16 addrspace(1)* %ptr, i16 %value seq_cst
				ret i16 %res
				}

				define i16 @test_atomicrmw_sub_i16_global(i16 addrspace(1)* %ptr, i16 %value) {
				; CHECK-LABEL: @test_atomicrmw_sub_i16_global(
				; CHECK-NEXT: [[RES:%.]] = atomicrmw sub i16 addrspace(1) [[PTR:%.]], i16 [[VALUE:%.]] seq_cst
				; CHECK-NEXT: ret i16 [[RES]]
				;
				%res = atomicrmw sub i16 addrspace(1)* %ptr, i16 %value seq_cst
				ret i16 %res
				}

				define i16 @test_atomicrmw_and_i16_global(i16 addrspace(1)* %ptr, i16 %value) {
				; CHECK-LABEL: @test_atomicrmw_and_i16_global(
				; CHECK-NEXT: [[TMP1:%.]] = ptrtoint i16 addrspace(1) [[PTR:%.*]] to i64
				; CHECK-NEXT: [[TMP2:%.*]] = and i64 [[TMP1]], -4
				; CHECK-NEXT: [[ALIGNEDADDR:%.]] = inttoptr i64 [[TMP2]] to i32 addrspace(1)
				; CHECK-NEXT: [[PTRLSB:%.*]] = and i64 [[TMP1]], 3
				; CHECK-NEXT: [[TMP3:%.*]] = shl i64 [[PTRLSB]], 3
				; CHECK-NEXT: [[SHIFTAMT:%.*]] = trunc i64 [[TMP3]] to i32
				; CHECK-NEXT: [[MASK:%.*]] = shl i32 65535, [[SHIFTAMT]]
				; CHECK-NEXT: [[INV_MASK:%.*]] = xor i32 [[MASK]], -1
				; CHECK-NEXT: [[TMP4:%.]] = zext i16 [[VALUE:%.]] to i32
				; CHECK-NEXT: [[VALOPERAND_SHIFTED:%.*]] = shl i32 [[TMP4]], [[SHIFTAMT]]
				; CHECK-NEXT: [[ANDOPERAND:%.*]] = or i32 [[INV_MASK]], [[VALOPERAND_SHIFTED]]
				; CHECK-NEXT: [[TMP5:%.]] = atomicrmw and i32 addrspace(1) [[ALIGNEDADDR]], i32 [[ANDOPERAND]] seq_cst
				; CHECK-NEXT: [[TMP6:%.*]] = lshr i32 [[TMP5]], [[SHIFTAMT]]
				; CHECK-NEXT: [[TMP7:%.*]] = trunc i32 [[TMP6]] to i16
				; CHECK-NEXT: ret i16 [[TMP7]]
				;
				%res = atomicrmw and i16 addrspace(1)* %ptr, i16 %value seq_cst
				ret i16 %res
				}

				define i16 @test_atomicrmw_nand_i16_global(i16 addrspace(1)* %ptr, i16 %value) {
				; CHECK-LABEL: @test_atomicrmw_nand_i16_global(
				; CHECK-NEXT: [[RES:%.]] = atomicrmw nand i16 addrspace(1) [[PTR:%.]], i16 [[VALUE:%.]] seq_cst
				; CHECK-NEXT: ret i16 [[RES]]
				;
				%res = atomicrmw nand i16 addrspace(1)* %ptr, i16 %value seq_cst
				ret i16 %res
				}

				define i16 @test_atomicrmw_or_i16_global(i16 addrspace(1)* %ptr, i16 %value) {
				; CHECK-LABEL: @test_atomicrmw_or_i16_global(
				; CHECK-NEXT: [[TMP1:%.]] = ptrtoint i16 addrspace(1) [[PTR:%.*]] to i64
				; CHECK-NEXT: [[TMP2:%.*]] = and i64 [[TMP1]], -4
				; CHECK-NEXT: [[ALIGNEDADDR:%.]] = inttoptr i64 [[TMP2]] to i32 addrspace(1)
				; CHECK-NEXT: [[PTRLSB:%.*]] = and i64 [[TMP1]], 3
				; CHECK-NEXT: [[TMP3:%.*]] = shl i64 [[PTRLSB]], 3
				; CHECK-NEXT: [[SHIFTAMT:%.*]] = trunc i64 [[TMP3]] to i32
				; CHECK-NEXT: [[MASK:%.*]] = shl i32 65535, [[SHIFTAMT]]
				; CHECK-NEXT: [[INV_MASK:%.*]] = xor i32 [[MASK]], -1
				; CHECK-NEXT: [[TMP4:%.]] = zext i16 [[VALUE:%.]] to i32
				; CHECK-NEXT: [[VALOPERAND_SHIFTED:%.*]] = shl i32 [[TMP4]], [[SHIFTAMT]]
				; CHECK-NEXT: [[TMP5:%.]] = atomicrmw or i32 addrspace(1) [[ALIGNEDADDR]], i32 [[VALOPERAND_SHIFTED]] seq_cst
				; CHECK-NEXT: [[TMP6:%.*]] = lshr i32 [[TMP5]], [[SHIFTAMT]]
				; CHECK-NEXT: [[TMP7:%.*]] = trunc i32 [[TMP6]] to i16
				; CHECK-NEXT: ret i16 [[TMP7]]
				;
				%res = atomicrmw or i16 addrspace(1)* %ptr, i16 %value seq_cst
				ret i16 %res
				}

				define i16 @test_atomicrmw_xor_i16_global(i16 addrspace(1)* %ptr, i16 %value) {
				; CHECK-LABEL: @test_atomicrmw_xor_i16_global(
				; CHECK-NEXT: [[TMP1:%.]] = ptrtoint i16 addrspace(1) [[PTR:%.*]] to i64
				; CHECK-NEXT: [[TMP2:%.*]] = and i64 [[TMP1]], -4
				; CHECK-NEXT: [[ALIGNEDADDR:%.]] = inttoptr i64 [[TMP2]] to i32 addrspace(1)
				; CHECK-NEXT: [[PTRLSB:%.*]] = and i64 [[TMP1]], 3
				; CHECK-NEXT: [[TMP3:%.*]] = shl i64 [[PTRLSB]], 3
				; CHECK-NEXT: [[SHIFTAMT:%.*]] = trunc i64 [[TMP3]] to i32
				; CHECK-NEXT: [[MASK:%.*]] = shl i32 65535, [[SHIFTAMT]]
				; CHECK-NEXT: [[INV_MASK:%.*]] = xor i32 [[MASK]], -1
				; CHECK-NEXT: [[TMP4:%.]] = zext i16 [[VALUE:%.]] to i32
				; CHECK-NEXT: [[VALOPERAND_SHIFTED:%.*]] = shl i32 [[TMP4]], [[SHIFTAMT]]
				; CHECK-NEXT: [[TMP5:%.]] = atomicrmw xor i32 addrspace(1) [[ALIGNEDADDR]], i32 [[VALOPERAND_SHIFTED]] seq_cst
				; CHECK-NEXT: [[TMP6:%.*]] = lshr i32 [[TMP5]], [[SHIFTAMT]]
				; CHECK-NEXT: [[TMP7:%.*]] = trunc i32 [[TMP6]] to i16
				; CHECK-NEXT: ret i16 [[TMP7]]
				;
				%res = atomicrmw xor i16 addrspace(1)* %ptr, i16 %value seq_cst
				ret i16 %res
				}

				define i16 @test_atomicrmw_max_i16_global(i16 addrspace(1)* %ptr, i16 %value) {
				; CHECK-LABEL: @test_atomicrmw_max_i16_global(
				; CHECK-NEXT: [[RES:%.]] = atomicrmw max i16 addrspace(1) [[PTR:%.]], i16 [[VALUE:%.]] seq_cst
				; CHECK-NEXT: ret i16 [[RES]]
				;
				%res = atomicrmw max i16 addrspace(1)* %ptr, i16 %value seq_cst
				ret i16 %res
				}

				define i16 @test_atomicrmw_min_i16_global(i16 addrspace(1)* %ptr, i16 %value) {
				; CHECK-LABEL: @test_atomicrmw_min_i16_global(
				; CHECK-NEXT: [[RES:%.]] = atomicrmw min i16 addrspace(1) [[PTR:%.]], i16 [[VALUE:%.]] seq_cst
				; CHECK-NEXT: ret i16 [[RES]]
				;
				%res = atomicrmw min i16 addrspace(1)* %ptr, i16 %value seq_cst
				ret i16 %res
				}

				define i16 @test_atomicrmw_umax_i16_global(i16 addrspace(1)* %ptr, i16 %value) {
				; CHECK-LABEL: @test_atomicrmw_umax_i16_global(
				; CHECK-NEXT: [[RES:%.]] = atomicrmw umax i16 addrspace(1) [[PTR:%.]], i16 [[VALUE:%.]] seq_cst
				; CHECK-NEXT: ret i16 [[RES]]
				;
				%res = atomicrmw umax i16 addrspace(1)* %ptr, i16 %value seq_cst
				ret i16 %res
				}

				define i16 @test_atomicrmw_umin_i16_global(i16 addrspace(1)* %ptr, i16 %value) {
				; CHECK-LABEL: @test_atomicrmw_umin_i16_global(
				; CHECK-NEXT: [[RES:%.]] = atomicrmw umin i16 addrspace(1) [[PTR:%.]], i16 [[VALUE:%.]] seq_cst
				; CHECK-NEXT: ret i16 [[RES]]
				;
				%res = atomicrmw umin i16 addrspace(1)* %ptr, i16 %value seq_cst
				ret i16 %res
				}

				define i16 @test_cmpxchg_i16_global(i16 addrspace(1)* %out, i16 %in, i16 %old) {
				; CHECK-LABEL: @test_cmpxchg_i16_global(
				; CHECK-NEXT: [[GEP:%.]] = getelementptr i16, i16 addrspace(1) [[OUT:%.*]], i64 4
				; CHECK-NEXT: [[TMP1:%.]] = ptrtoint i16 addrspace(1) [[GEP]] to i64
				; CHECK-NEXT: [[TMP2:%.*]] = and i64 [[TMP1]], -4
				; CHECK-NEXT: [[ALIGNEDADDR:%.]] = inttoptr i64 [[TMP2]] to i32 addrspace(1)
				; CHECK-NEXT: [[PTRLSB:%.*]] = and i64 [[TMP1]], 3
				; CHECK-NEXT: [[TMP3:%.*]] = shl i64 [[PTRLSB]], 3
				; CHECK-NEXT: [[SHIFTAMT:%.*]] = trunc i64 [[TMP3]] to i32
				; CHECK-NEXT: [[MASK:%.*]] = shl i32 65535, [[SHIFTAMT]]
				; CHECK-NEXT: [[INV_MASK:%.*]] = xor i32 [[MASK]], -1
				; CHECK-NEXT: [[TMP4:%.]] = zext i16 [[IN:%.]] to i32
				; CHECK-NEXT: [[TMP5:%.*]] = shl i32 [[TMP4]], [[SHIFTAMT]]
				; CHECK-NEXT: [[TMP6:%.]] = zext i16 [[OLD:%.]] to i32
				; CHECK-NEXT: [[TMP7:%.*]] = shl i32 [[TMP6]], [[SHIFTAMT]]
				; CHECK-NEXT: [[TMP8:%.]] = load i32, i32 addrspace(1) [[ALIGNEDADDR]]
				; CHECK-NEXT: [[TMP9:%.*]] = and i32 [[TMP8]], [[INV_MASK]]
				; CHECK-NEXT: br label [[PARTWORD_CMPXCHG_LOOP:%.*]]
				; CHECK: partword.cmpxchg.loop:
				; CHECK-NEXT: [[TMP10:%.]] = phi i32 [ [[TMP9]], [[TMP0:%.]] ], [ [[TMP16:%.]], [[PARTWORD_CMPXCHG_FAILURE:%.]] ]
				; CHECK-NEXT: [[TMP11:%.*]] = or i32 [[TMP10]], [[TMP5]]
				; CHECK-NEXT: [[TMP12:%.*]] = or i32 [[TMP10]], [[TMP7]]
				; CHECK-NEXT: [[TMP13:%.]] = cmpxchg i32 addrspace(1) [[ALIGNEDADDR]], i32 [[TMP12]], i32 [[TMP11]] seq_cst seq_cst
				; CHECK-NEXT: [[TMP14:%.*]] = extractvalue { i32, i1 } [[TMP13]], 0
				; CHECK-NEXT: [[TMP15:%.*]] = extractvalue { i32, i1 } [[TMP13]], 1
				; CHECK-NEXT: br i1 [[TMP15]], label [[PARTWORD_CMPXCHG_END:%.*]], label [[PARTWORD_CMPXCHG_FAILURE]]
				; CHECK: partword.cmpxchg.failure:
				; CHECK-NEXT: [[TMP16]] = and i32 [[TMP14]], [[INV_MASK]]
				; CHECK-NEXT: [[TMP17:%.*]] = icmp ne i32 [[TMP10]], [[TMP16]]
				; CHECK-NEXT: br i1 [[TMP17]], label [[PARTWORD_CMPXCHG_LOOP]], label [[PARTWORD_CMPXCHG_END]]
				; CHECK: partword.cmpxchg.end:
				; CHECK-NEXT: [[TMP18:%.*]] = lshr i32 [[TMP14]], [[SHIFTAMT]]
				; CHECK-NEXT: [[TMP19:%.*]] = trunc i32 [[TMP18]] to i16
				; CHECK-NEXT: [[TMP20:%.*]] = insertvalue { i16, i1 } undef, i16 [[TMP19]], 0
				; CHECK-NEXT: [[TMP21:%.*]] = insertvalue { i16, i1 } [[TMP20]], i1 [[TMP15]], 1
				; CHECK-NEXT: [[EXTRACT:%.*]] = extractvalue { i16, i1 } [[TMP21]], 0
				; CHECK-NEXT: ret i16 [[EXTRACT]]
				;
				%gep = getelementptr i16, i16 addrspace(1)* %out, i64 4
				%res = cmpxchg i16 addrspace(1)* %gep, i16 %old, i16 %in seq_cst seq_cst
				%extract = extractvalue {i16, i1} %res, 0
				ret i16 %extract
				}

llvm/test/Transforms/AtomicExpand/NVPTX/expand-atomic-i8.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
				; RUN: opt -mtriple=nvptx-unknown-unknown -S -atomic-expand %s \| FileCheck %s
				; RUN: opt -mtriple=nvptx64-unknown-unknown -S -atomic-expand %s \| FileCheck %s

				define i8 @test_atomicrmw_xchg_i8_global(i8 addrspace(1)* %ptr, i8 %value) {
				; CHECK-LABEL: @test_atomicrmw_xchg_i8_global(
				; CHECK-NEXT: [[RES:%.]] = atomicrmw xchg i8 addrspace(1) [[PTR:%.]], i8 [[VALUE:%.]] seq_cst
				; CHECK-NEXT: ret i8 [[RES]]
				;
				%res = atomicrmw xchg i8 addrspace(1)* %ptr, i8 %value seq_cst
				ret i8 %res
				}

				define i8 @test_atomicrmw_add_i8_global(i8 addrspace(1)* %ptr, i8 %value) {
				; CHECK-LABEL: @test_atomicrmw_add_i8_global(
				; CHECK-NEXT: [[RES:%.]] = atomicrmw add i8 addrspace(1) [[PTR:%.]], i8 [[VALUE:%.]] seq_cst
				; CHECK-NEXT: ret i8 [[RES]]
				;
				%res = atomicrmw add i8 addrspace(1)* %ptr, i8 %value seq_cst
				ret i8 %res
				}

				define i8 @test_atomicrmw_sub_i8_global(i8 addrspace(1)* %ptr, i8 %value) {
				; CHECK-LABEL: @test_atomicrmw_sub_i8_global(
				; CHECK-NEXT: [[RES:%.]] = atomicrmw sub i8 addrspace(1) [[PTR:%.]], i8 [[VALUE:%.]] seq_cst
				; CHECK-NEXT: ret i8 [[RES]]
				;
				%res = atomicrmw sub i8 addrspace(1)* %ptr, i8 %value seq_cst
				ret i8 %res
				}

				define i8 @test_atomicrmw_and_i8_global(i8 addrspace(1)* %ptr, i8 %value) {
				; CHECK-LABEL: @test_atomicrmw_and_i8_global(
				; CHECK-NEXT: [[TMP1:%.]] = ptrtoint i8 addrspace(1) [[PTR:%.*]] to i64
				; CHECK-NEXT: [[TMP2:%.*]] = and i64 [[TMP1]], -4
				; CHECK-NEXT: [[ALIGNEDADDR:%.]] = inttoptr i64 [[TMP2]] to i32 addrspace(1)
				; CHECK-NEXT: [[PTRLSB:%.*]] = and i64 [[TMP1]], 3
				; CHECK-NEXT: [[TMP3:%.*]] = shl i64 [[PTRLSB]], 3
				; CHECK-NEXT: [[SHIFTAMT:%.*]] = trunc i64 [[TMP3]] to i32
				; CHECK-NEXT: [[MASK:%.*]] = shl i32 255, [[SHIFTAMT]]
				; CHECK-NEXT: [[INV_MASK:%.*]] = xor i32 [[MASK]], -1
				; CHECK-NEXT: [[TMP4:%.]] = zext i8 [[VALUE:%.]] to i32
				; CHECK-NEXT: [[VALOPERAND_SHIFTED:%.*]] = shl i32 [[TMP4]], [[SHIFTAMT]]
				; CHECK-NEXT: [[ANDOPERAND:%.*]] = or i32 [[INV_MASK]], [[VALOPERAND_SHIFTED]]
				; CHECK-NEXT: [[TMP5:%.]] = atomicrmw and i32 addrspace(1) [[ALIGNEDADDR]], i32 [[ANDOPERAND]] seq_cst
				; CHECK-NEXT: [[TMP6:%.*]] = lshr i32 [[TMP5]], [[SHIFTAMT]]
				; CHECK-NEXT: [[TMP7:%.*]] = trunc i32 [[TMP6]] to i8
				; CHECK-NEXT: ret i8 [[TMP7]]
				;
				%res = atomicrmw and i8 addrspace(1)* %ptr, i8 %value seq_cst
				ret i8 %res
				}

				define i8 @test_atomicrmw_nand_i8_global(i8 addrspace(1)* %ptr, i8 %value) {
				; CHECK-LABEL: @test_atomicrmw_nand_i8_global(
				; CHECK-NEXT: [[RES:%.]] = atomicrmw nand i8 addrspace(1) [[PTR:%.]], i8 [[VALUE:%.]] seq_cst
				; CHECK-NEXT: ret i8 [[RES]]
				;
				%res = atomicrmw nand i8 addrspace(1)* %ptr, i8 %value seq_cst
				ret i8 %res
				}

				define i8 @test_atomicrmw_or_i8_global(i8 addrspace(1)* %ptr, i8 %value) {
				; CHECK-LABEL: @test_atomicrmw_or_i8_global(
				; CHECK-NEXT: [[TMP1:%.]] = ptrtoint i8 addrspace(1) [[PTR:%.*]] to i64
				; CHECK-NEXT: [[TMP2:%.*]] = and i64 [[TMP1]], -4
				; CHECK-NEXT: [[ALIGNEDADDR:%.]] = inttoptr i64 [[TMP2]] to i32 addrspace(1)
				; CHECK-NEXT: [[PTRLSB:%.*]] = and i64 [[TMP1]], 3
				; CHECK-NEXT: [[TMP3:%.*]] = shl i64 [[PTRLSB]], 3
				; CHECK-NEXT: [[SHIFTAMT:%.*]] = trunc i64 [[TMP3]] to i32
				; CHECK-NEXT: [[MASK:%.*]] = shl i32 255, [[SHIFTAMT]]
				; CHECK-NEXT: [[INV_MASK:%.*]] = xor i32 [[MASK]], -1
				; CHECK-NEXT: [[TMP4:%.]] = zext i8 [[VALUE:%.]] to i32
				; CHECK-NEXT: [[VALOPERAND_SHIFTED:%.*]] = shl i32 [[TMP4]], [[SHIFTAMT]]
				; CHECK-NEXT: [[TMP5:%.]] = atomicrmw or i32 addrspace(1) [[ALIGNEDADDR]], i32 [[VALOPERAND_SHIFTED]] seq_cst
				; CHECK-NEXT: [[TMP6:%.*]] = lshr i32 [[TMP5]], [[SHIFTAMT]]
				; CHECK-NEXT: [[TMP7:%.*]] = trunc i32 [[TMP6]] to i8
				; CHECK-NEXT: ret i8 [[TMP7]]
				;
				%res = atomicrmw or i8 addrspace(1)* %ptr, i8 %value seq_cst
				ret i8 %res
				}

				define i8 @test_atomicrmw_xor_i8_global(i8 addrspace(1)* %ptr, i8 %value) {
				; CHECK-LABEL: @test_atomicrmw_xor_i8_global(
				; CHECK-NEXT: [[TMP1:%.]] = ptrtoint i8 addrspace(1) [[PTR:%.*]] to i64
				; CHECK-NEXT: [[TMP2:%.*]] = and i64 [[TMP1]], -4
				; CHECK-NEXT: [[ALIGNEDADDR:%.]] = inttoptr i64 [[TMP2]] to i32 addrspace(1)
				; CHECK-NEXT: [[PTRLSB:%.*]] = and i64 [[TMP1]], 3
				; CHECK-NEXT: [[TMP3:%.*]] = shl i64 [[PTRLSB]], 3
				; CHECK-NEXT: [[SHIFTAMT:%.*]] = trunc i64 [[TMP3]] to i32
				; CHECK-NEXT: [[MASK:%.*]] = shl i32 255, [[SHIFTAMT]]
				; CHECK-NEXT: [[INV_MASK:%.*]] = xor i32 [[MASK]], -1
				; CHECK-NEXT: [[TMP4:%.]] = zext i8 [[VALUE:%.]] to i32
				; CHECK-NEXT: [[VALOPERAND_SHIFTED:%.*]] = shl i32 [[TMP4]], [[SHIFTAMT]]
				; CHECK-NEXT: [[TMP5:%.]] = atomicrmw xor i32 addrspace(1) [[ALIGNEDADDR]], i32 [[VALOPERAND_SHIFTED]] seq_cst
				; CHECK-NEXT: [[TMP6:%.*]] = lshr i32 [[TMP5]], [[SHIFTAMT]]
				; CHECK-NEXT: [[TMP7:%.*]] = trunc i32 [[TMP6]] to i8
				; CHECK-NEXT: ret i8 [[TMP7]]
				;
				%res = atomicrmw xor i8 addrspace(1)* %ptr, i8 %value seq_cst
				ret i8 %res
				}

				define i8 @test_atomicrmw_max_i8_global(i8 addrspace(1)* %ptr, i8 %value) {
				; CHECK-LABEL: @test_atomicrmw_max_i8_global(
				; CHECK-NEXT: [[RES:%.]] = atomicrmw max i8 addrspace(1) [[PTR:%.]], i8 [[VALUE:%.]] seq_cst
				; CHECK-NEXT: ret i8 [[RES]]
				;
				%res = atomicrmw max i8 addrspace(1)* %ptr, i8 %value seq_cst
				ret i8 %res
				}

				define i8 @test_atomicrmw_min_i8_global(i8 addrspace(1)* %ptr, i8 %value) {
				; CHECK-LABEL: @test_atomicrmw_min_i8_global(
				; CHECK-NEXT: [[RES:%.]] = atomicrmw min i8 addrspace(1) [[PTR:%.]], i8 [[VALUE:%.]] seq_cst
				; CHECK-NEXT: ret i8 [[RES]]
				;
				%res = atomicrmw min i8 addrspace(1)* %ptr, i8 %value seq_cst
				ret i8 %res
				}

				define i8 @test_atomicrmw_umax_i8_global(i8 addrspace(1)* %ptr, i8 %value) {
				; CHECK-LABEL: @test_atomicrmw_umax_i8_global(
				; CHECK-NEXT: [[RES:%.]] = atomicrmw umax i8 addrspace(1) [[PTR:%.]], i8 [[VALUE:%.]] seq_cst
				; CHECK-NEXT: ret i8 [[RES]]
				;
				%res = atomicrmw umax i8 addrspace(1)* %ptr, i8 %value seq_cst
				ret i8 %res
				}

				define i8 @test_atomicrmw_umin_i8_global(i8 addrspace(1)* %ptr, i8 %value) {
				; CHECK-LABEL: @test_atomicrmw_umin_i8_global(
				; CHECK-NEXT: [[RES:%.]] = atomicrmw umin i8 addrspace(1) [[PTR:%.]], i8 [[VALUE:%.]] seq_cst
				; CHECK-NEXT: ret i8 [[RES]]
				;
				%res = atomicrmw umin i8 addrspace(1)* %ptr, i8 %value seq_cst
				ret i8 %res
				}

				define i8 @test_cmpxchg_i8_global(i8 addrspace(1)* %out, i8 %in, i8 %old) {
				; CHECK-LABEL: @test_cmpxchg_i8_global(
				; CHECK-NEXT: [[GEP:%.]] = getelementptr i8, i8 addrspace(1) [[OUT:%.*]], i64 4
				; CHECK-NEXT: [[TMP1:%.]] = ptrtoint i8 addrspace(1) [[GEP]] to i64
				; CHECK-NEXT: [[TMP2:%.*]] = and i64 [[TMP1]], -4
				; CHECK-NEXT: [[ALIGNEDADDR:%.]] = inttoptr i64 [[TMP2]] to i32 addrspace(1)
				; CHECK-NEXT: [[PTRLSB:%.*]] = and i64 [[TMP1]], 3
				; CHECK-NEXT: [[TMP3:%.*]] = shl i64 [[PTRLSB]], 3
				; CHECK-NEXT: [[SHIFTAMT:%.*]] = trunc i64 [[TMP3]] to i32
				; CHECK-NEXT: [[MASK:%.*]] = shl i32 255, [[SHIFTAMT]]
				; CHECK-NEXT: [[INV_MASK:%.*]] = xor i32 [[MASK]], -1
				; CHECK-NEXT: [[TMP4:%.]] = zext i8 [[IN:%.]] to i32
				; CHECK-NEXT: [[TMP5:%.*]] = shl i32 [[TMP4]], [[SHIFTAMT]]
				; CHECK-NEXT: [[TMP6:%.]] = zext i8 [[OLD:%.]] to i32
				; CHECK-NEXT: [[TMP7:%.*]] = shl i32 [[TMP6]], [[SHIFTAMT]]
				; CHECK-NEXT: [[TMP8:%.]] = load i32, i32 addrspace(1) [[ALIGNEDADDR]]
				; CHECK-NEXT: [[TMP9:%.*]] = and i32 [[TMP8]], [[INV_MASK]]
				; CHECK-NEXT: br label [[PARTWORD_CMPXCHG_LOOP:%.*]]
				; CHECK: partword.cmpxchg.loop:
				; CHECK-NEXT: [[TMP10:%.]] = phi i32 [ [[TMP9]], [[TMP0:%.]] ], [ [[TMP16:%.]], [[PARTWORD_CMPXCHG_FAILURE:%.]] ]
				; CHECK-NEXT: [[TMP11:%.*]] = or i32 [[TMP10]], [[TMP5]]
				; CHECK-NEXT: [[TMP12:%.*]] = or i32 [[TMP10]], [[TMP7]]
				; CHECK-NEXT: [[TMP13:%.]] = cmpxchg i32 addrspace(1) [[ALIGNEDADDR]], i32 [[TMP12]], i32 [[TMP11]] seq_cst seq_cst
				; CHECK-NEXT: [[TMP14:%.*]] = extractvalue { i32, i1 } [[TMP13]], 0
				; CHECK-NEXT: [[TMP15:%.*]] = extractvalue { i32, i1 } [[TMP13]], 1
				; CHECK-NEXT: br i1 [[TMP15]], label [[PARTWORD_CMPXCHG_END:%.*]], label [[PARTWORD_CMPXCHG_FAILURE]]
				; CHECK: partword.cmpxchg.failure:
				; CHECK-NEXT: [[TMP16]] = and i32 [[TMP14]], [[INV_MASK]]
				; CHECK-NEXT: [[TMP17:%.*]] = icmp ne i32 [[TMP10]], [[TMP16]]
				; CHECK-NEXT: br i1 [[TMP17]], label [[PARTWORD_CMPXCHG_LOOP]], label [[PARTWORD_CMPXCHG_END]]
				; CHECK: partword.cmpxchg.end:
				; CHECK-NEXT: [[TMP18:%.*]] = lshr i32 [[TMP14]], [[SHIFTAMT]]
				; CHECK-NEXT: [[TMP19:%.*]] = trunc i32 [[TMP18]] to i8
				; CHECK-NEXT: [[TMP20:%.*]] = insertvalue { i8, i1 } undef, i8 [[TMP19]], 0
				; CHECK-NEXT: [[TMP21:%.*]] = insertvalue { i8, i1 } [[TMP20]], i1 [[TMP15]], 1
				; CHECK-NEXT: [[EXTRACT:%.*]] = extractvalue { i8, i1 } [[TMP21]], 0
				; CHECK-NEXT: ret i8 [[EXTRACT]]
				;
				%gep = getelementptr i8, i8 addrspace(1)* %out, i64 4
				%res = cmpxchg i8 addrspace(1)* %gep, i8 %old, i8 %in seq_cst seq_cst
				%extract = extractvalue {i8, i1} %res, 0
				ret i8 %extract
				}

llvm/test/Transforms/AtomicExpand/NVPTX/expand-atomic-rmw-fadd.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
				; RUN: opt -S -mtriple=nvptx-unknown-unknown -mcpu=sm_30 -atomic-expand %s \| FileCheck %s
				; RUN: opt -S -mtriple=nvptx-unknown-unknown -mcpu=sm_60 -atomic-expand %s \| FileCheck %s
				; RUN: opt -S -mtriple=nvptx-unknown-unknown -mcpu=sm_75 -atomic-expand %s \| FileCheck %s

				define float @test_atomicrmw_fadd_f32_flat(float* %ptr, float %value) {
				; CHECK-LABEL: @test_atomicrmw_fadd_f32_flat(
				; CHECK-NEXT: [[TMP1:%.]] = load float, float [[PTR:%.*]], align 4
				; CHECK-NEXT: br label [[ATOMICRMW_START:%.*]]
				; CHECK: atomicrmw.start:
				; CHECK-NEXT: [[LOADED:%.]] = phi float [ [[TMP1]], [[TMP0:%.]] ], [ [[TMP6:%.*]], [[ATOMICRMW_START]] ]
				; CHECK-NEXT: [[NEW:%.]] = fadd float [[LOADED]], [[VALUE:%.]]
				; CHECK-NEXT: [[TMP2:%.]] = bitcast float [[PTR]] to i32*
				; CHECK-NEXT: [[TMP3:%.*]] = bitcast float [[NEW]] to i32
				; CHECK-NEXT: [[TMP4:%.*]] = bitcast float [[LOADED]] to i32
				; CHECK-NEXT: [[TMP5:%.]] = cmpxchg i32 [[TMP2]], i32 [[TMP4]], i32 [[TMP3]] seq_cst seq_cst
				; CHECK-NEXT: [[SUCCESS:%.*]] = extractvalue { i32, i1 } [[TMP5]], 1
				traUnsubmitted Not Done Reply Inline Actions Don't we want to preserve `atomicrmw fadd` in this case and lower it to `atom.add.f32` ? Why do we want to expand here? tra: Don't we want to preserve `atomicrmw fadd` in this case and lower it to `atom.add.f32` ? Why do…
				jdoerfertAuthorUnsubmitted Done Reply Inline Actions Same as below. jdoerfert: Same as below.
				; CHECK-NEXT: [[NEWLOADED:%.*]] = extractvalue { i32, i1 } [[TMP5]], 0
				; CHECK-NEXT: [[TMP6]] = bitcast i32 [[NEWLOADED]] to float
				; CHECK-NEXT: br i1 [[SUCCESS]], label [[ATOMICRMW_END:%.*]], label [[ATOMICRMW_START]]
				; CHECK: atomicrmw.end:
				; CHECK-NEXT: ret float [[TMP6]]
				;
				%res = atomicrmw fadd float* %ptr, float %value seq_cst
				ret float %res
				}

				define float @test_atomicrmw_fadd_f32_global(float addrspace(1)* %ptr, float %value) {
				; CHECK-LABEL: @test_atomicrmw_fadd_f32_global(
				; CHECK-NEXT: [[TMP1:%.]] = load float, float addrspace(1) [[PTR:%.*]], align 4
				; CHECK-NEXT: br label [[ATOMICRMW_START:%.*]]
				; CHECK: atomicrmw.start:
				; CHECK-NEXT: [[LOADED:%.]] = phi float [ [[TMP1]], [[TMP0:%.]] ], [ [[TMP6:%.*]], [[ATOMICRMW_START]] ]
				; CHECK-NEXT: [[NEW:%.]] = fadd float [[LOADED]], [[VALUE:%.]]
				; CHECK-NEXT: [[TMP2:%.]] = bitcast float addrspace(1) [[PTR]] to i32 addrspace(1)*
				; CHECK-NEXT: [[TMP3:%.*]] = bitcast float [[NEW]] to i32
				; CHECK-NEXT: [[TMP4:%.*]] = bitcast float [[LOADED]] to i32
				; CHECK-NEXT: [[TMP5:%.]] = cmpxchg i32 addrspace(1) [[TMP2]], i32 [[TMP4]], i32 [[TMP3]] seq_cst seq_cst
				; CHECK-NEXT: [[SUCCESS:%.*]] = extractvalue { i32, i1 } [[TMP5]], 1
				; CHECK-NEXT: [[NEWLOADED:%.*]] = extractvalue { i32, i1 } [[TMP5]], 0
				; CHECK-NEXT: [[TMP6]] = bitcast i32 [[NEWLOADED]] to float
				; CHECK-NEXT: br i1 [[SUCCESS]], label [[ATOMICRMW_END:%.*]], label [[ATOMICRMW_START]]
				; CHECK: atomicrmw.end:
				; CHECK-NEXT: ret float [[TMP6]]
				;
				%res = atomicrmw fadd float addrspace(1)* %ptr, float %value seq_cst
				ret float %res
				}

				define void @test_atomicrmw_fadd_f32_global_no_use(float addrspace(1)* %ptr, float %value) {
				; CHECK-LABEL: @test_atomicrmw_fadd_f32_global_no_use(
				; CHECK-NEXT: [[TMP1:%.]] = load float, float addrspace(1) [[PTR:%.*]], align 4
				; CHECK-NEXT: br label [[ATOMICRMW_START:%.*]]
				; CHECK: atomicrmw.start:
				; CHECK-NEXT: [[LOADED:%.]] = phi float [ [[TMP1]], [[TMP0:%.]] ], [ [[TMP6:%.*]], [[ATOMICRMW_START]] ]
				; CHECK-NEXT: [[NEW:%.]] = fadd float [[LOADED]], [[VALUE:%.]]
				; CHECK-NEXT: [[TMP2:%.]] = bitcast float addrspace(1) [[PTR]] to i32 addrspace(1)*
				; CHECK-NEXT: [[TMP3:%.*]] = bitcast float [[NEW]] to i32
				; CHECK-NEXT: [[TMP4:%.*]] = bitcast float [[LOADED]] to i32
				; CHECK-NEXT: [[TMP5:%.]] = cmpxchg i32 addrspace(1) [[TMP2]], i32 [[TMP4]], i32 [[TMP3]] seq_cst seq_cst
				; CHECK-NEXT: [[SUCCESS:%.*]] = extractvalue { i32, i1 } [[TMP5]], 1
				; CHECK-NEXT: [[NEWLOADED:%.*]] = extractvalue { i32, i1 } [[TMP5]], 0
				; CHECK-NEXT: [[TMP6]] = bitcast i32 [[NEWLOADED]] to float
				; CHECK-NEXT: br i1 [[SUCCESS]], label [[ATOMICRMW_END:%.*]], label [[ATOMICRMW_START]]
				; CHECK: atomicrmw.end:
				; CHECK-NEXT: ret void
				;
				%res = atomicrmw fadd float addrspace(1)* %ptr, float %value seq_cst
				ret void
				}

				define float @test_atomicrmw_fadd_f32_local(float addrspace(3)* %ptr, float %value) {
				; CHECK-LABEL: @test_atomicrmw_fadd_f32_local(
				; CHECK-NEXT: [[TMP1:%.]] = load float, float addrspace(3) [[PTR:%.*]], align 4
				; CHECK-NEXT: br label [[ATOMICRMW_START:%.*]]
				; CHECK: atomicrmw.start:
				; CHECK-NEXT: [[LOADED:%.]] = phi float [ [[TMP1]], [[TMP0:%.]] ], [ [[TMP6:%.*]], [[ATOMICRMW_START]] ]
				; CHECK-NEXT: [[NEW:%.]] = fadd float [[LOADED]], [[VALUE:%.]]
				; CHECK-NEXT: [[TMP2:%.]] = bitcast float addrspace(3) [[PTR]] to i32 addrspace(3)*
				; CHECK-NEXT: [[TMP3:%.*]] = bitcast float [[NEW]] to i32
				; CHECK-NEXT: [[TMP4:%.*]] = bitcast float [[LOADED]] to i32
				; CHECK-NEXT: [[TMP5:%.]] = cmpxchg i32 addrspace(3) [[TMP2]], i32 [[TMP4]], i32 [[TMP3]] seq_cst seq_cst
				; CHECK-NEXT: [[SUCCESS:%.*]] = extractvalue { i32, i1 } [[TMP5]], 1
				; CHECK-NEXT: [[NEWLOADED:%.*]] = extractvalue { i32, i1 } [[TMP5]], 0
				; CHECK-NEXT: [[TMP6]] = bitcast i32 [[NEWLOADED]] to float
				; CHECK-NEXT: br i1 [[SUCCESS]], label [[ATOMICRMW_END:%.*]], label [[ATOMICRMW_START]]
				; CHECK: atomicrmw.end:
				; CHECK-NEXT: ret float [[TMP6]]
				;
				%res = atomicrmw fadd float addrspace(3)* %ptr, float %value seq_cst
				ret float %res
				}

				define half @test_atomicrmw_fadd_f16_flat(half* %ptr, half %value) {
				; CHECK-LABEL: @test_atomicrmw_fadd_f16_flat(
				; CHECK-NEXT: [[RES:%.]] = atomicrmw fadd half [[PTR:%.]], half [[VALUE:%.]] seq_cst
				; CHECK-NEXT: ret half [[RES]]
				;
				%res = atomicrmw fadd half* %ptr, half %value seq_cst
				ret half %res
				}

				define half @test_atomicrmw_fadd_f16_global(half addrspace(1)* %ptr, half %value) {
				; CHECK-LABEL: @test_atomicrmw_fadd_f16_global(
				; CHECK-NEXT: [[RES:%.]] = atomicrmw fadd half addrspace(1) [[PTR:%.]], half [[VALUE:%.]] seq_cst
				; CHECK-NEXT: ret half [[RES]]
				;
				%res = atomicrmw fadd half addrspace(1)* %ptr, half %value seq_cst
				ret half %res
				}

				define half @test_atomicrmw_fadd_f16_local(half addrspace(3)* %ptr, half %value) {
				; CHECK-LABEL: @test_atomicrmw_fadd_f16_local(
				; CHECK-NEXT: [[RES:%.]] = atomicrmw fadd half addrspace(3) [[PTR:%.]], half [[VALUE:%.]] seq_cst
				; CHECK-NEXT: ret half [[RES]]
				;
				%res = atomicrmw fadd half addrspace(3)* %ptr, half %value seq_cst
				ret half %res
				}

				define double @test_atomicrmw_fadd_f64_flat(double* %ptr, double %value) {
				; CHECK-LABEL: @test_atomicrmw_fadd_f64_flat(
				; CHECK-NEXT: [[TMP1:%.]] = load double, double [[PTR:%.*]], align 8
				; CHECK-NEXT: br label [[ATOMICRMW_START:%.*]]
				; CHECK: atomicrmw.start:
				; CHECK-NEXT: [[LOADED:%.]] = phi double [ [[TMP1]], [[TMP0:%.]] ], [ [[TMP6:%.*]], [[ATOMICRMW_START]] ]
				; CHECK-NEXT: [[NEW:%.]] = fadd double [[LOADED]], [[VALUE:%.]]
				; CHECK-NEXT: [[TMP2:%.]] = bitcast double [[PTR]] to i64*
				; CHECK-NEXT: [[TMP3:%.*]] = bitcast double [[NEW]] to i64
				; CHECK-NEXT: [[TMP4:%.*]] = bitcast double [[LOADED]] to i64
				; CHECK-NEXT: [[TMP5:%.]] = cmpxchg i64 [[TMP2]], i64 [[TMP4]], i64 [[TMP3]] seq_cst seq_cst
				; CHECK-NEXT: [[SUCCESS:%.*]] = extractvalue { i64, i1 } [[TMP5]], 1
				traUnsubmitted Not Done Reply Inline Actions Ditto here and below. We do have `atom.add.f64` tra: Ditto here and below. We do have `atom.add.f64`
				jdoerfertAuthorUnsubmitted Done Reply Inline Actions To be honest, I don't even know why we do not match it. All I (tried) to do is add the limits wrt. size and alignment. Somehow that had more effect than I wanted. The new hooks already remove some of the weirdness we saw but it seems something is missing here (maybe during the instruction "registration"). jdoerfert: To be honest, I don't even know why we do not match it. All I (tried) to do is add the limits…
				; CHECK-NEXT: [[NEWLOADED:%.*]] = extractvalue { i64, i1 } [[TMP5]], 0
				; CHECK-NEXT: [[TMP6]] = bitcast i64 [[NEWLOADED]] to double
				; CHECK-NEXT: br i1 [[SUCCESS]], label [[ATOMICRMW_END:%.*]], label [[ATOMICRMW_START]]
				; CHECK: atomicrmw.end:
				; CHECK-NEXT: ret double [[TMP6]]
				;
				%res = atomicrmw fadd double* %ptr, double %value seq_cst
				ret double %res
				}

				define double @test_atomicrmw_fadd_f64_global(double addrspace(1)* %ptr, double %value) {
				; CHECK-LABEL: @test_atomicrmw_fadd_f64_global(
				; CHECK-NEXT: [[TMP1:%.]] = load double, double addrspace(1) [[PTR:%.*]], align 8
				; CHECK-NEXT: br label [[ATOMICRMW_START:%.*]]
				; CHECK: atomicrmw.start:
				; CHECK-NEXT: [[LOADED:%.]] = phi double [ [[TMP1]], [[TMP0:%.]] ], [ [[TMP6:%.*]], [[ATOMICRMW_START]] ]
				; CHECK-NEXT: [[NEW:%.]] = fadd double [[LOADED]], [[VALUE:%.]]
				; CHECK-NEXT: [[TMP2:%.]] = bitcast double addrspace(1) [[PTR]] to i64 addrspace(1)*
				; CHECK-NEXT: [[TMP3:%.*]] = bitcast double [[NEW]] to i64
				; CHECK-NEXT: [[TMP4:%.*]] = bitcast double [[LOADED]] to i64
				; CHECK-NEXT: [[TMP5:%.]] = cmpxchg i64 addrspace(1) [[TMP2]], i64 [[TMP4]], i64 [[TMP3]] seq_cst seq_cst
				; CHECK-NEXT: [[SUCCESS:%.*]] = extractvalue { i64, i1 } [[TMP5]], 1
				; CHECK-NEXT: [[NEWLOADED:%.*]] = extractvalue { i64, i1 } [[TMP5]], 0
				; CHECK-NEXT: [[TMP6]] = bitcast i64 [[NEWLOADED]] to double
				; CHECK-NEXT: br i1 [[SUCCESS]], label [[ATOMICRMW_END:%.*]], label [[ATOMICRMW_START]]
				; CHECK: atomicrmw.end:
				; CHECK-NEXT: ret double [[TMP6]]
				;
				%res = atomicrmw fadd double addrspace(1)* %ptr, double %value seq_cst
				ret double %res
				}

				define double @test_atomicrmw_fadd_f64_local(double addrspace(3)* %ptr, double %value) {
				; CHECK-LABEL: @test_atomicrmw_fadd_f64_local(
				; CHECK-NEXT: [[TMP1:%.]] = load double, double addrspace(3) [[PTR:%.*]], align 8
				; CHECK-NEXT: br label [[ATOMICRMW_START:%.*]]
				; CHECK: atomicrmw.start:
				; CHECK-NEXT: [[LOADED:%.]] = phi double [ [[TMP1]], [[TMP0:%.]] ], [ [[TMP6:%.*]], [[ATOMICRMW_START]] ]
				; CHECK-NEXT: [[NEW:%.]] = fadd double [[LOADED]], [[VALUE:%.]]
				; CHECK-NEXT: [[TMP2:%.]] = bitcast double addrspace(3) [[PTR]] to i64 addrspace(3)*
				; CHECK-NEXT: [[TMP3:%.*]] = bitcast double [[NEW]] to i64
				; CHECK-NEXT: [[TMP4:%.*]] = bitcast double [[LOADED]] to i64
				; CHECK-NEXT: [[TMP5:%.]] = cmpxchg i64 addrspace(3) [[TMP2]], i64 [[TMP4]], i64 [[TMP3]] seq_cst seq_cst
				; CHECK-NEXT: [[SUCCESS:%.*]] = extractvalue { i64, i1 } [[TMP5]], 1
				; CHECK-NEXT: [[NEWLOADED:%.*]] = extractvalue { i64, i1 } [[TMP5]], 0
				; CHECK-NEXT: [[TMP6]] = bitcast i64 [[NEWLOADED]] to double
				; CHECK-NEXT: br i1 [[SUCCESS]], label [[ATOMICRMW_END:%.*]], label [[ATOMICRMW_START]]
				; CHECK: atomicrmw.end:
				; CHECK-NEXT: ret double [[TMP6]]
				;
				%res = atomicrmw fadd double addrspace(3)* %ptr, double %value seq_cst
				ret double %res
				}

llvm/test/Transforms/AtomicExpand/NVPTX/expand-atomic-rmw-fsub.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
				; RUN: opt -S -mtriple=nvptx64-unknown-unknown -mcpu=sm_30 -atomic-expand %s \| FileCheck %s
				; RUN: opt -S -mtriple=nvptx64-unknown-unknown -mcpu=sm_75 -atomic-expand %s \| FileCheck %s

				define float @test_atomicrmw_fadd_f32_flat(float* %ptr, float %value) {
				; CHECK-LABEL: @test_atomicrmw_fadd_f32_flat(
				traUnsubmitted Not Done Reply Inline Actions Functilon name `fadd` does not seem to match the instruction `fsub`. tra: Functilon name `fadd` does not seem to match the instruction `fsub`.
				jdoerfertAuthorUnsubmitted Done Reply Inline Actions Good catch, copy & paste from the AMD tests ;) (@arsenm jdoerfert: Good catch, copy & paste from the AMD tests ;) (@arsenm
				; CHECK-NEXT: [[TMP1:%.]] = load float, float [[PTR:%.*]], align 4
				; CHECK-NEXT: br label [[ATOMICRMW_START:%.*]]
				; CHECK: atomicrmw.start:
				; CHECK-NEXT: [[LOADED:%.]] = phi float [ [[TMP1]], [[TMP0:%.]] ], [ [[TMP6:%.*]], [[ATOMICRMW_START]] ]
				; CHECK-NEXT: [[NEW:%.]] = fsub float [[LOADED]], [[VALUE:%.]]
				; CHECK-NEXT: [[TMP2:%.]] = bitcast float [[PTR]] to i32*
				; CHECK-NEXT: [[TMP3:%.*]] = bitcast float [[NEW]] to i32
				; CHECK-NEXT: [[TMP4:%.*]] = bitcast float [[LOADED]] to i32
				; CHECK-NEXT: [[TMP5:%.]] = cmpxchg i32 [[TMP2]], i32 [[TMP4]], i32 [[TMP3]] seq_cst seq_cst
				; CHECK-NEXT: [[SUCCESS:%.*]] = extractvalue { i32, i1 } [[TMP5]], 1
				traUnsubmitted Not Done Reply Inline Actions I must be missing something -- I would think that we do not want to expand atomicrmw variants which we can lower to an existing instruction, but a lot of the tests show the opposite and expand atomics that have direct support in hardware. The patch subject seems to agree with my assumptions, but the tests appear to contradict it. Is that intentional? If so, what is it that I'm missing? tra: I must be missing something -- I would think that we do not want to expand atomicrmw variants…
				jdoerfertAuthorUnsubmitted Done Reply Inline Actions It is not intentional to pesimise anything, as mentioned above. The problem is I am neither a backend nor NVPTX person and my changes do seem to have unwanted effects I cannot even categorize. jdoerfert: It is not intentional to pesimise anything, as mentioned above. The problem is I am neither a…
				arsenmUnsubmitted Not Done Reply Inline Actions For the purpose of this change, that this isn't optimal doesn't matter. These aren't implemented correct, but doing so is a separate change and those changes will show up in the same tests here arsenm: For the purpose of this change, that this isn't optimal doesn't matter. These aren't…
				traUnsubmitted Not Done Reply Inline Actions OK. Looks like `atomicrmw fsub` currently fails to lower on NVPTX, so expanding it is an improvement. However, expanding `atomicrmw fadd` is a substantial regression and is likely to be a showstopper. Atomic FP32 addition is a commonly used instruction in various reduction kernels so anything that prevents mapping it to `atom.add.f32` instruction will be very noticeable. I realize that there are many moving parts involved in getting this to work properly. If proper fix needs multiple patches, please try to commit them atomically to avoid the performance regression in between those changes. Also, if there are dependent patches, it would be great to arrange all of them as such in phabricator, so it's easier to see the big picture. tra: OK. Looks like `atomicrmw fsub` currently fails to lower on NVPTX, so expanding it is an…
				; CHECK-NEXT: [[NEWLOADED:%.*]] = extractvalue { i32, i1 } [[TMP5]], 0
				; CHECK-NEXT: [[TMP6]] = bitcast i32 [[NEWLOADED]] to float
				; CHECK-NEXT: br i1 [[SUCCESS]], label [[ATOMICRMW_END:%.*]], label [[ATOMICRMW_START]]
				; CHECK: atomicrmw.end:
				; CHECK-NEXT: ret float [[TMP6]]
				;
				%res = atomicrmw fsub float* %ptr, float %value seq_cst
				ret float %res
				}

				define float @test_atomicrmw_fsub_f32_global(float addrspace(1)* %ptr, float %value) {
				; CHECK-LABEL: @test_atomicrmw_fsub_f32_global(
				; CHECK-NEXT: [[TMP1:%.]] = load float, float addrspace(1) [[PTR:%.*]], align 4
				; CHECK-NEXT: br label [[ATOMICRMW_START:%.*]]
				; CHECK: atomicrmw.start:
				; CHECK-NEXT: [[LOADED:%.]] = phi float [ [[TMP1]], [[TMP0:%.]] ], [ [[TMP6:%.*]], [[ATOMICRMW_START]] ]
				; CHECK-NEXT: [[NEW:%.]] = fsub float [[LOADED]], [[VALUE:%.]]
				; CHECK-NEXT: [[TMP2:%.]] = bitcast float addrspace(1) [[PTR]] to i32 addrspace(1)*
				; CHECK-NEXT: [[TMP3:%.*]] = bitcast float [[NEW]] to i32
				; CHECK-NEXT: [[TMP4:%.*]] = bitcast float [[LOADED]] to i32
				; CHECK-NEXT: [[TMP5:%.]] = cmpxchg i32 addrspace(1) [[TMP2]], i32 [[TMP4]], i32 [[TMP3]] seq_cst seq_cst
				; CHECK-NEXT: [[SUCCESS:%.*]] = extractvalue { i32, i1 } [[TMP5]], 1
				; CHECK-NEXT: [[NEWLOADED:%.*]] = extractvalue { i32, i1 } [[TMP5]], 0
				; CHECK-NEXT: [[TMP6]] = bitcast i32 [[NEWLOADED]] to float
				; CHECK-NEXT: br i1 [[SUCCESS]], label [[ATOMICRMW_END:%.*]], label [[ATOMICRMW_START]]
				; CHECK: atomicrmw.end:
				; CHECK-NEXT: ret float [[TMP6]]
				;
				%res = atomicrmw fsub float addrspace(1)* %ptr, float %value seq_cst
				ret float %res
				}

				define float @test_atomicrmw_fsub_f32_local(float addrspace(3)* %ptr, float %value) {
				; CHECK-LABEL: @test_atomicrmw_fsub_f32_local(
				; CHECK-NEXT: [[TMP1:%.]] = load float, float addrspace(3) [[PTR:%.*]], align 4
				; CHECK-NEXT: br label [[ATOMICRMW_START:%.*]]
				; CHECK: atomicrmw.start:
				; CHECK-NEXT: [[LOADED:%.]] = phi float [ [[TMP1]], [[TMP0:%.]] ], [ [[TMP6:%.*]], [[ATOMICRMW_START]] ]
				; CHECK-NEXT: [[NEW:%.]] = fsub float [[LOADED]], [[VALUE:%.]]
				; CHECK-NEXT: [[TMP2:%.]] = bitcast float addrspace(3) [[PTR]] to i32 addrspace(3)*
				; CHECK-NEXT: [[TMP3:%.*]] = bitcast float [[NEW]] to i32
				; CHECK-NEXT: [[TMP4:%.*]] = bitcast float [[LOADED]] to i32
				; CHECK-NEXT: [[TMP5:%.]] = cmpxchg i32 addrspace(3) [[TMP2]], i32 [[TMP4]], i32 [[TMP3]] seq_cst seq_cst
				; CHECK-NEXT: [[SUCCESS:%.*]] = extractvalue { i32, i1 } [[TMP5]], 1
				; CHECK-NEXT: [[NEWLOADED:%.*]] = extractvalue { i32, i1 } [[TMP5]], 0
				; CHECK-NEXT: [[TMP6]] = bitcast i32 [[NEWLOADED]] to float
				; CHECK-NEXT: br i1 [[SUCCESS]], label [[ATOMICRMW_END:%.*]], label [[ATOMICRMW_START]]
				; CHECK: atomicrmw.end:
				; CHECK-NEXT: ret float [[TMP6]]
				;
				%res = atomicrmw fsub float addrspace(3)* %ptr, float %value seq_cst
				ret float %res
				}

				define half @test_atomicrmw_fsub_f16_flat(half* %ptr, half %value) {
				; CHECK-LABEL: @test_atomicrmw_fsub_f16_flat(
				; CHECK-NEXT: [[RES:%.]] = atomicrmw fsub half [[PTR:%.]], half [[VALUE:%.]] seq_cst
				; CHECK-NEXT: ret half [[RES]]
				;
				%res = atomicrmw fsub half* %ptr, half %value seq_cst
				ret half %res
				}

				define half @test_atomicrmw_fsub_f16_global(half addrspace(1)* %ptr, half %value) {
				; CHECK-LABEL: @test_atomicrmw_fsub_f16_global(
				; CHECK-NEXT: [[RES:%.]] = atomicrmw fsub half addrspace(1) [[PTR:%.]], half [[VALUE:%.]] seq_cst
				; CHECK-NEXT: ret half [[RES]]
				;
				%res = atomicrmw fsub half addrspace(1)* %ptr, half %value seq_cst
				ret half %res
				}

				define half @test_atomicrmw_fsub_f16_local(half addrspace(3)* %ptr, half %value) {
				; CHECK-LABEL: @test_atomicrmw_fsub_f16_local(
				; CHECK-NEXT: [[RES:%.]] = atomicrmw fsub half addrspace(3) [[PTR:%.]], half [[VALUE:%.]] seq_cst
				; CHECK-NEXT: ret half [[RES]]
				;
				%res = atomicrmw fsub half addrspace(3)* %ptr, half %value seq_cst
				ret half %res
				}

				define double @test_atomicrmw_fsub_f64_flat(double* %ptr, double %value) {
				; CHECK-LABEL: @test_atomicrmw_fsub_f64_flat(
				; CHECK-NEXT: [[TMP1:%.]] = load double, double [[PTR:%.*]], align 8
				; CHECK-NEXT: br label [[ATOMICRMW_START:%.*]]
				; CHECK: atomicrmw.start:
				; CHECK-NEXT: [[LOADED:%.]] = phi double [ [[TMP1]], [[TMP0:%.]] ], [ [[TMP6:%.*]], [[ATOMICRMW_START]] ]
				; CHECK-NEXT: [[NEW:%.]] = fsub double [[LOADED]], [[VALUE:%.]]
				; CHECK-NEXT: [[TMP2:%.]] = bitcast double [[PTR]] to i64*
				; CHECK-NEXT: [[TMP3:%.*]] = bitcast double [[NEW]] to i64
				; CHECK-NEXT: [[TMP4:%.*]] = bitcast double [[LOADED]] to i64
				; CHECK-NEXT: [[TMP5:%.]] = cmpxchg i64 [[TMP2]], i64 [[TMP4]], i64 [[TMP3]] seq_cst seq_cst
				; CHECK-NEXT: [[SUCCESS:%.*]] = extractvalue { i64, i1 } [[TMP5]], 1
				; CHECK-NEXT: [[NEWLOADED:%.*]] = extractvalue { i64, i1 } [[TMP5]], 0
				; CHECK-NEXT: [[TMP6]] = bitcast i64 [[NEWLOADED]] to double
				; CHECK-NEXT: br i1 [[SUCCESS]], label [[ATOMICRMW_END:%.*]], label [[ATOMICRMW_START]]
				; CHECK: atomicrmw.end:
				; CHECK-NEXT: ret double [[TMP6]]
				;
				%res = atomicrmw fsub double* %ptr, double %value seq_cst
				ret double %res
				}

				define double @test_atomicrmw_fsub_f64_global(double addrspace(1)* %ptr, double %value) {
				; CHECK-LABEL: @test_atomicrmw_fsub_f64_global(
				; CHECK-NEXT: [[TMP1:%.]] = load double, double addrspace(1) [[PTR:%.*]], align 8
				; CHECK-NEXT: br label [[ATOMICRMW_START:%.*]]
				; CHECK: atomicrmw.start:
				; CHECK-NEXT: [[LOADED:%.]] = phi double [ [[TMP1]], [[TMP0:%.]] ], [ [[TMP6:%.*]], [[ATOMICRMW_START]] ]
				; CHECK-NEXT: [[NEW:%.]] = fsub double [[LOADED]], [[VALUE:%.]]
				; CHECK-NEXT: [[TMP2:%.]] = bitcast double addrspace(1) [[PTR]] to i64 addrspace(1)*
				; CHECK-NEXT: [[TMP3:%.*]] = bitcast double [[NEW]] to i64
				; CHECK-NEXT: [[TMP4:%.*]] = bitcast double [[LOADED]] to i64
				; CHECK-NEXT: [[TMP5:%.]] = cmpxchg i64 addrspace(1) [[TMP2]], i64 [[TMP4]], i64 [[TMP3]] seq_cst seq_cst
				; CHECK-NEXT: [[SUCCESS:%.*]] = extractvalue { i64, i1 } [[TMP5]], 1
				; CHECK-NEXT: [[NEWLOADED:%.*]] = extractvalue { i64, i1 } [[TMP5]], 0
				; CHECK-NEXT: [[TMP6]] = bitcast i64 [[NEWLOADED]] to double
				; CHECK-NEXT: br i1 [[SUCCESS]], label [[ATOMICRMW_END:%.*]], label [[ATOMICRMW_START]]
				; CHECK: atomicrmw.end:
				; CHECK-NEXT: ret double [[TMP6]]
				;
				%res = atomicrmw fsub double addrspace(1)* %ptr, double %value seq_cst
				ret double %res
				}

				define double @test_atomicrmw_fsub_f64_local(double addrspace(3)* %ptr, double %value) {
				; CHECK-LABEL: @test_atomicrmw_fsub_f64_local(
				; CHECK-NEXT: [[TMP1:%.]] = load double, double addrspace(3) [[PTR:%.*]], align 8
				; CHECK-NEXT: br label [[ATOMICRMW_START:%.*]]
				; CHECK: atomicrmw.start:
				; CHECK-NEXT: [[LOADED:%.]] = phi double [ [[TMP1]], [[TMP0:%.]] ], [ [[TMP6:%.*]], [[ATOMICRMW_START]] ]
				; CHECK-NEXT: [[NEW:%.]] = fsub double [[LOADED]], [[VALUE:%.]]
				; CHECK-NEXT: [[TMP2:%.]] = bitcast double addrspace(3) [[PTR]] to i64 addrspace(3)*
				; CHECK-NEXT: [[TMP3:%.*]] = bitcast double [[NEW]] to i64
				; CHECK-NEXT: [[TMP4:%.*]] = bitcast double [[LOADED]] to i64
				; CHECK-NEXT: [[TMP5:%.]] = cmpxchg i64 addrspace(3) [[TMP2]], i64 [[TMP4]], i64 [[TMP3]] seq_cst seq_cst
				; CHECK-NEXT: [[SUCCESS:%.*]] = extractvalue { i64, i1 } [[TMP5]], 1
				; CHECK-NEXT: [[NEWLOADED:%.*]] = extractvalue { i64, i1 } [[TMP5]], 0
				; CHECK-NEXT: [[TMP6]] = bitcast i64 [[NEWLOADED]] to double
				; CHECK-NEXT: br i1 [[SUCCESS]], label [[ATOMICRMW_END:%.*]], label [[ATOMICRMW_START]]
				; CHECK: atomicrmw.end:
				; CHECK-NEXT: ret double [[TMP6]]
				;
				%res = atomicrmw fsub double addrspace(3)* %ptr, double %value seq_cst
				ret double %res
				}

llvm/test/Transforms/AtomicExpand/NVPTX/expand-atomic-rmw-nand.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
				; RUN: opt -mtriple=nvptx64-unknown-unknown -S -atomic-expand %s \| FileCheck %s
				; RUN: opt -mtriple=nvptx64-unknown-unknown -S -atomic-expand %s \| FileCheck %s

				define i32 @test_atomicrmw_nand_i32_flat(i32* %ptr, i32 %value) {
				; CHECK-LABEL: @test_atomicrmw_nand_i32_flat(
				; CHECK-NEXT: [[RES:%.]] = atomicrmw nand i32 [[PTR:%.]], i32 [[VALUE:%.]] seq_cst
				; CHECK-NEXT: ret i32 [[RES]]
				;
				%res = atomicrmw nand i32* %ptr, i32 %value seq_cst
				ret i32 %res
				}

				define i32 @test_atomicrmw_nand_i32_global(i32 addrspace(1)* %ptr, i32 %value) {
				; CHECK-LABEL: @test_atomicrmw_nand_i32_global(
				; CHECK-NEXT: [[RES:%.]] = atomicrmw nand i32 addrspace(1) [[PTR:%.]], i32 [[VALUE:%.]] seq_cst
				; CHECK-NEXT: ret i32 [[RES]]
				;
				%res = atomicrmw nand i32 addrspace(1)* %ptr, i32 %value seq_cst
				ret i32 %res
				}

				define i32 @test_atomicrmw_nand_i32_local(i32 addrspace(3)* %ptr, i32 %value) {
				; CHECK-LABEL: @test_atomicrmw_nand_i32_local(
				; CHECK-NEXT: [[RES:%.]] = atomicrmw nand i32 addrspace(3) [[PTR:%.]], i32 [[VALUE:%.]] seq_cst
				; CHECK-NEXT: ret i32 [[RES]]
				;
				%res = atomicrmw nand i32 addrspace(3)* %ptr, i32 %value seq_cst
				ret i32 %res
				}

llvm/test/Transforms/AtomicExpand/NVPTX/lit.local.cfg

This file was added.

				if not 'NVPTX' in config.root.targets:
				config.unsupported = True

llvm/test/Transforms/AtomicExpand/NVPTX/unaligned-atomic.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
				; RUN: opt -S -mtriple=nvptx64-unknown-unknown -atomic-expand %s \| FileCheck -check-prefix=CHECK %s

				traUnsubmitted Not Done Reply Inline Actions Nit: no need for `-check-prefix` as you only using `CHECK` in the test. tra: Nit: no need for `-check-prefix` as you only using `CHECK` in the test.
				jdoerfertAuthorUnsubmitted Done Reply Inline Actions Fair, I think I copied this ;) jdoerfert: Fair, I think I copied this ;)
				define i32 @atomic_load_global_align1(i32 addrspace(1)* %ptr) {
				; CHECK-LABEL: @atomic_load_global_align1(
				; CHECK-NEXT: [[TMP1:%.]] = bitcast i32 addrspace(1) [[PTR:%.]] to i8 addrspace(1)
				; CHECK-NEXT: [[TMP2:%.]] = addrspacecast i8 addrspace(1) [[TMP1]] to i8*
				; CHECK-NEXT: [[TMP3:%.*]] = alloca i32, align 4
				; CHECK-NEXT: [[TMP4:%.]] = bitcast i32 [[TMP3]] to i8*
				; CHECK-NEXT: call void @llvm.lifetime.start.p0i8(i64 4, i8* [[TMP4]])
				; CHECK-NEXT: call void @__atomic_load(i64 4, i8* [[TMP2]], i8* [[TMP4]], i32 5)
				; CHECK-NEXT: [[TMP5:%.]] = load i32, i32 [[TMP3]], align 4
				; CHECK-NEXT: call void @llvm.lifetime.end.p0i8(i64 4, i8* [[TMP4]])
				; CHECK-NEXT: ret i32 [[TMP5]]
				;
				%val = load atomic i32, i32 addrspace(1)* %ptr seq_cst, align 1
				ret i32 %val
				}

				define void @atomic_store_global_align1(i32 addrspace(1)* %ptr, i32 %val) {
				; CHECK-LABEL: @atomic_store_global_align1(
				; CHECK-NEXT: [[TMP1:%.]] = bitcast i32 addrspace(1) [[PTR:%.]] to i8 addrspace(1)
				; CHECK-NEXT: [[TMP2:%.]] = addrspacecast i8 addrspace(1) [[TMP1]] to i8*
				; CHECK-NEXT: [[TMP3:%.*]] = alloca i32, align 4
				; CHECK-NEXT: [[TMP4:%.]] = bitcast i32 [[TMP3]] to i8*
				; CHECK-NEXT: call void @llvm.lifetime.start.p0i8(i64 4, i8* [[TMP4]])
				; CHECK-NEXT: store i32 [[VAL:%.]], i32 [[TMP3]], align 4
				; CHECK-NEXT: call void @__atomic_store(i64 4, i8* [[TMP2]], i8* [[TMP4]], i32 0)
				; CHECK-NEXT: call void @llvm.lifetime.end.p0i8(i64 4, i8* [[TMP4]])
				; CHECK-NEXT: ret void
				;
				store atomic i32 %val, i32 addrspace(1)* %ptr monotonic, align 1
				ret void
				}