This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
include/llvm/CodeGen/
-
llvm/
-
CodeGen/
1/2
TargetLowering.h
-
lib/
-
CodeGen/
-
AtomicExpandPass.cpp
-
Target/NVPTX/
-
NVPTX/
-
NVPTXISelLowering.h
1/1
NVPTXISelLowering.cpp
-
NVPTXTargetMachine.cpp
-
test/Transforms/AtomicExpand/NVPTX/
-
Transforms/
-
AtomicExpand/
-
NVPTX/
-
expand-atomic-i16.ll
-
expand-atomic-i8.ll
2/4
expand-atomic-rmw-fadd.ll
2/6
expand-atomic-rmw-fsub.ll
-
expand-atomic-rmw-nand.ll
-
lit.local.cfg
1/2
unaligned-atomic.ll

Differential D71128

[NVPTX][FIX] Expand atomics we cannot handle natively in the ISA
Needs ReviewPublic

Authored by jdoerfert on Dec 6 2019, 9:47 AM.

Download Raw Diff

Details

Reviewers

tra
__simt__
arsenm

Summary

NOTE: This is lacking a test and more of a request for feedback (I'm not an NVPTX person).

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

jdoerfert created this revision.Dec 6 2019, 9:47 AM

Herald added a project: Restricted Project. · View Herald TranscriptDec 6 2019, 9:47 AM

Herald added subscribers: jfb, bollu, hiraditya and 2 others. · View Herald Transcript

jfb added a reviewer: __simt__.Dec 6 2019, 9:54 AM

Needs tests. The AMDGPU ones in test/Transforms/AtomicExpand can probably be copied as-is (plus another codegen one to make sure AtomicExpand is actually running)

Build result: pass - 60562 tests passed, 0 failed and 726 were skipped.

Log files: console-log.txt, CMakeCache.txt

Harbormaster completed remote builds in B42022: Diff 232596.Dec 6 2019, 10:05 AM

+1 for tests.

llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp
360	Typo: it's 44219: https://bugs.llvm.org/show_bug.cgi?id=44219

Add tests and run pass in the pipeline

Fixed the typo, copied test/Transforms/AtomicExpand/AMDGPU into NVPTX and changed the run lines accordingly. Then I run the update_test_checks. The result is different than before (some expansion happens), and close to the AMDGPU result, but I haven't verified everything.

I also added the pass explicitly to the NVPTX required passes. Are there existing tests to check the target specific pipeline?

Build result: fail - 60568 tests passed, 3 failed and 726 were skipped.

failed: LLVM.CodeGen/NVPTX/atomics-sm60.ll
failed: LLVM.CodeGen/NVPTX/atomics.ll
failed: LLVM.CodeGen/NVPTX/load-store.ll

Log files: console-log.txt, CMakeCache.txt

Harbormaster failed remote builds in B42034: Diff 232633!Dec 6 2019, 1:12 PM

> Command Output (stderr):
> --
> /mnt/disks/ssd0/agent/workspace/amd64_debian_testing_clang8/llvm/test/CodeGen/NVPTX/atomics-sm60.ll:6:10: error: CHECK: expected string not found in input
> ; CHECK: atom.add.f64
>          ^
> <stdin>:1:1: note: scanning from here
> //
> ^
> <stdin>:34:2: note: possible intended match here
>  atom.cas.b64 %rd3, [%r1], %rd2, %rd1;
>  ^

This appears to be a regression. We do have fp32/fp64 atomic adds in NVPTX. Replacing them with add+CAS is suboptimal.

In D71128#1773447, @tra wrote:

> Command Output (stderr):
> --
> /mnt/disks/ssd0/agent/workspace/amd64_debian_testing_clang8/llvm/test/CodeGen/NVPTX/atomics-sm60.ll:6:10: error: CHECK: expected string not found in input
> ; CHECK: atom.add.f64
>          ^
> <stdin>:1:1: note: scanning from here
> //
> ^
> <stdin>:34:2: note: possible intended match here
>  atom.cas.b64 %rd3, [%r1], %rd2, %rd1;
>  ^

This appears to be a regression. We do have fp32/fp64 atomic adds in NVPTX. Replacing them with add+CAS is suboptimal.

I'm already working on it.

Fix test cases by exposing more TLI hooks

Build result: FAILURE -
Log files: console-log.txt, CMakeCache.txt

Harbormaster failed remote builds in B42045: Diff 232662!Dec 6 2019, 3:39 PM

In D71128#1773635, @merge_guards_bot wrote:

Build result: FAILURE -
Log files: console-log.txt, CMakeCache.txt

I have the feeling that wasn't my fault.

tra added inline comments.Dec 9 2019, 11:57 AM

llvm/test/Transforms/AtomicExpand/NVPTX/expand-atomic-rmw-fadd.ll
16	Don't we want to preserve `atomicrmw fadd` in this case and lower it to `atom.add.f32` ? Why do we want to expand here?
131	Ditto here and below. We do have `atom.add.f64`
llvm/test/Transforms/AtomicExpand/NVPTX/expand-atomic-rmw-fsub.ll
5	Functilon name `fadd` does not seem to match the instruction `fsub`.
15	I must be missing something -- I would think that we do not want to expand atomicrmw variants which we can lower to an existing instruction, but a lot of the tests show the opposite and expand atomics that have direct support in hardware. The patch subject seems to agree with my assumptions, but the tests appear to contradict it. Is that intentional? If so, what is it that I'm missing?
llvm/test/Transforms/AtomicExpand/NVPTX/unaligned-atomic.ll
2	Nit: no need for `-check-prefix` as you only using `CHECK` in the test.

I will try to look into the problematic lowerings @tra pointed out (thanks btw!). Any hints to why they are expanded are appreciated :)

llvm/test/Transforms/AtomicExpand/NVPTX/expand-atomic-rmw-fadd.ll
16	Same as below.
131	To be honest, I don't even know why we do not match it. All I (tried) to do is add the limits wrt. size and alignment. Somehow that had more effect than I wanted. The new hooks already remove some of the weirdness we saw but it seems something is missing here (maybe during the instruction "registration").
llvm/test/Transforms/AtomicExpand/NVPTX/expand-atomic-rmw-fsub.ll
5	Good catch, copy & paste from the AMD tests ;) (@arsenm
15	It is not intentional to pesimise anything, as mentioned above. The problem is I am neither a backend nor NVPTX person and my changes do seem to have unwanted effects I cannot even categorize.
llvm/test/Transforms/AtomicExpand/NVPTX/unaligned-atomic.ll
2	Fair, I think I copied this ;)

arsenm added inline comments.Dec 10 2019, 8:20 AM

llvm/test/Transforms/AtomicExpand/NVPTX/expand-atomic-rmw-fsub.ll
15	For the purpose of this change, that this isn't optimal doesn't matter. These aren't implemented correct, but doing so is a separate change and those changes will show up in the same tests here

tra added inline comments.Dec 10 2019, 9:05 AM

llvm/test/Transforms/AtomicExpand/NVPTX/expand-atomic-rmw-fsub.ll
15	OK. Looks like `atomicrmw fsub` currently fails to lower on NVPTX, so expanding it is an improvement. However, expanding `atomicrmw fadd` is a substantial regression and is likely to be a showstopper. Atomic FP32 addition is a commonly used instruction in various reduction kernels so anything that prevents mapping it to `atom.add.f32` instruction will be very noticeable. I realize that there are many moving parts involved in getting this to work properly. If proper fix needs multiple patches, please try to commit them atomically to avoid the performance regression in between those changes. Also, if there are dependent patches, it would be great to arrange all of them as such in phabricator, so it's easier to see the big picture.

arsenm added inline comments.Jan 9 2020, 7:15 AM

llvm/include/llvm/CodeGen/TargetLowering.h
1856–1871	Do we really need 4 of these when just the one for Instruction will work

jdoerfert marked an inline comment as done.Jan 9 2020, 8:32 AM

jdoerfert added inline comments.

llvm/include/llvm/CodeGen/TargetLowering.h
1856–1871	Alternatively we can overload the instruction one and check for the kind to decide what to do. I don't remember how I ended up like this, I'll address this once I get around to this patch again...

arsenm resigned from this revision.Feb 13 2020, 4:44 PM

Revision Contents

Path

Size

llvm/

include/

llvm/

CodeGen/

TargetLowering.h

24 lines

lib/

CodeGen/

AtomicExpandPass.cpp

15 lines

Target/

NVPTX/

NVPTXISelLowering.h

9 lines

NVPTXISelLowering.cpp

5 lines

NVPTXTargetMachine.cpp

1 line

test/

Transforms/

AtomicExpand/

NVPTX/

expand-atomic-i16.ll

183 lines

expand-atomic-i8.ll

183 lines

expand-atomic-rmw-fadd.ll

186 lines

expand-atomic-rmw-fsub.ll

162 lines

expand-atomic-rmw-nand.ll

30 lines

lit.local.cfg

2 lines

unaligned-atomic.ll

34 lines

Diff 232662

llvm/include/llvm/CodeGen/TargetLowering.h

Show First 20 Lines • Show All 1,840 Lines • ▼ Show 20 Lines	public:

/// Returns how the IR-level AtomicExpand pass should expand the given		/// Returns how the IR-level AtomicExpand pass should expand the given
/// AtomicRMW, if at all. Default is to never expand.		/// AtomicRMW, if at all. Default is to never expand.
virtual AtomicExpansionKind shouldExpandAtomicRMWInIR(AtomicRMWInst *RMW) const {		virtual AtomicExpansionKind shouldExpandAtomicRMWInIR(AtomicRMWInst *RMW) const {
return RMW->isFloatingPointOperation() ?		return RMW->isFloatingPointOperation() ?
AtomicExpansionKind::CmpXChg : AtomicExpansionKind::None;		AtomicExpansionKind::CmpXChg : AtomicExpansionKind::None;
}		}

		/// Returns true if the floating or pointer operation \p I should be
		/// transformed to an integer operation or not.
		///
		/// The hook is fairly generic but only used in AtomicExpand so far.
		virtual bool shouldTransformToIntegerOperation(Instruction *I) const {
		return true;
		}

		/// See shouldTransformToIntegerOperation(Instruction *)
		virtual bool shouldTransformToIntegerOperation(LoadInst *LI) const {
		return shouldTransformToIntegerOperation(cast<Instruction>(LI));
		}

		/// See shouldTransformToIntegerOperation(Instruction *)
		virtual bool shouldTransformToIntegerOperation(StoreInst *SI) const {
		return shouldTransformToIntegerOperation(cast<Instruction>(SI));
		}

		/// See shouldTransformToIntegerOperation(Instruction *)
		virtual bool
		shouldTransformToIntegerOperation(AtomicCmpXchgInst *CASI) const {
		return shouldTransformToIntegerOperation(cast<Instruction>(CASI));
		}
		arsenmUnsubmitted Not Done Reply Inline Actions Do we really need 4 of these when just the one for Instruction will work arsenm: Do we really need 4 of these when just the one for Instruction will work
		jdoerfertAuthorUnsubmitted Done Reply Inline Actions Alternatively we can overload the instruction one and check for the kind to decide what to do. I don't remember how I ended up like this, I'll address this once I get around to this patch again... jdoerfert: Alternatively we can overload the instruction one and check for the kind to decide what to do.

/// On some platforms, an AtomicRMW that never actually modifies the value		/// On some platforms, an AtomicRMW that never actually modifies the value
/// (such as fetch_add of 0) can be turned into a fence followed by an		/// (such as fetch_add of 0) can be turned into a fence followed by an
/// atomic load. This may sound useless, but it makes it possible for the		/// atomic load. This may sound useless, but it makes it possible for the
/// processor to keep the cacheline shared, dramatically improving		/// processor to keep the cacheline shared, dramatically improving
/// performance. And such idempotent RMWs are useful for implementing some		/// performance. And such idempotent RMWs are useful for implementing some
/// kinds of locks, see for example (justification + benchmarks):		/// kinds of locks, see for example (justification + benchmarks):
/// http://www.hpl.hp.com/techreports/2012/HPL-2012-68.pdf		/// http://www.hpl.hp.com/techreports/2012/HPL-2012-68.pdf
/// This method tries doing that transformation, returning the atomic load if		/// This method tries doing that transformation, returning the atomic load if
▲ Show 20 Lines • Show All 2,465 Lines • Show Last 20 Lines

llvm/lib/CodeGen/AtomicExpandPass.cpp

Show First 20 Lines • Show All 276 Lines • ▼ Show 20 Lines	if (TLI->shouldInsertFencesForAtomic(I)) {
}		}

if (FenceOrdering != AtomicOrdering::Monotonic) {		if (FenceOrdering != AtomicOrdering::Monotonic) {
MadeChange \|= bracketInstWithFences(I, FenceOrdering);		MadeChange \|= bracketInstWithFences(I, FenceOrdering);
}		}
}		}

if (LI) {		if (LI) {
if (LI->getType()->isFloatingPointTy()) {		if (LI->getType()->isFloatingPointTy() &&
// TODO: add a TLI hook to control this so that each target can		TLI->shouldTransformToIntegerOperation(LI)) {
// convert to lowering the original type one at a time.
LI = convertAtomicLoadToIntegerType(LI);		LI = convertAtomicLoadToIntegerType(LI);
assert(LI->getType()->isIntegerTy() && "invariant broken");		assert(LI->getType()->isIntegerTy() && "invariant broken");
MadeChange = true;		MadeChange = true;
}		}

MadeChange \|= tryExpandAtomicLoad(LI);		MadeChange \|= tryExpandAtomicLoad(LI);
} else if (SI) {		} else if (SI) {
if (SI->getValueOperand()->getType()->isFloatingPointTy()) {		if (SI->getValueOperand()->getType()->isFloatingPointTy() &&
// TODO: add a TLI hook to control this so that each target can		TLI->shouldTransformToIntegerOperation(SI)) {
// convert to lowering the original type one at a time.
SI = convertAtomicStoreToIntegerType(SI);		SI = convertAtomicStoreToIntegerType(SI);
assert(SI->getValueOperand()->getType()->isIntegerTy() &&		assert(SI->getValueOperand()->getType()->isIntegerTy() &&
"invariant broken");		"invariant broken");
MadeChange = true;		MadeChange = true;
}		}

if (TLI->shouldExpandAtomicStoreInIR(SI))		if (TLI->shouldExpandAtomicStoreInIR(SI))
MadeChange \|= expandAtomicStore(SI);		MadeChange \|= expandAtomicStore(SI);
Show All 18 Lines	if (LI) {

MadeChange \|= tryExpandAtomicRMW(RMWI);		MadeChange \|= tryExpandAtomicRMW(RMWI);
}		}
} else if (CASI) {		} else if (CASI) {
// TODO: when we're ready to make the change at the IR level, we can		// TODO: when we're ready to make the change at the IR level, we can
// extend convertCmpXchgToInteger for floating point too.		// extend convertCmpXchgToInteger for floating point too.
assert(!CASI->getCompareOperand()->getType()->isFloatingPointTy() &&		assert(!CASI->getCompareOperand()->getType()->isFloatingPointTy() &&
"unimplemented - floating point not legal at IR level");		"unimplemented - floating point not legal at IR level");
if (CASI->getCompareOperand()->getType()->isPointerTy() ) {		if (CASI->getCompareOperand()->getType()->isPointerTy() &&
// TODO: add a TLI hook to control this so that each target can		TLI->shouldTransformToIntegerOperation(CASI)) {
// convert to lowering the original type one at a time.
CASI = convertCmpXchgToIntegerType(CASI);		CASI = convertCmpXchgToIntegerType(CASI);
assert(CASI->getCompareOperand()->getType()->isIntegerTy() &&		assert(CASI->getCompareOperand()->getType()->isIntegerTy() &&
"invariant broken");		"invariant broken");
MadeChange = true;		MadeChange = true;
}		}

MadeChange \|= tryExpandAtomicCmpXchg(CASI);		MadeChange \|= tryExpandAtomicCmpXchg(CASI);
}		}
▲ Show 20 Lines • Show All 1,467 Lines • Show Last 20 Lines

llvm/lib/Target/NVPTX/NVPTXISelLowering.h

Show First 20 Lines • Show All 523 Lines • ▼ Show 20 Lines	public:
// Get whether we should use a precise or approximate 32-bit floating point		// Get whether we should use a precise or approximate 32-bit floating point
// sqrt instruction.		// sqrt instruction.
bool usePrecSqrtF32() const;		bool usePrecSqrtF32() const;

// Get whether we should use instructions that flush floating-point denormals		// Get whether we should use instructions that flush floating-point denormals
// to sign-preserving zero.		// to sign-preserving zero.
bool useF32FTZ(const MachineFunction &MF) const;		bool useF32FTZ(const MachineFunction &MF) const;

		AtomicExpansionKind
		shouldExpandAtomicRMWInIR(AtomicRMWInst *RMW) const override {
		return AtomicExpansionKind::None;
		}

		bool shouldTransformToIntegerOperation(Instruction *I) const override {
		return false;
		}

SDValue getSqrtEstimate(SDValue Operand, SelectionDAG &DAG, int Enabled,		SDValue getSqrtEstimate(SDValue Operand, SelectionDAG &DAG, int Enabled,
int &ExtraSteps, bool &UseOneConst,		int &ExtraSteps, bool &UseOneConst,
bool Reciprocal) const override;		bool Reciprocal) const override;

unsigned combineRepeatedFPDivisors() const override { return 2; }		unsigned combineRepeatedFPDivisors() const override { return 2; }

bool allowFMA(MachineFunction &MF, CodeGenOpt::Level OptLevel) const;		bool allowFMA(MachineFunction &MF, CodeGenOpt::Level OptLevel) const;
bool allowUnsafeFPMath(MachineFunction &MF) const;		bool allowUnsafeFPMath(MachineFunction &MF) const;
▲ Show 20 Lines • Show All 48 Lines • Show Last 20 Lines

llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp

Show First 20 Lines • Show All 351 Lines • ▼ Show 20 Lines	NVPTXTargetLowering::NVPTXTargetLowering(const NVPTXTargetMachine &TM,

setBooleanContents(ZeroOrNegativeOneBooleanContent);		setBooleanContents(ZeroOrNegativeOneBooleanContent);
setBooleanVectorContents(ZeroOrNegativeOneBooleanContent);		setBooleanVectorContents(ZeroOrNegativeOneBooleanContent);

// Jump is Expensive. Don't create extra control flow for 'and', 'or'		// Jump is Expensive. Don't create extra control flow for 'and', 'or'
// condition branches.		// condition branches.
setJumpIsExpensive(true);		setJumpIsExpensive(true);

		// Force atomics to be expanded if the ISA doesn't support them: PR44219
		traUnsubmitted Done Reply Inline Actions Typo: it's 44219: https://bugs.llvm.org/show_bug.cgi?id=44219 tra: Typo: it's 44219: https://bugs.llvm.org/show_bug.cgi?id=44219
		setMinCmpXchgSizeInBits(32);
		setMaxAtomicSizeInBitsSupported(64);
		setSupportsUnalignedAtomics(false);

// Wide divides are _very_ slow. Try to reduce the width of the divide if		// Wide divides are _very_ slow. Try to reduce the width of the divide if
// possible.		// possible.
addBypassSlowDiv(64, 32);		addBypassSlowDiv(64, 32);

// By default, use the Source scheduling		// By default, use the Source scheduling
if (sched4reg)		if (sched4reg)
setSchedulingPreference(Sched::RegPressure);		setSchedulingPreference(Sched::RegPressure);
else		else
▲ Show 20 Lines • Show All 4,700 Lines • Show Last 20 Lines

llvm/lib/Target/NVPTX/NVPTXTargetMachine.cpp

Show First 20 Lines • Show All 264 Lines • ▼ Show 20 Lines	void NVPTXPassConfig::addIRPasses() {
// call addEarlyAsPossiblePasses.		// call addEarlyAsPossiblePasses.
const NVPTXSubtarget &ST = *getTM<NVPTXTargetMachine>().getSubtargetImpl();		const NVPTXSubtarget &ST = *getTM<NVPTXTargetMachine>().getSubtargetImpl();
addPass(createNVVMReflectPass(ST.getSmVersion()));		addPass(createNVVMReflectPass(ST.getSmVersion()));

if (getOptLevel() != CodeGenOpt::None)		if (getOptLevel() != CodeGenOpt::None)
addPass(createNVPTXImageOptimizerPass());		addPass(createNVPTXImageOptimizerPass());
addPass(createNVPTXAssignValidGlobalNamesPass());		addPass(createNVPTXAssignValidGlobalNamesPass());
addPass(createGenericToNVVMPass());		addPass(createGenericToNVVMPass());
		addPass(createAtomicExpandPass());

// NVPTXLowerArgs is required for correctness and should be run right		// NVPTXLowerArgs is required for correctness and should be run right
// before the address space inference passes.		// before the address space inference passes.
addPass(createNVPTXLowerArgsPass(&getNVPTXTargetMachine()));		addPass(createNVPTXLowerArgsPass(&getNVPTXTargetMachine()));
if (getOptLevel() != CodeGenOpt::None) {		if (getOptLevel() != CodeGenOpt::None) {
addAddressSpaceInferencePasses();		addAddressSpaceInferencePasses();
if (!DisableLoadStoreVectorizer)		if (!DisableLoadStoreVectorizer)
addPass(createLoadStoreVectorizerPass());		addPass(createLoadStoreVectorizerPass());
▲ Show 20 Lines • Show All 119 Lines • Show Last 20 Lines

llvm/test/Transforms/AtomicExpand/NVPTX/expand-atomic-i16.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
				; RUN: opt -mtriple=nvptx-unknown-unknown -S -atomic-expand %s \| FileCheck %s
				; RUN: opt -mtriple=nvptx64-unknown-unknown -S -atomic-expand %s \| FileCheck %s

				define i16 @test_atomicrmw_xchg_i16_global(i16 addrspace(1)* %ptr, i16 %value) {
				; CHECK-LABEL: @test_atomicrmw_xchg_i16_global(
				; CHECK-NEXT: [[RES:%.]] = atomicrmw xchg i16 addrspace(1) [[PTR:%.]], i16 [[VALUE:%.]] seq_cst
				; CHECK-NEXT: ret i16 [[RES]]
				;
				%res = atomicrmw xchg i16 addrspace(1)* %ptr, i16 %value seq_cst
				ret i16 %res
				}

				define i16 @test_atomicrmw_add_i16_global(i16 addrspace(1)* %ptr, i16 %value) {
				; CHECK-LABEL: @test_atomicrmw_add_i16_global(
				; CHECK-NEXT: [[RES:%.]] = atomicrmw add i16 addrspace(1) [[PTR:%.]], i16 [[VALUE:%.]] seq_cst
				; CHECK-NEXT: ret i16 [[RES]]
				;
				%res = atomicrmw add i16 addrspace(1)* %ptr, i16 %value seq_cst
				ret i16 %res
				}

				define i16 @test_atomicrmw_sub_i16_global(i16 addrspace(1)* %ptr, i16 %value) {
				; CHECK-LABEL: @test_atomicrmw_sub_i16_global(
				; CHECK-NEXT: [[RES:%.]] = atomicrmw sub i16 addrspace(1) [[PTR:%.]], i16 [[VALUE:%.]] seq_cst
				; CHECK-NEXT: ret i16 [[RES]]
				;
				%res = atomicrmw sub i16 addrspace(1)* %ptr, i16 %value seq_cst
				ret i16 %res
				}

				define i16 @test_atomicrmw_and_i16_global(i16 addrspace(1)* %ptr, i16 %value) {
				; CHECK-LABEL: @test_atomicrmw_and_i16_global(
				; CHECK-NEXT: [[TMP1:%.]] = ptrtoint i16 addrspace(1) [[PTR:%.*]] to i64
				; CHECK-NEXT: [[TMP2:%.*]] = and i64 [[TMP1]], -4
				; CHECK-NEXT: [[ALIGNEDADDR:%.]] = inttoptr i64 [[TMP2]] to i32 addrspace(1)
				; CHECK-NEXT: [[PTRLSB:%.*]] = and i64 [[TMP1]], 3
				; CHECK-NEXT: [[TMP3:%.*]] = shl i64 [[PTRLSB]], 3
				; CHECK-NEXT: [[SHIFTAMT:%.*]] = trunc i64 [[TMP3]] to i32
				; CHECK-NEXT: [[MASK:%.*]] = shl i32 65535, [[SHIFTAMT]]
				; CHECK-NEXT: [[INV_MASK:%.*]] = xor i32 [[MASK]], -1
				; CHECK-NEXT: [[TMP4:%.]] = zext i16 [[VALUE:%.]] to i32
				; CHECK-NEXT: [[VALOPERAND_SHIFTED:%.*]] = shl i32 [[TMP4]], [[SHIFTAMT]]
				; CHECK-NEXT: [[ANDOPERAND:%.*]] = or i32 [[INV_MASK]], [[VALOPERAND_SHIFTED]]
				; CHECK-NEXT: [[TMP5:%.]] = atomicrmw and i32 addrspace(1) [[ALIGNEDADDR]], i32 [[ANDOPERAND]] seq_cst
				; CHECK-NEXT: [[TMP6:%.*]] = lshr i32 [[TMP5]], [[SHIFTAMT]]
				; CHECK-NEXT: [[TMP7:%.*]] = trunc i32 [[TMP6]] to i16
				; CHECK-NEXT: ret i16 [[TMP7]]
				;
				%res = atomicrmw and i16 addrspace(1)* %ptr, i16 %value seq_cst
				ret i16 %res
				}

				define i16 @test_atomicrmw_nand_i16_global(i16 addrspace(1)* %ptr, i16 %value) {
				; CHECK-LABEL: @test_atomicrmw_nand_i16_global(
				; CHECK-NEXT: [[RES:%.]] = atomicrmw nand i16 addrspace(1) [[PTR:%.]], i16 [[VALUE:%.]] seq_cst
				; CHECK-NEXT: ret i16 [[RES]]
				;
				%res = atomicrmw nand i16 addrspace(1)* %ptr, i16 %value seq_cst
				ret i16 %res
				}

				define i16 @test_atomicrmw_or_i16_global(i16 addrspace(1)* %ptr, i16 %value) {
				; CHECK-LABEL: @test_atomicrmw_or_i16_global(
				; CHECK-NEXT: [[TMP1:%.]] = ptrtoint i16 addrspace(1) [[PTR:%.*]] to i64
				; CHECK-NEXT: [[TMP2:%.*]] = and i64 [[TMP1]], -4
				; CHECK-NEXT: [[ALIGNEDADDR:%.]] = inttoptr i64 [[TMP2]] to i32 addrspace(1)
				; CHECK-NEXT: [[PTRLSB:%.*]] = and i64 [[TMP1]], 3
				; CHECK-NEXT: [[TMP3:%.*]] = shl i64 [[PTRLSB]], 3
				; CHECK-NEXT: [[SHIFTAMT:%.*]] = trunc i64 [[TMP3]] to i32
				; CHECK-NEXT: [[MASK:%.*]] = shl i32 65535, [[SHIFTAMT]]
				; CHECK-NEXT: [[INV_MASK:%.*]] = xor i32 [[MASK]], -1
				; CHECK-NEXT: [[TMP4:%.]] = zext i16 [[VALUE:%.]] to i32
				; CHECK-NEXT: [[VALOPERAND_SHIFTED:%.*]] = shl i32 [[TMP4]], [[SHIFTAMT]]
				; CHECK-NEXT: [[TMP5:%.]] = atomicrmw or i32 addrspace(1) [[ALIGNEDADDR]], i32 [[VALOPERAND_SHIFTED]] seq_cst
				; CHECK-NEXT: [[TMP6:%.*]] = lshr i32 [[TMP5]], [[SHIFTAMT]]
				; CHECK-NEXT: [[TMP7:%.*]] = trunc i32 [[TMP6]] to i16
				; CHECK-NEXT: ret i16 [[TMP7]]
				;
				%res = atomicrmw or i16 addrspace(1)* %ptr, i16 %value seq_cst
				ret i16 %res
				}

				define i16 @test_atomicrmw_xor_i16_global(i16 addrspace(1)* %ptr, i16 %value) {
				; CHECK-LABEL: @test_atomicrmw_xor_i16_global(
				; CHECK-NEXT: [[TMP1:%.]] = ptrtoint i16 addrspace(1) [[PTR:%.*]] to i64
				; CHECK-NEXT: [[TMP2:%.*]] = and i64 [[TMP1]], -4
				; CHECK-NEXT: [[ALIGNEDADDR:%.]] = inttoptr i64 [[TMP2]] to i32 addrspace(1)
				; CHECK-NEXT: [[PTRLSB:%.*]] = and i64 [[TMP1]], 3
				; CHECK-NEXT: [[TMP3:%.*]] = shl i64 [[PTRLSB]], 3
				; CHECK-NEXT: [[SHIFTAMT:%.*]] = trunc i64 [[TMP3]] to i32
				; CHECK-NEXT: [[MASK:%.*]] = shl i32 65535, [[SHIFTAMT]]
				; CHECK-NEXT: [[INV_MASK:%.*]] = xor i32 [[MASK]], -1
				; CHECK-NEXT: [[TMP4:%.]] = zext i16 [[VALUE:%.]] to i32
				; CHECK-NEXT: [[VALOPERAND_SHIFTED:%.*]] = shl i32 [[TMP4]], [[SHIFTAMT]]
				; CHECK-NEXT: [[TMP5:%.]] = atomicrmw xor i32 addrspace(1) [[ALIGNEDADDR]], i32 [[VALOPERAND_SHIFTED]] seq_cst
				; CHECK-NEXT: [[TMP6:%.*]] = lshr i32 [[TMP5]], [[SHIFTAMT]]
				; CHECK-NEXT: [[TMP7:%.*]] = trunc i32 [[TMP6]] to i16
				; CHECK-NEXT: ret i16 [[TMP7]]
				;
				%res = atomicrmw xor i16 addrspace(1)* %ptr, i16 %value seq_cst
				ret i16 %res
				}

				define i16 @test_atomicrmw_max_i16_global(i16 addrspace(1)* %ptr, i16 %value) {
				; CHECK-LABEL: @test_atomicrmw_max_i16_global(
				; CHECK-NEXT: [[RES:%.]] = atomicrmw max i16 addrspace(1) [[PTR:%.]], i16 [[VALUE:%.]] seq_cst
				; CHECK-NEXT: ret i16 [[RES]]
				;
				%res = atomicrmw max i16 addrspace(1)* %ptr, i16 %value seq_cst
				ret i16 %res
				}

				define i16 @test_atomicrmw_min_i16_global(i16 addrspace(1)* %ptr, i16 %value) {
				; CHECK-LABEL: @test_atomicrmw_min_i16_global(
				; CHECK-NEXT: [[RES:%.]] = atomicrmw min i16 addrspace(1) [[PTR:%.]], i16 [[VALUE:%.]] seq_cst
				; CHECK-NEXT: ret i16 [[RES]]
				;
				%res = atomicrmw min i16 addrspace(1)* %ptr, i16 %value seq_cst
				ret i16 %res
				}

				define i16 @test_atomicrmw_umax_i16_global(i16 addrspace(1)* %ptr, i16 %value) {
				; CHECK-LABEL: @test_atomicrmw_umax_i16_global(
				; CHECK-NEXT: [[RES:%.]] = atomicrmw umax i16 addrspace(1) [[PTR:%.]], i16 [[VALUE:%.]] seq_cst
				; CHECK-NEXT: ret i16 [[RES]]
				;
				%res = atomicrmw umax i16 addrspace(1)* %ptr, i16 %value seq_cst
				ret i16 %res
				}

				define i16 @test_atomicrmw_umin_i16_global(i16 addrspace(1)* %ptr, i16 %value) {
				; CHECK-LABEL: @test_atomicrmw_umin_i16_global(
				; CHECK-NEXT: [[RES:%.]] = atomicrmw umin i16 addrspace(1) [[PTR:%.]], i16 [[VALUE:%.]] seq_cst
				; CHECK-NEXT: ret i16 [[RES]]
				;
				%res = atomicrmw umin i16 addrspace(1)* %ptr, i16 %value seq_cst
				ret i16 %res
				}

				define i16 @test_cmpxchg_i16_global(i16 addrspace(1)* %out, i16 %in, i16 %old) {
				; CHECK-LABEL: @test_cmpxchg_i16_global(
				; CHECK-NEXT: [[GEP:%.]] = getelementptr i16, i16 addrspace(1) [[OUT:%.*]], i64 4
				; CHECK-NEXT: [[TMP1:%.]] = ptrtoint i16 addrspace(1) [[GEP]] to i64
				; CHECK-NEXT: [[TMP2:%.*]] = and i64 [[TMP1]], -4
				; CHECK-NEXT: [[ALIGNEDADDR:%.]] = inttoptr i64 [[TMP2]] to i32 addrspace(1)
				; CHECK-NEXT: [[PTRLSB:%.*]] = and i64 [[TMP1]], 3
				; CHECK-NEXT: [[TMP3:%.*]] = shl i64 [[PTRLSB]], 3
				; CHECK-NEXT: [[SHIFTAMT:%.*]] = trunc i64 [[TMP3]] to i32
				; CHECK-NEXT: [[MASK:%.*]] = shl i32 65535, [[SHIFTAMT]]
				; CHECK-NEXT: [[INV_MASK:%.*]] = xor i32 [[MASK]], -1
				; CHECK-NEXT: [[TMP4:%.]] = zext i16 [[IN:%.]] to i32
				; CHECK-NEXT: [[TMP5:%.*]] = shl i32 [[TMP4]], [[SHIFTAMT]]
				; CHECK-NEXT: [[TMP6:%.]] = zext i16 [[OLD:%.]] to i32
				; CHECK-NEXT: [[TMP7:%.*]] = shl i32 [[TMP6]], [[SHIFTAMT]]
				; CHECK-NEXT: [[TMP8:%.]] = load i32, i32 addrspace(1) [[ALIGNEDADDR]]
				; CHECK-NEXT: [[TMP9:%.*]] = and i32 [[TMP8]], [[INV_MASK]]
				; CHECK-NEXT: br label [[PARTWORD_CMPXCHG_LOOP:%.*]]
				; CHECK: partword.cmpxchg.loop:
				; CHECK-NEXT: [[TMP10:%.]] = phi i32 [ [[TMP9]], [[TMP0:%.]] ], [ [[TMP16:%.]], [[PARTWORD_CMPXCHG_FAILURE:%.]] ]
				; CHECK-NEXT: [[TMP11:%.*]] = or i32 [[TMP10]], [[TMP5]]
				; CHECK-NEXT: [[TMP12:%.*]] = or i32 [[TMP10]], [[TMP7]]
				; CHECK-NEXT: [[TMP13:%.]] = cmpxchg i32 addrspace(1) [[ALIGNEDADDR]], i32 [[TMP12]], i32 [[TMP11]] seq_cst seq_cst
				; CHECK-NEXT: [[TMP14:%.*]] = extractvalue { i32, i1 } [[TMP13]], 0
				; CHECK-NEXT: [[TMP15:%.*]] = extractvalue { i32, i1 } [[TMP13]], 1
				; CHECK-NEXT: br i1 [[TMP15]], label [[PARTWORD_CMPXCHG_END:%.*]], label [[PARTWORD_CMPXCHG_FAILURE]]
				; CHECK: partword.cmpxchg.failure:
				; CHECK-NEXT: [[TMP16]] = and i32 [[TMP14]], [[INV_MASK]]
				; CHECK-NEXT: [[TMP17:%.*]] = icmp ne i32 [[TMP10]], [[TMP16]]
				; CHECK-NEXT: br i1 [[TMP17]], label [[PARTWORD_CMPXCHG_LOOP]], label [[PARTWORD_CMPXCHG_END]]
				; CHECK: partword.cmpxchg.end:
				; CHECK-NEXT: [[TMP18:%.*]] = lshr i32 [[TMP14]], [[SHIFTAMT]]
				; CHECK-NEXT: [[TMP19:%.*]] = trunc i32 [[TMP18]] to i16
				; CHECK-NEXT: [[TMP20:%.*]] = insertvalue { i16, i1 } undef, i16 [[TMP19]], 0
				; CHECK-NEXT: [[TMP21:%.*]] = insertvalue { i16, i1 } [[TMP20]], i1 [[TMP15]], 1
				; CHECK-NEXT: [[EXTRACT:%.*]] = extractvalue { i16, i1 } [[TMP21]], 0
				; CHECK-NEXT: ret i16 [[EXTRACT]]
				;
				%gep = getelementptr i16, i16 addrspace(1)* %out, i64 4
				%res = cmpxchg i16 addrspace(1)* %gep, i16 %old, i16 %in seq_cst seq_cst
				%extract = extractvalue {i16, i1} %res, 0
				ret i16 %extract
				}

llvm/test/Transforms/AtomicExpand/NVPTX/expand-atomic-i8.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
				; RUN: opt -mtriple=nvptx-unknown-unknown -S -atomic-expand %s \| FileCheck %s
				; RUN: opt -mtriple=nvptx64-unknown-unknown -S -atomic-expand %s \| FileCheck %s

				define i8 @test_atomicrmw_xchg_i8_global(i8 addrspace(1)* %ptr, i8 %value) {
				; CHECK-LABEL: @test_atomicrmw_xchg_i8_global(
				; CHECK-NEXT: [[RES:%.]] = atomicrmw xchg i8 addrspace(1) [[PTR:%.]], i8 [[VALUE:%.]] seq_cst
				; CHECK-NEXT: ret i8 [[RES]]
				;
				%res = atomicrmw xchg i8 addrspace(1)* %ptr, i8 %value seq_cst
				ret i8 %res
				}

				define i8 @test_atomicrmw_add_i8_global(i8 addrspace(1)* %ptr, i8 %value) {
				; CHECK-LABEL: @test_atomicrmw_add_i8_global(
				; CHECK-NEXT: [[RES:%.]] = atomicrmw add i8 addrspace(1) [[PTR:%.]], i8 [[VALUE:%.]] seq_cst
				; CHECK-NEXT: ret i8 [[RES]]
				;
				%res = atomicrmw add i8 addrspace(1)* %ptr, i8 %value seq_cst
				ret i8 %res
				}

				define i8 @test_atomicrmw_sub_i8_global(i8 addrspace(1)* %ptr, i8 %value) {
				; CHECK-LABEL: @test_atomicrmw_sub_i8_global(
				; CHECK-NEXT: [[RES:%.]] = atomicrmw sub i8 addrspace(1) [[PTR:%.]], i8 [[VALUE:%.]] seq_cst
				; CHECK-NEXT: ret i8 [[RES]]
				;
				%res = atomicrmw sub i8 addrspace(1)* %ptr, i8 %value seq_cst
				ret i8 %res
				}

				define i8 @test_atomicrmw_and_i8_global(i8 addrspace(1)* %ptr, i8 %value) {
				; CHECK-LABEL: @test_atomicrmw_and_i8_global(
				; CHECK-NEXT: [[TMP1:%.]] = ptrtoint i8 addrspace(1) [[PTR:%.*]] to i64
				; CHECK-NEXT: [[TMP2:%.*]] = and i64 [[TMP1]], -4
				; CHECK-NEXT: [[ALIGNEDADDR:%.]] = inttoptr i64 [[TMP2]] to i32 addrspace(1)
				; CHECK-NEXT: [[PTRLSB:%.*]] = and i64 [[TMP1]], 3
				; CHECK-NEXT: [[TMP3:%.*]] = shl i64 [[PTRLSB]], 3
				; CHECK-NEXT: [[SHIFTAMT:%.*]] = trunc i64 [[TMP3]] to i32
				; CHECK-NEXT: [[MASK:%.*]] = shl i32 255, [[SHIFTAMT]]
				; CHECK-NEXT: [[INV_MASK:%.*]] = xor i32 [[MASK]], -1
				; CHECK-NEXT: [[TMP4:%.]] = zext i8 [[VALUE:%.]] to i32
				; CHECK-NEXT: [[VALOPERAND_SHIFTED:%.*]] = shl i32 [[TMP4]], [[SHIFTAMT]]
				; CHECK-NEXT: [[ANDOPERAND:%.*]] = or i32 [[INV_MASK]], [[VALOPERAND_SHIFTED]]
				; CHECK-NEXT: [[TMP5:%.]] = atomicrmw and i32 addrspace(1) [[ALIGNEDADDR]], i32 [[ANDOPERAND]] seq_cst
				; CHECK-NEXT: [[TMP6:%.*]] = lshr i32 [[TMP5]], [[SHIFTAMT]]
				; CHECK-NEXT: [[TMP7:%.*]] = trunc i32 [[TMP6]] to i8
				; CHECK-NEXT: ret i8 [[TMP7]]
				;
				%res = atomicrmw and i8 addrspace(1)* %ptr, i8 %value seq_cst
				ret i8 %res
				}

				define i8 @test_atomicrmw_nand_i8_global(i8 addrspace(1)* %ptr, i8 %value) {
				; CHECK-LABEL: @test_atomicrmw_nand_i8_global(
				; CHECK-NEXT: [[RES:%.]] = atomicrmw nand i8 addrspace(1) [[PTR:%.]], i8 [[VALUE:%.]] seq_cst
				; CHECK-NEXT: ret i8 [[RES]]
				;
				%res = atomicrmw nand i8 addrspace(1)* %ptr, i8 %value seq_cst
				ret i8 %res
				}

				define i8 @test_atomicrmw_or_i8_global(i8 addrspace(1)* %ptr, i8 %value) {
				; CHECK-LABEL: @test_atomicrmw_or_i8_global(
				; CHECK-NEXT: [[TMP1:%.]] = ptrtoint i8 addrspace(1) [[PTR:%.*]] to i64
				; CHECK-NEXT: [[TMP2:%.*]] = and i64 [[TMP1]], -4
				; CHECK-NEXT: [[ALIGNEDADDR:%.]] = inttoptr i64 [[TMP2]] to i32 addrspace(1)
				; CHECK-NEXT: [[PTRLSB:%.*]] = and i64 [[TMP1]], 3
				; CHECK-NEXT: [[TMP3:%.*]] = shl i64 [[PTRLSB]], 3
				; CHECK-NEXT: [[SHIFTAMT:%.*]] = trunc i64 [[TMP3]] to i32
				; CHECK-NEXT: [[MASK:%.*]] = shl i32 255, [[SHIFTAMT]]
				; CHECK-NEXT: [[INV_MASK:%.*]] = xor i32 [[MASK]], -1
				; CHECK-NEXT: [[TMP4:%.]] = zext i8 [[VALUE:%.]] to i32
				; CHECK-NEXT: [[VALOPERAND_SHIFTED:%.*]] = shl i32 [[TMP4]], [[SHIFTAMT]]
				; CHECK-NEXT: [[TMP5:%.]] = atomicrmw or i32 addrspace(1) [[ALIGNEDADDR]], i32 [[VALOPERAND_SHIFTED]] seq_cst
				; CHECK-NEXT: [[TMP6:%.*]] = lshr i32 [[TMP5]], [[SHIFTAMT]]
				; CHECK-NEXT: [[TMP7:%.*]] = trunc i32 [[TMP6]] to i8
				; CHECK-NEXT: ret i8 [[TMP7]]
				;
				%res = atomicrmw or i8 addrspace(1)* %ptr, i8 %value seq_cst
				ret i8 %res
				}

				define i8 @test_atomicrmw_xor_i8_global(i8 addrspace(1)* %ptr, i8 %value) {
				; CHECK-LABEL: @test_atomicrmw_xor_i8_global(
				; CHECK-NEXT: [[TMP1:%.]] = ptrtoint i8 addrspace(1) [[PTR:%.*]] to i64
				; CHECK-NEXT: [[TMP2:%.*]] = and i64 [[TMP1]], -4
				; CHECK-NEXT: [[ALIGNEDADDR:%.]] = inttoptr i64 [[TMP2]] to i32 addrspace(1)
				; CHECK-NEXT: [[PTRLSB:%.*]] = and i64 [[TMP1]], 3
				; CHECK-NEXT: [[TMP3:%.*]] = shl i64 [[PTRLSB]], 3
				; CHECK-NEXT: [[SHIFTAMT:%.*]] = trunc i64 [[TMP3]] to i32
				; CHECK-NEXT: [[MASK:%.*]] = shl i32 255, [[SHIFTAMT]]
				; CHECK-NEXT: [[INV_MASK:%.*]] = xor i32 [[MASK]], -1
				; CHECK-NEXT: [[TMP4:%.]] = zext i8 [[VALUE:%.]] to i32
				; CHECK-NEXT: [[VALOPERAND_SHIFTED:%.*]] = shl i32 [[TMP4]], [[SHIFTAMT]]
				; CHECK-NEXT: [[TMP5:%.]] = atomicrmw xor i32 addrspace(1) [[ALIGNEDADDR]], i32 [[VALOPERAND_SHIFTED]] seq_cst
				; CHECK-NEXT: [[TMP6:%.*]] = lshr i32 [[TMP5]], [[SHIFTAMT]]
				; CHECK-NEXT: [[TMP7:%.*]] = trunc i32 [[TMP6]] to i8
				; CHECK-NEXT: ret i8 [[TMP7]]
				;
				%res = atomicrmw xor i8 addrspace(1)* %ptr, i8 %value seq_cst
				ret i8 %res
				}

				define i8 @test_atomicrmw_max_i8_global(i8 addrspace(1)* %ptr, i8 %value) {
				; CHECK-LABEL: @test_atomicrmw_max_i8_global(
				; CHECK-NEXT: [[RES:%.]] = atomicrmw max i8 addrspace(1) [[PTR:%.]], i8 [[VALUE:%.]] seq_cst
				; CHECK-NEXT: ret i8 [[RES]]
				;
				%res = atomicrmw max i8 addrspace(1)* %ptr, i8 %value seq_cst
				ret i8 %res
				}

				define i8 @test_atomicrmw_min_i8_global(i8 addrspace(1)* %ptr, i8 %value) {
				; CHECK-LABEL: @test_atomicrmw_min_i8_global(
				; CHECK-NEXT: [[RES:%.]] = atomicrmw min i8 addrspace(1) [[PTR:%.]], i8 [[VALUE:%.]] seq_cst
				; CHECK-NEXT: ret i8 [[RES]]
				;
				%res = atomicrmw min i8 addrspace(1)* %ptr, i8 %value seq_cst
				ret i8 %res
				}

				define i8 @test_atomicrmw_umax_i8_global(i8 addrspace(1)* %ptr, i8 %value) {
				; CHECK-LABEL: @test_atomicrmw_umax_i8_global(
				; CHECK-NEXT: [[RES:%.]] = atomicrmw umax i8 addrspace(1) [[PTR:%.]], i8 [[VALUE:%.]] seq_cst
				; CHECK-NEXT: ret i8 [[RES]]
				;
				%res = atomicrmw umax i8 addrspace(1)* %ptr, i8 %value seq_cst
				ret i8 %res
				}

				define i8 @test_atomicrmw_umin_i8_global(i8 addrspace(1)* %ptr, i8 %value) {
				; CHECK-LABEL: @test_atomicrmw_umin_i8_global(
				; CHECK-NEXT: [[RES:%.]] = atomicrmw umin i8 addrspace(1) [[PTR:%.]], i8 [[VALUE:%.]] seq_cst
				; CHECK-NEXT: ret i8 [[RES]]
				;
				%res = atomicrmw umin i8 addrspace(1)* %ptr, i8 %value seq_cst
				ret i8 %res
				}

				define i8 @test_cmpxchg_i8_global(i8 addrspace(1)* %out, i8 %in, i8 %old) {
				; CHECK-LABEL: @test_cmpxchg_i8_global(
				; CHECK-NEXT: [[GEP:%.]] = getelementptr i8, i8 addrspace(1) [[OUT:%.*]], i64 4
				; CHECK-NEXT: [[TMP1:%.]] = ptrtoint i8 addrspace(1) [[GEP]] to i64
				; CHECK-NEXT: [[TMP2:%.*]] = and i64 [[TMP1]], -4
				; CHECK-NEXT: [[ALIGNEDADDR:%.]] = inttoptr i64 [[TMP2]] to i32 addrspace(1)
				; CHECK-NEXT: [[PTRLSB:%.*]] = and i64 [[TMP1]], 3
				; CHECK-NEXT: [[TMP3:%.*]] = shl i64 [[PTRLSB]], 3
				; CHECK-NEXT: [[SHIFTAMT:%.*]] = trunc i64 [[TMP3]] to i32
				; CHECK-NEXT: [[MASK:%.*]] = shl i32 255, [[SHIFTAMT]]
				; CHECK-NEXT: [[INV_MASK:%.*]] = xor i32 [[MASK]], -1
				; CHECK-NEXT: [[TMP4:%.]] = zext i8 [[IN:%.]] to i32
				; CHECK-NEXT: [[TMP5:%.*]] = shl i32 [[TMP4]], [[SHIFTAMT]]
				; CHECK-NEXT: [[TMP6:%.]] = zext i8 [[OLD:%.]] to i32
				; CHECK-NEXT: [[TMP7:%.*]] = shl i32 [[TMP6]], [[SHIFTAMT]]
				; CHECK-NEXT: [[TMP8:%.]] = load i32, i32 addrspace(1) [[ALIGNEDADDR]]
				; CHECK-NEXT: [[TMP9:%.*]] = and i32 [[TMP8]], [[INV_MASK]]
				; CHECK-NEXT: br label [[PARTWORD_CMPXCHG_LOOP:%.*]]
				; CHECK: partword.cmpxchg.loop:
				; CHECK-NEXT: [[TMP10:%.]] = phi i32 [ [[TMP9]], [[TMP0:%.]] ], [ [[TMP16:%.]], [[PARTWORD_CMPXCHG_FAILURE:%.]] ]
				; CHECK-NEXT: [[TMP11:%.*]] = or i32 [[TMP10]], [[TMP5]]
				; CHECK-NEXT: [[TMP12:%.*]] = or i32 [[TMP10]], [[TMP7]]
				; CHECK-NEXT: [[TMP13:%.]] = cmpxchg i32 addrspace(1) [[ALIGNEDADDR]], i32 [[TMP12]], i32 [[TMP11]] seq_cst seq_cst
				; CHECK-NEXT: [[TMP14:%.*]] = extractvalue { i32, i1 } [[TMP13]], 0
				; CHECK-NEXT: [[TMP15:%.*]] = extractvalue { i32, i1 } [[TMP13]], 1
				; CHECK-NEXT: br i1 [[TMP15]], label [[PARTWORD_CMPXCHG_END:%.*]], label [[PARTWORD_CMPXCHG_FAILURE]]
				; CHECK: partword.cmpxchg.failure:
				; CHECK-NEXT: [[TMP16]] = and i32 [[TMP14]], [[INV_MASK]]
				; CHECK-NEXT: [[TMP17:%.*]] = icmp ne i32 [[TMP10]], [[TMP16]]
				; CHECK-NEXT: br i1 [[TMP17]], label [[PARTWORD_CMPXCHG_LOOP]], label [[PARTWORD_CMPXCHG_END]]
				; CHECK: partword.cmpxchg.end:
				; CHECK-NEXT: [[TMP18:%.*]] = lshr i32 [[TMP14]], [[SHIFTAMT]]
				; CHECK-NEXT: [[TMP19:%.*]] = trunc i32 [[TMP18]] to i8
				; CHECK-NEXT: [[TMP20:%.*]] = insertvalue { i8, i1 } undef, i8 [[TMP19]], 0
				; CHECK-NEXT: [[TMP21:%.*]] = insertvalue { i8, i1 } [[TMP20]], i1 [[TMP15]], 1
				; CHECK-NEXT: [[EXTRACT:%.*]] = extractvalue { i8, i1 } [[TMP21]], 0
				; CHECK-NEXT: ret i8 [[EXTRACT]]
				;
				%gep = getelementptr i8, i8 addrspace(1)* %out, i64 4
				%res = cmpxchg i8 addrspace(1)* %gep, i8 %old, i8 %in seq_cst seq_cst
				%extract = extractvalue {i8, i1} %res, 0
				ret i8 %extract
				}

llvm/test/Transforms/AtomicExpand/NVPTX/expand-atomic-rmw-fadd.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
				; RUN: opt -S -mtriple=nvptx-unknown-unknown -mcpu=sm_30 -atomic-expand %s \| FileCheck %s
				; RUN: opt -S -mtriple=nvptx-unknown-unknown -mcpu=sm_60 -atomic-expand %s \| FileCheck %s
				; RUN: opt -S -mtriple=nvptx-unknown-unknown -mcpu=sm_75 -atomic-expand %s \| FileCheck %s

				define float @test_atomicrmw_fadd_f32_flat(float* %ptr, float %value) {
				; CHECK-LABEL: @test_atomicrmw_fadd_f32_flat(
				; CHECK-NEXT: [[TMP1:%.]] = load float, float [[PTR:%.*]], align 4
				; CHECK-NEXT: br label [[ATOMICRMW_START:%.*]]
				; CHECK: atomicrmw.start:
				; CHECK-NEXT: [[LOADED:%.]] = phi float [ [[TMP1]], [[TMP0:%.]] ], [ [[TMP6:%.*]], [[ATOMICRMW_START]] ]
				; CHECK-NEXT: [[NEW:%.]] = fadd float [[LOADED]], [[VALUE:%.]]
				; CHECK-NEXT: [[TMP2:%.]] = bitcast float [[PTR]] to i32*
				; CHECK-NEXT: [[TMP3:%.*]] = bitcast float [[NEW]] to i32
				; CHECK-NEXT: [[TMP4:%.*]] = bitcast float [[LOADED]] to i32
				; CHECK-NEXT: [[TMP5:%.]] = cmpxchg i32 [[TMP2]], i32 [[TMP4]], i32 [[TMP3]] seq_cst seq_cst
				traUnsubmitted Not Done Reply Inline Actions Don't we want to preserve `atomicrmw fadd` in this case and lower it to `atom.add.f32` ? Why do we want to expand here? tra: Don't we want to preserve `atomicrmw fadd` in this case and lower it to `atom.add.f32` ? Why do…
				jdoerfertAuthorUnsubmitted Done Reply Inline Actions Same as below. jdoerfert: Same as below.
				; CHECK-NEXT: [[SUCCESS:%.*]] = extractvalue { i32, i1 } [[TMP5]], 1
				; CHECK-NEXT: [[NEWLOADED:%.*]] = extractvalue { i32, i1 } [[TMP5]], 0
				; CHECK-NEXT: [[TMP6]] = bitcast i32 [[NEWLOADED]] to float
				; CHECK-NEXT: br i1 [[SUCCESS]], label [[ATOMICRMW_END:%.*]], label [[ATOMICRMW_START]]
				; CHECK: atomicrmw.end:
				; CHECK-NEXT: ret float [[TMP6]]
				;
				%res = atomicrmw fadd float* %ptr, float %value seq_cst
				ret float %res
				}

				define float @test_atomicrmw_fadd_f32_global(float addrspace(1)* %ptr, float %value) {
				; CHECK-LABEL: @test_atomicrmw_fadd_f32_global(
				; CHECK-NEXT: [[TMP1:%.]] = load float, float addrspace(1) [[PTR:%.*]], align 4
				; CHECK-NEXT: br label [[ATOMICRMW_START:%.*]]
				; CHECK: atomicrmw.start:
				; CHECK-NEXT: [[LOADED:%.]] = phi float [ [[TMP1]], [[TMP0:%.]] ], [ [[TMP6:%.*]], [[ATOMICRMW_START]] ]
				; CHECK-NEXT: [[NEW:%.]] = fadd float [[LOADED]], [[VALUE:%.]]
				; CHECK-NEXT: [[TMP2:%.]] = bitcast float addrspace(1) [[PTR]] to i32 addrspace(1)*
				; CHECK-NEXT: [[TMP3:%.*]] = bitcast float [[NEW]] to i32
				; CHECK-NEXT: [[TMP4:%.*]] = bitcast float [[LOADED]] to i32
				; CHECK-NEXT: [[TMP5:%.]] = cmpxchg i32 addrspace(1) [[TMP2]], i32 [[TMP4]], i32 [[TMP3]] seq_cst seq_cst
				; CHECK-NEXT: [[SUCCESS:%.*]] = extractvalue { i32, i1 } [[TMP5]], 1
				; CHECK-NEXT: [[NEWLOADED:%.*]] = extractvalue { i32, i1 } [[TMP5]], 0
				; CHECK-NEXT: [[TMP6]] = bitcast i32 [[NEWLOADED]] to float
				; CHECK-NEXT: br i1 [[SUCCESS]], label [[ATOMICRMW_END:%.*]], label [[ATOMICRMW_START]]
				; CHECK: atomicrmw.end:
				; CHECK-NEXT: ret float [[TMP6]]
				;
				%res = atomicrmw fadd float addrspace(1)* %ptr, float %value seq_cst
				ret float %res
				}

				define void @test_atomicrmw_fadd_f32_global_no_use(float addrspace(1)* %ptr, float %value) {
				; CHECK-LABEL: @test_atomicrmw_fadd_f32_global_no_use(
				; CHECK-NEXT: [[TMP1:%.]] = load float, float addrspace(1) [[PTR:%.*]], align 4
				; CHECK-NEXT: br label [[ATOMICRMW_START:%.*]]
				; CHECK: atomicrmw.start:
				; CHECK-NEXT: [[LOADED:%.]] = phi float [ [[TMP1]], [[TMP0:%.]] ], [ [[TMP6:%.*]], [[ATOMICRMW_START]] ]
				; CHECK-NEXT: [[NEW:%.]] = fadd float [[LOADED]], [[VALUE:%.]]
				; CHECK-NEXT: [[TMP2:%.]] = bitcast float addrspace(1) [[PTR]] to i32 addrspace(1)*
				; CHECK-NEXT: [[TMP3:%.*]] = bitcast float [[NEW]] to i32
				; CHECK-NEXT: [[TMP4:%.*]] = bitcast float [[LOADED]] to i32
				; CHECK-NEXT: [[TMP5:%.]] = cmpxchg i32 addrspace(1) [[TMP2]], i32 [[TMP4]], i32 [[TMP3]] seq_cst seq_cst
				; CHECK-NEXT: [[SUCCESS:%.*]] = extractvalue { i32, i1 } [[TMP5]], 1
				; CHECK-NEXT: [[NEWLOADED:%.*]] = extractvalue { i32, i1 } [[TMP5]], 0
				; CHECK-NEXT: [[TMP6]] = bitcast i32 [[NEWLOADED]] to float
				; CHECK-NEXT: br i1 [[SUCCESS]], label [[ATOMICRMW_END:%.*]], label [[ATOMICRMW_START]]
				; CHECK: atomicrmw.end:
				; CHECK-NEXT: ret void
				;
				%res = atomicrmw fadd float addrspace(1)* %ptr, float %value seq_cst
				ret void
				}

				define float @test_atomicrmw_fadd_f32_local(float addrspace(3)* %ptr, float %value) {
				; CHECK-LABEL: @test_atomicrmw_fadd_f32_local(
				; CHECK-NEXT: [[TMP1:%.]] = load float, float addrspace(3) [[PTR:%.*]], align 4
				; CHECK-NEXT: br label [[ATOMICRMW_START:%.*]]
				; CHECK: atomicrmw.start:
				; CHECK-NEXT: [[LOADED:%.]] = phi float [ [[TMP1]], [[TMP0:%.]] ], [ [[TMP6:%.*]], [[ATOMICRMW_START]] ]
				; CHECK-NEXT: [[NEW:%.]] = fadd float [[LOADED]], [[VALUE:%.]]
				; CHECK-NEXT: [[TMP2:%.]] = bitcast float addrspace(3) [[PTR]] to i32 addrspace(3)*
				; CHECK-NEXT: [[TMP3:%.*]] = bitcast float [[NEW]] to i32
				; CHECK-NEXT: [[TMP4:%.*]] = bitcast float [[LOADED]] to i32
				; CHECK-NEXT: [[TMP5:%.]] = cmpxchg i32 addrspace(3) [[TMP2]], i32 [[TMP4]], i32 [[TMP3]] seq_cst seq_cst
				; CHECK-NEXT: [[SUCCESS:%.*]] = extractvalue { i32, i1 } [[TMP5]], 1
				; CHECK-NEXT: [[NEWLOADED:%.*]] = extractvalue { i32, i1 } [[TMP5]], 0
				; CHECK-NEXT: [[TMP6]] = bitcast i32 [[NEWLOADED]] to float
				; CHECK-NEXT: br i1 [[SUCCESS]], label [[ATOMICRMW_END:%.*]], label [[ATOMICRMW_START]]
				; CHECK: atomicrmw.end:
				; CHECK-NEXT: ret float [[TMP6]]
				;
				%res = atomicrmw fadd float addrspace(3)* %ptr, float %value seq_cst
				ret float %res
				}

				define half @test_atomicrmw_fadd_f16_flat(half* %ptr, half %value) {
				; CHECK-LABEL: @test_atomicrmw_fadd_f16_flat(
				; CHECK-NEXT: [[RES:%.]] = atomicrmw fadd half [[PTR:%.]], half [[VALUE:%.]] seq_cst
				; CHECK-NEXT: ret half [[RES]]
				;
				%res = atomicrmw fadd half* %ptr, half %value seq_cst
				ret half %res
				}

				define half @test_atomicrmw_fadd_f16_global(half addrspace(1)* %ptr, half %value) {
				; CHECK-LABEL: @test_atomicrmw_fadd_f16_global(
				; CHECK-NEXT: [[RES:%.]] = atomicrmw fadd half addrspace(1) [[PTR:%.]], half [[VALUE:%.]] seq_cst
				; CHECK-NEXT: ret half [[RES]]
				;
				%res = atomicrmw fadd half addrspace(1)* %ptr, half %value seq_cst
				ret half %res
				}

				define half @test_atomicrmw_fadd_f16_local(half addrspace(3)* %ptr, half %value) {
				; CHECK-LABEL: @test_atomicrmw_fadd_f16_local(
				; CHECK-NEXT: [[RES:%.]] = atomicrmw fadd half addrspace(3) [[PTR:%.]], half [[VALUE:%.]] seq_cst
				; CHECK-NEXT: ret half [[RES]]
				;
				%res = atomicrmw fadd half addrspace(3)* %ptr, half %value seq_cst
				ret half %res
				}

				define double @test_atomicrmw_fadd_f64_flat(double* %ptr, double %value) {
				; CHECK-LABEL: @test_atomicrmw_fadd_f64_flat(
				; CHECK-NEXT: [[TMP1:%.]] = load double, double [[PTR:%.*]], align 8
				; CHECK-NEXT: br label [[ATOMICRMW_START:%.*]]
				; CHECK: atomicrmw.start:
				; CHECK-NEXT: [[LOADED:%.]] = phi double [ [[TMP1]], [[TMP0:%.]] ], [ [[TMP6:%.*]], [[ATOMICRMW_START]] ]
				; CHECK-NEXT: [[NEW:%.]] = fadd double [[LOADED]], [[VALUE:%.]]
				; CHECK-NEXT: [[TMP2:%.]] = bitcast double [[PTR]] to i64*
				; CHECK-NEXT: [[TMP3:%.*]] = bitcast double [[NEW]] to i64
				; CHECK-NEXT: [[TMP4:%.*]] = bitcast double [[LOADED]] to i64
				; CHECK-NEXT: [[TMP5:%.]] = cmpxchg i64 [[TMP2]], i64 [[TMP4]], i64 [[TMP3]] seq_cst seq_cst
				traUnsubmitted Not Done Reply Inline Actions Ditto here and below. We do have `atom.add.f64` tra: Ditto here and below. We do have `atom.add.f64`
				jdoerfertAuthorUnsubmitted Done Reply Inline Actions To be honest, I don't even know why we do not match it. All I (tried) to do is add the limits wrt. size and alignment. Somehow that had more effect than I wanted. The new hooks already remove some of the weirdness we saw but it seems something is missing here (maybe during the instruction "registration"). jdoerfert: To be honest, I don't even know why we do not match it. All I (tried) to do is add the limits…
				; CHECK-NEXT: [[SUCCESS:%.*]] = extractvalue { i64, i1 } [[TMP5]], 1
				; CHECK-NEXT: [[NEWLOADED:%.*]] = extractvalue { i64, i1 } [[TMP5]], 0
				; CHECK-NEXT: [[TMP6]] = bitcast i64 [[NEWLOADED]] to double
				; CHECK-NEXT: br i1 [[SUCCESS]], label [[ATOMICRMW_END:%.*]], label [[ATOMICRMW_START]]
				; CHECK: atomicrmw.end:
				; CHECK-NEXT: ret double [[TMP6]]
				;
				%res = atomicrmw fadd double* %ptr, double %value seq_cst
				ret double %res
				}

				define double @test_atomicrmw_fadd_f64_global(double addrspace(1)* %ptr, double %value) {
				; CHECK-LABEL: @test_atomicrmw_fadd_f64_global(
				; CHECK-NEXT: [[TMP1:%.]] = load double, double addrspace(1) [[PTR:%.*]], align 8
				; CHECK-NEXT: br label [[ATOMICRMW_START:%.*]]
				; CHECK: atomicrmw.start:
				; CHECK-NEXT: [[LOADED:%.]] = phi double [ [[TMP1]], [[TMP0:%.]] ], [ [[TMP6:%.*]], [[ATOMICRMW_START]] ]
				; CHECK-NEXT: [[NEW:%.]] = fadd double [[LOADED]], [[VALUE:%.]]
				; CHECK-NEXT: [[TMP2:%.]] = bitcast double addrspace(1) [[PTR]] to i64 addrspace(1)*
				; CHECK-NEXT: [[TMP3:%.*]] = bitcast double [[NEW]] to i64
				; CHECK-NEXT: [[TMP4:%.*]] = bitcast double [[LOADED]] to i64
				; CHECK-NEXT: [[TMP5:%.]] = cmpxchg i64 addrspace(1) [[TMP2]], i64 [[TMP4]], i64 [[TMP3]] seq_cst seq_cst
				; CHECK-NEXT: [[SUCCESS:%.*]] = extractvalue { i64, i1 } [[TMP5]], 1
				; CHECK-NEXT: [[NEWLOADED:%.*]] = extractvalue { i64, i1 } [[TMP5]], 0
				; CHECK-NEXT: [[TMP6]] = bitcast i64 [[NEWLOADED]] to double
				; CHECK-NEXT: br i1 [[SUCCESS]], label [[ATOMICRMW_END:%.*]], label [[ATOMICRMW_START]]
				; CHECK: atomicrmw.end:
				; CHECK-NEXT: ret double [[TMP6]]
				;
				%res = atomicrmw fadd double addrspace(1)* %ptr, double %value seq_cst
				ret double %res
				}

				define double @test_atomicrmw_fadd_f64_local(double addrspace(3)* %ptr, double %value) {
				; CHECK-LABEL: @test_atomicrmw_fadd_f64_local(
				; CHECK-NEXT: [[TMP1:%.]] = load double, double addrspace(3) [[PTR:%.*]], align 8
				; CHECK-NEXT: br label [[ATOMICRMW_START:%.*]]
				; CHECK: atomicrmw.start:
				; CHECK-NEXT: [[LOADED:%.]] = phi double [ [[TMP1]], [[TMP0:%.]] ], [ [[TMP6:%.*]], [[ATOMICRMW_START]] ]
				; CHECK-NEXT: [[NEW:%.]] = fadd double [[LOADED]], [[VALUE:%.]]
				; CHECK-NEXT: [[TMP2:%.]] = bitcast double addrspace(3) [[PTR]] to i64 addrspace(3)*
				; CHECK-NEXT: [[TMP3:%.*]] = bitcast double [[NEW]] to i64
				; CHECK-NEXT: [[TMP4:%.*]] = bitcast double [[LOADED]] to i64
				; CHECK-NEXT: [[TMP5:%.]] = cmpxchg i64 addrspace(3) [[TMP2]], i64 [[TMP4]], i64 [[TMP3]] seq_cst seq_cst
				; CHECK-NEXT: [[SUCCESS:%.*]] = extractvalue { i64, i1 } [[TMP5]], 1
				; CHECK-NEXT: [[NEWLOADED:%.*]] = extractvalue { i64, i1 } [[TMP5]], 0
				; CHECK-NEXT: [[TMP6]] = bitcast i64 [[NEWLOADED]] to double
				; CHECK-NEXT: br i1 [[SUCCESS]], label [[ATOMICRMW_END:%.*]], label [[ATOMICRMW_START]]
				; CHECK: atomicrmw.end:
				; CHECK-NEXT: ret double [[TMP6]]
				;
				%res = atomicrmw fadd double addrspace(3)* %ptr, double %value seq_cst
				ret double %res
				}

llvm/test/Transforms/AtomicExpand/NVPTX/expand-atomic-rmw-fsub.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
				; RUN: opt -S -mtriple=nvptx64-unknown-unknown -mcpu=sm_30 -atomic-expand %s \| FileCheck %s
				; RUN: opt -S -mtriple=nvptx64-unknown-unknown -mcpu=sm_75 -atomic-expand %s \| FileCheck %s

				define float @test_atomicrmw_fadd_f32_flat(float* %ptr, float %value) {
				traUnsubmitted Not Done Reply Inline Actions Functilon name `fadd` does not seem to match the instruction `fsub`. tra: Functilon name `fadd` does not seem to match the instruction `fsub`.
				jdoerfertAuthorUnsubmitted Done Reply Inline Actions Good catch, copy & paste from the AMD tests ;) (@arsenm jdoerfert: Good catch, copy & paste from the AMD tests ;) (@arsenm
				; CHECK-LABEL: @test_atomicrmw_fadd_f32_flat(
				; CHECK-NEXT: [[TMP1:%.]] = load float, float [[PTR:%.*]], align 4
				; CHECK-NEXT: br label [[ATOMICRMW_START:%.*]]
				; CHECK: atomicrmw.start:
				; CHECK-NEXT: [[LOADED:%.]] = phi float [ [[TMP1]], [[TMP0:%.]] ], [ [[TMP6:%.*]], [[ATOMICRMW_START]] ]
				; CHECK-NEXT: [[NEW:%.]] = fsub float [[LOADED]], [[VALUE:%.]]
				; CHECK-NEXT: [[TMP2:%.]] = bitcast float [[PTR]] to i32*
				; CHECK-NEXT: [[TMP3:%.*]] = bitcast float [[NEW]] to i32
				; CHECK-NEXT: [[TMP4:%.*]] = bitcast float [[LOADED]] to i32
				; CHECK-NEXT: [[TMP5:%.]] = cmpxchg i32 [[TMP2]], i32 [[TMP4]], i32 [[TMP3]] seq_cst seq_cst
				traUnsubmitted Not Done Reply Inline Actions I must be missing something -- I would think that we do not want to expand atomicrmw variants which we can lower to an existing instruction, but a lot of the tests show the opposite and expand atomics that have direct support in hardware. The patch subject seems to agree with my assumptions, but the tests appear to contradict it. Is that intentional? If so, what is it that I'm missing? tra: I must be missing something -- I would think that we do not want to expand atomicrmw variants…
				jdoerfertAuthorUnsubmitted Done Reply Inline Actions It is not intentional to pesimise anything, as mentioned above. The problem is I am neither a backend nor NVPTX person and my changes do seem to have unwanted effects I cannot even categorize. jdoerfert: It is not intentional to pesimise anything, as mentioned above. The problem is I am neither a…
				arsenmUnsubmitted Not Done Reply Inline Actions For the purpose of this change, that this isn't optimal doesn't matter. These aren't implemented correct, but doing so is a separate change and those changes will show up in the same tests here arsenm: For the purpose of this change, that this isn't optimal doesn't matter. These aren't…
				traUnsubmitted Not Done Reply Inline Actions OK. Looks like `atomicrmw fsub` currently fails to lower on NVPTX, so expanding it is an improvement. However, expanding `atomicrmw fadd` is a substantial regression and is likely to be a showstopper. Atomic FP32 addition is a commonly used instruction in various reduction kernels so anything that prevents mapping it to `atom.add.f32` instruction will be very noticeable. I realize that there are many moving parts involved in getting this to work properly. If proper fix needs multiple patches, please try to commit them atomically to avoid the performance regression in between those changes. Also, if there are dependent patches, it would be great to arrange all of them as such in phabricator, so it's easier to see the big picture. tra: OK. Looks like `atomicrmw fsub` currently fails to lower on NVPTX, so expanding it is an…
				; CHECK-NEXT: [[SUCCESS:%.*]] = extractvalue { i32, i1 } [[TMP5]], 1
				; CHECK-NEXT: [[NEWLOADED:%.*]] = extractvalue { i32, i1 } [[TMP5]], 0
				; CHECK-NEXT: [[TMP6]] = bitcast i32 [[NEWLOADED]] to float
				; CHECK-NEXT: br i1 [[SUCCESS]], label [[ATOMICRMW_END:%.*]], label [[ATOMICRMW_START]]
				; CHECK: atomicrmw.end:
				; CHECK-NEXT: ret float [[TMP6]]
				;
				%res = atomicrmw fsub float* %ptr, float %value seq_cst
				ret float %res
				}

				define float @test_atomicrmw_fsub_f32_global(float addrspace(1)* %ptr, float %value) {
				; CHECK-LABEL: @test_atomicrmw_fsub_f32_global(
				; CHECK-NEXT: [[TMP1:%.]] = load float, float addrspace(1) [[PTR:%.*]], align 4
				; CHECK-NEXT: br label [[ATOMICRMW_START:%.*]]
				; CHECK: atomicrmw.start:
				; CHECK-NEXT: [[LOADED:%.]] = phi float [ [[TMP1]], [[TMP0:%.]] ], [ [[TMP6:%.*]], [[ATOMICRMW_START]] ]
				; CHECK-NEXT: [[NEW:%.]] = fsub float [[LOADED]], [[VALUE:%.]]
				; CHECK-NEXT: [[TMP2:%.]] = bitcast float addrspace(1) [[PTR]] to i32 addrspace(1)*
				; CHECK-NEXT: [[TMP3:%.*]] = bitcast float [[NEW]] to i32
				; CHECK-NEXT: [[TMP4:%.*]] = bitcast float [[LOADED]] to i32
				; CHECK-NEXT: [[TMP5:%.]] = cmpxchg i32 addrspace(1) [[TMP2]], i32 [[TMP4]], i32 [[TMP3]] seq_cst seq_cst
				; CHECK-NEXT: [[SUCCESS:%.*]] = extractvalue { i32, i1 } [[TMP5]], 1
				; CHECK-NEXT: [[NEWLOADED:%.*]] = extractvalue { i32, i1 } [[TMP5]], 0
				; CHECK-NEXT: [[TMP6]] = bitcast i32 [[NEWLOADED]] to float
				; CHECK-NEXT: br i1 [[SUCCESS]], label [[ATOMICRMW_END:%.*]], label [[ATOMICRMW_START]]
				; CHECK: atomicrmw.end:
				; CHECK-NEXT: ret float [[TMP6]]
				;
				%res = atomicrmw fsub float addrspace(1)* %ptr, float %value seq_cst
				ret float %res
				}

				define float @test_atomicrmw_fsub_f32_local(float addrspace(3)* %ptr, float %value) {
				; CHECK-LABEL: @test_atomicrmw_fsub_f32_local(
				; CHECK-NEXT: [[TMP1:%.]] = load float, float addrspace(3) [[PTR:%.*]], align 4
				; CHECK-NEXT: br label [[ATOMICRMW_START:%.*]]
				; CHECK: atomicrmw.start:
				; CHECK-NEXT: [[LOADED:%.]] = phi float [ [[TMP1]], [[TMP0:%.]] ], [ [[TMP6:%.*]], [[ATOMICRMW_START]] ]
				; CHECK-NEXT: [[NEW:%.]] = fsub float [[LOADED]], [[VALUE:%.]]
				; CHECK-NEXT: [[TMP2:%.]] = bitcast float addrspace(3) [[PTR]] to i32 addrspace(3)*
				; CHECK-NEXT: [[TMP3:%.*]] = bitcast float [[NEW]] to i32
				; CHECK-NEXT: [[TMP4:%.*]] = bitcast float [[LOADED]] to i32
				; CHECK-NEXT: [[TMP5:%.]] = cmpxchg i32 addrspace(3) [[TMP2]], i32 [[TMP4]], i32 [[TMP3]] seq_cst seq_cst
				; CHECK-NEXT: [[SUCCESS:%.*]] = extractvalue { i32, i1 } [[TMP5]], 1
				; CHECK-NEXT: [[NEWLOADED:%.*]] = extractvalue { i32, i1 } [[TMP5]], 0
				; CHECK-NEXT: [[TMP6]] = bitcast i32 [[NEWLOADED]] to float
				; CHECK-NEXT: br i1 [[SUCCESS]], label [[ATOMICRMW_END:%.*]], label [[ATOMICRMW_START]]
				; CHECK: atomicrmw.end:
				; CHECK-NEXT: ret float [[TMP6]]
				;
				%res = atomicrmw fsub float addrspace(3)* %ptr, float %value seq_cst
				ret float %res
				}

				define half @test_atomicrmw_fsub_f16_flat(half* %ptr, half %value) {
				; CHECK-LABEL: @test_atomicrmw_fsub_f16_flat(
				; CHECK-NEXT: [[RES:%.]] = atomicrmw fsub half [[PTR:%.]], half [[VALUE:%.]] seq_cst
				; CHECK-NEXT: ret half [[RES]]
				;
				%res = atomicrmw fsub half* %ptr, half %value seq_cst
				ret half %res
				}

				define half @test_atomicrmw_fsub_f16_global(half addrspace(1)* %ptr, half %value) {
				; CHECK-LABEL: @test_atomicrmw_fsub_f16_global(
				; CHECK-NEXT: [[RES:%.]] = atomicrmw fsub half addrspace(1) [[PTR:%.]], half [[VALUE:%.]] seq_cst
				; CHECK-NEXT: ret half [[RES]]
				;
				%res = atomicrmw fsub half addrspace(1)* %ptr, half %value seq_cst
				ret half %res
				}

				define half @test_atomicrmw_fsub_f16_local(half addrspace(3)* %ptr, half %value) {
				; CHECK-LABEL: @test_atomicrmw_fsub_f16_local(
				; CHECK-NEXT: [[RES:%.]] = atomicrmw fsub half addrspace(3) [[PTR:%.]], half [[VALUE:%.]] seq_cst
				; CHECK-NEXT: ret half [[RES]]
				;
				%res = atomicrmw fsub half addrspace(3)* %ptr, half %value seq_cst
				ret half %res
				}

				define double @test_atomicrmw_fsub_f64_flat(double* %ptr, double %value) {
				; CHECK-LABEL: @test_atomicrmw_fsub_f64_flat(
				; CHECK-NEXT: [[TMP1:%.]] = load double, double [[PTR:%.*]], align 8
				; CHECK-NEXT: br label [[ATOMICRMW_START:%.*]]
				; CHECK: atomicrmw.start:
				; CHECK-NEXT: [[LOADED:%.]] = phi double [ [[TMP1]], [[TMP0:%.]] ], [ [[TMP6:%.*]], [[ATOMICRMW_START]] ]
				; CHECK-NEXT: [[NEW:%.]] = fsub double [[LOADED]], [[VALUE:%.]]
				; CHECK-NEXT: [[TMP2:%.]] = bitcast double [[PTR]] to i64*
				; CHECK-NEXT: [[TMP3:%.*]] = bitcast double [[NEW]] to i64
				; CHECK-NEXT: [[TMP4:%.*]] = bitcast double [[LOADED]] to i64
				; CHECK-NEXT: [[TMP5:%.]] = cmpxchg i64 [[TMP2]], i64 [[TMP4]], i64 [[TMP3]] seq_cst seq_cst
				; CHECK-NEXT: [[SUCCESS:%.*]] = extractvalue { i64, i1 } [[TMP5]], 1
				; CHECK-NEXT: [[NEWLOADED:%.*]] = extractvalue { i64, i1 } [[TMP5]], 0
				; CHECK-NEXT: [[TMP6]] = bitcast i64 [[NEWLOADED]] to double
				; CHECK-NEXT: br i1 [[SUCCESS]], label [[ATOMICRMW_END:%.*]], label [[ATOMICRMW_START]]
				; CHECK: atomicrmw.end:
				; CHECK-NEXT: ret double [[TMP6]]
				;
				%res = atomicrmw fsub double* %ptr, double %value seq_cst
				ret double %res
				}

				define double @test_atomicrmw_fsub_f64_global(double addrspace(1)* %ptr, double %value) {
				; CHECK-LABEL: @test_atomicrmw_fsub_f64_global(
				; CHECK-NEXT: [[TMP1:%.]] = load double, double addrspace(1) [[PTR:%.*]], align 8
				; CHECK-NEXT: br label [[ATOMICRMW_START:%.*]]
				; CHECK: atomicrmw.start:
				; CHECK-NEXT: [[LOADED:%.]] = phi double [ [[TMP1]], [[TMP0:%.]] ], [ [[TMP6:%.*]], [[ATOMICRMW_START]] ]
				; CHECK-NEXT: [[NEW:%.]] = fsub double [[LOADED]], [[VALUE:%.]]
				; CHECK-NEXT: [[TMP2:%.]] = bitcast double addrspace(1) [[PTR]] to i64 addrspace(1)*
				; CHECK-NEXT: [[TMP3:%.*]] = bitcast double [[NEW]] to i64
				; CHECK-NEXT: [[TMP4:%.*]] = bitcast double [[LOADED]] to i64
				; CHECK-NEXT: [[TMP5:%.]] = cmpxchg i64 addrspace(1) [[TMP2]], i64 [[TMP4]], i64 [[TMP3]] seq_cst seq_cst
				; CHECK-NEXT: [[SUCCESS:%.*]] = extractvalue { i64, i1 } [[TMP5]], 1
				; CHECK-NEXT: [[NEWLOADED:%.*]] = extractvalue { i64, i1 } [[TMP5]], 0
				; CHECK-NEXT: [[TMP6]] = bitcast i64 [[NEWLOADED]] to double
				; CHECK-NEXT: br i1 [[SUCCESS]], label [[ATOMICRMW_END:%.*]], label [[ATOMICRMW_START]]
				; CHECK: atomicrmw.end:
				; CHECK-NEXT: ret double [[TMP6]]
				;
				%res = atomicrmw fsub double addrspace(1)* %ptr, double %value seq_cst
				ret double %res
				}

				define double @test_atomicrmw_fsub_f64_local(double addrspace(3)* %ptr, double %value) {
				; CHECK-LABEL: @test_atomicrmw_fsub_f64_local(
				; CHECK-NEXT: [[TMP1:%.]] = load double, double addrspace(3) [[PTR:%.*]], align 8
				; CHECK-NEXT: br label [[ATOMICRMW_START:%.*]]
				; CHECK: atomicrmw.start:
				; CHECK-NEXT: [[LOADED:%.]] = phi double [ [[TMP1]], [[TMP0:%.]] ], [ [[TMP6:%.*]], [[ATOMICRMW_START]] ]
				; CHECK-NEXT: [[NEW:%.]] = fsub double [[LOADED]], [[VALUE:%.]]
				; CHECK-NEXT: [[TMP2:%.]] = bitcast double addrspace(3) [[PTR]] to i64 addrspace(3)*
				; CHECK-NEXT: [[TMP3:%.*]] = bitcast double [[NEW]] to i64
				; CHECK-NEXT: [[TMP4:%.*]] = bitcast double [[LOADED]] to i64
				; CHECK-NEXT: [[TMP5:%.]] = cmpxchg i64 addrspace(3) [[TMP2]], i64 [[TMP4]], i64 [[TMP3]] seq_cst seq_cst
				; CHECK-NEXT: [[SUCCESS:%.*]] = extractvalue { i64, i1 } [[TMP5]], 1
				; CHECK-NEXT: [[NEWLOADED:%.*]] = extractvalue { i64, i1 } [[TMP5]], 0
				; CHECK-NEXT: [[TMP6]] = bitcast i64 [[NEWLOADED]] to double
				; CHECK-NEXT: br i1 [[SUCCESS]], label [[ATOMICRMW_END:%.*]], label [[ATOMICRMW_START]]
				; CHECK: atomicrmw.end:
				; CHECK-NEXT: ret double [[TMP6]]
				;
				%res = atomicrmw fsub double addrspace(3)* %ptr, double %value seq_cst
				ret double %res
				}

llvm/test/Transforms/AtomicExpand/NVPTX/expand-atomic-rmw-nand.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
				; RUN: opt -mtriple=nvptx64-unknown-unknown -S -atomic-expand %s \| FileCheck %s
				; RUN: opt -mtriple=nvptx64-unknown-unknown -S -atomic-expand %s \| FileCheck %s

				define i32 @test_atomicrmw_nand_i32_flat(i32* %ptr, i32 %value) {
				; CHECK-LABEL: @test_atomicrmw_nand_i32_flat(
				; CHECK-NEXT: [[RES:%.]] = atomicrmw nand i32 [[PTR:%.]], i32 [[VALUE:%.]] seq_cst
				; CHECK-NEXT: ret i32 [[RES]]
				;
				%res = atomicrmw nand i32* %ptr, i32 %value seq_cst
				ret i32 %res
				}

				define i32 @test_atomicrmw_nand_i32_global(i32 addrspace(1)* %ptr, i32 %value) {
				; CHECK-LABEL: @test_atomicrmw_nand_i32_global(
				; CHECK-NEXT: [[RES:%.]] = atomicrmw nand i32 addrspace(1) [[PTR:%.]], i32 [[VALUE:%.]] seq_cst
				; CHECK-NEXT: ret i32 [[RES]]
				;
				%res = atomicrmw nand i32 addrspace(1)* %ptr, i32 %value seq_cst
				ret i32 %res
				}

				define i32 @test_atomicrmw_nand_i32_local(i32 addrspace(3)* %ptr, i32 %value) {
				; CHECK-LABEL: @test_atomicrmw_nand_i32_local(
				; CHECK-NEXT: [[RES:%.]] = atomicrmw nand i32 addrspace(3) [[PTR:%.]], i32 [[VALUE:%.]] seq_cst
				; CHECK-NEXT: ret i32 [[RES]]
				;
				%res = atomicrmw nand i32 addrspace(3)* %ptr, i32 %value seq_cst
				ret i32 %res
				}

llvm/test/Transforms/AtomicExpand/NVPTX/lit.local.cfg

This file was added.

				if not 'NVPTX' in config.root.targets:
				config.unsupported = True

llvm/test/Transforms/AtomicExpand/NVPTX/unaligned-atomic.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
				; RUN: opt -S -mtriple=nvptx64-unknown-unknown -atomic-expand %s \| FileCheck -check-prefix=CHECK %s
				traUnsubmitted Not Done Reply Inline Actions Nit: no need for `-check-prefix` as you only using `CHECK` in the test. tra: Nit: no need for `-check-prefix` as you only using `CHECK` in the test.
				jdoerfertAuthorUnsubmitted Done Reply Inline Actions Fair, I think I copied this ;) jdoerfert: Fair, I think I copied this ;)

				define i32 @atomic_load_global_align1(i32 addrspace(1)* %ptr) {
				; CHECK-LABEL: @atomic_load_global_align1(
				; CHECK-NEXT: [[TMP1:%.]] = bitcast i32 addrspace(1) [[PTR:%.]] to i8 addrspace(1)
				; CHECK-NEXT: [[TMP2:%.]] = addrspacecast i8 addrspace(1) [[TMP1]] to i8*
				; CHECK-NEXT: [[TMP3:%.*]] = alloca i32, align 4
				; CHECK-NEXT: [[TMP4:%.]] = bitcast i32 [[TMP3]] to i8*
				; CHECK-NEXT: call void @llvm.lifetime.start.p0i8(i64 4, i8* [[TMP4]])
				; CHECK-NEXT: call void @__atomic_load(i64 4, i8* [[TMP2]], i8* [[TMP4]], i32 5)
				; CHECK-NEXT: [[TMP5:%.]] = load i32, i32 [[TMP3]], align 4
				; CHECK-NEXT: call void @llvm.lifetime.end.p0i8(i64 4, i8* [[TMP4]])
				; CHECK-NEXT: ret i32 [[TMP5]]
				;
				%val = load atomic i32, i32 addrspace(1)* %ptr seq_cst, align 1
				ret i32 %val
				}

				define void @atomic_store_global_align1(i32 addrspace(1)* %ptr, i32 %val) {
				; CHECK-LABEL: @atomic_store_global_align1(
				; CHECK-NEXT: [[TMP1:%.]] = bitcast i32 addrspace(1) [[PTR:%.]] to i8 addrspace(1)
				; CHECK-NEXT: [[TMP2:%.]] = addrspacecast i8 addrspace(1) [[TMP1]] to i8*
				; CHECK-NEXT: [[TMP3:%.*]] = alloca i32, align 4
				; CHECK-NEXT: [[TMP4:%.]] = bitcast i32 [[TMP3]] to i8*
				; CHECK-NEXT: call void @llvm.lifetime.start.p0i8(i64 4, i8* [[TMP4]])
				; CHECK-NEXT: store i32 [[VAL:%.]], i32 [[TMP3]], align 4
				; CHECK-NEXT: call void @__atomic_store(i64 4, i8* [[TMP2]], i8* [[TMP4]], i32 0)
				; CHECK-NEXT: call void @llvm.lifetime.end.p0i8(i64 4, i8* [[TMP4]])
				; CHECK-NEXT: ret void
				;
				store atomic i32 %val, i32 addrspace(1)* %ptr monotonic, align 1
				ret void
				}