This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
lib/Target/NVPTX/
-
Target/
-
NVPTX/
2/2
NVPTXAsmPrinter.cpp
1/1
NVPTXISelLowering.cpp
-
test/CodeGen/NVPTX/
-
CodeGen/
-
NVPTX/
-
i128-param.ll
-
i128-retval.ll

Differential D34555

[NVPTX] Add lowering of i128 params.
ClosedPublic

Authored by denzp on Jun 23 2017, 7:07 AM.

Download Raw Diff

Details

Reviewers

jholewinski
jlebar
tra

Commits

rGd7a73824e46a: [NVPTX] Add lowering of i128 params.
rGb9fc48da8326: [NVPTX] Add lowering of i128 params.
rC308675: [NVPTX] Add lowering of i128 params.
rL308675: [NVPTX] Add lowering of i128 params.
rL307326: [NVPTX] Add lowering of i128 params.

Summary

The patch adds support of i128 params lowering. The changes are quite trivial to support i128 as a "special case" of integer type. With this patch, we lower i128 params the same way as aggregates of size 16 bytes: .param .b8 _ [16].

Currently, NVPTX can't deal with the 128 bit integers:

in some cases because of failed assertions like ValVTs.size() == OutVals.size() && "Bad return value decomposition"
in other cases emitting PTX with .i128 or .u128 types (which are not valid [1])

[1] http://docs.nvidia.com/cuda/parallel-thread-execution/index.html#fundamental-types

Diff Detail

Event Timeline

denzp created this revision.Jun 23 2017, 7:07 AM

I'd like tra to look at this -- he's much more of an expert in selection DAG than I am -- but he's on vacation for another week. Are you OK waiting?

In D34555#789076, @jlebar wrote:

I'd like tra to look at this -- he's much more of an expert in selection DAG than I am -- but he's on vacation for another week. Are you OK waiting?

Of course! Code review should never be hurried :)

In general the patch looks OK.
One thing that you need to make sure is that alignment of i128 on nvptx side always matches that on the host side. AFAICT, i128 alignment on x64 is 16 (see D28990) , but this patch shows that it's 8 on nvptx side. That would result in different in-memory representation of aggregates between host and device.

lib/Target/NVPTX/NVPTXAsmPrinter.cpp
403	Nit. I'd group together `(Ty->isIntegerTy() && !Ty->isIntegerTy(128))`
1428–1434	Please add a test case for this.
lib/Target/NVPTX/NVPTXISelLowering.cpp
1279	Same nit as above.

This revision now requires changes to proceed.Jul 5 2017, 10:47 AM

Fixed i128 params alignment
Added regression test for global variables
Couple minor code readability improvements

Thanks for the review, @tra! I fixed alignment and some lines according to your comments.

Nice. With alignment=16 we also benefit from 128-bit loads/stores.

This revision is now accepted and ready to land.Jul 6 2017, 9:37 AM

Could you please land it for me?

Closed by commit rL307326: [NVPTX] Add lowering of i128 params. (authored by mkuper). · Explain WhyJul 6 2017, 3:19 PM

This revision was automatically updated to reflect the committed changes.

I've reverted this in r307334 because it breaks a lot of clang tests with:

error: 'error' diagnostics seen but not expected:

(frontend): backend data layout 'e-i64:64-i128:128-v16:16-v32:32-n16:32:64' does not match expected target description 'e-i64:64-v16:16-v32:32-n16:32:64'

Is it safe just to change data layout at clang side? Won't it break backward compatibility?

clang-nvptx-i128-alignment.diff1 KBDownload

This revision is now accepted and ready to land.Jul 7 2017, 4:21 AM

In D34555#802008, @denzp wrote:

Is it safe just to change data layout at clang side? Won't it break backward compatibility?

Clang and LLVM sides of this change should land together as they have to be in sync.

Other than that, AFAICT it should be safe as it affects only i128 which is not currently supported, so there should be no backward compatibility issues. That's in theory. You may want to ask this question on cfe-dev@.

This revision now requires changes to proceed.Jul 7 2017, 9:29 AM

In D34555#802254, @tra wrote:

In D34555#802008, @denzp wrote:

Is it safe just to change data layout at clang side? Won't it break backward compatibility?

Clang and LLVM sides of this change should land together as they have to be in sync.

Other than that, AFAICT it should be safe as it affects only i128 which is not currently supported, so there should be no backward compatibility issues. That's in theory. You may want to ask this question on cfe-dev@.

Looks like we can simply update Clang's data layout and it should not break anything: cfe-dev thread.

Do I need to create a "differential" for the clang patch? I attached the diff to my previous comment here, and changes there are quite trivial.

tra accepted this revision.Jul 20 2017, 11:37 AM

This revision is now accepted and ready to land.Jul 20 2017, 11:37 AM

Closed by commit rL308675: [NVPTX] Add lowering of i128 params. (authored by tra). · Explain WhyJul 20 2017, 2:17 PM

This revision was automatically updated to reflect the committed changes.

denzp mentioned this in D55144: [NVPTX] Add lowering of i128 numbers as struct fields.Nov 30 2018, 1:21 PM

tra mentioned this in rL348057: [NVPTX] Add lowering of i128 numbers as struct fields.Nov 30 2018, 4:24 PM

Revision Contents

Path

Size

lib/

Target/

NVPTX/

NVPTXAsmPrinter.cpp

14 lines

NVPTXISelLowering.cpp

27 lines

test/

CodeGen/

NVPTX/

i128-param.ll

70 lines

i128-retval.ll

35 lines

Diff 103725

lib/Target/NVPTX/NVPTXAsmPrinter.cpp

Show First 20 Lines • Show All 394 Lines • ▼ Show 20 Lines	void NVPTXAsmPrinter::printReturnValStr(const Function *F, raw_ostream &O) {
bool isABI = (nvptxSubtarget->getSmVersion() >= 20);		bool isABI = (nvptxSubtarget->getSmVersion() >= 20);

if (Ty->getTypeID() == Type::VoidTyID)		if (Ty->getTypeID() == Type::VoidTyID)
return;		return;

O << " (";		O << " (";

if (isABI) {		if (isABI) {
if (Ty->isFloatingPointTy() \|\| Ty->isIntegerTy()) {		if ((Ty->isFloatingPointTy() \|\| Ty->isIntegerTy()) && !Ty->isIntegerTy(128)) {
		traUnsubmitted Done Reply Inline Actions Nit. I'd group together `(Ty->isIntegerTy() && !Ty->isIntegerTy(128))` tra: Nit. I'd group together `(Ty->isIntegerTy() && !Ty->isIntegerTy(128))`
unsigned size = 0;		unsigned size = 0;
if (auto *ITy = dyn_cast<IntegerType>(Ty)) {		if (auto *ITy = dyn_cast<IntegerType>(Ty)) {
size = ITy->getBitWidth();		size = ITy->getBitWidth();
} else {		} else {
assert(Ty->isFloatingPointTy() && "Floating point type expected here");		assert(Ty->isFloatingPointTy() && "Floating point type expected here");
size = Ty->getPrimitiveSizeInBits();		size = Ty->getPrimitiveSizeInBits();
}		}
// PTX ABI requires all scalar return values to be at least 32		// PTX ABI requires all scalar return values to be at least 32
// bits in size. fp16 normally uses .b16 as its storage type in		// bits in size. fp16 normally uses .b16 as its storage type in
// PTX, so its size must be adjusted here, too.		// PTX, so its size must be adjusted here, too.
if (size < 32)		if (size < 32)
size = 32;		size = 32;

O << ".param .b" << size << " func_retval0";		O << ".param .b" << size << " func_retval0";
} else if (isa<PointerType>(Ty)) {		} else if (isa<PointerType>(Ty)) {
O << ".param .b" << TLI->getPointerTy(DL).getSizeInBits()		O << ".param .b" << TLI->getPointerTy(DL).getSizeInBits()
<< " func_retval0";		<< " func_retval0";
} else if (Ty->isAggregateType() \|\| Ty->isVectorTy()) {		} else if (Ty->isAggregateType() \|\| Ty->isVectorTy() \|\| Ty->isIntegerTy(128)) {
unsigned totalsz = DL.getTypeAllocSize(Ty);		unsigned totalsz = DL.getTypeAllocSize(Ty);
unsigned retAlignment = 0;		unsigned retAlignment = 0;
if (!getAlign(*F, 0, retAlignment))		if (!getAlign(*F, 0, retAlignment))
retAlignment = DL.getABITypeAlignment(Ty);		retAlignment = DL.getABITypeAlignment(Ty);
O << ".param .align " << retAlignment << " .b8 func_retval0[" << totalsz		O << ".param .align " << retAlignment << " .b8 func_retval0[" << totalsz
<< "]";		<< "]";
} else		} else
llvm_unreachable("Unknown return type");		llvm_unreachable("Unknown return type");
▲ Show 20 Lines • Show All 990 Lines • ▼ Show 20 Lines	void NVPTXAsmPrinter::emitPTXGlobalVariable(const GlobalVariable *GVar,

O << ".";		O << ".";
emitPTXAddressSpace(GVar->getType()->getAddressSpace(), O);		emitPTXAddressSpace(GVar->getType()->getAddressSpace(), O);
if (GVar->getAlignment() == 0)		if (GVar->getAlignment() == 0)
O << " .align " << (int)DL.getPrefTypeAlignment(ETy);		O << " .align " << (int)DL.getPrefTypeAlignment(ETy);
else		else
O << " .align " << GVar->getAlignment();		O << " .align " << GVar->getAlignment();

		// Special case for i128
		if (ETy->isIntegerTy(128)) {
		O << " .b8 ";
		getSymbol(GVar)->print(O, MAI);
		O << "[16]";
		return;
		}
		traUnsubmitted Done Reply Inline Actions Please add a test case for this. tra: Please add a test case for this.

if (ETy->isFloatingPointTy() \|\| ETy->isIntegerTy() \|\| ETy->isPointerTy()) {		if (ETy->isFloatingPointTy() \|\| ETy->isIntegerTy() \|\| ETy->isPointerTy()) {
O << " .";		O << " .";
O << getPTXFundamentalTypeStr(ETy);		O << getPTXFundamentalTypeStr(ETy);
O << " ";		O << " ";
getSymbol(GVar)->print(O, MAI);		getSymbol(GVar)->print(O, MAI);
return;		return;
}		}

▲ Show 20 Lines • Show All 110 Lines • ▼ Show 20 Lines	if (isKernelFunction(*F)) {
CurrentFnSym->print(O, MAI);		CurrentFnSym->print(O, MAI);
O << "_param_" << paramIndex;		O << "_param_" << paramIndex;
}		}
continue;		continue;
}		}
}		}

if (!PAL.hasParamAttribute(paramIndex, Attribute::ByVal)) {		if (!PAL.hasParamAttribute(paramIndex, Attribute::ByVal)) {
if (Ty->isAggregateType() \|\| Ty->isVectorTy()) {		if (Ty->isAggregateType() \|\| Ty->isVectorTy() \|\| Ty->isIntegerTy(128)) {
// Just print .param .align <a> .b8 .param[size];		// Just print .param .align <a> .b8 .param[size];
// <a> = PAL.getparamalignment		// <a> = PAL.getparamalignment
// size = typeallocsize of element type		// size = typeallocsize of element type
unsigned align = PAL.getParamAlignment(paramIndex);		unsigned align = PAL.getParamAlignment(paramIndex);
if (align == 0)		if (align == 0)
align = DL.getABITypeAlignment(Ty);		align = DL.getABITypeAlignment(Ty);

unsigned sz = DL.getTypeAllocSize(Ty);		unsigned sz = DL.getTypeAllocSize(Ty);
▲ Show 20 Lines • Show All 887 Lines • Show Last 20 Lines

lib/Target/NVPTX/NVPTXISelLowering.cpp

Show First 20 Lines • Show All 163 Lines • ▼ Show 20 Lines
/// LowerCall, and LowerReturn.		/// LowerCall, and LowerReturn.
static void ComputePTXValueVTs(const TargetLowering &TLI, const DataLayout &DL,		static void ComputePTXValueVTs(const TargetLowering &TLI, const DataLayout &DL,
Type *Ty, SmallVectorImpl<EVT> &ValueVTs,		Type *Ty, SmallVectorImpl<EVT> &ValueVTs,
SmallVectorImpl<uint64_t> *Offsets = nullptr,		SmallVectorImpl<uint64_t> *Offsets = nullptr,
uint64_t StartingOffset = 0) {		uint64_t StartingOffset = 0) {
SmallVector<EVT, 16> TempVTs;		SmallVector<EVT, 16> TempVTs;
SmallVector<uint64_t, 16> TempOffsets;		SmallVector<uint64_t, 16> TempOffsets;

		// Special case for i128 - decompose to (i64, i64)
		if (Ty->isIntegerTy(128)) {
		ValueVTs.push_back(EVT(MVT::i64));
		ValueVTs.push_back(EVT(MVT::i64));

		if (Offsets) {
		Offsets->push_back(StartingOffset + 0);
		Offsets->push_back(StartingOffset + 8);
		}

		return;
		}

ComputeValueVTs(TLI, DL, Ty, TempVTs, &TempOffsets, StartingOffset);		ComputeValueVTs(TLI, DL, Ty, TempVTs, &TempOffsets, StartingOffset);
for (unsigned i = 0, e = TempVTs.size(); i != e; ++i) {		for (unsigned i = 0, e = TempVTs.size(); i != e; ++i) {
EVT VT = TempVTs[i];		EVT VT = TempVTs[i];
uint64_t Off = TempOffsets[i];		uint64_t Off = TempOffsets[i];
// Split vectors into individual elements, except for v2f16, which		// Split vectors into individual elements, except for v2f16, which
// we will pass as a single scalar.		// we will pass as a single scalar.
if (VT.isVector()) {		if (VT.isVector()) {
unsigned NumElts = VT.getVectorNumElements();		unsigned NumElts = VT.getVectorNumElements();
▲ Show 20 Lines • Show All 1,078 Lines • ▼ Show 20 Lines	std::string NVPTXTargetLowering::getPrototype(

std::stringstream O;		std::stringstream O;
O << "prototype_" << uniqueCallSite << " : .callprototype ";		O << "prototype_" << uniqueCallSite << " : .callprototype ";

if (retTy->getTypeID() == Type::VoidTyID) {		if (retTy->getTypeID() == Type::VoidTyID) {
O << "()";		O << "()";
} else {		} else {
O << "(";		O << "(";
if (retTy->isFloatingPointTy() \|\| retTy->isIntegerTy()) {		if ((retTy->isFloatingPointTy() \|\| retTy->isIntegerTy()) && !retTy->isIntegerTy(128)) {
		traUnsubmitted Done Reply Inline Actions Same nit as above. tra: Same nit as above.
unsigned size = 0;		unsigned size = 0;
if (auto *ITy = dyn_cast<IntegerType>(retTy)) {		if (auto *ITy = dyn_cast<IntegerType>(retTy)) {
size = ITy->getBitWidth();		size = ITy->getBitWidth();
} else {		} else {
assert(retTy->isFloatingPointTy() &&		assert(retTy->isFloatingPointTy() &&
"Floating point type expected here");		"Floating point type expected here");
size = retTy->getPrimitiveSizeInBits();		size = retTy->getPrimitiveSizeInBits();
}		}
// PTX ABI requires all scalar return values to be at least 32		// PTX ABI requires all scalar return values to be at least 32
// bits in size. fp16 normally uses .b16 as its storage type in		// bits in size. fp16 normally uses .b16 as its storage type in
// PTX, so its size must be adjusted here, too.		// PTX, so its size must be adjusted here, too.
if (size < 32)		if (size < 32)
size = 32;		size = 32;

O << ".param .b" << size << " _";		O << ".param .b" << size << " _";
} else if (isa<PointerType>(retTy)) {		} else if (isa<PointerType>(retTy)) {
O << ".param .b" << PtrVT.getSizeInBits() << " _";		O << ".param .b" << PtrVT.getSizeInBits() << " _";
} else if (retTy->isAggregateType() \|\| retTy->isVectorTy()) {		} else if (retTy->isAggregateType() \|\| retTy->isVectorTy() \|\| retTy->isIntegerTy(128)) {
auto &DL = CS->getCalledFunction()->getParent()->getDataLayout();		auto &DL = CS->getCalledFunction()->getParent()->getDataLayout();
O << ".param .align " << retAlignment << " .b8 _["		O << ".param .align " << retAlignment << " .b8 _["
<< DL.getTypeAllocSize(retTy) << "]";		<< DL.getTypeAllocSize(retTy) << "]";
} else {		} else {
llvm_unreachable("Unknown return type");		llvm_unreachable("Unknown return type");
}		}
O << ") ";		O << ") ";
}		}
O << "_ (";		O << "_ (";

bool first = true;		bool first = true;

unsigned OIdx = 0;		unsigned OIdx = 0;
for (unsigned i = 0, e = Args.size(); i != e; ++i, ++OIdx) {		for (unsigned i = 0, e = Args.size(); i != e; ++i, ++OIdx) {
Type *Ty = Args[i].Ty;		Type *Ty = Args[i].Ty;
if (!first) {		if (!first) {
O << ", ";		O << ", ";
}		}
first = false;		first = false;

if (!Outs[OIdx].Flags.isByVal()) {		if (!Outs[OIdx].Flags.isByVal()) {
if (Ty->isAggregateType() \|\| Ty->isVectorTy()) {		if (Ty->isAggregateType() \|\| Ty->isVectorTy() \|\| Ty->isIntegerTy(128)) {
unsigned align = 0;		unsigned align = 0;
const CallInst *CallI = cast<CallInst>(CS->getInstruction());		const CallInst *CallI = cast<CallInst>(CS->getInstruction());
// +1 because index 0 is reserved for return type alignment		// +1 because index 0 is reserved for return type alignment
if (!getAlign(*CallI, i + 1, align))		if (!getAlign(*CallI, i + 1, align))
align = DL.getABITypeAlignment(Ty);		align = DL.getABITypeAlignment(Ty);
unsigned sz = DL.getTypeAllocSize(Ty);		unsigned sz = DL.getTypeAllocSize(Ty);
O << ".param .align " << align << " .b8 ";		O << ".param .align " << align << " .b8 ";
O << "_";		O << "_";
▲ Show 20 Lines • Show All 139 Lines • ▼ Show 20 Lines	if (!Outs[OIdx].Flags.isByVal()) {
SmallVector<EVT, 16> VTs;		SmallVector<EVT, 16> VTs;
SmallVector<uint64_t, 16> Offsets;		SmallVector<uint64_t, 16> Offsets;
ComputePTXValueVTs(*this, DL, Ty, VTs, &Offsets);		ComputePTXValueVTs(*this, DL, Ty, VTs, &Offsets);
unsigned ArgAlign =		unsigned ArgAlign =
getArgumentAlignment(Callee, CS, Ty, paramCount + 1, DL);		getArgumentAlignment(Callee, CS, Ty, paramCount + 1, DL);
unsigned AllocSize = DL.getTypeAllocSize(Ty);		unsigned AllocSize = DL.getTypeAllocSize(Ty);
SDVTList DeclareParamVTs = DAG.getVTList(MVT::Other, MVT::Glue);		SDVTList DeclareParamVTs = DAG.getVTList(MVT::Other, MVT::Glue);
bool NeedAlign; // Does argument declaration specify alignment?		bool NeedAlign; // Does argument declaration specify alignment?
if (Ty->isAggregateType() \|\| Ty->isVectorTy()) {		if (Ty->isAggregateType() \|\| Ty->isVectorTy() \|\| Ty->isIntegerTy(128)) {
// declare .param .align <align> .b8 .param<n>[<size>];		// declare .param .align <align> .b8 .param<n>[<size>];
SDValue DeclareParamOps[] = {		SDValue DeclareParamOps[] = {
Chain, DAG.getConstant(ArgAlign, dl, MVT::i32),		Chain, DAG.getConstant(ArgAlign, dl, MVT::i32),
DAG.getConstant(paramCount, dl, MVT::i32),		DAG.getConstant(paramCount, dl, MVT::i32),
DAG.getConstant(AllocSize, dl, MVT::i32), InFlag};		DAG.getConstant(AllocSize, dl, MVT::i32), InFlag};
Chain = DAG.getNode(NVPTXISD::DeclareParam, dl, DeclareParamVTs,		Chain = DAG.getNode(NVPTXISD::DeclareParam, dl, DeclareParamVTs,
DeclareParamOps);		DeclareParamOps);
NeedAlign = true;		NeedAlign = true;
▲ Show 20 Lines • Show All 159 Lines • ▼ Show 20 Lines	if (Ins.size() > 0) {
// Declare		// Declare
// .param .align 16 .b8 retval0[<size-in-bytes>], or		// .param .align 16 .b8 retval0[<size-in-bytes>], or
// .param .b<size-in-bits> retval0		// .param .b<size-in-bits> retval0
unsigned resultsz = DL.getTypeAllocSizeInBits(RetTy);		unsigned resultsz = DL.getTypeAllocSizeInBits(RetTy);
// Emit ".param .b<size-in-bits> retval0" instead of byte arrays only for		// Emit ".param .b<size-in-bits> retval0" instead of byte arrays only for
// these three types to match the logic in		// these three types to match the logic in
// NVPTXAsmPrinter::printReturnValStr and NVPTXTargetLowering::getPrototype.		// NVPTXAsmPrinter::printReturnValStr and NVPTXTargetLowering::getPrototype.
// Plus, this behavior is consistent with nvcc's.		// Plus, this behavior is consistent with nvcc's.
if (RetTy->isFloatingPointTy() \|\| RetTy->isIntegerTy() \|\|		if ((RetTy->isFloatingPointTy() \|\| RetTy->isIntegerTy() \|\|
RetTy->isPointerTy()) {		RetTy->isPointerTy()) && !RetTy->isIntegerTy(128)) {
// Scalar needs to be at least 32bit wide		// Scalar needs to be at least 32bit wide
if (resultsz < 32)		if (resultsz < 32)
resultsz = 32;		resultsz = 32;
SDVTList DeclareRetVTs = DAG.getVTList(MVT::Other, MVT::Glue);		SDVTList DeclareRetVTs = DAG.getVTList(MVT::Other, MVT::Glue);
SDValue DeclareRetOps[] = { Chain, DAG.getConstant(1, dl, MVT::i32),		SDValue DeclareRetOps[] = { Chain, DAG.getConstant(1, dl, MVT::i32),
DAG.getConstant(resultsz, dl, MVT::i32),		DAG.getConstant(resultsz, dl, MVT::i32),
DAG.getConstant(0, dl, MVT::i32), InFlag };		DAG.getConstant(0, dl, MVT::i32), InFlag };
Chain = DAG.getNode(NVPTXISD::DeclareRet, dl, DeclareRetVTs,		Chain = DAG.getNode(NVPTXISD::DeclareRet, dl, DeclareRetVTs,
▲ Show 20 Lines • Show All 714 Lines • ▼ Show 20 Lines	if (isImageOrSamplerVal(
assert(isKernelFunction(*F) &&		assert(isKernelFunction(*F) &&
"Only kernels can have image/sampler params");		"Only kernels can have image/sampler params");
InVals.push_back(DAG.getConstant(i + 1, dl, MVT::i32));		InVals.push_back(DAG.getConstant(i + 1, dl, MVT::i32));
continue;		continue;
}		}

if (theArgs[i]->use_empty()) {		if (theArgs[i]->use_empty()) {
// argument is dead		// argument is dead
if (Ty->isAggregateType()) {		if (Ty->isAggregateType() \|\| Ty->isIntegerTy(128)) {
SmallVector<EVT, 16> vtparts;		SmallVector<EVT, 16> vtparts;

ComputePTXValueVTs(*this, DAG.getDataLayout(), Ty, vtparts);		ComputePTXValueVTs(*this, DAG.getDataLayout(), Ty, vtparts);
assert(vtparts.size() > 0 && "empty aggregate type not expected");		assert(vtparts.size() > 0 && "empty aggregate type not expected");
for (unsigned parti = 0, parte = vtparts.size(); parti != parte;		for (unsigned parti = 0, parte = vtparts.size(); parti != parte;
++parti) {		++parti) {
InVals.push_back(DAG.getNode(ISD::UNDEF, dl, Ins[InsIdx].VT));		InVals.push_back(DAG.getNode(ISD::UNDEF, dl, Ins[InsIdx].VT));
++InsIdx;		++InsIdx;
▲ Show 20 Lines • Show All 2,290 Lines • Show Last 20 Lines

test/CodeGen/NVPTX/i128-param.ll

This file was added.

				; RUN: llc < %s -O0 -march=nvptx -mcpu=sm_20 \| FileCheck %s

				target datalayout = "e-p:32:32:32-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64"

				; CHECK-LABEL: .visible .func callee(
				; CHECK-NEXT: .param .align 8 .b8 callee_param_0[16],
				; CHECK-NEXT: .param .align 8 .b8 callee_param_1[16],
				define void @callee(i128, i128, i128*) {
				; CHECK-DAG: ld.param.u64 %[[REG0:rd[0-9]+]], [callee_param_0];
				; CHECK-DAG: ld.param.u64 %[[REG1:rd[0-9]+]], [callee_param_0+8];
				; CHECK-DAG: ld.param.u64 %[[REG2:rd[0-9]+]], [callee_param_1];
				; CHECK-DAG: ld.param.u64 %[[REG3:rd[0-9]+]], [callee_param_1+8];

				; CHECK: mul.lo.s64 %[[REG4:rd[0-9]+]], %[[REG0]], %[[REG3]];
				; CHECK-NEXT: mul.hi.u64 %[[REG5:rd[0-9]+]], %[[REG0]], %[[REG2]];
				; CHECK-NEXT: add.s64 %[[REG6:rd[0-9]+]], %[[REG5]], %[[REG4]];
				; CHECK-NEXT: mul.lo.s64 %[[REG7:rd[0-9]+]], %[[REG1]], %[[REG2]];
				; CHECK-NEXT: add.s64 %[[REG8:rd[0-9]+]], %[[REG6]], %[[REG7]];
				; CHECK-NEXT: mul.lo.s64 %[[REG9:rd[0-9]+]], %[[REG0]], %[[REG2]];
				%a = mul i128 %0, %1

				store i128 %a, i128* %2
				ret void
				}

				; CHECK-LABEL: .visible .entry caller_kernel(
				; CHECK-NEXT: .param .align 8 .b8 caller_kernel_param_0[16],
				; CHECK-NEXT: .param .align 8 .b8 caller_kernel_param_1[16],
				define ptx_kernel void @caller_kernel(i128, i128, i128*) {
				start:
				; CHECK-DAG: ld.param.u64 %[[REG0:rd[0-9]+]], [caller_kernel_param_0];
				; CHECK-DAG: ld.param.u64 %[[REG1:rd[0-9]+]], [caller_kernel_param_0+8];
				; CHECK-DAG: ld.param.u64 %[[REG2:rd[0-9]+]], [caller_kernel_param_1];
				; CHECK-DAG: ld.param.u64 %[[REG3:rd[0-9]+]], [caller_kernel_param_1+8];

				; CHECK: { // callseq [[CALLSEQ_ID:[0-9]]], 0
				; CHECK: .param .align 8 .b8 param0[16];
				; CHECK-NEXT: st.param.b64 [param0+0], %[[REG0]];
				; CHECK-NEXT: st.param.b64 [param0+8], %[[REG1]];
				; CHECK: .param .align 8 .b8 param1[16];
				; CHECK-NEXT: st.param.b64 [param1+0], %[[REG2]];
				; CHECK-NEXT: st.param.b64 [param1+8], %[[REG3]];
				; CHECK: } // callseq [[CALLSEQ_ID]]
				call void @callee(i128 %0, i128 %1, i128* %2)

				ret void
				}

				; CHECK-LABEL: .visible .func caller_func(
				; CHECK-NEXT: .param .align 8 .b8 caller_func_param_0[16],
				; CHECK-NEXT: .param .align 8 .b8 caller_func_param_1[16],
				define void @caller_func(i128, i128, i128*) {
				start:
				; CHECK-DAG: ld.param.u64 %[[REG0:rd[0-9]+]], [caller_func_param_0];
				; CHECK-DAG: ld.param.u64 %[[REG1:rd[0-9]+]], [caller_func_param_0+8];
				; CHECK-DAG: ld.param.u64 %[[REG2:rd[0-9]+]], [caller_func_param_1];
				; CHECK-DAG: ld.param.u64 %[[REG3:rd[0-9]+]], [caller_func_param_1+8];

				; CHECK: { // callseq [[CALLSEQ_ID:[0-9]]], 0
				; CHECK: .param .align 8 .b8 param0[16];
				; CHECK: st.param.b64 [param0+0], %[[REG0]];
				; CHECK: st.param.b64 [param0+8], %[[REG1]];
				; CHECK: .param .align 8 .b8 param1[16];
				; CHECK: st.param.b64 [param1+0], %[[REG2]];
				; CHECK: st.param.b64 [param1+8], %[[REG3]];
				; CHECK: } // callseq [[CALLSEQ_ID]]
				call void @callee(i128 %0, i128 %1, i128* %2)

				ret void
				}

test/CodeGen/NVPTX/i128-retval.ll

This file was added.

				; RUN: llc < %s -O0 -march=nvptx64 -mcpu=sm_20 \| FileCheck %s

				target datalayout = "e-p:32:32:32-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64"

				; CHECK-LABEL: .visible .func (.param .align 8 .b8 func_retval0[16]) callee(
				define i128 @callee(i128) {
				; CHECK-DAG: ld.param.u64 %[[REG0:rd[0-9]+]], [callee_param_0];
				; CHECK-DAG: ld.param.u64 %[[REG1:rd[0-9]+]], [callee_param_0+8];

				; CHECK: st.param.b64 [func_retval0+0], %[[REG0]];
				; CHECK: st.param.b64 [func_retval0+8], %[[REG1]];
				ret i128 %0
				}

				; CHECK-LABEL: .visible .func caller(
				define void @caller(i128, i128*) {
				start:
				; CHECK-DAG: ld.param.u64 %[[REG0:rd[0-9]+]], [caller_param_0];
				; CHECK-DAG: ld.param.u64 %[[REG1:rd[0-9]+]], [caller_param_0+8];
				; CHECK-DAG: ld.param.u64 %[[OUT:rd[0-9]+]], [caller_param_1];

				; CHECK: { // callseq 0, 0
				; CHECK: .param .align 8 .b8 retval0[16];
				; CHECK: call.uni (retval0),
				; CHECK: ld.param.b64 %[[REG2:rd[0-9]+]], [retval0+0];
				; CHECK: ld.param.b64 %[[REG3:rd[0-9]+]], [retval0+8];
				; CHECK: } // callseq 0
				%a = call i128 @callee(i128 %0)

				; CHECK-DAG: st.u64 [%[[OUT]]], %[[REG2]];
				; CHECK-DAG: st.u64 [%[[OUT]]+8], %[[REG3]];
				store i128 %a, i128* %1

				ret void
				}