This is an archive of the discontinued LLVM Phabricator instance.

[NVPTX] Fixed vectorized LDG for f16.
ClosedPublic

Authored by tra on Apr 5 2018, 3:41 PM.

Download Raw Diff

Details

Reviewers

jlebar
bixia

Commits

rGa28e598ebb60: [NVPTX] Fixed vectorized LDG for f16.
rL329456: [NVPTX] Fixed vectorized LDG for f16.

Summary

v2f16 is a special case in NVPTX. v4f16 may be loaded as a pair of v2f16
and that was not previously handled correctly by tryLDGLDU()

Diff Detail

Build Status

Buildable 16833
Build 16833: arc lint + arc unit

Event Timeline

tra created this revision.Apr 5 2018, 3:41 PM

Herald added subscribers: hiraditya, sanjoy, jholewinski. · View Herald TranscriptApr 5 2018, 3:41 PM

bixia added inline comments.Apr 5 2018, 4:21 PM

llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.cpp
1244	I played with this and found that the only f16 vector that can reach here is v2f16, due to the way that the legalizer is set up for the target. If we can somehow make v4f16 reach here, I believe that we will still see an error in routine GetConvertOpcode, similar to what we see for v2f16 before this fix. For this reason, I prefer an assertion, something like this: if (EltVT == MVT::v2f16) { assert(EltVT.getVectorNumElements() == 2 && "missed legalizing vector-of-f16"); } else { NumElts = EltVT.getVectorNumElements(); EltVT = EltVT.getVectorElementType(); }

lgtm modulo Bixia's comment.

llvm/test/CodeGen/NVPTX/ldg-invariant.ll
17	This could use a comma or something somewhere :)

This revision is now accepted and ready to land.Apr 5 2018, 4:41 PM

tra added inline comments.Apr 5 2018, 4:50 PM

llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.cpp
1244	Please take a look at the test cases in ldg-invariants.ll in this patch. v8f16 variant also gets here. Legalizing node: t6: v8f16,ch = load<(invariant load 16 from %ir.ptr, addrspace 1)> t0, t23, undef:i64 ... Legalizing node: t24: v2f16,v2f16,v2f16,v2f16,ch = NVPTXISD::LoadV4<(invariant load 16 from %ir.ptr, addrspace 1)> t0, t23, undef:i64, Constant:i64<0> Potentially v4f16 could also get here as a load of a pair of v2f16 elements. If you insist on assertion, it should accept all three cases.

Updated test comments.

Harbormaster completed remote builds in B16799: Diff 141240.Apr 5 2018, 5:01 PM

bixia added inline comments.Apr 5 2018, 7:18 PM

llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.cpp
1244	What do we do if NumElts % 2 != 0? Can we assert the situation, assuming that the legalizer can guarantee this?

Made the check for even number of elements an assertion.
Cosmetic typo fix in the tests.

@bixia PTAL.

bixia accepted this revision.Apr 6 2018, 1:58 PM

Closed by commit rL329456: [NVPTX] Fixed vectorized LDG for f16. (authored by tra). · Explain WhyApr 6 2018, 2:13 PM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

llvm/

lib/

Target/

NVPTX/

NVPTXISelDAGToDAG.cpp

6 lines

test/

CodeGen/

NVPTX/

ldg-invariant.ll

45 lines

Diff 141411

llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.cpp

Show First 20 Lines • Show All 1,233 Lines • ▼ Show 20 Lines	bool NVPTXDAGToDAGISel::tryLDGLDU(SDNode *N) {
SDNode *LD;		SDNode *LD;
SDValue Base, Offset, Addr;		SDValue Base, Offset, Addr;

EVT EltVT = Mem->getMemoryVT();		EVT EltVT = Mem->getMemoryVT();
unsigned NumElts = 1;		unsigned NumElts = 1;
if (EltVT.isVector()) {		if (EltVT.isVector()) {
NumElts = EltVT.getVectorNumElements();		NumElts = EltVT.getVectorNumElements();
EltVT = EltVT.getVectorElementType();		EltVT = EltVT.getVectorElementType();
		// vectors of f16 are loaded/stored as multiples of v2f16 elements.
		if (EltVT == MVT::f16 && N->getValueType(0) == MVT::v2f16) {
		assert(NumElts % 2 == 0 && "Vector must have even number of elements");
		bixiaUnsubmitted Not Done Reply Inline Actions I played with this and found that the only f16 vector that can reach here is v2f16, due to the way that the legalizer is set up for the target. If we can somehow make v4f16 reach here, I believe that we will still see an error in routine GetConvertOpcode, similar to what we see for v2f16 before this fix. For this reason, I prefer an assertion, something like this: if (EltVT == MVT::v2f16) { assert(EltVT.getVectorNumElements() == 2 && "missed legalizing vector-of-f16"); } else { NumElts = EltVT.getVectorNumElements(); EltVT = EltVT.getVectorElementType(); } bixia: I played with this and found that the only f16 vector that can reach here is v2f16, due to the…
		traAuthorUnsubmitted Not Done Reply Inline Actions Please take a look at the test cases in ldg-invariants.ll in this patch. v8f16 variant also gets here. Legalizing node: t6: v8f16,ch = load<(invariant load 16 from %ir.ptr, addrspace 1)> t0, t23, undef:i64 ... Legalizing node: t24: v2f16,v2f16,v2f16,v2f16,ch = NVPTXISD::LoadV4<(invariant load 16 from %ir.ptr, addrspace 1)> t0, t23, undef:i64, Constant:i64<0> Potentially v4f16 could also get here as a load of a pair of v2f16 elements. If you insist on assertion, it should accept all three cases. tra: Please take a look at the test cases in ldg-invariants.ll in this patch. v8f16 variant also…
		bixiaUnsubmitted Done Reply Inline Actions What do we do if NumElts % 2 != 0? Can we assert the situation, assuming that the legalizer can guarantee this? bixia: What do we do if NumElts % 2 != 0? Can we assert the situation, assuming that the legalizer can…
		EltVT = MVT::v2f16;
		NumElts /= 2;
		}
}		}

// Build the "promoted" result VTList for the load. If we are really loading		// Build the "promoted" result VTList for the load. If we are really loading
// i8s, then the return type will be promoted to i16 since we do not expose		// i8s, then the return type will be promoted to i16 since we do not expose
// 8-bit registers in NVPTX.		// 8-bit registers in NVPTX.
EVT NodeVT = (EltVT == MVT::i8) ? MVT::i16 : EltVT;		EVT NodeVT = (EltVT == MVT::i8) ? MVT::i16 : EltVT;
SmallVector<EVT, 5> InstVTs;		SmallVector<EVT, 5> InstVTs;
for (unsigned i = 0; i != NumElts; ++i) {		for (unsigned i = 0; i != NumElts; ++i) {
▲ Show 20 Lines • Show All 2,454 Lines • Show Last 20 Lines

llvm/test/CodeGen/NVPTX/ldg-invariant.ll

	; RUN: llc < %s -march=nvptx64 -mcpu=sm_35 \| FileCheck %s			; RUN: llc < %s -march=nvptx64 -mcpu=sm_35 \| FileCheck %s

	; Check that invariant loads from the global addrspace are lowered to			; Check that invariant loads from the global addrspace are lowered to
	; ld.global.nc.			; ld.global.nc.

	; CHECK-LABEL: @ld_global			; CHECK-LABEL: @ld_global
	define i32 @ld_global(i32 addrspace(1)* %ptr) {			define i32 @ld_global(i32 addrspace(1)* %ptr) {
	; CHECK: ld.global.nc.{{[a-z]}}32			; CHECK: ld.global.nc.{{[a-z]}}32
	%a = load i32, i32 addrspace(1)* %ptr, !invariant.load !0			%a = load i32, i32 addrspace(1)* %ptr, !invariant.load !0
	ret i32 %a			ret i32 %a
	}			}

				; CHECK-LABEL: @ld_global_v2f16
				define half @ld_global_v2f16(<2 x half> addrspace(1)* %ptr) {
				; Load of v2f16 is weird. We consider it to be a legal type, which happens to be
				; loaded/stored as a 32-bit scalar.
				; CHECK: ld.global.nc.b32
				jlebarUnsubmitted Not Done Reply Inline Actions This could use a comma or something somewhere :) jlebar: This could use a comma or something somewhere :)
				%a = load <2 x half>, <2 x half> addrspace(1)* %ptr, !invariant.load !0
				%v1 = extractelement <2 x half> %a, i32 0
				%v2 = extractelement <2 x half> %a, i32 1
				%sum = fadd half %v1, %v2
				ret half %sum
				}

				; CHECK-LABEL: @ld_global_v4f16
				define half @ld_global_v4f16(<4 x half> addrspace(1)* %ptr) {
				; Larger f16 vectors may be split into individual f16 elements and multiple
				; loads/stores may be vectorized using f16 element type. Practically it's
				; limited to v4 variant only.
				; CHECK: ld.global.nc.v4.b16
				%a = load <4 x half>, <4 x half> addrspace(1)* %ptr, !invariant.load !0
				%v1 = extractelement <4 x half> %a, i32 0
				%v2 = extractelement <4 x half> %a, i32 1
				%v3 = extractelement <4 x half> %a, i32 2
				%v4 = extractelement <4 x half> %a, i32 3
				%sum1 = fadd half %v1, %v2
				%sum2 = fadd half %v3, %v4
				%sum = fadd half %sum1, %sum2
				ret half %sum
				}

				; CHECK-LABEL: @ld_global_v8f16
				define half @ld_global_v8f16(<8 x half> addrspace(1)* %ptr) {
				; Larger vectors are, again, loaded as v4i32. PTX has no v8 variants of loads/stores,
				; so load/store vectorizer has to convert v8f16 -> v4 x v2f16.
				; CHECK: ld.global.nc.v4.b32
				%a = load <8 x half>, <8 x half> addrspace(1)* %ptr, !invariant.load !0
				%v1 = extractelement <8 x half> %a, i32 0
				%v2 = extractelement <8 x half> %a, i32 2
				%v3 = extractelement <8 x half> %a, i32 4
				%v4 = extractelement <8 x half> %a, i32 6
				%sum1 = fadd half %v1, %v2
				%sum2 = fadd half %v3, %v4
				%sum = fadd half %sum1, %sum2
				ret half %sum
				}

	; CHECK-LABEL: @ld_global_v2i32			; CHECK-LABEL: @ld_global_v2i32
	define i32 @ld_global_v2i32(<2 x i32> addrspace(1)* %ptr) {			define i32 @ld_global_v2i32(<2 x i32> addrspace(1)* %ptr) {
	; CHECK: ld.global.nc.v2.{{[a-z]}}32			; CHECK: ld.global.nc.v2.{{[a-z]}}32
	%a = load <2 x i32>, <2 x i32> addrspace(1)* %ptr, !invariant.load !0			%a = load <2 x i32>, <2 x i32> addrspace(1)* %ptr, !invariant.load !0
	%v1 = extractelement <2 x i32> %a, i32 0			%v1 = extractelement <2 x i32> %a, i32 0
	%v2 = extractelement <2 x i32> %a, i32 1			%v2 = extractelement <2 x i32> %a, i32 1
	%sum = add i32 %v1, %v2			%sum = add i32 %v1, %v2
	ret i32 %sum			ret i32 %sum
	Show All 31 Lines