This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/NVPTX/
-
Target/
-
NVPTX/
-
NVPTXTargetMachine.cpp
-
test/CodeGen/NVPTX/
-
CodeGen/
-
NVPTX/
2
vector-loads-complex.ll

Differential D74444

[NVPTX, LSV] Move the LSV optimization pass to later when the graph is cleaner
ClosedPublic

Authored by nouiz on Feb 11 2020, 2:14 PM.

Download Raw Diff

Details

Reviewers

tra
george.karpenkov

Commits

rG019ab61e25f2: [NVPTX, LSV] Move the LSV optimization pass to later when the graph is cleaner

Summary

This allow it to recognize more loads as being consecutive when the load's address are complex at the start.

Diff Detail

Event Timeline

nouiz created this revision.Feb 11 2020, 2:14 PM

Herald added a project: Restricted Project. · View Herald TranscriptFeb 11 2020, 2:14 PM

Herald added subscribers: llvm-commits, jfb, hiraditya, jholewinski. · View Herald Transcript

tra added inline comments.Feb 11 2020, 2:43 PM

llvm/test/CodeGen/NVPTX/vector-loads-complex.ll
12	Couple of general notes: I'd attempt to reduce IR to the bare minimum necessary to demonstrate/test intended behavior. This IR is not bad, but I think it may be reduced further to the point where we end up with just one vectorized 2-element load. It's a good practice to 'anchor the checks with fixed start/end points CHECK-LABEL: <function_name> ... # other checks on whatever happens inside the function. CHECK: ret It helps to delineate the search boundaries for FileCheck. Not a big deal now, as you only have one function, but I would not be surprised if more tests would be added over time.
14–18	Negative matching is tricky. As written, FileCheck will only ensure that there's no ld.global between the fourth instance of ld.global.v2.u8 and the end of the file. I'd start by attempting to place the checks close to the `load` statements they coalesce into. It's not always possible, but it probably works in this case. As long as the number of loads is fixed/predictable, positively matching all of them makes negative matching unnecessary, so I'd just drop it.

I simplified the test. As it was way simpler, I added it to an existing test file.

Minor test nits. LGTM otherwise.
I'm curious, what does generated ptx for the function look before/after the patch.

llvm/test/CodeGen/NVPTX/vector-loads.ll
73 ↗	(On Diff #244243)	CHECK-LABEL
75 ↗	(On Diff #244243)	I'm not sure we need it. The check above verifies that the load did get vectorized, which was the purpose of this patch and the test. Any condition when CHECK-NOT may fail (e.g. we're loading more than one, or it's an unrelated load/store) is unlikely to be relevant. It will also only look in-between `ld.v2.u8` and `ret`, and such issues will not be detected before `ld.v2.u8`. If you want to ensure that `ld.v2.u8` is the only relevant load, then you may need to have a separate test case which would fail if it's found more than one `ld` of any kind. Or you can add another `CHECK-NOT: ld` between the start of the function and `CHECK: ld.v2.u8`. I think that may work, but I'm not completely sure. I'd just skip CHECK-NOT altogether.
92 ↗	(On Diff #244243)	CHECK: ld.v2.u8
98 ↗	(On Diff #244243)	CHECK: ret

This revision is now accepted and ready to land.Feb 12 2020, 12:18 PM

Before:

.visible .func foo_complex(

.param .b32 foo_complex_param_0

) // @foo_complex
{

.reg .b16       %rs<4>;
.reg .b32       %r<11>;

// %bb.0:

ld.param.u32    %r1, [foo_complex_param_0];
mov.u32         %r2, %tid.x;
mov.u32         %r3, %ctaid.x;
shl.b32         %r4, %r3, 9;
and.b32         %r5, %r4, -131072;
and.b32         %r6, %r4, 130560;
shl.b32         %r7, %r2, 1;
or.b32          %r8, %r6, %r7;
add.s32         %r9, %r1, %r5;
add.s32         %r10, %r9, %r8;

ld.u8 %rs1, [%r10+128]; ld.u8 %rs2, [%r10+129];** max.u16 %rs3, %rs1, %rs2; st.u8 [%r10+129], %rs3; ret; // -- End function

}

After:

.visible .func foo_complex(

.param .b32 foo_complex_param_0

) // @foo_complex
{

.reg .b16       %rs<4>;
.reg .b32       %r<11>;

// %bb.0:

ld.param.u32    %r1, [foo_complex_param_0];
mov.u32         %r2, %tid.x;
mov.u32         %r3, %ctaid.x;
shl.b32         %r4, %r3, 9;
and.b32         %r5, %r4, -131072;
and.b32         %r6, %r4, 130560;
shl.b32         %r7, %r2, 1;
or.b32          %r8, %r6, %r7;
add.s32         %r9, %r1, %r5;
add.s32         %r10, %r9, %r8;

ld.v2.u8 {%rs1, %rs2}, [%r10+128];** max.u16 %rs3, %rs1, %rs2; st.u8 [%r10+129], %rs3; ret; // -- End function

}

nouiz marked an inline comment as done.Feb 12 2020, 12:55 PM

nouiz added inline comments.

llvm/test/CodeGen/NVPTX/vector-loads.ll
73 ↗	(On Diff #244243)	Here I did as the other tests in this file. I'll fix all of them.

I did all the small changes.

tra accepted this revision.Feb 12 2020, 1:42 PM

Closed by commit rG019ab61e25f2: [NVPTX, LSV] Move the LSV optimization pass to later when the graph is cleaner (authored by nouiz, committed by tra). · Explain WhyFeb 13 2020, 12:24 PM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

llvm/

lib/

Target/

NVPTX/

NVPTXTargetMachine.cpp

7 lines

test/

CodeGen/

NVPTX/

vector-loads-complex.ll

143 lines

Diff 243996

llvm/lib/Target/NVPTX/NVPTXTargetMachine.cpp

Show First 20 Lines • Show All 270 Lines • ▼ Show 20 Lines	void NVPTXPassConfig::addIRPasses() {
addPass(createNVPTXAssignValidGlobalNamesPass());		addPass(createNVPTXAssignValidGlobalNamesPass());
addPass(createGenericToNVVMPass());		addPass(createGenericToNVVMPass());

// NVPTXLowerArgs is required for correctness and should be run right		// NVPTXLowerArgs is required for correctness and should be run right
// before the address space inference passes.		// before the address space inference passes.
addPass(createNVPTXLowerArgsPass(&getNVPTXTargetMachine()));		addPass(createNVPTXLowerArgsPass(&getNVPTXTargetMachine()));
if (getOptLevel() != CodeGenOpt::None) {		if (getOptLevel() != CodeGenOpt::None) {
addAddressSpaceInferencePasses();		addAddressSpaceInferencePasses();
if (!DisableLoadStoreVectorizer)
addPass(createLoadStoreVectorizerPass());
addStraightLineScalarOptimizationPasses();		addStraightLineScalarOptimizationPasses();
}		}

// === LSR and other generic IR passes ===		// === LSR and other generic IR passes ===
TargetPassConfig::addIRPasses();		TargetPassConfig::addIRPasses();
// EarlyCSE is not always strong enough to clean up what LSR produces. For		// EarlyCSE is not always strong enough to clean up what LSR produces. For
// example, GVN can combine		// example, GVN can combine
//		//
// %0 = add %a, %b		// %0 = add %a, %b
// %1 = add %b, %a		// %1 = add %b, %a
//		//
// and		// and
//		//
// %0 = shl nsw %a, 2		// %0 = shl nsw %a, 2
// %1 = shl %a, 2		// %1 = shl %a, 2
//		//
// but EarlyCSE can do neither of them.		// but EarlyCSE can do neither of them.
if (getOptLevel() != CodeGenOpt::None)		if (getOptLevel() != CodeGenOpt::None) {
addEarlyCSEOrGVNPass();		addEarlyCSEOrGVNPass();
		if (!DisableLoadStoreVectorizer)
		addPass(createLoadStoreVectorizerPass());
		}
}		}

bool NVPTXPassConfig::addInstSelector() {		bool NVPTXPassConfig::addInstSelector() {
const NVPTXSubtarget &ST = *getTM<NVPTXTargetMachine>().getSubtargetImpl();		const NVPTXSubtarget &ST = *getTM<NVPTXTargetMachine>().getSubtargetImpl();

addPass(createLowerAggrCopies());		addPass(createLowerAggrCopies());
addPass(createAllocaHoisting());		addPass(createAllocaHoisting());
addPass(createNVPTXISelDag(getNVPTXTargetMachine(), getOptLevel()));		addPass(createNVPTXISelDag(getNVPTXTargetMachine(), getOptLevel()));
▲ Show 20 Lines • Show All 92 Lines • Show Last 20 Lines

llvm/test/CodeGen/NVPTX/vector-loads-complex.ll

This file was added.

				; RUN: llc -march=nvptx -mcpu=sm_30 < %s \| FileCheck %s

				; This test that NVPTX can still vectorize loads even when the load's
				; address is complex.


				declare void @llvm.assume(i1) #3
				declare i32 @llvm.nvvm.read.ptx.sreg.ctaid.x() #0
				declare i32 @llvm.nvvm.read.ptx.sreg.tid.x() #0
				declare i32 @llvm.nvvm.shfl.sync.down.i32(i32, i32, i32, i32) #2

				define void @reduce_10(i8* nocapture readonly align 16 dereferenceable(134217728) %alloc0, i8* align 64 dereferenceable(1024) %alloc1) local_unnamed_addr #0 {
				traUnsubmitted Not Done Reply Inline Actions Couple of general notes: I'd attempt to reduce IR to the bare minimum necessary to demonstrate/test intended behavior. This IR is not bad, but I think it may be reduced further to the point where we end up with just one vectorized 2-element load. It's a good practice to 'anchor the checks with fixed start/end points CHECK-LABEL: <function_name> ... # other checks on whatever happens inside the function. CHECK: ret It helps to delineate the search boundaries for FileCheck. Not a big deal now, as you only have one function, but I would not be surprised if more tests would be added over time. tra: Couple of general notes: I'd attempt to reduce IR to the bare minimum necessary to…
				entry:
				; CHECK: ld.global.v2.u8
				; CHECK: ld.global.v2.u8
				; CHECK: ld.global.v2.u8
				; CHECK: ld.global.v2.u8
				; CHECK-NOT: ld.global
				traUnsubmitted Not Done Reply Inline Actions Negative matching is tricky. As written, FileCheck will only ensure that there's no ld.global between the fourth instance of ld.global.v2.u8 and the end of the file. I'd start by attempting to place the checks close to the `load` statements they coalesce into. It's not always possible, but it probably works in this case. As long as the number of loads is fixed/predictable, positively matching all of them makes negative matching unnecessary, so I'd just drop it. tra: Negative matching is tricky. As written, FileCheck will only ensure that there's no ld.global…
				%arg0.1.typed = bitcast i8* %alloc0 to [1024 x [131072 x i8]]*
				%0 = tail call i32 @llvm.nvvm.read.ptx.sreg.tid.x(), !range !3
				%1 = tail call i32 @llvm.nvvm.read.ptx.sreg.ctaid.x(), !range !4
				%2 = lshr i32 %1, 8
				%3 = shl nuw nsw i32 %1, 9
				%tile_origin.2 = and i32 %3, 130560
				%start_offset_x_mul = shl nuw nsw i32 %0, 1
				%4 = or i32 %tile_origin.2, %start_offset_x_mul
				%5 = zext i32 %4 to i64
				%6 = or i32 %4, 1
				%7 = zext i32 %6 to i64
				%8 = or i32 %4, 128
				%9 = zext i32 %8 to i64
				%10 = or i32 %4, 129
				%11 = zext i32 %10 to i64
				%12 = or i32 %4, 256
				%13 = zext i32 %12 to i64
				%14 = or i32 %4, 257
				%15 = zext i32 %14 to i64
				%16 = or i32 %4, 384
				%17 = zext i32 %16 to i64
				%18 = or i32 %4, 385
				%19 = zext i32 %18 to i64
				%20 = zext i32 %2 to i64
				%21 = getelementptr inbounds [1024 x [131072 x i8]], [1024 x [131072 x i8]]* %arg0.1.typed, i64 0, i64 %20, i64 %5
				%22 = load i8, i8* %21, align 2
				%23 = getelementptr inbounds [1024 x [131072 x i8]], [1024 x [131072 x i8]]* %arg0.1.typed, i64 0, i64 %20, i64 %7
				%24 = load i8, i8* %23, align 1
				%25 = icmp ult i8 %22, %24
				%26 = select i1 %25, i8 %24, i8 %22
				%27 = getelementptr inbounds [1024 x [131072 x i8]], [1024 x [131072 x i8]]* %arg0.1.typed, i64 0, i64 %20, i64 %9
				%28 = load i8, i8* %27, align 2
				%29 = icmp ult i8 %26, %28
				%30 = select i1 %29, i8 %28, i8 %26
				%31 = getelementptr inbounds [1024 x [131072 x i8]], [1024 x [131072 x i8]]* %arg0.1.typed, i64 0, i64 %20, i64 %11
				%32 = load i8, i8* %31, align 1
				%33 = icmp ult i8 %30, %32
				%34 = select i1 %33, i8 %32, i8 %30
				%35 = getelementptr inbounds [1024 x [131072 x i8]], [1024 x [131072 x i8]]* %arg0.1.typed, i64 0, i64 %20, i64 %13
				%36 = load i8, i8* %35, align 2
				%37 = icmp ult i8 %34, %36
				%38 = select i1 %37, i8 %36, i8 %34
				%39 = getelementptr inbounds [1024 x [131072 x i8]], [1024 x [131072 x i8]]* %arg0.1.typed, i64 0, i64 %20, i64 %15
				%40 = load i8, i8* %39, align 1
				%41 = icmp ult i8 %38, %40
				%42 = select i1 %41, i8 %40, i8 %38
				%43 = getelementptr inbounds [1024 x [131072 x i8]], [1024 x [131072 x i8]]* %arg0.1.typed, i64 0, i64 %20, i64 %17
				%44 = load i8, i8* %43, align 2
				%45 = icmp ult i8 %42, %44
				%46 = select i1 %45, i8 %44, i8 %42
				%47 = getelementptr inbounds [1024 x [131072 x i8]], [1024 x [131072 x i8]]* %arg0.1.typed, i64 0, i64 %20, i64 %19
				%48 = load i8, i8* %47, align 1
				%49 = icmp ult i8 %46, %48
				%50 = select i1 %49, i8 %48, i8 %46
				%lane_id = and i32 %0, 31
				%51 = zext i8 %50 to i32
				%52 = tail call i32 @llvm.nvvm.shfl.sync.down.i32(i32 -1, i32 %51, i32 16, i32 31)
				%53 = trunc i32 %52 to i8
				%54 = icmp ult i8 %50, %53
				%55 = select i1 %54, i8 %53, i8 %50
				%56 = zext i8 %55 to i32
				%57 = tail call i32 @llvm.nvvm.shfl.sync.down.i32(i32 -1, i32 %56, i32 8, i32 31)
				%58 = trunc i32 %57 to i8
				%59 = icmp ult i8 %55, %58
				%60 = select i1 %59, i8 %58, i8 %55
				%61 = zext i8 %60 to i32
				%62 = tail call i32 @llvm.nvvm.shfl.sync.down.i32(i32 -1, i32 %61, i32 4, i32 31)
				%63 = trunc i32 %62 to i8
				%64 = icmp ult i8 %60, %63
				%65 = select i1 %64, i8 %63, i8 %60
				%66 = zext i8 %65 to i32
				%67 = tail call i32 @llvm.nvvm.shfl.sync.down.i32(i32 -1, i32 %66, i32 2, i32 31)
				%68 = trunc i32 %67 to i8
				%69 = icmp ult i8 %65, %68
				%70 = select i1 %69, i8 %68, i8 %65
				%71 = zext i8 %70 to i32
				%72 = tail call i32 @llvm.nvvm.shfl.sync.down.i32(i32 -1, i32 %71, i32 1, i32 31)
				%73 = trunc i32 %72 to i8
				%74 = icmp ult i8 %70, %73
				%75 = select i1 %74, i8 %73, i8 %70
				%76 = icmp eq i32 %lane_id, 0
				br i1 %76, label %lane_id_is_zero-true, label %lane_id_is_zero-after

				lane_id_is_zero-after: ; preds = %atomic_op_loop_body, %entry
				ret void

				lane_id_is_zero-true: ; preds = %entry
				%output_element_address = getelementptr inbounds i8, i8* %alloc1, i64 %20
				%cas_new_output_address = alloca i32, align 4
				%77 = ptrtoint i8* %output_element_address to i64
				%78 = and i64 %77, 3
				%79 = and i64 %77, -4
				%80 = inttoptr i64 %79 to i32*
				%81 = ptrtoint i32* %cas_new_output_address to i64
				%82 = or i64 %78, %81
				%83 = inttoptr i64 %82 to i8*
				%cas_old_output = load i32, i32* %80, align 4
				br label %atomic_op_loop_body

				atomic_op_loop_body: ; preds = %atomic_op_loop_body, %lane_id_is_zero-true
				%cas_old_output21 = phi i32 [ %cas_old_output22, %atomic_op_loop_body ], [ %cas_old_output, %lane_id_is_zero-true ]
				store i32 %cas_old_output21, i32* %cas_new_output_address, align 4
				%.val42 = load i8, i8* %83, align 1
				%84 = icmp ult i8 %.val42, %75
				%85 = select i1 %84, i8 %75, i8 %.val42
				store i8 %85, i8* %83, align 1
				%cas_new_output = load i32, i32* %cas_new_output_address, align 4
				%86 = cmpxchg i32* %80, i32 %cas_old_output21, i32 %cas_new_output seq_cst seq_cst
				%cas_old_output22 = extractvalue { i32, i1 } %86, 0
				%success = extractvalue { i32, i1 } %86, 1
				br i1 %success, label %lane_id_is_zero-after, label %atomic_op_loop_body
				}

				attributes #0 = { nounwind }
				attributes #1 = { nounwind readnone }
				attributes #2 = { convergent inaccessiblememonly nounwind }

				!nvvm.annotations = !{!0, !1}
				!llvm.module.flags = !{!2}

				!0 = !{void (i8, i8)* @reduce_10, !"kernel", i32 1}
				!1 = !{void (i8, i8)* @reduce_10, !"reqntidx", i32 64}
				!2 = !{i32 4, !"nvvm-reflect-ftz", i32 0}
				!3 = !{i32 0, i32 64}
				!4 = !{i32 0, i32 262144}