This is an archive of the discontinued LLVM Phabricator instance.

[NVPTX] enable SpeculativeExecution in NVPTX
ClosedPublic

Authored by jingyue on Jul 14 2015, 2:56 PM.

Download Raw Diff

Details

Reviewers

jholewinski
broune
eliben

Commits

rGe7981cee2427: [NVPTX] enable SpeculativeExecution in NVPTX
rL242437: [NVPTX] enable SpeculativeExecution in NVPTX

Summary

SpeculativeExecution enables a series straight line optimizations (such
as SLSR and NaryReassociate) on conditional code. For example,

if (...)
  ... b * s ...
if (...)
  ... (b + 1) * s ...

speculative execution can hoist b * s and (b + 1) * s from then-blocks,
so that we have

... b * s ...
if (...)
  ...
... (b + 1) * s ...
if (...)
  ...

Then, SLSR can rewrite (b + 1) * s to (b * s + s) because after
speculative execution b * s dominates (b + 1) * s.

The performance impact of this change is significant. It speeds up the
benchmarks running EigenFloatContractionKernelInternal16x16
(https://bitbucket.org/eigen/eigen/src/ba68f42fa69e4f43417fe1e52669d4dd5d2b3bee/unsupported/Eigen/CXX11/src/Tensor/TensorContractionCuda.h?at=default#cl-526)
by roughly 2%. Some internal benchmarks that have the above code pattern
are improved by up to 40%. No significant slowdowns are observed on
Eigen CUDA microbenchmarks.

Diff Detail

Event Timeline

jingyue updated this revision to Diff 29717.Jul 14 2015, 2:56 PM

jingyue retitled this revision from to [NVPTX] enable SpeculativeExecution in NVPTX.

jingyue updated this object.

jingyue added reviewers: jholewinski, broune.

jingyue added a subscriber: llvm-commits.

Herald added a subscriber: jholewinski. · View Herald TranscriptJul 14 2015, 2:56 PM

+eliben

lgtm

This revision is now accepted and ready to land.Jul 16 2015, 9:58 AM

jingyue closed this revision.Jul 16 2015, 1:13 PM

Revision Contents

Path

Size

lib/

Target/

NVPTX/

NVPTXTargetMachine.cpp

1 line

test/

Transforms/

StraightLineStrengthReduce/

NVPTX/

speculative-slsr.ll

71 lines

Diff 29717

lib/Target/NVPTX/NVPTXTargetMachine.cpp

Show First 20 Lines • Show All 173 Lines • ▼ Show 20 Lines	void NVPTXPassConfig::addIRPasses() {
addPass(createSROAPass());		addPass(createSROAPass());
addPass(createNVPTXLowerAllocaPass());		addPass(createNVPTXLowerAllocaPass());
addPass(createNVPTXFavorNonGenericAddrSpacesPass());		addPass(createNVPTXFavorNonGenericAddrSpacesPass());
// FavorNonGenericAddrSpaces shortcuts unnecessary addrspacecasts, and leave		// FavorNonGenericAddrSpaces shortcuts unnecessary addrspacecasts, and leave
// them unused. We could remove dead code in an ad-hoc manner, but that		// them unused. We could remove dead code in an ad-hoc manner, but that
// requires manual work and might be error-prone.		// requires manual work and might be error-prone.
addPass(createDeadCodeEliminationPass());		addPass(createDeadCodeEliminationPass());
addPass(createSeparateConstOffsetFromGEPPass());		addPass(createSeparateConstOffsetFromGEPPass());
		addPass(createSpeculativeExecutionPass());
// ReassociateGEPs exposes more opportunites for SLSR. See		// ReassociateGEPs exposes more opportunites for SLSR. See
// the example in reassociate-geps-and-slsr.ll.		// the example in reassociate-geps-and-slsr.ll.
addPass(createStraightLineStrengthReducePass());		addPass(createStraightLineStrengthReducePass());
// SeparateConstOffsetFromGEP and SLSR creates common expressions which GVN or		// SeparateConstOffsetFromGEP and SLSR creates common expressions which GVN or
// EarlyCSE can reuse. GVN generates significantly better code than EarlyCSE		// EarlyCSE can reuse. GVN generates significantly better code than EarlyCSE
// for some of our benchmarks.		// for some of our benchmarks.
if (getOptLevel() == CodeGenOpt::Aggressive)		if (getOptLevel() == CodeGenOpt::Aggressive)
addPass(createGVNPass());		addPass(createGVNPass());
▲ Show 20 Lines • Show All 103 Lines • Show Last 20 Lines

test/Transforms/StraightLineStrengthReduce/NVPTX/speculative-slsr.ll

This file was added.

				; RUN: llc < %s -march=nvptx64 -mcpu=sm_35 \| FileCheck %s

				target datalayout = "e-i64:64-v16:16-v32:32-n16:32:64"
				target triple = "nvptx64-nvidia-cuda"

				; CUDA code
				; __global__ void foo(int b, int s) {
				; #pragma unroll
				; for (int i = 0; i < 4; ++i) {
				; if (cond(i))
				; use((b + i) * s);
				; }
				; }
				define void @foo(i32 %b, i32 %s) {
				; CHECK-LABEL: .visible .entry foo(
				entry:
				; CHECK: ld.param.u32 [[s:%r[0-9]+]], [foo_param_1];
				; CHECK: ld.param.u32 [[b:%r[0-9]+]], [foo_param_0];
				%call = tail call zeroext i1 @cond(i32 0)
				br i1 %call, label %if.then, label %for.inc

				if.then: ; preds = %entry
				%mul = mul nsw i32 %b, %s
				; CHECK: mul.lo.s32 [[a0:%r[0-9]+]], [[b]], [[s]]
				tail call void @use(i32 %mul)
				br label %for.inc

				for.inc: ; preds = %entry, %if.then
				%call.1 = tail call zeroext i1 @cond(i32 1)
				br i1 %call.1, label %if.then.1, label %for.inc.1

				if.then.1: ; preds = %for.inc
				%add.1 = add nsw i32 %b, 1
				%mul.1 = mul nsw i32 %add.1, %s
				; CHECK: add.s32 [[a1:%r[0-9]+]], [[a0]], [[s]]
				tail call void @use(i32 %mul.1)
				br label %for.inc.1

				for.inc.1: ; preds = %if.then.1, %for.inc
				%call.2 = tail call zeroext i1 @cond(i32 2)
				br i1 %call.2, label %if.then.2, label %for.inc.2

				if.then.2: ; preds = %for.inc.1
				%add.2 = add nsw i32 %b, 2
				%mul.2 = mul nsw i32 %add.2, %s
				; CHECK: add.s32 [[a2:%r[0-9]+]], [[a1]], [[s]]
				tail call void @use(i32 %mul.2)
				br label %for.inc.2

				for.inc.2: ; preds = %if.then.2, %for.inc.1
				%call.3 = tail call zeroext i1 @cond(i32 3)
				br i1 %call.3, label %if.then.3, label %for.inc.3

				if.then.3: ; preds = %for.inc.2
				%add.3 = add nsw i32 %b, 3
				%mul.3 = mul nsw i32 %add.3, %s
				; CHECK: add.s32 [[a3:%r[0-9]+]], [[a2]], [[s]]
				tail call void @use(i32 %mul.3)
				br label %for.inc.3

				for.inc.3: ; preds = %if.then.3, %for.inc.2
				ret void
				}

				declare zeroext i1 @cond(i32)

				declare void @use(i32)

				!nvvm.annotations = !{!0}

				!0 = !{void (i32, i32)* @foo, !"kernel", i32 1}