This is an archive of the discontinued LLVM Phabricator instance.

Differential D11926

[NVPTX] Use 32-bit divides instead of 64-bit divides where possible
ClosedPublic

Authored by meheff on Aug 10 2015, 6:18 PM.

Download Raw Diff

Details

Reviewers

jholewinski
eliben
jingyue

Summary

For NVPTX, try to use 32-bit division instead of 64-bit division when the dividend and divisor
fit in 32-bit. This speeds up some internal benchmarks significantly. The underlying reason
is that many index computations are carried out in 64-bits but never actually exceed the
capacity of a 32-bit word.

Diff Detail

Event Timeline

meheff updated this revision to Diff 31755.Aug 10 2015, 6:18 PM

meheff retitled this revision from to [NVPTX] Use 32-bit divides instead of 64-bit divides where possible.

meheff updated this object.

meheff added reviewers: jingyue, jholewinski.

meheff added a subscriber: llvm-commits.

Herald added a subscriber: jholewinski. · View Herald TranscriptAug 10 2015, 6:18 PM

Does any Eigen3 kernel (https://bitbucket.org/eigen/eigen/src/890ac1744b090c8de30aba2a33f4393e049d1559/unsupported/Eigen/CXX11/src/Tensor/?at=default) benefit from this improvement? If so, we can report some numbers there, so that people can understand how important this is for real-world CUDA programs.

Also, can you come up with some llc tests?

Thanks!

Test added. Unfortunately the stand alone Eigen3 benchmarks don't show much improvement with this patch because, I believe, they use 32-bit indices throughout. Where we see the huge speedup is in the larger-scale benchmarks using Eigen with 64-bit indices.

LGTM

This revision is now accepted and ready to land.Aug 11 2015, 12:21 PM

lgtm

jingyue closed this revision.Aug 21 2015, 10:41 PM

Revision Contents

Path

Size

lib/

Target/

NVPTX/

NVPTXISelLowering.cpp

4 lines

test/

CodeGen/

NVPTX/

bypass-div.ll

80 lines

Diff 31848

lib/Target/NVPTX/NVPTXISelLowering.cpp

Show First 20 Lines • Show All 118 Lines • ▼ Show 20 Lines	NVPTXTargetLowering::NVPTXTargetLowering(const NVPTXTargetMachine &TM,

setBooleanContents(ZeroOrNegativeOneBooleanContent);		setBooleanContents(ZeroOrNegativeOneBooleanContent);
setBooleanVectorContents(ZeroOrNegativeOneBooleanContent);		setBooleanVectorContents(ZeroOrNegativeOneBooleanContent);

// Jump is Expensive. Don't create extra control flow for 'and', 'or'		// Jump is Expensive. Don't create extra control flow for 'and', 'or'
// condition branches.		// condition branches.
setJumpIsExpensive(true);		setJumpIsExpensive(true);

		// Wide divides are _very_ slow. Try to reduce the width of the divide if
		// possible.
		addBypassSlowDiv(64, 32);

// By default, use the Source scheduling		// By default, use the Source scheduling
if (sched4reg)		if (sched4reg)
setSchedulingPreference(Sched::RegPressure);		setSchedulingPreference(Sched::RegPressure);
else		else
setSchedulingPreference(Sched::Source);		setSchedulingPreference(Sched::Source);

addRegisterClass(MVT::i1, &NVPTX::Int1RegsRegClass);		addRegisterClass(MVT::i1, &NVPTX::Int1RegsRegClass);
addRegisterClass(MVT::i16, &NVPTX::Int16RegsRegClass);		addRegisterClass(MVT::i16, &NVPTX::Int16RegsRegClass);
▲ Show 20 Lines • Show All 4,393 Lines • Show Last 20 Lines

test/CodeGen/NVPTX/bypass-div.ll

				; RUN: llc < %s -march=nvptx -mcpu=sm_35 \| FileCheck %s

				; 64-bit divides and rems should be split into a fast and slow path where
				; the fast path uses a 32-bit operation.

				define void @sdiv64(i64 %a, i64 %b, i64* %retptr) {
				; CHECK-LABEL: sdiv64(
				; CHECK: div.s64
				; CHECK: div.u32
				; CHECK: ret
				%d = sdiv i64 %a, %b
				store i64 %d, i64* %retptr
				ret void
				}

				define void @udiv64(i64 %a, i64 %b, i64* %retptr) {
				; CHECK-LABEL: udiv64(
				; CHECK: div.u64
				; CHECK: div.u32
				; CHECK: ret
				%d = udiv i64 %a, %b
				store i64 %d, i64* %retptr
				ret void
				}

				define void @srem64(i64 %a, i64 %b, i64* %retptr) {
				; CHECK-LABEL: srem64(
				; CHECK: rem.s64
				; CHECK: rem.u32
				; CHECK: ret
				%d = srem i64 %a, %b
				store i64 %d, i64* %retptr
				ret void
				}

				define void @urem64(i64 %a, i64 %b, i64* %retptr) {
				; CHECK-LABEL: urem64(
				; CHECK: rem.u64
				; CHECK: rem.u32
				; CHECK: ret
				%d = urem i64 %a, %b
				store i64 %d, i64* %retptr
				ret void
				}

				define void @sdiv32(i32 %a, i32 %b, i32* %retptr) {
				; CHECK-LABEL: sdiv32(
				; CHECK: div.s32
				; CHECK-NOT: div.
				%d = sdiv i32 %a, %b
				store i32 %d, i32* %retptr
				ret void
				}

				define void @udiv32(i32 %a, i32 %b, i32* %retptr) {
				; CHECK-LABEL: udiv32(
				; CHECK: div.u32
				; CHECK-NOT: div.
				%d = udiv i32 %a, %b
				store i32 %d, i32* %retptr
				ret void
				}

				define void @srem32(i32 %a, i32 %b, i32* %retptr) {
				; CHECK-LABEL: srem32(
				; CHECK: rem.s32
				; CHECK-NOT: rem.
				%d = srem i32 %a, %b
				store i32 %d, i32* %retptr
				ret void
				}

				define void @urem32(i32 %a, i32 %b, i32* %retptr) {
				; CHECK-LABEL: urem32(
				; CHECK: rem.u32
				; CHECK-NOT: rem.
				%d = urem i32 %a, %b
				store i32 %d, i32* %retptr
				ret void
				}