This is an archive of the discontinued LLVM Phabricator instance.

Differential D11926

[NVPTX] Use 32-bit divides instead of 64-bit divides where possible
ClosedPublic

Authored by meheff on Aug 10 2015, 6:18 PM.

Download Raw Diff

Details

Reviewers

jholewinski
eliben
jingyue

Summary

For NVPTX, try to use 32-bit division instead of 64-bit division when the dividend and divisor
fit in 32-bit. This speeds up some internal benchmarks significantly. The underlying reason
is that many index computations are carried out in 64-bits but never actually exceed the
capacity of a 32-bit word.

Diff Detail

Event Timeline

meheff updated this revision to Diff 31755.Aug 10 2015, 6:18 PM

meheff retitled this revision from to [NVPTX] Use 32-bit divides instead of 64-bit divides where possible.

meheff updated this object.

meheff added reviewers: jingyue, jholewinski.

meheff added a subscriber: llvm-commits.

Herald added a subscriber: jholewinski. · View Herald TranscriptAug 10 2015, 6:18 PM

Does any Eigen3 kernel (https://bitbucket.org/eigen/eigen/src/890ac1744b090c8de30aba2a33f4393e049d1559/unsupported/Eigen/CXX11/src/Tensor/?at=default) benefit from this improvement? If so, we can report some numbers there, so that people can understand how important this is for real-world CUDA programs.

Also, can you come up with some llc tests?

Thanks!

Test added. Unfortunately the stand alone Eigen3 benchmarks don't show much improvement with this patch because, I believe, they use 32-bit indices throughout. Where we see the huge speedup is in the larger-scale benchmarks using Eigen with 64-bit indices.

LGTM

This revision is now accepted and ready to land.Aug 11 2015, 12:21 PM

lgtm

jingyue closed this revision.Aug 21 2015, 10:41 PM

Revision Contents

Path

Size

lib/

Target/

NVPTX/

NVPTXISelLowering.cpp

4 lines

Diff 31755

lib/Target/NVPTX/NVPTXISelLowering.cpp

Show First 20 Lines • Show All 118 Lines • ▼ Show 20 Lines	NVPTXTargetLowering::NVPTXTargetLowering(const NVPTXTargetMachine &TM,

setBooleanContents(ZeroOrNegativeOneBooleanContent);		setBooleanContents(ZeroOrNegativeOneBooleanContent);
setBooleanVectorContents(ZeroOrNegativeOneBooleanContent);		setBooleanVectorContents(ZeroOrNegativeOneBooleanContent);

// Jump is Expensive. Don't create extra control flow for 'and', 'or'		// Jump is Expensive. Don't create extra control flow for 'and', 'or'
// condition branches.		// condition branches.
setJumpIsExpensive(true);		setJumpIsExpensive(true);

		// Wide divides are _very_ slow. Try to reduce the width of the divide if
		// possible.
		addBypassSlowDiv(64, 32);

// By default, use the Source scheduling		// By default, use the Source scheduling
if (sched4reg)		if (sched4reg)
setSchedulingPreference(Sched::RegPressure);		setSchedulingPreference(Sched::RegPressure);
else		else
setSchedulingPreference(Sched::Source);		setSchedulingPreference(Sched::Source);

addRegisterClass(MVT::i1, &NVPTX::Int1RegsRegClass);		addRegisterClass(MVT::i1, &NVPTX::Int1RegsRegClass);
addRegisterClass(MVT::i16, &NVPTX::Int16RegsRegClass);		addRegisterClass(MVT::i16, &NVPTX::Int16RegsRegClass);
▲ Show 20 Lines • Show All 4,393 Lines • Show Last 20 Lines