This is an archive of the discontinued LLVM Phabricator instance.

[NVPTX] Use 32-bit divides instead of 64-bit divides where possible
ClosedPublic

Authored by meheff on Aug 10 2015, 6:18 PM.

Details

Summary

For NVPTX, try to use 32-bit division instead of 64-bit division when the dividend and divisor
fit in 32-bit. This speeds up some internal benchmarks significantly. The underlying reason
is that many index computations are carried out in 64-bits but never actually exceed the
capacity of a 32-bit word.

Diff Detail

Event Timeline

meheff updated this revision to Diff 31755.Aug 10 2015, 6:18 PM
meheff retitled this revision from to [NVPTX] Use 32-bit divides instead of 64-bit divides where possible.
meheff updated this object.
meheff added reviewers: jingyue, jholewinski.
meheff added a subscriber: llvm-commits.
jingyue edited edge metadata.Aug 10 2015, 8:25 PM

Does any Eigen3 kernel (https://bitbucket.org/eigen/eigen/src/890ac1744b090c8de30aba2a33f4393e049d1559/unsupported/Eigen/CXX11/src/Tensor/?at=default) benefit from this improvement? If so, we can report some numbers there, so that people can understand how important this is for real-world CUDA programs.

Also, can you come up with some llc tests?

Thanks!

meheff updated this revision to Diff 31848.Aug 11 2015, 12:17 PM
meheff edited edge metadata.

Test added. Unfortunately the stand alone Eigen3 benchmarks don't show much improvement with this patch because, I believe, they use 32-bit indices throughout. Where we see the huge speedup is in the larger-scale benchmarks using Eigen with 64-bit indices.

jingyue accepted this revision.Aug 11 2015, 12:21 PM
jingyue edited edge metadata.

LGTM

This revision is now accepted and ready to land.Aug 11 2015, 12:21 PM
eliben accepted this revision.Aug 11 2015, 1:22 PM
eliben added a reviewer: eliben.
eliben added a subscriber: eliben.

lgtm

jingyue closed this revision.Aug 21 2015, 10:41 PM