For NVPTX, try to use 32-bit division instead of 64-bit division when the dividend and divisor
fit in 32-bit. This speeds up some internal benchmarks significantly. The underlying reason
is that many index computations are carried out in 64-bits but never actually exceed the
capacity of a 32-bit word.
Details
Details
- Reviewers
jholewinski eliben jingyue
Diff Detail
Diff Detail
Event Timeline
Comment Actions
Does any Eigen3 kernel (https://bitbucket.org/eigen/eigen/src/890ac1744b090c8de30aba2a33f4393e049d1559/unsupported/Eigen/CXX11/src/Tensor/?at=default) benefit from this improvement? If so, we can report some numbers there, so that people can understand how important this is for real-world CUDA programs.
Also, can you come up with some llc tests?
Thanks!
Comment Actions
Test added. Unfortunately the stand alone Eigen3 benchmarks don't show much improvement with this patch because, I believe, they use 32-bit indices throughout. Where we see the huge speedup is in the larger-scale benchmarks using Eigen with 64-bit indices.