This is an archive of the discontinued LLVM Phabricator instance.

[NVPTX] Turn on Loop/SLP vectorization
ClosedPublic

Authored by bkramer on Apr 26 2018, 9:17 AM.

Details

Summary

Since PTX has grown a <2 x half> datatype vectorization has become more
important. The late LoadStoreVectorizer intentionally only does loads
and stores, but now arithmetic has to be vectorized for optimal
throughput too.

This is still very limited, SLP vectorization happily creates <2 x half>
if it's a legal type but there's still a lot of register moving
happening to get that fed into a vectorized store. Overall it's a small
performance win by reducing the amount of arithmetic instructions.

I haven't really checked what the loop vectorizer does to PTX code, the
cost model there might need some more tweaks. I didn't see it causing
harm though.

Diff Detail

Repository
rL LLVM

Event Timeline

bkramer created this revision.Apr 26 2018, 9:17 AM
tra accepted this revision.Apr 26 2018, 9:35 AM
This revision is now accepted and ready to land.Apr 26 2018, 9:35 AM
jlebar added inline comments.Apr 26 2018, 10:49 AM
lib/Target/NVPTX/NVPTXTargetTransformInfo.h
53 ↗(On Diff #144136)

Does 1 have specific meaning? I don't see this in any of the comments, and that would be a pretty weird API... (Like, did you mean -1?)

bkramer added inline comments.Apr 26 2018, 10:51 AM
lib/Target/NVPTX/NVPTXTargetTransformInfo.h
53 ↗(On Diff #144136)

1 is the default of the generic implementation, I just copied that and removed the check for vector. I'm not even sure if anyone ever checks the value or just compares it against zero.

jlebar added inline comments.Apr 26 2018, 10:59 AM
lib/Target/NVPTX/NVPTXTargetTransformInfo.h
53 ↗(On Diff #144136)

I see a few places using the actual value returned: LoopStrengthReduce, LoopVectorize...

bkramer added inline comments.Apr 26 2018, 11:26 AM
lib/Target/NVPTX/NVPTXTargetTransformInfo.h
53 ↗(On Diff #144136)

Right, LoopStrengthReduce calls it with (false), which was returning '1' before and after :)

I'm not sure about LoopVectorize, but conservatively limiting its vectorization factor is probably a good thing.

jlebar added inline comments.Apr 26 2018, 11:28 AM
lib/Target/NVPTX/NVPTXTargetTransformInfo.h
53 ↗(On Diff #144136)

OK, I think I see what you're after: You want to make the minimal/safest change that gets vectorization on your testcases?

That sounds good to me, but can we make the comment on "return 1" express this motivation?

bkramer updated this revision to Diff 144172.Apr 26 2018, 11:44 AM

Extend getNumberOfRegisters comment.

jlebar accepted this revision.Apr 26 2018, 1:25 PM
hfinkel added inline comments.
lib/Target/NVPTX/NVPTXTargetTransformInfo.h
53 ↗(On Diff #144136)

LoopVectorize is using this number to control the amount of interleaving/unrolling it does. The idea being that inverleaving is beneficial until you create too much register pressure (i.e., until you start spilling).

I don't think that setting this to 1 is a good idea. Perfectly reasonable changes to the loop vectorizer in the future could cause this to completely disable vectorizer. You should set this, I suspect, to the maximum number of registers you can have per thread at full occupancy.

bkramer added inline comments.Apr 27 2018, 1:54 AM
lib/Target/NVPTX/NVPTXTargetTransformInfo.h
53 ↗(On Diff #144136)

This is intentional. I want to keep LoopVectorizer as conservative as possible for now. I have benchmarks where SLP makes a clear improvement, for LV it's much less clear and it has the potential of causing huge regressions, especially since NVPTX TTI is not yet tuned for it.

Happy to reword the comment more in case that's still unclear and put in a note to revisit making the LV more aggressive. WDYT?

This revision was automatically updated to reflect the committed changes.
hfinkel added inline comments.Apr 27 2018, 6:50 AM
lib/Target/NVPTX/NVPTXTargetTransformInfo.h
53 ↗(On Diff #144136)

Please give the TTI functions meaningful values. It's perfectly plausible that this can be used for something else in the future, and you're just asking for trouble when we do need to change this later because the delta is bound to be large.

That having been said, we certainly do want the LV to be conservative in terms of increasing register pressure. Luckily, we have a separate TTI function for that:

unsigned getMaxInterleaveFactor(unsigned VF) const;

make this function return 1 for all values of VF. I believe this is the default, but in this case, I recommend explicitly overriding it with a comment about explicitly wanting to be conservative about reducing register pressure. In terms of the LV, this should have the same current effect.

As I recall, the register file size on a Volta, for example, is 256kB/SM and you can have 2048 threads/SM, so that leaves 32 32-bit registers per thread at full occupancy. Thus, I'd recommend setting this number to 32.