This is an archive of the discontinued LLVM Phabricator instance.

[NVPTX] Allow load/store vectorization.
AbandonedPublic

Authored by jlebar on May 24 2016, 4:09 PM.

Details

Reviewers
tra
hfinkel
Summary

This patch adds enough information to the NVPTX TTI so that the SLP
vectorizer will fire.

This gets us to parity with NVCC on the Eigen benchmark suite. (Without these
changes, we're 30+% slower on many benchmarks.)

Diff Detail

Event Timeline

jlebar updated this revision to Diff 58348.May 24 2016, 4:09 PM
jlebar retitled this revision from to [NVPTX] Allow load/store vectorization..
jlebar updated this object.
jlebar added reviewers: tra, hfinkel.
jlebar added a subscriber: jholewinski.
hfinkel added inline comments.May 24 2016, 4:21 PM
lib/Target/NVPTX/NVPTXTargetTransformInfo.cpp
130

This should say 'FIXME', because we should fix that behavior of the SLP vectorizer (I actually thought we had, but maybe we only had in the loop vectorizer).

146

Same is true for ExtractElement?

lib/Target/NVPTX/NVPTXTargetTransformInfo.h
59

Please comment on why 1 register.

tra added inline comments.May 24 2016, 4:21 PM
test/Transforms/SLPVectorizer/NVPTX/simple.ll
48

IR above does not seem to have much to do with load/store vectorization. Perhaps it can be trimmed down.

51

Does this change vectorize loads as well? If so, it would be nice to add that, too.

jlebar updated this revision to Diff 58364.May 24 2016, 5:50 PM
jlebar marked 3 inline comments as done.

Add test checking load vectorization, and make vectorization of very small
functions work.

jlebar added inline comments.May 24 2016, 5:51 PM
lib/Target/NVPTX/NVPTXTargetTransformInfo.cpp
130
146

Done and added a test, thank you.

test/Transforms/SLPVectorizer/NVPTX/simple.ll
48

Interestingly, toy cases are a special case, so we need to test both. Added a fix to make toy cases work, and a test.

jlebar added a comment.EditedMay 24 2016, 5:56 PM

We cannot currently vectorize loads across calls, unless those calls are vectorizable intrinsics.

This seems sort of broken to me even outside the context of nvptx, because it means that if you do something like

load 4 floats
do a bunch of vectorizable math
call a non-vectorizable function on each element of your bundle
store the 4 results

then we won't vectorize this (except maybe the store). It seems to me that we should calculate the cost of vectorizing this if we un-vectorize the calls. I dunno if that would be a useful optimization anywhere other than on nvptx, though.

The case above is important because none of our intrinsics are vectorizable, so this is basically any kernel that doesn't do exclusively +-/*. But I think I'm happy to look at that in a separate patch, since this is a substantial win as-is.

jlebar updated this revision to Diff 58366.May 24 2016, 6:04 PM

Fix comment.

hfinkel added inline comments.May 24 2016, 6:09 PM
lib/Target/NVPTX/NVPTXTargetTransformInfo.h
59

I'm still wondering why this is set to 1.

jlebar added a comment.EditedMay 24 2016, 6:13 PM

(Accidental dup comment removed; I cannot figure out phabricator)

lib/Target/NVPTX/NVPTXTargetTransformInfo.h
59

Sorry I missed this one.

The default TTI behavior is to say we have 1 general purpose register and 0 vector regs. Changing it to 1 and 1 seems like the minimal change.

The virtual/physical reg config in the NVPTX backend is kind of weird. Although we never do register allocation, we still define 4 physical NVPTX registers for each type... Except vector types; we have no virtual vector-type registers. That's OK here because this is just an upper bound on the number of vector registers of any type.

tra edited edge metadata.May 25 2016, 9:54 AM

LGTM. I'll defer approval to Hal.

jlebar abandoned this revision.Jun 10 2016, 11:28 AM

Abandoning this in favor of D19501, which does a *much* better job at doing what we want.