This is an archive of the discontinued LLVM Phabricator instance.

[SLPVectorizer] Try different vectorization factors and set max vector register size based on target
ClosedPublic

Authored by spatel on Jul 5 2015, 9:22 PM.

Details

Summary

This patch is based on discussion on the llvmdev mailing list:
http://lists.cs.uiuc.edu/pipermail/llvmdev/2015-July/087405.html

and also solves:
https://llvm.org/bugs/show_bug.cgi?id=17170

As mentioned on the dev list and bug report, the new loop on the vector register size may cause an unacceptable compile-time increase, so this may need to be shielded by some more aggressive optimization specification. If not, this patch could be extended to other SLP pattern matchers that hardcode the vector register size (see FIXME comments).

The AMDGPU XFAIL test either should be fixed or removed if it's not valid any more?

Diff Detail

Event Timeline

spatel updated this revision to Diff 29061.Jul 5 2015, 9:22 PM
spatel retitled this revision from to [SLPVectorizer] Try different vectorization factors and set max vector register size based on target .
spatel updated this object.
spatel added reviewers: hfinkel, nadav, rengolin, arsenm.
spatel added a subscriber: llvm-commits.
nadav edited edge metadata.Jul 6 2015, 10:17 AM

Sanjay, this patch looks okay. I think that the compile time hit should be minimal but I think that we need to measure just to be sure.

spatel added a comment.Jul 6 2015, 2:59 PM

Thanks, Nadav. I'm collecting some data using test-suite now. Should have something posted here tomorrow.

spatel added a comment.Jul 7 2015, 9:22 AM

I ran the benchmarking subset of test-suite on and targeting an AMD Jaguar system (has AVX, so 256-bit SLP was activated).

  1. The 256-bit vector store optimization fired 51 times on 16 different tests, so looking for the wider vector does appear to be a useful thing to do based on this sample of tests.
  2. There's no measurable perf difference on any of those tests; this is somewhat expected given that Jaguar has a double-pumped AVX implementation (128-bit data paths).
  3. The sum of average CC_Time (3 trials) for the tests was 121.21 seconds for the baseline (only check for 128-bit SLP) and 121.34 seconds for the new version (looks for 256-bit SLP before 128-bit), so about 0.1% longer, but that difference is in the noise for the compile times in this data set.

Hi Sanjay,

I think this patch is good to commit as-is, though I have one question (I'm ok with just adding TODO for now).

Thanks,
Michael

lib/Transforms/Vectorize/SLPVectorizer.cpp
4020–4022

Shouldn't we update this threshold too? Otherwise, we won't be able to vectorize with VF=32 (and AVX2 might need <32 x i8> vectors).

However, increasing this value change *would* hurt compile time, so we need to be careful here.

spatel added a comment.Jul 7 2015, 6:01 PM

I think this patch is good to commit as-is, though I have one question (I'm ok with just adding TODO for now).

Thanks, Michael!

You're right; we need to increase that limit to vectorize more than 16 elements at a time. I'll make that a TODO and then add another cl::opt override, so we can experiment with that setting. This raises another problem: AVX has 256-bit registers, but it can't handle <32 x i8> ops, so creating those here would be useless. Using the data type rather than the register size could get us more optimizations while limiting the compile-time explosion.

Adding Tom:
I'd prefer not to have to XFAIL the AMDGPU test. Can you provide some guidance about what the expected behavior should be there? Thanks!

This revision was automatically updated to reflect the committed changes.