Page MenuHomePhabricator

[SLP] Truncate expressions to minimum required bit width
ClosedPublic

Authored by mssimpso on Dec 29 2015, 1:36 PM.

Details

Summary

This change attempts to produce vectorized integer expressions in bit widths
that are narrower than their scalar counterparts. By reducing the bit width
where possible, we can pack more isomorphic expressions into a single vector
and increase parallelism. The need for demotion arises especially on
architectures in which the small integer types (e.g., i8 and i16) are not legal
for scalar operations but can still be used in vectors.

Like similar work done within the loop vectorizer, we rely on InstCombine to
perform the actual type-shrinking. Here, we only insert the truncations that
are needed to seed InstCombine's type demotion. This introduces the limitation
that we can only rewrite single-use chains (every instruction in the expression
can have at most one use). We further limit ourselves to chains that are rooted
by instructions other than stores, since we cannot change the width of vector
memory operations. With these restrictions, only expression roots can be used
externally, and we sign extend them back to their original type after we
extract them from the vectors.

We use ComputeNumSignBits from ValueTracking to determine the minimum required
bit width of an expression. We update cost estimates to account for the
narrower types and sign extensions we add to the vectorized code.

Diff Detail

Repository
rL LLVM

Event Timeline

mssimpso updated this revision to Diff 43758.Dec 29 2015, 1:36 PM
mssimpso retitled this revision from to [SLP] Truncate expressions to minimum required bit width.
mssimpso updated this object.
mssimpso added reviewers: nadav, jmolloy, hfinkel, anemet.
mssimpso added subscribers: mcrosier, llvm-commits.
jmolloy requested changes to this revision.Jan 6 2016, 6:23 AM
jmolloy edited edge metadata.

Hi Matt,

A couple of comments, but generally looks fine.

James

lib/Transforms/Vectorize/SLPVectorizer.cpp
3256 ↗(On Diff #43758)

I'm not quite sure I understand why this is the case.

3299 ↗(On Diff #43758)

It feels like you should be able to benefit from DemandedBits here. That's what we do in the LoopVectorizer and it can trigger in many more situations than just ValueTracking.

This revision now requires changes to proceed.Jan 6 2016, 6:23 AM
Ayal added a subscriber: Ayal.Jan 6 2016, 11:39 AM
mssimpso updated this revision to Diff 44162.Jan 6 2016, 2:49 PM
mssimpso edited edge metadata.

Addressed James's comments

Thanks very much James! I've updated the patch and added some responses inline.

mssimpso added inline comments.Jan 6 2016, 2:51 PM
lib/Transforms/Vectorize/SLPVectorizer.cpp
3260 ↗(On Diff #44162)

This work is a little more limited than what you've done in the loop vectorizer. Because we're working with expression trees, I thought it would be simpler to only truncate the roots of single-use chains. If the tree has no external users, it should be rooted by stores, which obviously can't be truncated. So we give up in that case. I've tried to explain my reasoning a little better in the comments, but I haven't yet thought much about the more general case.

In particular, the current patch doesn't do anything useful for the single-use chains ending in stores that have unnecessary extends and truncations in them that prevent us from getting past the cost model. Even though the extends and truncations in those chains may be removed by InstCombine after vectorization, they will currently still be counted by the model. I was thinking of adding this functionality as a follow-on, but I can add it to this patch if you like. What do you think?

3303 ↗(On Diff #44162)

My first inclination was to try and reuse the DemandedBits analysis you added to VectorUtils. However, demanded bits alone didn't help the case I was most interested in: the index computation of GEPs, which is promoted to 64 bits, all of them being demanded. Since all the bits are demanded by the GEP, we have to look at the number of leading sign bits instead. But I've updated the patch to also consider the demanded bits of the roots since they are all we truncate.

jmolloy added inline comments.Jan 7 2016, 12:48 AM
lib/Transforms/Vectorize/SLPVectorizer.cpp
3260 ↗(On Diff #44162)

In particular, the current patch doesn't do anything useful for the single-use chains ending in stores that have unnecessary extends and truncations

OK, that's what I was missing.

I was thinking of adding this functionality as a follow-on, but I can add it to this patch if you like. What do you think?

I'm happy with it as a followon. I just didn't understand. Thanks!

3303 ↗(On Diff #44162)

Ah I see. Yes, I can see that both approaches together would probably get you the best result.

test/Transforms/SLPVectorizer/AArch64/gather-reduce.ll
1 ↗(On Diff #44162)

I'm not sure about this: Why are tests being removed? Unless we're fundamentally regressing performance, I think new tests should be added instead of existing tests modified to check new behaviour.

mssimpso added inline comments.Jan 7 2016, 6:22 AM
test/Transforms/SLPVectorizer/AArch64/gather-reduce.ll
1 ↗(On Diff #44162)

Sure, I'm happy to add the test back.

mssimpso updated this revision to Diff 44212.Jan 7 2016, 7:45 AM
mssimpso edited edge metadata.

Added back the test case with i32 subtraction.

The test now has a case with i32 subtraction that we vectorize and a case with i64 subtraction that we type-shrink to i32. The cost model currently finds the i64 case to be unprofitable, though I will fix that with a follow-on patch.

mssimpso marked 2 inline comments as done.Jan 8 2016, 12:15 PM
jmolloy accepted this revision.Jan 14 2016, 7:49 AM
jmolloy edited edge metadata.

LGTM

This revision is now accepted and ready to land.Jan 14 2016, 7:49 AM
This revision was automatically updated to reflect the committed changes.
nadav edited edge metadata.Jan 21 2016, 9:22 AM

Did you see any performance improvements due to this commit? Did you measure compile times?

Hi Nadav. Yes, I measured compile-time and performance of this patch together with D14829 (I included the results in comments for that revision). There were no compile-time regressions and modest performance improvements in two spec benchmarks (povray and h264ref). But I've temporarily reverted this patch due to a failing bot while I correct the issue.

nadav added a subscriber: nadav.Jan 21 2016, 10:08 AM

Sounds good. Thanks Matthew.