Page MenuHomePhabricator

Supports lowerInterleavedStore() in X86InterleavedAccess.
ClosedPublic

Authored by Farhana on Apr 28 2017, 1:17 PM.

Details

Summary

This is a follow-up on D24681 which supports lowerInterleavedLoad() on X86.

This change-set supports lowerInterleavedStore(). It mainly provides the necessary infrastructure/utilities in order to have lowerInterleavedStore() in place. It does not try to support more patterns beyond what X86InterleaveAccess already supports (currently, X86InterleavedAccess supports interleaved access with 64 x 4 bits in transpose4_4()).

Diff Detail

Repository
rL LLVM

Event Timeline

Farhana created this revision.Apr 28 2017, 1:17 PM
RKSimon edited edge metadata.Apr 29 2017, 2:12 AM

Performance numbers?

Also, please add context to the diff.

test/CodeGen/X86/x86-interleaved-access.ll
143 ↗(On Diff #97146)

Please add these new tests to trunk now with the current codegen so your patch shows the diff.

test/Transforms/InterleavedAccess/X86/interleaved-accesses-64bits-avx.ll
105 ↗(On Diff #97146)

Please add these new tests to trunk now with the current codegen so your patch shows the diff.

And don't forget to subscribe to llvm-commits to get visibility on the mailing list!

RKSimon added inline comments.Apr 30 2017, 4:44 AM
lib/Target/X86/X86InterleavedAccess.cpp
212 ↗(On Diff #97146)

for (unsigned i = 0, e = Shuffles.size(); i < e; i++)

Farhana updated this revision to Diff 97365.May 1 2017, 4:47 PM

Performance numbers?

Also, please add context to the diff.

Hi Simon,

Thanks for your prompt review.

Regarding performance, I don't have real-world examples. So, I tested it on a kernel where a loop contains only four scatters. On broadwell, this support improves the performance of that loop by ~70%.

Farhana

And don't forget to subscribe to llvm-commits to get visibility on the mailing list!

Thanks! I subscribed.

Also, I added context to the svn-diff.

Please can you add the new tests to trunk now with the current codegen so your patch shows the diff.

DavidKreitzer added inline comments.May 10 2017, 10:25 AM
lib/Target/X86/X86InterleavedAccess.cpp
115 ↗(On Diff #97365)

This routine is making some assumptions about the nature of the re-interleave ShuffleVectorInst. The optimized code sequence that you are emitting is only correct for a shuffle mask that is exactly {0, 4, 8, 12, 1, 5, 9, 13, 2, 6, 10, 14, 3, 7, 11, 15}. That is the common case, but InterleavedAccessPass is more general than that.

Consider the following example:

define void @store_factorf64_4(<16 x double>* %ptr, <4 x double> %v0, <4 x double> %v1, <4 x double> %v2, <4 x double> %v3) {
  %s0 = shufflevector <4 x double> %v0, <4 x double> %v1, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
  %s1 = shufflevector <4 x double> %v2, <4 x double> %v3, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
  %interleaved.vec = shufflevector <8 x double> %s0, <8 x double> %s1, <16 x i32> <i32 12, i32 8, i32 4, i32 0, i32 13, i32 9, i32 5, i32 1, i32 14, i32 10, i32 6, i32 2, i32 15, i32 11, i32 7, i32 3>
  store <16 x double> %interleaved.vec, <16 x double>* %ptr, align 16
  ret void
}

This is identical to your store_factorf64_4 test case except that I've changed the mask for %interleaved.vec. With the current implementation, this example gets optimized and is compiled to the exact same code as your original test case, which is obviously incorrect behavior.

137 ↗(On Diff #97365)

++i is preferred

141 ↗(On Diff #97365)

This call to createSequentialMask looks incorrect. I think you want something like this:

unsigned NumSubVectorElements = getVectorNumElements(VecTy) / NumSubVectors;
createSequentialMask(Builder, i * NumSubVectorElements, NumSubVectorElements, 0);

What you have written only works because Factor, NumSubVectors, and NumSubVectorElements are all the same (i.e. 4) in the cases you currently support.

198 ↗(On Diff #97365)

I think it is incorrect to be using Factor here. SubVecTy is not <Factor x ShuffleEltTy> but rather <(Wide Vector # Elems / Factor) x ShuffleEltTy>. This only works because both computations evaluate to 4 in the cases you currently support.

To illustrate what I mean, suppose instead we were optimizing 4 interleaved i8 vectors with a wide vector type of <128 x i8>. In this case, the SubVecTy should be <32 x i8>, but your current code will compute it as <4 x i8>.

212 ↗(On Diff #97365)

++i is preferred

218 ↗(On Diff #97365)

Please add '.' at the end of each sentence.

270 ↗(On Diff #97365)

This array of indices is meaningless for the store optimization, right? Why not simply pass an empty ArrayRef to the X86InterleavedAccessGroup constructor?

Farhana updated this revision to Diff 100179.May 24 2017, 3:48 PM

Updated the changes with the feedback from Dave.

lib/Target/X86/X86InterleavedAccess.cpp
115 ↗(On Diff #97365)

Thanks Dave for pointing this out.

198 ↗(On Diff #97365)

These functions were making some assumptions. I agree that they should not have done so.

DavidKreitzer added inline comments.May 26 2017, 8:49 AM
lib/Target/X86/X86InterleavedAccess.cpp
115 ↗(On Diff #97365)

I like the way you fixed this, thanks!

270 ↗(On Diff #97365)

You didn't address this comment. Maybe the best change here would be to use "Indices.push_back(Mask[i]);" That would be consistent with the change you made at line 141:

DecomposedVectors.push_back(
    cast<ShuffleVectorInst>(Builder.CreateShuffleVector(
        Op0, Op1, createSequentialMask(Builder, Mask[i],
                                       SubVecTy->getVectorNumElements(), 0))));
202 ↗(On Diff #100179)

good catch!

218 ↗(On Diff #100179)

Use ShuffleTy instead of Shuffles[0]->getType()

test/CodeGen/X86/x86-interleaved-access.ll
143 ↗(On Diff #97146)

This a good suggestion, Farhana. Can you please do this in a separate initial patch?

Farhana updated this revision to Diff 102377.Jun 13 2017, 12:00 PM

Updated diff with the latest changes in the trunk.

This update shows the improvement due to this change-set in the following tests.

test/Transforms/InterleavedAccess/X86/interleaved-accesses-64bits-avx.ll
test/CodeGen/X86/x86-interleaved-access.ll
DavidKreitzer edited edge metadata.Jun 20 2017, 1:52 PM

Thanks for the fixes, Farhana. Other than a couple typos, this looks good.

lib/Target/X86/X86InterleavedAccess.cpp
272 ↗(On Diff #102377)

"index each" --> "index of each"
"interleave" --> "interleaved"

test/CodeGen/X86/x86-interleaved-access.ll
143 ↗(On Diff #97146)

Thanks! These are nice improvements in the codegen.

Farhana updated this revision to Diff 103387.Jun 21 2017, 8:22 AM

This change-set reflects the fixed typos.

DavidKreitzer accepted this revision.Jun 21 2017, 3:19 PM

Thanks, Farhana. LGTM.

This revision is now accepted and ready to land.Jun 21 2017, 3:19 PM

Hi Simon,

Do you have further comments/concerns?

Farhana

test/CodeGen/X86/x86-interleaved-access.ll
143 ↗(On Diff #97146)

A very good suggestion. I just did not understand the suggestion at the first place.

RKSimon accepted this revision.Jun 22 2017, 9:07 AM

Hi Simon,

Do you have further comments/concerns?

Farhana

No more comments from me. LGTM

This revision was automatically updated to reflect the committed changes.
dorit added a subscriber: dorit.Jun 25 2017, 12:44 AM