This is an archive of the discontinued LLVM Phabricator instance.

Strided Memory Access Vectorization
Needs ReviewPublic

Authored by ashutosh.nema on Jun 14 2016, 11:13 PM.

Details

Summary

<< This patch is to demonstrate the Strided Memory Access Vectorization >>

Currently LLVM does not support strided access vectorization completely, some support is available via Interleave vectorization.

There are two main overheads with strided vectorizations:
• An overhead of consolidating data into an operable vector.
• An overhead of distributing the data elements after the operations.

Because of these overheads LLVM finds strided memory vectorization is not profitable and generating serial code sequence over vectorized code.
GATHER & SCATTER instruction support is available on few(not all) targets for consolidation & distribution operations.

In this approach we are trying to handle cases like this:
for (int i = 0; i < len; i++)

a[i*3] = b[i*2] + c[i*3];

We model strided memory load & store using shuffle & load/mask-store operations.
• Load is modeled as loads followed by shuffle.
• Store is modeled as shuffle followed by mask store.
• To minimize load and store operation introduced ‘SkipFactor’.

‘SkipFactor’:
• Multiple load operation required for consolidating data into an operable vector.
• Between various loads we skip by some offsets to effective consolidate.
• SkipFactor is the number of additional offsets required to move from the previous vector load memory-end address for the next vector load.
• SkipFactor allows all memory loads to be considered as identical (From valid element perspective).
• SkipFactor = (Stride - (VF % Stride)) % Stride)

How LOAD is modeled:
Load stride 3 (i.e. load for b [ 3 * i ])

%5 = getelementptr inbounds i32, i32* %b, i64 %.lhs
%6 = bitcast i32* %5 to <4 x i32>*
%stride.load27 = load <4 x i32>, <4 x i32>* %6, align 1, !tbaa !1
%7 = getelementptr i32, i32* %5, i64 6 
%8 = bitcast i32* %7 to <4 x i32>*
%stride.load28 = load <4 x i32>, <4 x i32>* %8, align 1, !tbaa !1
%strided.vec29 = shufflevector <4 x i32> %stride.load27, <4 x i32> %stride.load28, <4 x i32> <i32 0, i32 3, i32 4, i32 7>

How STORE is modeled:
Store with stride 3 (i.e. store to c [ 3 * i ])
%10 = getelementptr inbounds i32, i32* %c, i64 %.lhs

%11 = bitcast i32* %10 to <4 x i32>*
%interleaved.vec = shufflevector <4 x i32> %StoreResult, <4 x i32> undef, <4 x i32> <i32 0, i32 undef, i32 undef, i32 1>
call void @llvm.masked.store.v4i32(<4 x i32> %interleaved.vec, <4 x i32>* %11, i32 4, <4 x i1> <i1 true, i1 false, i1 false, i1 true>)
%12 = getelementptr i32, i32* %10, i64 6
%13 = bitcast i32* %12 to <4 x i32>*
%interleaved.vec30 = shufflevector <4 x i32> %StoreResult, <4 x i32> undef, <4 x i32> <i32 2, i32 undef, i32 undef, i32 3>
call void @llvm.masked.store.v4i32(<4 x i32> %interleaved.vec30, <4 x i32>* %13, i32 4, <4 x i1> <i1 true, i1 false, i1 false, i1 true>)
NOTE: For Stride 3 SkipFactor is 2, that’s why the second GEP offset by 6(4+2).

To enable this feature below LLVM changes required.

  1. Identify strided memory access (Its already there i.e. interleave vectorizer does this)
  2. Costing changes:

a. Identify number of Load[s]/Store[s] & Shuffle[s] required to model Load/Store operation by considering SkipFactor.
b. Return total cost by adding Load[s]/Store[s] and shuffle[s] costs.

  1. Transform:

a. Generate Shuffle[s] followed by Mask-Store[s] instructions to model a Store operation.
b. Generate Load[s] followed by Shuffle[s] instructions to model a Load operation.

Use below option to enable this feature:
“-mllvm -enable-interleaved-mem-accesses -mllvm -enable-strided-vectorization”

Gains observed with prototype:
TSVC kernel S111 1.15x
TSVC kernel S1111 1.42x

Thanks,
Ashutosh

Diff Detail

Repository
rL LLVM

Event Timeline

ashutosh.nema retitled this revision from to Strided Memory Access Vectorization.
ashutosh.nema updated this object.
ashutosh.nema set the repository for this revision to rL LLVM.
ashutosh.nema added a subscriber: llvm-commits.

Adding Shahid as co Author.

rengolin added a subscriber: rengolin.
delena added inline comments.Jun 19 2016, 2:49 AM
lib/Target/X86/X86ISelLowering.h
686

Why 4?

lib/Target/X86/X86TargetTransformInfo.cpp
50

But it depends on element size..

lib/Transforms/Vectorize/LoopVectorize.cpp
1004

getPtrStride() does something similar

Thanks Elena for looking into this RFC.

lib/Target/X86/X86ISelLowering.h
686

This may not be required, ideally we should not put any limit. Things should be driven by costing.
Will check and remove this.

lib/Target/X86/X86TargetTransformInfo.cpp
50

I did not understood this comment completely.

Are you pointing where vectorizer can go above the target supported width ?
I.e.
double foo(double *A, double *B, int n) {

double sum = 0;

#pragma clang loop vectorize_width(16)

for (int i = 0; i < n; ++i)
  sum += A[i] + 5;
return sum;

}

lib/Transforms/Vectorize/LoopVectorize.cpp
1004

Will check it.

Ayal added a subscriber: Ayal.Jun 20 2016, 10:00 AM

Part 1. To be continued.

The performance improvements are relative to scalar, or relative to vectorizing with scalar loads/stores packed into vectors? For your 1-line example above the latter may not be better, but for more computationally intensive loops it might.

include/llvm/Analysis/TargetTransformInfo.h
531–542

Why not simply call getInterleavedMemoryOpCost() with a single Index of 0 instead of specializing it to getStridedMemoryOpCost()?

Yes, we need to implement it for x86.

lib/Target/X86/X86TargetTransformInfo.cpp
40

This "ceiling(VF/Stride)" is not x86 specific.

getNumStridedElementsInConsecutiveVF()? Better to simply inlining this short formula, or call it "ceiling".

50

requi[r]ed

This is counting how many vectors of VF-consecutive-elements-each are needed to cover a set of VF strided elements. To match the number of target loads or stores, the element size would need to be such that VF elements fit in a simd register.

55

This is a similar "ceiling(VF/VMem)" computation as before, better use the same formula for clarity. Perhaps the clearest is "(VF+Stride-1) / Stride".

(Or return getValidMemInSingleVFAccess(VF, Vmem) ... ;-)

61

Isn't this the same as getRequiredLoadStore() - 2?

"number of shuffle[s]"

What do you mean by "a[n] upper type"?

1158

Check if ! and return first.

1169

This may be different than getMemoryOpCost(Opcode, VecTy, Alignment, AddressSpace);
because we align each load on its first strided element; e.g., {0,3,6,9} can be loaded using 2 loads of {0,3}{6,9} rather than 3 consecutive loads of {0,11} = {0,3}{4,7}{8,11}.

1175

Explain this also above getShuffleRequiredForPromotion().

You can shuffle using a reduction binary tree, which may require fewer shuffles. In particular, if getRequiredLoadStore() is a power of 2, no additional promoting shuffles are needed.

number of shuffle[s]

can[']t

Give a full proper example, not one using *

1178

For store, don't we need to add the cost of expanding a vector of VF elements to be masked-stored stridedly?

1182

Why not simply multiply ShuffleCost by number of shuffles instead?

Isn't getShuffle[s]RequiredForPromotion() returning the TotalS[h]uffle[s], rather than the TotalSufflePerMemOp?

1185

How many shuffles total?

Reached this far, to be continued.

Thanks Ayal for looking into this RFC.

The performance improvements are relative to scalar,
or relative to vectorizing with scalar loads/stores packed
into vectors?

The performance improvements are relative to scalar.

For your 1-line example above the latter may not be better,
but for more computationally intensive loops it might.

Agree, with more compute in loops the performance will be much better.
I just kept it to demonstrate this feature.

include/llvm/Analysis/TargetTransformInfo.h
531–542

Approach to get the cost is different i.e. “getInterleavedMemoryOpCost” consider wider load[s] & store[s] and gets the cost.
Whereas we minimise the number of load[s] & store[s] and computes the cost.
May not be a good idea to merge them as approach to get the cost itself is different.
I’m open if people still feel it’s good to merge them will do it.

lib/Target/X86/X86TargetTransformInfo.cpp
50

This identifies how many vector[s] (wide VF) are required to cover a set of VF strided elements.
Agree, there can be a possibility that VF wide vector may not fit the target simd register,
Currently it leaves things to costing, which is not be good as some cases costing goes to
basic TTI and gets the cost which may not be proper as per hardware.

This should be handled correctly by considering target support.

61

Sorry, comments for this function is not very clear.
The intent of the function is to identify the number of shuffle required to promote smaller vector type to a bigger vector type.
When types are different we can’t directly shuffle elements because shuffle expects same type
i.e. <4 x i32> & <2 x i32> shuffle is not possible, so its required to promote <2 x i32> to <4 x i32> with undefs vector.

1158

Sure.

1169

getMemoryOpCost is used to estimate cost of single load or store operation.
Then we multiply it with number of load[s] or store[s] required to get total load[s] or store[s] cost.
i.e. For VF 4 & Stride 3 the total load[s]/store[s] required are 2(MemOpRequired) so we identify the cost of single load/store than multiply with 2.

1175

Sure will provide more details here & at getShuffleRequiredForPromotion.
Good idea reduction binary tree can be used here, will try it.

1178

For store we generate shuffle followed by mask-stores, so it’s not required.
It’s only required for load, as while loading because of mismatch in vector type we can’t shuffle directly
So it’s required there to promote smaller vector type using shuffle, but it’s not the case with store.

1182

Yes this looks bad, will directly multiply.

1185

Example:
Load for stride 3 & VF8:

1 extra shuffle is required promotion:

Load#1 : 0 * * 3 * * 6 *
Load#2: 9 * * 12 * * 15 *
Shuffle#1: ShuffleVector <8 x i32>Load#1, <8 x i32>Load#2, <6 x i32> <i32 0, i32 3, i32 6, i32 8, i32 11, i32 14>
Load#3: 18 * * 21 * * * *
Shuffle#2 : ShuffleVector <6 x i32>Shuffle#1, <6 x i32>undef, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 undef, i32 undef>
Shuffle#3 : ShuffleVector <8 x i32>Shuffle#2, <8 x i32>Load#3 , <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 8, i32 11>

Note: “Shuffle#2” is extra generated for vector type promotion.