- Introduction
This patch adds a supporting pass just before the InterleavedLoadAccess pass to combine further interleaved patterns such that Interleaved loads can be more efficiently lowered by the InterleavedLoadAccess pass.
Currently patterns such as the following
%gep1 = getelementptr inbounds <4 x float>, <4 x float>* %ptr, i64 2 %gep2 = getelementptr inbounds <4 x float>, <4 x float>* %ptr, i64 3 %gep3 = getelementptr inbounds <4 x float>, <4 x float>* %ptr, i64 4 %gep4 = getelementptr inbounds <4 x float>, <4 x float>* %ptr, i64 5 %ld1 = load <4 x float>, <4 x float>* %gep1, align 16 %ld2 = load <4 x float>, <4 x float>* %gep2, align 16 %ld3 = load <4 x float>, <4 x float>* %gep3, align 16 %ld4 = load <4 x float>, <4 x float>* %gep4, align 16 %sv1 = shufflevector <4 x float> %ld1, <4 x float> %ld2, <4 x i32> <i32 0, i32 1, i32 4, i32 5> %sv2 = shufflevector <4 x float> %ld1, <4 x float> %ld2, <4 x i32> <i32 2, i32 3, i32 6, i32 7> %sv3 = shufflevector <4 x float> %ld3, <4 x float> %ld4, <4 x i32> <i32 0, i32 1, i32 4, i32 5> %sv4 = shufflevector <4 x float> %ld3, <4 x float> %ld4, <4 x i32> <i32 2, i32 3, i32 6, i32 7> %m0_3 = shufflevector <4 x float> %sv1, <4 x float> %sv3, <4 x i32> <i32 0, i32 2, i32 4, i32 6> %m4_7 = shufflevector <4 x float> %sv1, <4 x float> %sv3, <4 x i32> <i32 1, i32 3, i32 5, i32 7> %m8_11 = shufflevector <4 x float> %sv2, <4 x float> %sv4, <4 x i32> <i32 0, i32 2, i32 4, i32 6> %m12_15 = shufflevector <4 x float> %sv2, <4 x float> %sv4, <4 x i32> <i32 1, i32 3, i32 5, i32 7>
are not transformed by any pass. Thus this interleaved load cannot be detected by the InterleavedLoadAccess pass. On AArch64 for example, the generated code is a series of loads and unzip instructions instead of the more efficient single ld4.
This pass transforms the example above into:
%gep1 = getelementptr inbounds <4 x float>, <4 x float>* %ptr, i64 2 %interleaved.wide.ptrcast = bitcast <4 x float>* %gep1 to <16 x float>* %interleaved.wide.load = load <16 x float>, <16 x float>* %interleaved.wide.ptrcast, align 16 %interleaved.shuffle = shufflevector <16 x float> %interleaved.wide.load, <16 x float> undef, <4 x i32> <i32 0, i32 4, i32 8, i32 12> %interleaved.shuffle1 = shufflevector <16 x float> %interleaved.wide.load, <16 x float> undef, <4 x i32> <i32 1, i32 5, i32 9, i32 13> %interleaved.shuffle2 = shufflevector <16 x float> %interleaved.wide.load, <16 x float> undef, <4 x i32> <i32 2, i32 6, i32 10, i32 14> %interleaved.shuffle3 = shufflevector <16 x float> %interleaved.wide.load, <16 x float> undef, <4 x i32> <i32 3, i32 7, i32 11, i32 15>
This is seen by the InterleavedAccess and will be lowered to ld4 on Aarch64.
- Algorithm Outline
The pass works in two stages.
Stage 1: Analyse shufflevector instructions.
For each shufflevector instruction a VectorInfo structure is built the stores the required loads and information on which memory location is loaded into the vector elements.
With this information interleaved candidates are extracted and put into a candidates list.
Stage 2: Combine Loads
The candidates list is searched to find tuples that match an interleaved load. With this information a memory alias analysis is done to test if it is legal to combine the loads. If it is legal to combine the load and if benefits are expected (e.g.: intermediate values will be dead). a combined load the corresponding shufflevector instructions are emitted.
- Offset Computation and Representation
In order to also support to combine loads with indexed offsets (last element is not constant). Offsets are represented as first order polynomial class Polynomial;.
This class allows poofs for the relative locations of loaded data. Because all computations are done modular arithmetic the class also tracks possible errors and emits proved deltas if and only if the proof can be made.
A proof for the legality of operation on a polynomial is depicted in the code.
"combines them into wide loads detectable by InterleavedAccessPass"