Page MenuHomePhabricator

[CodeGen, AArch64] Combine Interleaved Loads which are not covered by the Vectorizer

Authored by marels on Sep 28 2018, 8:04 AM.


  • Introduction

This patch adds a supporting pass just before the InterleavedLoadAccess pass to combine further interleaved patterns such that Interleaved loads can be more efficiently lowered by the InterleavedLoadAccess pass.

Currently patterns such as the following

%gep1 = getelementptr inbounds <4 x float>, <4 x float>* %ptr, i64  2
%gep2 = getelementptr inbounds <4 x float>, <4 x float>* %ptr, i64  3
%gep3 = getelementptr inbounds <4 x float>, <4 x float>* %ptr, i64  4
%gep4 = getelementptr inbounds <4 x float>, <4 x float>* %ptr, i64  5
%ld1 = load <4 x float>, <4 x float>* %gep1, align 16
%ld2 = load <4 x float>, <4 x float>* %gep2, align 16
%ld3 = load <4 x float>, <4 x float>* %gep3, align 16
%ld4 = load <4 x float>, <4 x float>* %gep4, align 16
%sv1 = shufflevector <4 x float> %ld1, <4 x float> %ld2, <4 x i32> <i32 0, i32 1, i32 4, i32 5>
%sv2 = shufflevector <4 x float> %ld1, <4 x float> %ld2, <4 x i32> <i32 2, i32 3, i32 6, i32 7>
%sv3 = shufflevector <4 x float> %ld3, <4 x float> %ld4, <4 x i32> <i32 0, i32 1, i32 4, i32 5>
%sv4 = shufflevector <4 x float> %ld3, <4 x float> %ld4, <4 x i32> <i32 2, i32 3, i32 6, i32 7>
%m0_3   = shufflevector <4 x float> %sv1, <4 x float> %sv3, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
%m4_7   = shufflevector <4 x float> %sv1, <4 x float> %sv3, <4 x i32> <i32 1, i32 3, i32 5, i32 7>
%m8_11  = shufflevector <4 x float> %sv2, <4 x float> %sv4, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
%m12_15 = shufflevector <4 x float> %sv2, <4 x float> %sv4, <4 x i32> <i32 1, i32 3, i32 5, i32 7>

are not transformed by any pass. Thus this interleaved load cannot be detected by the InterleavedLoadAccess pass. On AArch64 for example, the generated code is a series of loads and unzip instructions instead of the more efficient single ld4.

This pass transforms the example above into:

%gep1 = getelementptr inbounds <4 x float>, <4 x float>* %ptr, i64 2
%interleaved.wide.ptrcast = bitcast <4 x float>* %gep1 to <16 x float>*
%interleaved.wide.load = load <16 x float>, <16 x float>* %interleaved.wide.ptrcast, align 16
%interleaved.shuffle = shufflevector <16 x float> %interleaved.wide.load, <16 x float> undef, <4 x i32> <i32 0, i32 4, i32 8, i32 12>
%interleaved.shuffle1 = shufflevector <16 x float> %interleaved.wide.load, <16 x float> undef, <4 x i32> <i32 1, i32 5, i32 9, i32 13>
%interleaved.shuffle2 = shufflevector <16 x float> %interleaved.wide.load, <16 x float> undef, <4 x i32> <i32 2, i32 6, i32 10, i32 14>
%interleaved.shuffle3 = shufflevector <16 x float> %interleaved.wide.load, <16 x float> undef, <4 x i32> <i32 3, i32 7, i32 11, i32 15>

This is seen by the InterleavedAccess and will be lowered to ld4 on Aarch64.

  • Algorithm Outline

The pass works in two stages.

Stage 1: Analyse shufflevector instructions.

For each shufflevector instruction a VectorInfo structure is built the stores the required loads and information on which memory location is loaded into the vector elements.
With this information interleaved candidates are extracted and put into a candidates list.

Stage 2: Combine Loads

The candidates list is searched to find tuples that match an interleaved load. With this information a memory alias analysis is done to test if it is legal to combine the loads. If it is legal to combine the load and if benefits are expected (e.g.: intermediate values will be dead). a combined load the corresponding shufflevector instructions are emitted.

  • Offset Computation and Representation

In order to also support to combine loads with indexed offsets (last element is not constant). Offsets are represented as first order polynomial class Polynomial;.
This class allows poofs for the relative locations of loaded data. Because all computations are done modular arithmetic the class also tracks possible errors and emits proved deltas if and only if the proof can be made.

A proof for the legality of operation on a polynomial is depicted in the code.

Diff Detail


Event Timeline

marels created this revision.Sep 28 2018, 8:04 AM
john.brawn added inline comments.Oct 12 2018, 7:04 AM
112 ↗(On Diff #167463)

I don't think any of the dominator checks in this file are needed, so this isn't needed.

225–226 ↗(On Diff #167463)

I don't think this is needed. hasB() can check !B.empty(), and in isCompatibleTo() we can assume that the user of this class will be using polynomials of the same variable.

679 ↗(On Diff #167463)

LI is used only for the InsertionPoint in combine, and you can find that by picking the least element of VectorInfo::LIs e.g.

LoadInst *getFirstLoad() const { return *std::min_element(LIs.begin(), LIs.end()); }

so LI isn't needed here, so we don't need ElementInfo at all and can just have an array of Polynomial.

787 ↗(On Diff #167463)

Duplicates line 782.

1137 ↗(On Diff #167463)

All loads are required to be in the same block, so the dominator will just be the first load.

1229 ↗(On Diff #167463)

This isn't needed, as if this weren't the case the IR wouldn't be valid in the first place (I'm pretty sure).

1292–1294 ↗(On Diff #167463)

You can do this all at once with

if (auto SVI = dyn_cast<ShuffleVectorInst>(&I))
1295–1296 ↗(On Diff #167463)

Having a predetermined width multiple and using that as the basis for when to try this doesn't seem right. It would make more sense to use TTI->getInterleavedMemoryOpCost and check if we expect the cost to be lower than not doing the transform. I tried

VectorType *Ty = SVI->getType();
SmallVector<uint32_t, 4> Mask;
for (unsigned i = 0; i < Factor; i++)

VectorType *InterleaveTy = VectorType::get(Ty->getScalarType(), Ty->getNumElements() * Factor);
int InterleavedCost = TTI->getInterleavedMemoryOpCost(Instruction::Load, InterleaveTy,
                                                      Factor, Mask, 16, 0); // TODO: correct alignment, address space
int MemoryOpCost = TTI->getMemoryOpCost(Instruction::Load, Ty, 16, 0);
int ShuffleCost = TTI->getInstructionCost(SVI, TargetTransformInfo::TCK_Latency);
if (InterleavedCost > Factor * (MemoryOpCost + ShuffleCost)) {

and it seemed like it worked, though maybe here is the wrong place to do this check and maybe it should be in InterleavedLoadCombine::combine.

1303–1307 ↗(On Diff #167463)

SVI and PV should never be null if compute returned true, so these should be asserts that SVI and PV are non-null.

466 ↗(On Diff #167463)

Do we also need a similar call in initializeCodeGen in lib/CodeGen/CodeGen.cpp?

@john.brawn thanks for the input. I commented on some and will upload a new revision shortly

112 ↗(On Diff #167463)

I think some of the dominator checks can be removed, but at least one needs to remain (see comment in line 1229)

225–226 ↗(On Diff #167463)

I think it is necessary. The vector B stores the operations executed on V, but V is the value those operations are applied on. Example:

define @fn(i64 %a, i64 %b, u8 *%base)
%ptr1 = gep %base, %a 
%ptr2 = gep %base, %b

This leads to two incompatible polynomials:

1*%a + 0 and 
1*%b + 0


Pa = { .B = <{mul, 1}>, V = %a, A = 0}


Pb = { .B = <{mul, 1}>, V = %b, A = 0}

Currently {mul, 1} is stored in B thus B.empty() could be used. But this is not necessary but pushing {mul, 1} unnecessarily restricts the computation, and i think I will remove this.

679 ↗(On Diff #167463)

I am not sure if this field can be omitted:

  • The InsertionPoint is not always the first load. It is the load instruction loading the from the memory address where the first element of the interleave is located at. This insertion point is selected because at this position it is guaranteed that the Pointer Value from which that combined load has to load from is available and that this value can be used just by adding a pointer cast.
  • The LI field also guarantees that this load exists. This is not always the case consider for example the wired case:
%ptr1 = &a[0];
%ptr2 = &a[9];
%v1 =<12 x float> load %ptr1
%v2 =<12 x float> load %ptr2
%s1 = <4 x float> shufflevector %v1, %v2, <1, 5, 9, 16>
%s2 = <4 x float> shufflevector %v1, %v2, <2, 6, 10, 17>
%s3 = <4 x float> shufflevector %v1, %v2, <3, 7, 11, 18>
%s4 = <4 x float> shufflevector %v1, %v2, <4, 8, 15, 19>

This interleave would be detected in the first step, but the transformation would be aborted because the there is no load that loads from &a[1].

1137 ↗(On Diff #167463)

Thats true, I will replace it by some kind of

std::find_if(BB.begin(), BB.end(), [] {...});

Note that LIs is a set of pointer values, thus it is ordered by addresses to guarantee uniqueness. It is not ordered in respected to dominance so LIs.front() will not work here.

1229 ↗(On Diff #167463)

I think it is. Again consider a weird but possible case:

%ptr2 = &a[2]
%v2 = <14 x float> load %ptr2
%s3 = <4 x float> shufflevector %v2, undef, <0, 4, 8, 12>
%s4 = <4 x float> shufflevector %v2, undef, <1, 5, 9, 13>
%x = some-nonaliasing-use-eg-extract-element-instr of %s3 or %s4 
%ptr2 = &a[0];
%v1 = <14 x float> load %ptr2
%s1 = shufflevector %v1, undef, <0, 4, 8, 12>
%s2 = shufflevector %v1, undef, <1, 5, 9, 13>

Unless moving and proving that %ptr2 is movable before %v2 the insertion point must be at %v1. However because %v1 does not dominate %x the transformation would generate an invalid IR. This loop prevents this case. Also because the shuffle vector instructions are not necessarily in the same basic block I think that the dominator relation is the way to go.

Also not the extract element instructions are inserted just before the %s1 to %s3 respectively.

1292–1294 ↗(On Diff #167463)

yep (old C habit, sorry)

1295–1296 ↗(On Diff #167463)

Good point. I think the correct location for this is in InterlevedLoadCombine::combine which still can bail. Moreover this function has knowledge of all participating instructions that will be replaced and left dead. Thus we can more precisely determine the cost of all instructions. Also not that the Loads do not necessarily have to be symmetrical. Hence summing instead of multiplying will be better.

1303–1307 ↗(On Diff #167463)

When calling VectorInfo::compute() in general this is true for VI->PV but for VI->SVI only if the instruction is a ShuffleVectorInst which it is in this case. I will change this but I will also call computeFromSVI directly.

466 ↗(On Diff #167463)

Not sure too. But, I guess, adding it there will not harm.

@john.brawn thanks for the input. I commented on some and will upload a new revision shortly

225–226 ↗(On Diff #167463)

To clarify semantics:

  • hasB() -> means Is the first-order term zero
  • B.empty() -> means The first-order term is multiplied by one.

I think renaming the function to isFirstOrder makes this semantics more clearly.

marels updated this revision to Diff 169986.Oct 17 2018, 4:17 AM
marels marked 5 inline comments as done.


  • Change Code to reflect comments from John Brawn.
  • Add InterleavedLoadCombineImpl class to explicitly avoid state propagation between different invocations
  • Remove shared_ptr<> constructs in favor to lists and in-place object handling
  • Move isInterleavedLoad into VectorInfo (also changed name to isInterleaved)
marels updated this revision to Diff 169988.Oct 17 2018, 4:22 AM

Add context of all changed files. Sorry I missed it the previous update

john.brawn added inline comments.Oct 23 2018, 7:00 AM
1277–1278 ↗(On Diff #169988)

The comment here looks like it should be deleted.

1284–1285 ↗(On Diff #169988)

Now that the instruction cost is being computed this isn't needed, as that will already handle mis-sized vector types.

1298–1299 ↗(On Diff #169988)

The same for this comment.

679 ↗(On Diff #167463)

Could you add tests for these two cases? Currently if you do as I suggested then you get no failures (so it looks like it works). Also for the second it looks like it could be handled by inserting a getelementptr to generate the address &a[1] and load from that, though it doesn't look easy to do (seems like it would involve generating a getelementptr from a Polynomial in some manner) so I wouldn't worry about implementing it.

1229 ↗(On Diff #167463)

Could you add a test case for this?

marels updated this revision to Diff 171536.Oct 29 2018, 10:35 AM
  • removed VectorWidth related thing
  • removed some comments
  • added testcases to check correct behavior in weird but possible IR inputs

I also added missing colons in some of the test, and in the main loop I added a missing pop_back

@john.brawn: thanks for commenting

1277–1278 ↗(On Diff #169988)


1284–1285 ↗(On Diff #169988)

removed, including related code

1298–1299 ↗(On Diff #169988)


679 ↗(On Diff #167463)

I added a case for where this load does not exist.

Yes, one could add a corresponding GEP, but it might not only be complicated to reconstruct this from a polynomial, i cannot imagine that many real-life examples would profit.

1229 ↗(On Diff #167463)

sure & done

john.brawn accepted this revision.Nov 5 2018, 5:03 AM

A few errors in comments, but otherwise LGTM so long as you fix those errors before committing.

383 ↗(On Diff #171536)

"combines them into wide loads detectable by InterleavedAccessPass"

73 ↗(On Diff #171536)

"Scan the function for interleaved load candidates" (not "and for")

114–117 ↗(On Diff #171536)

This paragraph doesn't really make sense, particularly the second and fourth sentences. Could you reword it?

141 ↗(On Diff #171536)

"prove" not "proof"

143 ↗(On Diff #171536)

"the" not "the the"

547 ↗(On Diff #171536)

"Return an undefined polynomial if incompatible"

573 ↗(On Diff #171536)

Capitalise "Subtract"

This revision is now accepted and ready to land.Nov 5 2018, 5:03 AM
marels updated this revision to Diff 173658.Nov 12 2018, 5:44 AM
marels set the repository for this revision to rL LLVM.
  • fix typos
  • update commit message

Unless I do not receive any further comments, I think it is OK to commit this patch
Thank you for supporting

This revision was automatically updated to reflect the committed changes.
xbolva00 added inline comments.

self-comparison always evaluates to true [-Wtautological-compare]

marels added inline comments.Nov 19 2018, 10:42 AM

Thanks, I pushed a fix minutes ago

lebedev.ri added inline comments.

This does not build. (gcc8, release with asserts).

FAILED: lib/CodeGen/CMakeFiles/LLVMCodeGen.dir/InterleavedLoadCombinePass.cpp.o 
/usr/bin/g++  -DGTEST_HAS_RTTI=0 -D_DEBUG -D_GNU_SOURCE -D__STDC_CONSTANT_MACROS -D__STDC_FORMAT_MACROS -D__STDC_LIMIT_MACROS -Ilib/CodeGen -I/build/llvm/lib/CodeGen -I/usr/include/libxml2 -Iinclude -I/build/llvm/include -pipe -O2 -g0 -UNDEBUG -fPIC -fvisibility-inlines-hidden -Werror=date-time -std=c++11 -Wall -Wextra -Wno-unused-parameter -Wwrite-strings -Wcast-qual -Wno-missing-field-initializers -pedantic -Wno-long-long -Wimplicit-fallthrough -Wno-maybe-uninitialized -Wno-class-memaccess -Wno-noexcept-type -Wdelete-non-virtual-dtor -Wno-comment -fdiagnostics-color -ffunction-sections -fdata-sections -pipe -O2 -g0 -UNDEBUG -fPIC   -UNDEBUG  -fno-exceptions -fno-rtti -MD -MT lib/CodeGen/CMakeFiles/LLVMCodeGen.dir/InterleavedLoadCombinePass.cpp.o -MF lib/CodeGen/CMakeFiles/LLVMCodeGen.dir/InterleavedLoadCombinePass.cpp.o.d -o lib/CodeGen/CMakeFiles/LLVMCodeGen.dir/InterleavedLoadCombinePass.cpp.o -c /build/llvm/lib/CodeGen/InterleavedLoadCombinePass.cpp
/build/llvm/lib/CodeGen/InterleavedLoadCombinePass.cpp: In member function ‘void {anonymous}::VectorInfo::print(llvm::raw_ostream&) const’:
/build/llvm/lib/CodeGen/InterleavedLoadCombinePass.cpp:1037:37: error: no match for ‘operator<<’ (operand types are ‘llvm::raw_ostream’ and ‘{anonymous}::Polynomial’)
       OS << ((i == 0) ? "[" : ", ") << EI[i].Ofs;

I will change this to

for (unsigned i = 0; i < getDimension(); i++) {
  OS << ((i == 0) ? "[" : ", ");

to fix the build.

Also, it is weird that such a class is hidden in such place.

marels added inline comments.Nov 19 2018, 11:02 AM

Just remove all the debug print functions.

Thank you for fixing & sorry for breaking the build.