This is an archive of the discontinued LLVM Phabricator instance.

Rewrite LSV to handle longer chains.
ClosedPublic

Authored by jlebar on May 4 2023, 12:40 PM.

Details

Summary

Rewrite load-store-vectorizer.

The motivation for this change is a workload generated by the XLA compiler
targeting nvidia GPUs.

This kernel has a few hundred i8 loads and stores. Merging is critical for
performance.

The current LSV doesn't merge these well because it only considers instructions
within a block of 64 loads+stores. This limit is necessary to contain the
O(n^2) behavior of the pass. I'm hesitant to increase the limit, because this
pass is already one of the slowest parts of compiling an XLA program.

So we rewrite basically the whole thing to use a new algorithm. Before, we
compared every load/store to every other to see if they're consecutive. The
insight (from tra@) is that this is redundant. If we know the offset from PtrA
to PtrB, then we don't need to compare PtrC to both of them in order to tell
whether C may be adjacent to A or B.

So that's what we do. When scanning a basic block, we maintain a list of
chains, where we know the offset from every element in the chain to the first
element in the chain. Each instruction gets compared only to the leaders of
all the chains.

In the worst case, this is still O(n^2), because all chains might be of length 1.
To prevent compile time blowup, we only consider the 64 most recently used
chains. Thus we do no more comparisons than before, but we have the potential
to make much longer chains.

This rewrite affects many tests. The changes to tests fall into two
categories.

  1. The old code had what appears to be a bug when deciding whether a misaligned vectorized load is fast. Suppose TTI reports that load <i32 x 4> align 4 has relative speed 1, and suppose that load i32 align 4 has relative speed 32.

    The intent of the code seems to be that we prefer the scalar load, because it's faster. But the old code would choose the vectorized load. accessIsMisaligned would set RelativeSpeed to 0 for the scalar load (and not even call into TTI to get the relative speed), because the scalar load is aligned.

    After this patch, we will prefer the scalar load if it's faster.
  1. This patch changes the logic for how we vectorize. Usually this results in vectorizing more.

Explanation of changes to tests:

  • AMDGPU/adjust-alloca-alignment.ll: #1
  • AMDGPU/flat_atomic.ll: #2, we vectorize more.
  • AMDGPU/int_sideeffect.ll: #2, there are two possible locations for the call to @foo, and the pass is brittle to this. Before, we'd vectorize in case 1 and not case 2. Now we vectorize in case 2 and not case 1. So we just move the call.
  • AMDGPU/adjust-alloca-alignment.ll: #2, we vectorize more
  • AMDGPU/insertion-point.ll: #2 we vectorize more
  • AMDGPU/merge-stores-private.ll: #1 (undoes changes from git rev 86f9117d476, which appear to have hit the bug from #1)
  • AMDGPU/multiple_tails.ll: #1
  • AMDGPU/vect-ptr-ptr-size-mismatch.ll: Fix alignment (I think related to #1 above).
  • AMDGPU CodeGen: I have difficulty commenting on these changes, but many of them look like #2, we vectorize more.
  • NVPTX/4x2xhalf.ll: Fix alignment (I think related to #1 above).
  • NVPTX/vectorize_i8.ll: We don't generate <3 x i8> vectors on NVPTX because they're not legal (and eventually get split)
  • X86/correct-order.ll: #2, we vectorize more, probably because of changes to the chain-splitting logic.
  • X86/subchain-interleaved.ll: #2, we vectorize more
  • X86/vector-scalar.ll: #2, we can now vectorize scalar float + <1 x float>
  • X86/vectorize-i8-nested-add-inseltpoison.ll: Deleted the nuw test because it was nonsensical. It was doing add nuw %v0, -1, but this is equivalent to add nuw %v0, 0xffff'ffff, which is equivalent to asserting that %v0 == 0.
  • X86/vectorize-i8-nested-add.ll: Same as nested-add-inseltpoison.ll

Diff Detail

Event Timeline

jlebar created this revision.May 4 2023, 12:40 PM
Herald added a project: Restricted Project. · View Herald TranscriptMay 4 2023, 12:40 PM
jlebar requested review of this revision.May 4 2023, 12:40 PM
arsenm added a comment.May 4 2023, 2:05 PM

Now we only generate vectors of power-of-two size. Previously the code was inconsistent about whether this was a requirement.

It shouldn't be a requirement. Codegen support for non-power-of-2 vectors is better than ever and basically works now. We have native 96-bit loads and stores

jlebar added a comment.May 4 2023, 2:10 PM

Now we only generate vectors of power-of-two size. Previously the code was inconsistent about whether this was a requirement.

It shouldn't be a requirement. Codegen support for non-power-of-2 vectors is better than ever and basically works now. We have native 96-bit loads and stores

OK, thanks for that feedback. I will try to fix this.

arsenm added inline comments.May 4 2023, 2:12 PM
llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
648–652

This is reinventing DL.getTypeStoreSize

918

This shouldn't introduce addrspacecasts

976

Ditto

jlebar updated this revision to Diff 519669.May 4 2023, 3:01 PM

No longer require power-of-two chains

jlebar added a comment.May 4 2023, 3:02 PM

Now we only generate vectors of power-of-two size. Previously the code was inconsistent about whether this was a requirement.

It shouldn't be a requirement. Codegen support for non-power-of-2 vectors is better than ever and basically works now. We have native 96-bit loads and stores

OK, thanks for that feedback. I will try to fix this.

Done, but I see you have new feedback in the meantime. :)

jlebar updated this revision to Diff 519674.May 4 2023, 3:15 PM
jlebar marked 2 inline comments as done.

Address arsenm's feedback.

jlebar edited the summary of this revision. (Show Details)May 4 2023, 3:16 PM
jlebar edited the summary of this revision. (Show Details)

Thanks for the comments, Matt. I've addressed them.

It looks like I missed a bunch of AMDGPU codegen test failures. I didn't realize that our infra at Google was skipping these. I will have a look.

llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
918

True, but it does have the potential to do a bitcast (e.g. i32 -> float) or an inttoptr (e.g. int64 -> ptr). Is that the wrong function to call?

Maybe you just meant to put this on the Value *Bitcast = Builder.CreateBitOrPointerCast( above -- that one I think can be just a BitCast. Changed.

jlebar updated this revision to Diff 519748.May 4 2023, 11:42 PM

Update AMDGPU codegen tests, and fix two bugs they found.

jlebar edited the summary of this revision. (Show Details)May 4 2023, 11:42 PM
jlebar updated this revision to Diff 520871.May 9 2023, 5:34 PM

Get rid of unnecessary curly braces per style guide.

tra added a comment.May 16 2023, 4:23 PM

I went through the code, but didn't look at the test changes yet. While I have a reasonable idea of what's going on, a lot of the patch details went over my head.

Overall, the patch looks OK to me, with mostly minor style/comment nits and a few clarification questions.

Do you have any data on how much impact on compilation time we'll see with this patch?

llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
167

Nit: empty line between functions

205–206

Naive question -- shouldn't this already be the case for the instructions within the same BB?

279

This raises more questions:

  • What *does* it return if it's given a 1-element chain?
  • Is it allowed not to split a chain (e.g if it's passed a 2-element chain)?
  • If it's given a 3-element chain, does it mean that it will return the input unchanged? What if we do want to split the chain. E.g. if we have 3 64-bit loads, for NVPTX it would be optimal to vectorize them into v2i64 and i64.
289

"Doesn't return length-1 chains." seems to be a common pattern. It would help to elaborate on that in one place. E.g. something along the lines of "all chain split operations expect at least 2 elements long input chain and will never produce singleton chains as the output. In case input chain can not be split, the original chain is returned." Note, that I didn't look at the implementation yet, so the details are probably wrong, edit as needed.

449–450

This should probably be mentioned at the point where ToErase field is declared. I was wondering why it was not a local.

452

Is there a particular reason GEPs are special cased here? Dont we want to erase other unused arguments of the instructions we're deleting? Are we guaranteed that GEPs are the only argument kinds we're dealing with here? E.g. we may conceivably have an inttoptr.

462

Nit: ... may be a better indication that we're dealing with a range. Comma may be somewhat confusing considering that it will be present in the IR instructions themselves.

547–558

could it be merged into something like this:

auto [ChainBegin, ChainEnd] = [&] {
      if constexpr (IsLoad())
        return {C.begin(), C.end()};
      else
        return {C.rbegin(), C.rend()};
}();
678–679

Is the comment still valid? I'm failing to find where the recursion happens in the code.

784

Nit: drop is

787

Nit: "should not" or "will not" would probably work better.

1164–1168

It's not obvious what the optional APInt represents here. Perhaps the function needs a better name to reflect it.

1278–1279

Ditto.

1326–1327

We already know that we can't have them both be nullptr.
So the only case when the assertion fill be false is when they both are non-null, and that is not possible as I can'be both a load and a store.
Drop it?

We could also change the code a bit to make the intent a bit more obvious:

auto *LI = dyn_cast<LoadInst>(&I);
auto *SI = !LI ? dyn_cast<StoreInst>(&I) : nullptr;
if (!LI && !SI)
     continue;
1423–1428

There's a bit of a dissonance between LRU and the "most recently used".

AFAICT the intent here is the access policy, and looking at the code it appears to be that "most recently used" is correct.

Rename LRU -> MRU?

tra added inline comments.May 16 2023, 5:04 PM
llvm/test/CodeGen/AMDGPU/GlobalISel/sdivrem.ll
1581

This looks like some sort of barrier which was probably important. It would be great if someone familiar with AMDGPU backend could double check if the removal of this instruction is OK.

1890

ditto.

2190

ditto.

llvm/test/CodeGen/AMDGPU/GlobalISel/udivrem.ll
1513

ditto, though in this case it may have moved to line 1328 in the new version.

1775

ditto.

jlebar updated this revision to Diff 523133.May 17 2023, 12:11 PM
jlebar marked 14 inline comments as done.

Address tra's comments.

llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
205–206

Yes, but vectorization of a load might break this invariant.

279

What *does* it return if it's given a 1-element chain?

Empty list.

Is it allowed not to split a chain (e.g if it's passed a 2-element chain)?

A two-element chain might or might not be split up, depending on whether the instructions read/write a contiguous block of memory. If it's split up, then we have two length-one chains. We do not return length-one chains, so we return the empty list.

If it's given a 3-element chain, does it mean that it will return the input unchanged?

If the three-element chain reads or writes contiguous memory, it will not be split up. If it does not read/write contiguous memory, it will be split up.

What if we do want to split the chain. E.g. if we have 3 64-bit loads, for NVPTX it would be optimal to vectorize them into v2i64 and i64.

This happens later, splitChainByAlignment.

289

all chain split operations expect at least 2 elements long input chain

They work fine with a one-element chain, it's just easier to remove them before sending them to the split function.

In case input chain can not be split, the original chain is returned

This is not quite right. Our goal is not to split up chains; the goal is to keep long chains and break them up only when necessary.

I tried to explain the overall algorithm at the top of the file. Is there something that I can add at the top of the file that would make this more clear?

452

This is just matching the behavior of the original code.

I don't believe GEPs are special, but also I don't think we want to reeimplement a full DCE pass?

678–679

Rewrote the comment to clarify that we don't recurse in the C++.

barannikov88 added inline comments.
llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
240–241
barannikov88 added inline comments.May 17 2023, 1:14 PM
llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
62

System includes should go last. clang-format would do this if there were no blank lines.
https://llvm.org/docs/CodingStandards.html#include-style

198

This method looks redundant. getScalarType already checks for vector type.

234–239

This way BBI is not decremented to be incremented again at the end of the loop.

792

Does DL.getStackAlign() instead of 4 cause many negative differences in tests?

866

VecElemTy can't be a vector type, can it?

1303

Capitalize begin/end (the declaration already uses the correct style).

1404
jlebar updated this revision to Diff 523186.May 17 2023, 3:13 PM
jlebar marked 10 inline comments as done.

Address review comments.

Thank you for the reviews!

llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
792

DL.getStackAlignment() assert-fails on x86, nvptx, and amdgpu:

llvm::DataLayout::getStackAlignment() const: StackNaturalAlign && "StackNaturalAlign must be defined"

Not sure what's going on, but I don't think this is a windmill I want to tilt at in this patch.

866

Yeah, I meant that VecTy == <32 x i1>. Tried to clarify the comment.

arsenm added inline comments.May 18 2023, 3:09 AM
llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
918

Sorry I got confused by the set of cast helpers. I thought these all ended up calling CreatePointerBitCastOrAddrSpaceCast at the end

1014

No else after return throughout here

arsenm added inline comments.May 18 2023, 3:10 AM
llvm/test/CodeGen/AMDGPU/GlobalISel/sdivrem.ll
1581

I wouldn't look too close at the waitcnt placement on a patch like this. This amounts to a scheduling change

arsenm added inline comments.May 18 2023, 4:23 AM
llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
792

For AMDGPU we definitely do not want to promote stack objects above 4 alignment. We should in fact have optmizations to under-align any stack values to 4 as long as the address isn't captured

barannikov88 added inline comments.May 18 2023, 4:38 AM
llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
792

For AMDGPU we definitely do not want to promote stack objects above 4 alignment.

AMDGPU has "S32" in the datalayout string, so this promotion shouldn't happen.

DL.getStackAlignment() assert-fails on x86, nvptx, and amdgpu:

This is probably because the datalayout in these tests does not have this "S" specifier.
Anyway, it was merely a suggestion, not a request for a change.

barannikov88 added inline comments.May 18 2023, 4:40 AM
llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
1279

Formatting is off (above, too).

jlebar updated this revision to Diff 523583.May 18 2023, 3:25 PM
jlebar marked 11 inline comments as done.

Review comments

Thanks for the comments.

llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
1014

This one is necessary because it's an if constexpr.

I found two matches for return.*\n.*else that aren't constexpr and fixed those.

tra accepted this revision.May 18 2023, 3:57 PM

Tests look OK.

llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
279

If the three-element chain reads or writes contiguous memory, it will not be split up. If it does not read/write contiguous memory, it will be split up.

I don't think it's possible to split 3-elements in a way that would not produce an 1-element part. The only choice is not to split and return an empty or a 3-element chain.

Anyways, I think I see what you mean. I'd prefer to see a description of what we *do* ("does not do X" leaves too much room for guessing, IMO), but I'll leave it up to you.

This revision is now accepted and ready to land.May 18 2023, 3:57 PM
arsenm added inline comments.May 19 2023, 9:07 AM
llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
178–179

Can combine the last two printed strings

250

Seems like a small size to use. I'd expect the number to be much larger in most cases

418

I find such printing redundant with regular pass manager debugging

433–453

This would be malformed IR, each block has to at least have a terminator

452

There is RecursivelyDeleteTriviallyDeadInstructions

arsenm added inline comments.May 19 2023, 9:09 AM
llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
616–619

Why not just use getTypeStoreSize

arsenm added inline comments.May 19 2023, 9:12 AM
llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
900

Are we losing any invariant metadata after vectorization?

llvm/test/Transforms/LoadStoreVectorizer/NVPTX/many_loads_stores.ll
25

Don't use anonymous values in tests

arsenm added inline comments.May 19 2023, 9:16 AM
llvm/test/Transforms/LoadStoreVectorizer/NVPTX/many_loads_stores.ll
5–7

Don't use grep. If you don't want to generate checks FileCheck now has a CHECK-N syntax you can use

jlebar updated this revision to Diff 523947.May 19 2023, 3:06 PM
jlebar marked 10 inline comments as done.

Address arsenm's comments

Thank you for the review!

llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
616–619

I want to assert that SzBits % 8 == 0 here. getTypeStoreSize will just round up.

900

There is a propagateMetadata call below.

@arsenm are you happy with this, or would you like me to wait for further review from you?

@arsenm are you happy with this, or would you like me to wait for further review from you?

Since all the recent comments have been cosmetic, I'm planning to submit this tomorrow unless there are objections.

jlebar updated this revision to Diff 525769.May 25 2023, 1:12 PM

Attempt to fix Windows build and test failure in many_chains.ll.

jlebar updated this revision to Diff 525818.May 25 2023, 2:37 PM

Attempt to fix Windows build again.

jlebar updated this revision to Diff 526096.May 26 2023, 9:21 AM

OK, I really think MSVC will work this time. :)

jlebar updated this revision to Diff 526133.May 26 2023, 11:12 AM

Oh, there were *two different* MSVC problems??

jlebar updated this revision to Diff 526134.May 26 2023, 11:14 AM

There are *two different* MSVC problems?!

jlebar updated this revision to Diff 526177.May 26 2023, 1:16 PM

Try changing the captures to placate MSVC.

This revision was automatically updated to reflect the committed changes.

Buildbots are showing a failure in one test, which is the result of a bad merge (sorry). Will fix forward.

craig.topper added inline comments.
llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
1511
llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp:1429:23: warning: comparison of integer expressions of different signedness: ‘int’ and ‘size_t’ {aka ‘long unsigned int’} [-Wsign-compare]
     for (int J = 0; J < MaxChainsToTry && ChainIter != MRU.end();
                     ~~^~~~~~~~~~~~~~~~
thurston added a subscriber: thurston.EditedMay 26 2023, 4:18 PM

edit: Craig beat me to it :-)

Sorry, I didn't see the -Wsign-compare error in the harbormaster builds before I submitted. Will send a fix.

seeing this
llm/lib/IR/Instruction.cpp:125: bool llvm::Instruction::comesBefore(const llvm::Instruction*) const: Assertion `Parent == Other->Parent && "cross-BB instruction order comparison"' failed.

struggling to produce the ll to reproduce it, will take some time ...

Yikes.

I have an idea how this could happen, I think it may be the Instruction *ContextInst = GEPA->comesBefore(GEPB) ? GEPB : GEPA; line. Trying to write a test.

Yes, I can reproduce. Writing a fix...

https://reviews.llvm.org/D151630, please have a look, @ronlieb, and sorry for the breakage.

bjope added a subscriber: bjope.May 31 2023, 5:44 AM
bjope added inline comments.
llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
792

Our downstream target suffered from this. Now it starts to dynamically align things on the stack, that weren't aligned in the past. So we see regressions after this patch.

I wonder if it would be OK to do something like this as a workaround until these fixme:s are sorted out:

@@ -822,11 +822,12 @@ std::vector<Chain> Vectorizer::splitChainByAlignment(Chain &C) {
       // FIXME: We will upgrade the alignment of the alloca even if it turns out
       // we can't vectorize for some other reason.
       Align Alignment = getLoadStoreAlignment(C[CBegin].Inst);
+      Align PrefAlignment = DL.getPrefTypeAlign(Type::getInt32Ty(F.getContext()));
       if (AS == DL.getAllocaAddrSpace() && Alignment.value() % SizeBytes != 0 &&
-          IsAllowedAndFast(Align(StackAdjustedAlignment))) {
+          IsAllowedAndFast(PrefAlignment)) {
         Align NewAlign = getOrEnforceKnownAlignment(
             getLoadStorePointerOperand(C[CBegin].Inst),
-            Align(StackAdjustedAlignment), DL, C[CBegin].Inst, nullptr, &DT);
+            PrefAlignment, DL, C[CBegin].Inst, nullptr, &DT);
         if (NewAlign >= Alignment) {
           LLVM_DEBUG(dbgs()
                      << "LSV: splitByChain upgrading alloca alignment from "

So instead of the hard-coded 4 byte alignment, we ask DataLayout about preferred alignment of a 32-bit value. I guess that is 4 for most targets, at least all upstream lit tests seem to pass.

Well, I haven't thought much about it myself. I just tried to make a workaround that makes our downstream target happy while still all in-tree lit tests pass.

jlebar added a comment.Jun 1 2023, 7:29 AM

@bjope sorry for the breakage.

I am fine with this. Would you be willing to send a patch?

arsenm added inline comments.Jun 1 2023, 7:30 AM
llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
792

Could also clamp by max of existing stack alignments and stack alignment

bjope added a comment.Jun 1 2023, 7:32 AM

@bjope sorry for the breakage.

I am fine with this. Would you be willing to send a patch?

Ok, great. I'll put something up for review.

bjope added a comment.Jun 2 2023, 9:04 AM

@bjope sorry for the breakage.

I am fine with this. Would you be willing to send a patch?

Ok, great. I'll put something up for review.

I've been hesitating a bit.
Mainly because the idea to use DL.getPrefTypeAlign(Type::getInt32Ty(F.getContext())); is pretty ugly as well (for example not considering address spaces etc).

And I've realized that if I add "S16" to the data layout for my target, then getOrEnforceKnownAlignment won't increase the alignment to 4 when the origin is an alloca. So that would avoid the dynamic stack align that I've seen.

However, we could still get an increased align for non-alloca accesses. And here I got a bit confused. The code comments say "If we're loading/storing from an alloca, align it if possible.", but then the code only compares the address space with the alloca address space. Now we might end up changing the align of global variables as well.
I think that there should be a more fine grained check to look at the ptr operand to see if we actually access an alloca, if that code comment should be fulfilled.
Something like this:

// If we're loading/storing from an alloca, align it if possible.
Value *PtrOperand = getLoadStorePointerOperand(C[CBegin].Inst);
bool IsAllocaAccess = isa<AllocaInst>(PtrOperand->stripPointerCasts());
if (IsAllocaAccess && ...

Or is it the code comment that is misleading?

jlebar edited the summary of this revision. (Show Details)Aug 11 2023, 4:02 PM