This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Analysis/
-
Analysis/
17/24
VectorUtils.cpp
-
test/Transforms/LoopVectorize/
-
Transforms/
-
LoopVectorize/
-
X86/
-
interleaved-accesses-hoist-load-across-store.ll
2/4
interleaved-accesses-sink-store-across-load.ll
-
interleaved-accesses-max-dependences.ll
6/7
interleaved-accesses-use-after-free.ll

Differential D155520

[LV] Complete load groups and release store groups in presence of dependency
ClosedPublic

Authored by anna on Jul 17 2023, 3:10 PM.

Download Raw Diff

Details

Reviewers

Ayal
fhahn
gilr

Commits

rG3cf24dbbdde0: [LV] Complete load groups and release store groups. Try 2.
rGeaf6117f3388: [LV] Complete load groups and release store groups in presence of dependency

Summary

This is a more complete fix for CompleteLoadGroups introduced in
D154309. We need to check for dependency between A and every member of
the load Group of B.
This patch also fixes another miscompile seen when we incorrectly sink stores
below a depending load (see testcase in
interleaved-accesses-sink-store-across-load.ll). This is fixed by
releasing store groups.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

anna created this revision.Jul 17 2023, 3:10 PM

Herald added a project: Restricted Project. · View Herald TranscriptJul 17 2023, 3:10 PM

Herald added subscribers: artagnon, StephenFan, hiraditya. · View Herald Transcript

anna requested review of this revision.Jul 17 2023, 3:10 PM

Herald added a project: Restricted Project. · View Herald TranscriptJul 17 2023, 3:10 PM

Herald added subscribers: llvm-commits, wangpc. · View Herald Transcript

Thanks for following-up on this!

llvm/lib/Analysis/VectorUtils.cpp
1165–1166	nit: while we're here, suffice to place this classification of GroupB and its insertion into either StoreGroups or LoadGroups, next to its creation above.
1190–1191	nit: one empty line suffices? Would it be clearer to do something like: auto GroupA = getInterleaveGroup(A); if (GroupA && StoreGroups.contains(GroupA) && !canReorderMemAccessesForInterleavedGroups(&AI, &BI)) { LLVM_DEBUG(dbgs() << "LV: Invalidated store group due to " "dependence between " << A << " and "<< B << '\n'); StoreGroups.remove(GroupA); releaseGroup(GroupA); } if (A->mayWriteToMemory() && GroupB && LoadGroups.contains(GroupB)) { bool CompleteGroupB = false; for (uint32_t Index = 0; Index < GroupB->getFactor(); ++Index) { Instruction MemberOfGroupB = GroupB->getMember(Index); if (MemberOfGroupB && !canReorderMemAccessesForInterleavedGroups( &AI, &AccessStrideInfo.find(MemberOfGroupB))) { CompleteGroupB = true; break; } } if (CompleteGroupB) { LLVM_DEBUG(dbgs() << "LV: Marking interleave group for " << B << " as complete.\n"); CompletedLoadGroups.insert(GroupB); break; } } Would be nicer with an `if (std::any_of(...)) { ...; break; }`, with support from InterleaveGroup to Iterate over its members. A test that requires both releasing GroupA and completing GroupB would help emphasize that the former must precede the latter, due to its `break`.
1202–1246	Suffice to place this `break` only when indicating that (load) GroupB is completed; can continue to expand (store) GroupB even if (store) GroupA is released?
1212	Should be `if (DependentInst)`?
llvm/test/Transforms/LoopVectorize/X86/interleaved-accesses-sink-store-across-load.ll
8	This additional/distinct testcase of preventing the sinking of a store is fixed by the same patch that compares all member of a load-groupB with a storeA, right?

anna marked an inline comment as done.Jul 19 2023, 9:10 AM

anna added inline comments.

llvm/lib/Analysis/VectorUtils.cpp
1190–1191	Actually, this wouldn't be enough AFAICT (it fixes the first test case added as a follow-on to the original review: pr63602_2). The test I added in `interleaved-accesses-sink-store-across-load.ll` won't be fixed with this. The reason is we need to "release store group" if there is a dependency between AI and any load in the (load) GroupB. So, this patch does two things: retains the order that store group is released first and LoadGroup is marked completed second (we just needed to do both, the order doesn't matter since I left the break where it is: line 1248). Make sure that we check all loads in GroupB for a dependency against `AI`, for both `GroupA` release and `GroupB` completion. I've added more details in that test case comment below to show why #2 is needed. Would be nicer with an if (std::any_of(...)) { ...; break; }, with support from InterleaveGroup to Iterate over its members. I had that locally, but when recording a `dependentInst` didn't find a good need for it.

anna added inline comments.Jul 19 2023, 9:10 AM

llvm/lib/Analysis/VectorUtils.cpp
1202–1246	yes, good point. Btw, I also found a third miscompile (which was the one I was initially trying to root-cause before running into the two miscompiles I'm trying to fix here :)). I want to fix that separately from this review after adding a test case. It is caused when`A` is a store that has a dependency with `GroupB` but `A` is not yet part of an interleave group.
1212	yes! thanks! I've updated the tests to corrected ones as well.
llvm/test/Transforms/LoopVectorize/X86/interleaved-accesses-sink-store-across-load.ll
8	Unfortunately, no. That only marks `GroupB` for completion, thereby preventing hoisting of a load to an earlier load group. To show what happens in this test case: we reverse traverse the memory access. So, the first interleave group created is: store i32 %add, ptr %gep.iv.2 <-- first access store i32 %mul, ptr %gep.iv.1.plus.2 <-- added into the (store) groupB. Let's call this StoreGroup1. We next come to B: `%l3 = load i32, ptr %gep.iv.2` Interleave group created with that inst. We create next interleave group: %l3 = load i32, ptr %gep.iv.2 %l2 = load i32, ptr %gep.iv.1.plus.2 Call this `LoadGroup1`. Wit that same `B`, we continue `A` accesses with A as `store i32 %mul, ptr %gep.iv.1.plus.2`. With the patch I have we will do two things: release `StoreGroup1` because we see the store has a dependency with `%l2` Mark `LoadGroup1` as complete (and break from this inner loop traversing `AI`). If we were to only compare `BI` when deciding for store release: if (GroupA && StoreGroups.contains(GroupA) && !canReorderMemAccessesForInterleavedGroups(&AI, &BI)) then, we will not release StoreGroup1 since BI (%l3) has no dependency. Which means even though the load group is correctly completed (and we will not hoist %l2 to location of %l1), we will sink the store (store i32 %mul, ptr %gep.iv.1.plus.2) below the dependent load (%l2).

addressed most review comments: corrected if (DependentInst) check and couple other NFC.

anna added inline comments.Jul 19 2023, 9:45 AM

llvm/test/Transforms/LoopVectorize/X86/interleaved-accesses-sink-store-across-load.ll
8	@Ayal For completeness, here's the diff showing why the fix suggested in the comments won't be enough: https://reviews.llvm.org/differential/diff/542078/. The test in `interleaved-accesses-sink-store-across-load.ll` shows that we still sunk the store incorrectly (see the masked.store at the end).

Ayal added inline comments.Jul 20 2023, 12:36 PM

llvm/lib/Analysis/VectorUtils.cpp
1160–1164	Avoid too early continue here, scan A's in search of potential GroupA's to release.
1161–1164	Ahh, when B belongs to a completed load group, no A can be added to it, but early-continuing here also prevents releasing the store group of any conflicting A. As outlined above: "Even if we don't create a group for B, we continue with the bottom-up algorithm to ensure we don't break any of B's dependences" - continuing the bottom-up algorithm even if the group of B is complete would fix avoid_sinking_store_across_load: considering %l2 as B would release the store group. But other cases may evade this "avoid early-continue here" fix - when the conflict to release a store group is encountered before the store group is formed; here's an example, worth adding to the documentation: // For example, assume we have the sequence of accesses shown below in a // stride-3 loop: // // (1, 3) is a group \| A[i] = a; // (1) // b = A[i+1]; // (2) \| // \| A[i+2] = c; // (3) // d = A[i]; // (4) \| (2, 4) is a group // but cannot have both (1,3) and (2,4) groups! // Because the former sinks (1) to (3), the latter hoists (4) to (2), // and there's a dependence between (1) and (4). // Whenever B is a member of a load group, consider it along with // all its members, because they will all be hoisted to B, or earlier. Note: exact full overlap of load-after-store and store-after-store dependencies, as shown in these examples, are best optimized by eliminating the redundant load or first store, respectively. Partial overlap dependencies may be more involved to resolve, by converting to fixed-order-recurrences and/or hoisting a load above a conditional store. Note: this algorithm of building interleave groups is worth revisiting, possibly by actually moving group members using VPlan, to both simplify and potentially optimize its decisions, e.g.: should a load group be built or a store group, if either is possible but not both? Keeping a conflicting member out of a group (thereby allowing several others to join) may be better than Completing it.
1190–1191	OK, very well. Perhaps the following would be clearer, provided it is correct: auto DependentMember = [&](InterleaveGroup<Instruction> Group, StrideEntry A) -> Instruction* { for (uint32_t Index = 0; Index < Group->getFactor(); ++Index) { Instruction MemberOfGroupB = Group->getMember(Index); if (MemberOfGroupB && !canReorderMemAccessesForInterleavedGroups( A, &AccessStrideInfo.find(MemberOfGroupB))) return MemberOfGroupB; } return nullptr; }; if (A->mayWriteToMemory()) { // Otherwise dependencies are tolerable. Instruction DependentInst = nullptr; if (GroupB && LoadGroups.contains(GroupB)) // Check all GroupB members. DependentInst = DependentMember(GroupB, &AI); else if (!canReorderMemAccessesForInterleavedGroups(&AI, &BI)) DependentInst = B; if (DependentInst) { auto GroupA = getInterleaveGroup(A); if (GroupA && StoreGroups.contains(GroupA)) { LLVM_DEBUG(dbgs() << "LV: Invalidated store group due to " "dependence between " << A << " and "<< DependentInst << '\n'); StoreGroups.remove(GroupA); releaseGroup(GroupA); } if (GroupB && LoadGroups.contains(GroupB)) { LLVM_DEBUG(dbgs() << "LV: Marking interleave group for " << *B << " as complete.\n"); CompletedLoadGroups.insert(GroupB); } } } if (CompletedLoadGroups.contains(GroupB)) { // Skip trying to add A to B, continue to look for other conflicting A's in groups to be released. continue; }
llvm/test/Transforms/LoopVectorize/X86/interleaved-accesses-sink-store-across-load.ll
8	Ah, that's what I was curious about! Thanks for the detailed and enlightening description! Led to the revised proposal above. nit: suggest to rename `%iv.2` as `%iv.1.plus.3`

anna marked an inline comment as done.Jul 21 2023, 10:09 AM

anna added inline comments.

llvm/lib/Analysis/VectorUtils.cpp
1161–1164	But other cases may evade this "avoid early-continue here" fix - when the conflict to release a store group is encountered before the store group is formed; Yes, this is the exact case I faced for the last miscompile (which I haven't fixed yet). The store group wasn't formed yet and your added example for the doc is exactly what happens. Note that when we switch around (3) and (4) in the example, we will not have a miscompile since we formed an interleave store group first (1,3). And then we come to load with B as `d = A[i]` and identify that the store group (1,3) must be released (since we will traverse backwards and reach A as `A[i] = a` which is a dependency with B and is part of a store group). My idea to fix that was to record such stores and state they are the "last insertion point" (i.e. other stores can be sunk to the store, but we cannot sink store further down). Also, thanks for catching the "early continue" here. here's an example, worth adding to the documentation: I will add this documentation when I precommit the miscompile test, if that's okay with you (since it would show the IR example as well)? We will need a separate fix for it as well.
1190–1191	yes, this works and keeps the test cases results the same as we have currently.

addressed review comments.

anna marked an inline comment as done.Jul 21 2023, 10:17 AM

Hi Ayal, any more comments? Thanks.

In D155520#4532840, @anna wrote:

Hi Ayal, any more comments? Thanks.

This is fine, thanks Anna for following-up!
Added some minor nits.

llvm/lib/Analysis/VectorUtils.cpp
1161–1164	here's an example, worth adding to the documentation: I will add this documentation when I precommit the miscompile test, if that's okay with you (since it would show the IR example as well)? We will need a separate fix for it as well. Sure, by all means.
1204
1206
1207
1213	nit: an empty line would help separate the setting of DependentInst from it use.
1217
1229

This revision is now accepted and ready to land.Jul 25 2023, 12:53 PM

Thank you for the review Ayal!

This revision was landed with ongoing or failed builds.Jul 25 2023, 2:32 PM

Closed by commit rGeaf6117f3388: [LV] Complete load groups and release store groups in presence of dependency (authored by anna). · Explain Why

This revision was automatically updated to reflect the committed changes.

anna added a commit: rGeaf6117f3388: [LV] Complete load groups and release store groups in presence of dependency.

I'm seeing compiler crashes in our (C++, ThinLTO) builds with this change. Working on a reproducer or fix.

Asan report from the crash I am experiencing: https://reviews.llvm.org/P8312

MatzeB added inline comments.Jul 26 2023, 10:56 AM

llvm/lib/Analysis/VectorUtils.cpp
1225–1226	For the crashes I see `GroupA == GroupB` so any later access to `GroupB` after freeing `GroupA` here fail.

Uploaded a reduced input here: https://reviews.llvm.org/F28456762

Run with opt -O2 vectorize_loop_crash.ll.

You may need an asan-enabled build as regular builds don't reliably crash on use-after-free problems.

anna added inline comments.Jul 26 2023, 11:56 AM

llvm/lib/Analysis/VectorUtils.cpp
1225–1226	thank you for the reproducer! I will revert for now and fix it.

anna added a reverting change: rGe85fd3cbdd68: Revert "[LV] Complete load groups and release store groups in presence of….Jul 26 2023, 12:07 PM

FWIW, I also ran into cases triggered by this commit, where compilation of one file has started taking an excessive amount of time (unsure if it completes or not). It can be reproduced with this file: https://martin.st/temp/RegisterBankEmitter-preproc.cpp

With this file, compiled with clang++ -target i686-w64-mingw32 -w -c -std=c++17 -O3 -fno-exceptions -fno-rtti RegisterBankEmitter-preproc.cpp, used to take ~6 seconds before this commit. After this commit, it doesn't complete within at least 3 minutes.

Thanks @mstorsjo and just for info, the use-after-free error shows up as a hang in non-asan builds (so its highly likely both reports are the same bug). I've reduced the first repro with bugpoint and it is due to GroupA == GroupB.

I reduced the testcase enough to show how GroupA can be same as GroupB, but there are two ways to fix this:

Identify that GroupA is same as GroupB and once we released groupA, we should iterate to the next B instruction (in the outer loop).
GroupA which is being released should never be same as GroupB and we do this by making sure that there's no dependency between *any of* the stores when inserting into GroupB. AFAICT, we don't do that since we check only between A and B when inserting (store) A into (store) groupB.

I will go with the first fix (since it seems the easier way to do). I think if we were to make this a more robust algorithm, we should just remove the stores that are dependent rather than releasing the entire storeGroup.

anna reopened this revision.Jul 27 2023, 10:08 AM

This revision is now accepted and ready to land.Jul 27 2023, 10:08 AM

fixed use-after-free error (with added testcase). This bug wasn't there before the patch because we would break out of the inner loop accessing A, whereas now we were continuing to see which other A accesses needed to release the group.

Fixed the bug caused by this patch.

llvm/lib/Analysis/VectorUtils.cpp
1225	JFI, this break was preventing the use-after-free error before this patch (when groupA == groupB).

anna added inline comments.Jul 27 2023, 10:13 AM

llvm/test/Transforms/LoopVectorize/interleaved-accesses-use-after-free.ll
38	Will remove this check line. We do vectorize. This testcase exists to catch use-after-free asan crash/hangs on regular build. So, we don't need any check lines.

Harbormaster completed remote builds in B248622: Diff 544837.Jul 27 2023, 1:11 PM

In D155520#4539500, @anna wrote:

fixed use-after-free error (with added testcase). This bug wasn't there before the patch because we would break out of the inner loop accessing A, whereas now we were continuing to see which other A accesses needed to release the group.

Hmm, this error seems to stem from a case of "Too many dependences, stopped recording" (LoopAccessAnalysis.cpp), which leads canReorderMemAccessesForInterleavedGroups() to answer false for any given pair of stores - even those joined together in the same interleave group. However, all members of an interleave group are effectively checked for independence upon insertion, implying they are pairwise reorderable, even in the absence of recorded dependencies.

An alternative outlined above is to augment canReorderMemAccessesForInterleavedGroups() with a pre-check if the two instructions belong to the same (store) group.

Having Dependences.size() surpass MaxDependences calls for an extensive testcase, given the latter defaults to 100.
An alternative is to reduce this threshold by setting max-dependences thereby helping to produce a simpler minimal testcase.

llvm/lib/Analysis/VectorUtils.cpp
1202–1246	Need to also move the definition of GroupA earlier.

Thanks Ayal for the root cause. I'll update the patch

addressed review comment for use-after-free error (updated test to show the LoopAccessAnalysis bailout no longer present).

Harbormaster completed remote builds in B249804: Diff 546506.Aug 2 2023, 1:19 PM

Ayal added inline comments.Aug 2 2023, 1:23 PM

llvm/test/Transforms/LoopVectorize/interleaved-accesses-use-after-free.ll
26	Ahh, but this test does exceed the threshold leading to no dependencies being recorded, which is needed to reproduce the bug? Checking if GroupA!=GroupB above before calling canReorderMemAccessesForInterleavedGroups() fixes interleave group construction when this threshold is exceeded?

anna added inline comments.Aug 3 2023, 9:52 AM

llvm/test/Transforms/LoopVectorize/interleaved-accesses-use-after-free.ll
26	Actually, this test does lead to no dependences being recorded. The CHECK line is incorrect (we do have more than 100 dependences). And yes, even with too many dependences, guarding it with the check allows us to avoid incorrectly releasing a group (technically had no dependences within it). I will update check line and the comment.

updated test case with correct check lines and comments.

Harbormaster completed remote builds in B250107: Diff 546917.Aug 3 2023, 12:03 PM

In D155520#4544942, @Ayal wrote:

...
Having Dependences.size() surpass MaxDependences calls for an extensive testcase, given the latter defaults to 100.
An alternative is to reduce this threshold by setting max-dependences thereby helping to produce a simpler minimal testcase.

The extensive test reproducer depends on current default value of MaxDependences.
How about also adding a minimal testcase such as something like:

; RUN: opt -passes=loop-vectorize -force-vector-width=4 -force-vector-interleave=1 -enable-interleaved-mem-accesses=true --max-dependences=0 -S %s | FileCheck %s
target datalayout = "e-m:e-i64:64-i128:128-n32:64-S128"

define void @three_interleaved_stores(ptr %arr) {
; CHECK-LABEL: define void @three_interleaved_stores
; CHECK:   store <12 x i8>
entry:
  br label %loop

loop:
  %i = phi i64 [ 0, %entry ], [ %i.next, %loop ]
  %i.plus.1 = add nuw nsw i64 %i, 1
  %i.plus.2 = add nuw nsw i64 %i, 2
  %gep.i.plus.0 = getelementptr inbounds i8, ptr %arr, i64 %i
  %gep.i.plus.1 = getelementptr inbounds i8, ptr %arr, i64 %i.plus.1
  %gep.i.plus.2 = getelementptr inbounds i8, ptr %arr, i64 %i.plus.2
  store i8 1, ptr %gep.i.plus.0
  store i8 1, ptr %gep.i.plus.1
  store i8 1, ptr %gep.i.plus.2
  %i.next = add nuw nsw i64 %i, 3
  %icmp = icmp ugt i64 %i, 1032
  br i1 %icmp, label %exit, label %loop

exit:
  ret void
}

llvm/test/Transforms/LoopVectorize/interleaved-accesses-use-after-free.ll
28
31–32

added Ayal's minimal reproducer. Addressed review comments.

Harbormaster completed remote builds in B250235: Diff 547078.Aug 3 2023, 9:42 PM

Ayal added inline comments.Aug 5 2023, 2:39 PM

llvm/lib/Analysis/VectorUtils.cpp
1195	BTW, unrelated to this patch: would be good to move the definition of canReorderMemAccessesForInterleavedGroups() from (being inlined in) VectorUtils.h to VectorUtils.cpp.

Looks good to me, thanks for fixing!

Worth checking with @MatzeB and @mstorsjo if this fixes their cases.

llvm/test/Transforms/LoopVectorize/interleaved-accesses-use-after-free.ll
44	note: worth clarifying which of the two groups is invalidated (a shortcoming of existing debug prints) - this is checking that exactly one store group is invalidated, due to a dependence between a store of one group and a store of the other. Alternatively, can check the generated scalar stores and shuffles that feed the interleaved store.

This revision is now accepted and ready to land.Aug 5 2023, 3:01 PM

In D155520#4563355, @Ayal wrote:

Looks good to me, thanks for fixing!

Worth checking with @MatzeB and @mstorsjo if this fixes their cases.

At least in an initial test, things seem to work still, so I think it'd be safe to try to reland this, and I'd see if it breaks something else after testing all configurations.

Thanks, but there's no need to wait for me, land it. Given the previous test-case works this should be good. We have extensive nightly tests running with our codebase, but if something breaks again our oncall should contact you :)

Thanks everyone for the review and test cases. I'll try landing this again today.

llvm/test/Transforms/LoopVectorize/interleaved-accesses-use-after-free.ll
44	Yeah, we'll need to read the source code in line with the debug statement: Invalidated store group due to dependence between store ptr %load7, ptr %getelementptr, align 8 and store ptr null, ptr %getelementptr13, align 8 A is `store ptr %load7, ptr %getelementptr, align 8`. We know that the store group being invalidated is the one containing A, which means the store group invalidated is the right one (and dependentInst is not part of that group) : store ptr null, ptr %phi5, align 8 store ptr %load7, ptr %getelementptr, align 8 store ptr %load12, ptr %getelementptr11, align 8 I'll add a comment clarifying which store group is invalidated. Also, side note: I have a patch which prints out the interleave groups at the end of this analysis. Looks like a generally useful thing to have (especially to make sure the analysis is right, even if we don't end up vectorizing).

Thanks for following up on this!

This revision was landed with ongoing or failed builds.Aug 8 2023, 3:10 PM

Closed by commit rG3cf24dbbdde0: [LV] Complete load groups and release store groups. Try 2. (authored by anna). · Explain Why

This revision was automatically updated to reflect the committed changes.

anna marked an inline comment as done.

anna added a commit: rG3cf24dbbdde0: [LV] Complete load groups and release store groups. Try 2..

I've added an x86 requirement to the test: https://github.com/llvm/llvm-project/commit/c09bdfe6f77afc1378edaa959d71993b038ca9a7

In D155520#4572269, @DavidSpickett wrote:

I've added an x86 requirement to the test: https://github.com/llvm/llvm-project/commit/c09bdfe6f77afc1378edaa959d71993b038ca9a7

Better move the test to llvm/test/Transforms/LoopVectorize/X86/interleaved-accesses-use-after-free.ll ?

I wonder why that test needs a triple given that the other one doesn't, maybe that line was left over from another one?

@anna feel free to move the test if that makes more sense.

Thanks, I've moved the test and removed the x86-registered-target

Revision Contents

Path

Size

llvm/

lib/

Analysis/

VectorUtils.cpp

95 lines

test/

Transforms/

LoopVectorize/

X86/

interleaved-accesses-hoist-load-across-store.ll

87 lines

interleaved-accesses-sink-store-across-load.ll

34 lines

interleaved-accesses-max-dependences.ll

28 lines

interleaved-accesses-use-after-free.ll

95 lines

Diff 548371

llvm/lib/Analysis/VectorUtils.cpp

Show First 20 Lines • Show All 1,151 Lines • ▼ Show 20 Lines for (auto BI = AccessStrideInfo.rbegin(), E = AccessStrideInfo.rend();

// we don't break any of B's dependences. // we don't break any of B's dependences.

InterleaveGroup<Instruction> *GroupB = nullptr; InterleaveGroup<Instruction> *GroupB = nullptr;

if (isStrided(DesB.Stride) && if (isStrided(DesB.Stride) &&

(!isPredicated(B->getParent()) || EnablePredicatedInterleavedMemAccesses)) { (!isPredicated(B->getParent()) || EnablePredicatedInterleavedMemAccesses)) {

GroupB = getInterleaveGroup(B); GroupB = getInterleaveGroup(B);

if (!GroupB) { if (!GroupB) {

LLVM_DEBUG(dbgs() << "LV: Creating an interleave group with:" << *B LLVM_DEBUG(dbgs() << "LV: Creating an interleave group with:" << *B

<< '\n'); << '\n');

GroupB = createInterleaveGroup(B, DesB.Stride, DesB.Alignment); GroupB = createInterleaveGroup(B, DesB.Stride, DesB.Alignment);

} else if (CompletedLoadGroups.contains(GroupB)) {

// Skip B if no new instructions can be added to its load group.

continue;

}

if (B->mayWriteToMemory()) if (B->mayWriteToMemory())

StoreGroups.insert(GroupB); StoreGroups.insert(GroupB);

else else

LoadGroups.insert(GroupB); LoadGroups.insert(GroupB);

AyalUnsubmitted

Not Done

Ahh, when B belongs to a completed load group, no A can be added to it, but early-continuing here also prevents releasing the store group of any conflicting A. As outlined above: "Even if we don't create a group for B, we continue with the bottom-up algorithm to ensure we don't break any of B's dependences" - continuing the bottom-up algorithm even if the group of B is complete would fix avoid_sinking_store_across_load: considering %l2 as B would release the store group. But other cases may evade this "avoid early-continue here" fix - when the conflict to release a store group is encountered before the store group is formed; here's an example, worth adding to the documentation:

// For example, assume we have the sequence of accesses shown below in a
// stride-3 loop:
//
// (1, 3) is a group | A[i] = a;   // (1)
//                     b = A[i+1]; // (2) |
//                   | A[i+2] = c; // (3)
//                     d = A[i];   // (4) | (2, 4) is a group
// but cannot have both (1,3) and (2,4) groups!
// Because the former sinks (1) to (3), the latter hoists (4) to (2),
// and there's a dependence between (1) and (4).
// Whenever B is a member of a load group, consider it along with
// all its members, because they will all be hoisted to B, or earlier.

Note: exact full overlap of load-after-store and store-after-store dependencies, as shown in these examples, are best optimized by eliminating the redundant load or first store, respectively. Partial overlap dependencies may be more involved to resolve, by converting to fixed-order-recurrences and/or hoisting a load above a conditional store.

Note: this algorithm of building interleave groups is worth revisiting, possibly by actually moving group members using VPlan, to both simplify and potentially optimize its decisions, e.g.: should a load group be built or a store group, if either is possible but not both? Keeping a conflicting member out of a group (thereby allowing several others to join) may be better than Completing it.

Ayal: Ahh, when B belongs to a completed load group, no A can be added to it, but **early-continuing…

annaAuthorUnsubmitted

Done

But other cases may evade this "avoid early-continue here" fix - when the conflict to release a store group is encountered before the store group is formed;

Yes, this is the exact case I faced for the last miscompile (which I haven't fixed yet). The store group wasn't formed yet and your added example for the doc is exactly what happens.
Note that when we switch around (3) and (4) in the example, we will not have a miscompile since we formed an interleave store group first (1,3). And then we come to load with B as d = A[i] and identify that the store group (1,3) must be released (since we will traverse backwards and reach A as A[i] = a which is a dependency with B and is part of a store group).
My idea to fix that was to record such stores and state they are the "last insertion point" (i.e. other stores can be sunk to the store, but we cannot sink store further down).

Also, thanks for catching the "early continue" here.

here's an example, worth adding to the documentation:

I will add this documentation when I precommit the miscompile test, if that's okay with you (since it would show the IR example as well)? We will need a separate fix for it as well.

anna: > But other cases may evade this "avoid early-continue here" fix - when the conflict to release…

AyalUnsubmitted

Not Done

here's an example, worth adding to the documentation:

I will add this documentation when I precommit the miscompile test, if that's okay with you (since it would show the IR example as well)? We will need a separate fix for it as well.

Sure, by all means.

Ayal: >> here's an example, worth adding to the documentation: > > I will add this documentation when…

AyalUnsubmitted

Done

LoadGroups.insert(GroupB);

- } else if (CompletedLoadGroups.contains(GroupB)) {

- // Skip B if no new instructions can be added to its load group.

- continue;

+ }

}

Avoid too early continue here, scan A's in search of potential GroupA's to release.

Ayal: Avoid too early continue here, scan A's in search of potential GroupA's to release.

} }

}

AyalUnsubmitted

Done

nit: while we're here, suffice to place this classification of GroupB and its insertion into either StoreGroups or LoadGroups, next to its creation above.

Ayal: nit: while we're here, suffice to place this classification of GroupB and its insertion into…

for (auto AI = std::next(BI); AI != E; ++AI) { for (auto AI = std::next(BI); AI != E; ++AI) {

Instruction *A = AI->first; Instruction *A = AI->first;

StrideDescriptor DesA = AI->second; StrideDescriptor DesA = AI->second;

// Our code motion strategy implies that we can't have dependences // Our code motion strategy implies that we can't have dependences

// between accesses in an interleaved group and other accesses located // between accesses in an interleaved group and other accesses located

// between the first and last member of the group. Note that this also // between the first and last member of the group. Note that this also

// means that a group can't have more than one member at a given offset. // means that a group can't have more than one member at a given offset.

// The accesses in a group can have dependences with other accesses, but // The accesses in a group can have dependences with other accesses, but

// we must ensure we don't extend the boundaries of the group such that // we must ensure we don't extend the boundaries of the group such that

// we encompass those dependent accesses. // we encompass those dependent accesses.

// //

// For example, assume we have the sequence of accesses shown below in a // For example, assume we have the sequence of accesses shown below in a

// stride-2 loop: // stride-2 loop:

// //

// (1, 2) is a group | A[i] = a; // (1) // (1, 2) is a group | A[i] = a; // (1)

// | A[i-1] = b; // (2) | // | A[i-1] = b; // (2) |

// A[i-3] = c; // (3) // A[i-3] = c; // (3)

// A[i] = d; // (4) | (2, 4) is not a group // A[i] = d; // (4) | (2, 4) is not a group

// //

// Because accesses (2) and (3) are dependent, we can group (2) with (1) // Because accesses (2) and (3) are dependent, we can group (2) with (1)

// but not with (4). If we did, the dependent access (3) would be within // but not with (4). If we did, the dependent access (3) would be within

// the boundaries of the (2, 4) group. // the boundaries of the (2, 4) group.

if (!canReorderMemAccessesForInterleavedGroups(&*AI, &*BI)) { auto DependentMember = [&](InterleaveGroup<Instruction> *Group,

AyalUnsubmitted

Not Done

nit: one empty line suffices?

Would it be clearer to do something like:

auto GroupA = getInterleaveGroup(A);
if (GroupA && StoreGroups.contains(GroupA) &&
    !canReorderMemAccessesForInterleavedGroups(&*AI, &*BI)) {
    LLVM_DEBUG(dbgs() << "LV: Invalidated store group due to "
                         "dependence between " << *A << " and "<< *B << '\n');
  StoreGroups.remove(GroupA);
  releaseGroup(GroupA);
}
if (A->mayWriteToMemory() && GroupB && LoadGroups.contains(GroupB)) {
  bool CompleteGroupB = false;
  for (uint32_t Index = 0; Index < GroupB->getFactor(); ++Index) {
    Instruction *MemberOfGroupB = GroupB->getMember(Index);
    if (MemberOfGroupB &&
        !canReorderMemAccessesForInterleavedGroups(
            &*AI, &*AccessStrideInfo.find(MemberOfGroupB))) {
      CompleteGroupB = true;
      break;
    }
  }
  if (CompleteGroupB) {
    LLVM_DEBUG(dbgs() << "LV: Marking interleave group for " << *B
                      << " as complete.\n");
      CompletedLoadGroups.insert(GroupB);
    break;
  }
}

Would be nicer with an if (std::any_of(...)) { ...; break; }, with support from InterleaveGroup to Iterate over its members.

A test that requires both releasing GroupA and completing GroupB would help emphasize that the former must precede the latter, due to its break.

Ayal: nit: one empty line suffices? Would it be clearer to do something like: ``` auto GroupA…

annaAuthorUnsubmitted

Done

Actually, this wouldn't be enough AFAICT (it fixes the first test case added as a follow-on to the original review: pr63602_2). The test I added in interleaved-accesses-sink-store-across-load.ll won't be fixed with this. The reason is we need to "release store group" if there is a dependency between AI and *any load in the (load) GroupB*.
So, this patch does two things:

retains the order that store group is released first and LoadGroup is marked completed second (we just needed to do both, the order doesn't matter since I left the break where it is: line 1248).
Make sure that we check all loads in GroupB for a dependency against AI, for both GroupA release and GroupB completion.

I've added more details in that test case comment below to show why #2 is needed.

Would be nicer with an if (std::any_of(...)) { ...; break; }, with support from InterleaveGroup to Iterate over its members.

I had that locally, but when recording a dependentInst didn't find a good need for it.

anna: Actually, this wouldn't be enough AFAICT (it fixes the first test case added as a follow-on to…

AyalUnsubmitted

Done

OK, very well. Perhaps the following would be clearer, provided it is correct:

  auto DependentMember = [&](InterleaveGroup<Instruction> *Group,
                             StrideEntry *A) -> Instruction* {
    for (uint32_t Index = 0; Index < Group->getFactor(); ++Index) {
      Instruction *MemberOfGroupB = Group->getMember(Index);
      if (MemberOfGroupB &&
          !canReorderMemAccessesForInterleavedGroups(
              A, &*AccessStrideInfo.find(MemberOfGroupB)))
        return MemberOfGroupB;
    }
    return nullptr;
  };

if (A->mayWriteToMemory()) { // Otherwise dependencies are tolerable.
  Instruction *DependentInst = nullptr;
  if (GroupB && LoadGroups.contains(GroupB)) // Check all GroupB members.
    DependentInst = DependentMember(GroupB, &*AI);
  else if (!canReorderMemAccessesForInterleavedGroups(&*AI, &*BI))
    DependentInst = B;
  if (DependentInst) {
    auto GroupA = getInterleaveGroup(A);
    if (GroupA && StoreGroups.contains(GroupA)) {
      LLVM_DEBUG(dbgs() << "LV: Invalidated store group due to "
                           "dependence between " << *A << " and "<< *DependentInst << '\n');
      StoreGroups.remove(GroupA);
      releaseGroup(GroupA);
    }
    if (GroupB && LoadGroups.contains(GroupB)) {
      LLVM_DEBUG(dbgs() << "LV: Marking interleave group for " << *B
                        << " as complete.\n");
      CompletedLoadGroups.insert(GroupB);
    }
  }
}
if (CompletedLoadGroups.contains(GroupB)) {
  // Skip trying to add A to B, continue to look for other conflicting A's in groups to be released.
  continue;
}

Ayal: OK, very well. Perhaps the following would be clearer, provided it is correct: ``` auto…

annaAuthorUnsubmitted

Done

yes, this works and keeps the test cases results the same as we have currently.

anna: yes, this works and keeps the test cases results the same as we have currently.

// If a dependence exists and A is already in a group, we know that A StrideEntry *A) -> Instruction * {

// must be a store since A precedes B and WAR dependences are allowed. for (uint32_t Index = 0; Index < Group->getFactor(); ++Index) {

// Thus, A would be sunk below B. We release A's group to prevent this Instruction *MemberOfGroupB = Group->getMember(Index);

// illegal code motion. A will then be free to form another group with if (MemberOfGroupB && !canReorderMemAccessesForInterleavedGroups(

AyalUnsubmitted

Not Done

BTW, unrelated to this patch: would be good to move the definition of canReorderMemAccessesForInterleavedGroups() from (being inlined in) VectorUtils.h to VectorUtils.cpp.

Ayal: BTW, unrelated to this patch: would be good to move the definition of…

// instructions that precede it. A, &*AccessStrideInfo.find(MemberOfGroupB)))

if (isInterleaved(A)) { return MemberOfGroupB;

InterleaveGroup<Instruction> *StoreGroup = getInterleaveGroup(A); }

return nullptr;

};

auto GroupA = getInterleaveGroup(A);

// If A is a load, dependencies are tolerable, there's nothing to do here.

// If both A and B belong to the same (store) group, they are independent,

AyalUnsubmitted

Done

Instruction *DependentInst = nullptr;

- // If GroupB is a load group, We have to compare AI against all

+ // If GroupB is a load group we have to compare AI against all

// members of GroupB because if any load within GroupB has a dependency

Ayal:

// even if dependencies have not been recorded.

// If both GroupA and GroupB are null, there's nothing to do here.

AyalUnsubmitted

Done

// members of GroupB because if any load within GroupB has a dependency

- // against AI, we need to mark GroupB as complete and also release the

+ // on AI, we need to mark GroupB as complete and also release the

// storeGroup (if A belongs to one). The former prevents incorrect

Ayal:

if (A->mayWriteToMemory() && GroupA != GroupB) {

AyalUnsubmitted

Done

// against AI, we need to mark GroupB as complete and also release the

- // storeGroup (if A belongs to one). The former prevents incorrect

+ // store GroupA (if A belongs to one). The former prevents incorrect

// hoisting of load B above store A while the latter prevents incorrect

Ayal:

Instruction *DependentInst = nullptr;

// If GroupB is a load group, we have to compare AI against all

// members of GroupB because if any load within GroupB has a dependency

// on AI, we need to mark GroupB as complete and also release the

// store GroupA (if A belongs to one). The former prevents incorrect

AyalUnsubmitted

Done

Should be if (DependentInst)?

Ayal: Should be `if (DependentInst)`?

annaAuthorUnsubmitted

Done

yes! thanks! I've updated the tests to corrected ones as well.

anna: yes! thanks! I've updated the tests to corrected ones as well.

// hoisting of load B above store A while the latter prevents incorrect

AyalUnsubmitted

Done

nit: an empty line would help separate the setting of DependentInst from it use.

Ayal: nit: an empty line would help separate the setting of DependentInst from it use.

// sinking of store A below load B.

if (GroupB && LoadGroups.contains(GroupB))

DependentInst = DependentMember(GroupB, &*AI);

else if (!canReorderMemAccessesForInterleavedGroups(&*AI, &*BI))

AyalUnsubmitted

Done

// A has a store dependence on B (or on some load within GroupB) and

- // is part of a storeGroup. Release A's group to prevent illegalt

+ // is part of a store group. Release A's group to prevent illegal

// sinking of A below B. A will then be free to form another group

Ayal:

DependentInst = B;

if (DependentInst) {

// A has a store dependence on B (or on some load within GroupB) and

// is part of a store group. Release A's group to prevent illegal

// sinking of A below B. A will then be free to form another group

// with instructions that precede it.

if (GroupA && StoreGroups.contains(GroupA)) {

LLVM_DEBUG(dbgs() << "LV: Invalidated store group due to " LLVM_DEBUG(dbgs() << "LV: Invalidated store group due to "

MatzeBUnsubmitted

Not Done

For the crashes I see GroupA == GroupB so any later access to GroupB after freeing GroupA here fail.

MatzeB: For the crashes I see `GroupA == GroupB` so any later access to `GroupB` after freeing `GroupA`…

annaAuthorUnsubmitted

Done

thank you for the reproducer! I will revert for now and fix it.

anna: thank you for the reproducer! I will revert for now and fix it.

"dependence between " << *A << " and "<< *B << '\n'); "dependence between "

<< *A << " and " << *DependentInst << '\n');

StoreGroups.remove(StoreGroup); StoreGroups.remove(GroupA);

AyalUnsubmitted

Done

// can be added to B's interleave group, because this would mean the

- // DependentInst would need to be moved across store A. Mark the

+ // DependentInst would move across store A. Mark the

// interleave group as complete.

Ayal:

releaseGroup(StoreGroup); releaseGroup(GroupA);

} }

// If B is a load and part of an interleave group, no earlier loads can // If B is a load and part of an interleave group, no earlier loads

// be added to B's interleave group, because this would mean the load B // can be added to B's interleave group, because this would mean the

// would need to be moved across store A. Mark the interleave group as // DependentInst would move across store A. Mark the interleave group

// complete. // as complete.

if (GroupB && isa<LoadInst>(B)) { if (GroupB && LoadGroups.contains(GroupB)) {

LLVM_DEBUG(dbgs() << "LV: Marking interleave group for " << *B LLVM_DEBUG(dbgs() << "LV: Marking interleave group for " << *B

<< " as complete.\n"); << " as complete.\n");

CompletedLoadGroups.insert(GroupB); CompletedLoadGroups.insert(GroupB);

} }

}

// If a dependence exists and A is not already in a group (or it was }

// and we just released it), B might be hoisted above A (if B is a if (CompletedLoadGroups.contains(GroupB)) {

// load) or another store might be sunk below A (if B is a store). In // Skip trying to add A to B, continue to look for other conflicting A's

// either case, we can't add additional instructions to B's group. B // in groups to be released.

// will only form a group with instructions that it precedes. continue;

AyalUnsubmitted

Not Done

Suffice to place this break only when indicating that (load) GroupB is completed; can continue to expand (store) GroupB even if (store) GroupA is released?

Ayal: Suffice to place this `break` only when indicating that (load) GroupB is completed; can…

annaAuthorUnsubmitted

Done

yes, good point.

Btw, I also found a third miscompile (which was the one I was initially trying to root-cause before running into the two miscompiles I'm trying to fix here :)).
I want to fix that separately from this review after adding a test case. It is caused when`A` is a store that has a dependency with GroupB but A is not yet part of an interleave group.

anna: yes, good point. Btw, I also found a third miscompile (which was the one I was initially…

AyalUnsubmitted

Not Done

return nullptr;

};

- if (A->mayWriteToMemory()) { // Otherwise dependencies are tolerable.

+ // If A is a load, dependencies are tolerable, there's nothing to do here.

+ // If both A and B belong to the same (store) group, they are independent,

+ // even if dependencies have not been recorded.

+ // If both GroupA and GroupB are null, there's nothing to do here.

+ if (A->mayWriteToMemory() && GroupB != GroupA) {

Instruction *DependentInst = nullptr;

Need to also move the definition of GroupA earlier.

Ayal: Need to also move the definition of GroupA earlier.

break;

annaAuthorUnsubmitted

Done

JFI, this break was preventing the use-after-free error before this patch (when groupA == groupB).

anna: JFI, this break was preventing the use-after-free error before this patch (when groupA ==…

} }

// At this point, we've checked for illegal code motion. If either A or B // At this point, we've checked for illegal code motion. If either A or B

// isn't strided, there's nothing left to do. // isn't strided, there's nothing left to do.

if (!isStrided(DesA.Stride) || !isStrided(DesB.Stride)) if (!isStrided(DesA.Stride) || !isStrided(DesB.Stride))

continue; continue;

// Ignore A if it's already in a group or isn't the same kind of memory // Ignore A if it's already in a group or isn't the same kind of memory

▲ Show 20 Lines • Show All 282 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopVectorize/X86/interleaved-accesses-hoist-load-across-store.ll

	Show First 20 Lines • Show All 115 Lines • ▼ Show 20 Lines
	; %l3 and the preceeding store access the same memory location. So, we cannot			; %l3 and the preceeding store access the same memory location. So, we cannot
	; have the loads %l1, %l2 and %l3 in the same interleave group since it would			; have the loads %l1, %l2 and %l3 in the same interleave group since it would
	; mean hoisting the load %l2 and %l3 across the store.			; mean hoisting the load %l2 and %l3 across the store.

	; Unlike the above case, since we go through the last load in program order and			; Unlike the above case, since we go through the last load in program order and
	; compare against the obstructing stores (%l2 versus the store) there is no			; compare against the obstructing stores (%l2 versus the store) there is no
	; dependency. However, the other load in %l2's interleave group (%l3) does			; dependency. However, the other load in %l2's interleave group (%l3) does
	; obstruct with the store.			; obstruct with the store.
	; FIXME: The test case is currently mis-compiled.
	define void @pr63602_2(ptr %arr) {			define void @pr63602_2(ptr %arr) {
	; CHECK-LABEL: define void @pr63602_2			; CHECK-LABEL: define void @pr63602_2
	; CHECK-SAME: (ptr [[ARR:%.*]]) {			; CHECK-SAME: (ptr [[ARR:%.*]]) {
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]			; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
	; CHECK: vector.ph:			; CHECK: vector.ph:
	; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]			; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
	; CHECK: vector.body:			; CHECK: vector.body:
	; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]			; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
	; CHECK-NEXT: [[TMP0:%.*]] = mul i64 [[INDEX]], 3			; CHECK-NEXT: [[TMP0:%.*]] = mul i64 [[INDEX]], 3
	; CHECK-NEXT: [[OFFSET_IDX:%.*]] = add i64 4, [[TMP0]]			; CHECK-NEXT: [[OFFSET_IDX:%.*]] = add i64 4, [[TMP0]]
	; CHECK-NEXT: [[TMP1:%.*]] = add i64 [[OFFSET_IDX]], 0			; CHECK-NEXT: [[TMP1:%.*]] = add i64 [[OFFSET_IDX]], 0
	; CHECK-NEXT: [[TMP2:%.*]] = add i64 [[OFFSET_IDX]], 3			; CHECK-NEXT: [[TMP2:%.*]] = add i64 [[OFFSET_IDX]], 3
	; CHECK-NEXT: [[TMP3:%.*]] = add i64 [[OFFSET_IDX]], 6			; CHECK-NEXT: [[TMP3:%.*]] = add i64 [[OFFSET_IDX]], 6
	; CHECK-NEXT: [[TMP4:%.*]] = add i64 [[OFFSET_IDX]], 9			; CHECK-NEXT: [[TMP4:%.*]] = add i64 [[OFFSET_IDX]], 9
	; CHECK-NEXT: [[TMP5:%.*]] = mul i64 [[INDEX]], 3			; CHECK-NEXT: [[TMP5:%.*]] = mul i64 [[INDEX]], 3
	; CHECK-NEXT: [[OFFSET_IDX2:%.*]] = add i64 1, [[TMP5]]			; CHECK-NEXT: [[OFFSET_IDX2:%.*]] = add i64 1, [[TMP5]]
	; CHECK-NEXT: [[TMP6:%.*]] = add i64 [[OFFSET_IDX2]], 0			; CHECK-NEXT: [[TMP6:%.*]] = add i64 [[OFFSET_IDX2]], 0
	; CHECK-NEXT: [[TMP7:%.*]] = add nuw nsw i64 [[TMP6]], 4			; CHECK-NEXT: [[TMP7:%.*]] = add i64 [[OFFSET_IDX2]], 3
	; CHECK-NEXT: [[TMP8:%.*]] = getelementptr inbounds i32, ptr [[ARR]], i64 [[TMP7]]			; CHECK-NEXT: [[TMP8:%.*]] = add i64 [[OFFSET_IDX2]], 6
	; CHECK-NEXT: [[TMP9:%.*]] = getelementptr inbounds i32, ptr [[TMP8]], i32 -2			; CHECK-NEXT: [[TMP9:%.*]] = add i64 [[OFFSET_IDX2]], 9
	; CHECK-NEXT: [[WIDE_VEC:%.*]] = load <12 x i32>, ptr [[TMP9]], align 4			; CHECK-NEXT: [[TMP10:%.*]] = add nuw nsw i64 [[TMP6]], 4
				; CHECK-NEXT: [[TMP11:%.*]] = getelementptr inbounds i32, ptr [[ARR]], i64 [[TMP10]]
				; CHECK-NEXT: [[TMP12:%.*]] = getelementptr inbounds i32, ptr [[TMP11]], i32 0
				; CHECK-NEXT: [[WIDE_VEC:%.*]] = load <12 x i32>, ptr [[TMP12]], align 4
	; CHECK-NEXT: [[STRIDED_VEC:%.*]] = shufflevector <12 x i32> [[WIDE_VEC]], <12 x i32> poison, <4 x i32> <i32 0, i32 3, i32 6, i32 9>			; CHECK-NEXT: [[STRIDED_VEC:%.*]] = shufflevector <12 x i32> [[WIDE_VEC]], <12 x i32> poison, <4 x i32> <i32 0, i32 3, i32 6, i32 9>
	; CHECK-NEXT: [[STRIDED_VEC3:%.*]] = shufflevector <12 x i32> [[WIDE_VEC]], <12 x i32> poison, <4 x i32> <i32 1, i32 4, i32 7, i32 10>			; CHECK-NEXT: [[TMP13:%.*]] = getelementptr inbounds i32, ptr [[ARR]], i64 [[TMP1]]
	; CHECK-NEXT: [[STRIDED_VEC4:%.*]] = shufflevector <12 x i32> [[WIDE_VEC]], <12 x i32> poison, <4 x i32> <i32 2, i32 5, i32 8, i32 11>			; CHECK-NEXT: [[TMP14:%.*]] = getelementptr inbounds i32, ptr [[ARR]], i64 [[TMP2]]
	; CHECK-NEXT: [[TMP10:%.*]] = getelementptr inbounds i32, ptr [[ARR]], i64 [[TMP1]]			; CHECK-NEXT: [[TMP15:%.*]] = getelementptr inbounds i32, ptr [[ARR]], i64 [[TMP3]]
	; CHECK-NEXT: [[TMP11:%.*]] = getelementptr inbounds i32, ptr [[ARR]], i64 [[TMP2]]			; CHECK-NEXT: [[TMP16:%.*]] = getelementptr inbounds i32, ptr [[ARR]], i64 [[TMP4]]
	; CHECK-NEXT: [[TMP12:%.*]] = getelementptr inbounds i32, ptr [[ARR]], i64 [[TMP3]]			; CHECK-NEXT: [[TMP17:%.*]] = extractelement <4 x i32> [[STRIDED_VEC]], i32 0
	; CHECK-NEXT: [[TMP13:%.*]] = getelementptr inbounds i32, ptr [[ARR]], i64 [[TMP4]]
	; CHECK-NEXT: [[TMP14:%.*]] = extractelement <4 x i32> [[STRIDED_VEC4]], i32 0
	; CHECK-NEXT: store i32 [[TMP14]], ptr [[TMP10]], align 4
	; CHECK-NEXT: [[TMP15:%.*]] = extractelement <4 x i32> [[STRIDED_VEC4]], i32 1
	; CHECK-NEXT: store i32 [[TMP15]], ptr [[TMP11]], align 4
	; CHECK-NEXT: [[TMP16:%.*]] = extractelement <4 x i32> [[STRIDED_VEC4]], i32 2
	; CHECK-NEXT: store i32 [[TMP16]], ptr [[TMP12]], align 4
	; CHECK-NEXT: [[TMP17:%.*]] = extractelement <4 x i32> [[STRIDED_VEC4]], i32 3
	; CHECK-NEXT: store i32 [[TMP17]], ptr [[TMP13]], align 4			; CHECK-NEXT: store i32 [[TMP17]], ptr [[TMP13]], align 4
	; CHECK-NEXT: [[TMP18:%.*]] = add <4 x i32> [[STRIDED_VEC3]], [[STRIDED_VEC]]			; CHECK-NEXT: [[TMP18:%.*]] = extractelement <4 x i32> [[STRIDED_VEC]], i32 1
	; CHECK-NEXT: [[TMP19:%.*]] = extractelement <4 x i32> [[TMP18]], i32 0			; CHECK-NEXT: store i32 [[TMP18]], ptr [[TMP14]], align 4
	; CHECK-NEXT: store i32 [[TMP19]], ptr [[TMP10]], align 4			; CHECK-NEXT: [[TMP19:%.*]] = extractelement <4 x i32> [[STRIDED_VEC]], i32 2
	; CHECK-NEXT: [[TMP20:%.*]] = extractelement <4 x i32> [[TMP18]], i32 1			; CHECK-NEXT: store i32 [[TMP19]], ptr [[TMP15]], align 4
	; CHECK-NEXT: store i32 [[TMP20]], ptr [[TMP11]], align 4			; CHECK-NEXT: [[TMP20:%.*]] = extractelement <4 x i32> [[STRIDED_VEC]], i32 3
	; CHECK-NEXT: [[TMP21:%.*]] = extractelement <4 x i32> [[TMP18]], i32 2			; CHECK-NEXT: store i32 [[TMP20]], ptr [[TMP16]], align 4
	; CHECK-NEXT: store i32 [[TMP21]], ptr [[TMP12]], align 4			; CHECK-NEXT: [[TMP21:%.*]] = add nuw nsw i64 [[TMP6]], 2
	; CHECK-NEXT: [[TMP22:%.*]] = extractelement <4 x i32> [[TMP18]], i32 3			; CHECK-NEXT: [[TMP22:%.*]] = add nuw nsw i64 [[TMP7]], 2
	; CHECK-NEXT: store i32 [[TMP22]], ptr [[TMP13]], align 4			; CHECK-NEXT: [[TMP23:%.*]] = add nuw nsw i64 [[TMP8]], 2
				; CHECK-NEXT: [[TMP24:%.*]] = add nuw nsw i64 [[TMP9]], 2
				; CHECK-NEXT: [[TMP25:%.*]] = getelementptr inbounds i32, ptr [[ARR]], i64 [[TMP21]]
				; CHECK-NEXT: [[TMP26:%.*]] = getelementptr inbounds i32, ptr [[ARR]], i64 [[TMP22]]
				; CHECK-NEXT: [[TMP27:%.*]] = getelementptr inbounds i32, ptr [[ARR]], i64 [[TMP23]]
				; CHECK-NEXT: [[TMP28:%.*]] = getelementptr inbounds i32, ptr [[ARR]], i64 [[TMP24]]
				; CHECK-NEXT: [[TMP29:%.*]] = load i32, ptr [[TMP13]], align 4
				; CHECK-NEXT: [[TMP30:%.*]] = load i32, ptr [[TMP14]], align 4
				; CHECK-NEXT: [[TMP31:%.*]] = load i32, ptr [[TMP15]], align 4
				; CHECK-NEXT: [[TMP32:%.*]] = load i32, ptr [[TMP16]], align 4
				; CHECK-NEXT: [[TMP33:%.*]] = insertelement <4 x i32> poison, i32 [[TMP29]], i32 0
				; CHECK-NEXT: [[TMP34:%.*]] = insertelement <4 x i32> [[TMP33]], i32 [[TMP30]], i32 1
				; CHECK-NEXT: [[TMP35:%.*]] = insertelement <4 x i32> [[TMP34]], i32 [[TMP31]], i32 2
				; CHECK-NEXT: [[TMP36:%.*]] = insertelement <4 x i32> [[TMP35]], i32 [[TMP32]], i32 3
				; CHECK-NEXT: [[TMP37:%.*]] = load i32, ptr [[TMP25]], align 4
				; CHECK-NEXT: [[TMP38:%.*]] = load i32, ptr [[TMP26]], align 4
				; CHECK-NEXT: [[TMP39:%.*]] = load i32, ptr [[TMP27]], align 4
				; CHECK-NEXT: [[TMP40:%.*]] = load i32, ptr [[TMP28]], align 4
				; CHECK-NEXT: [[TMP41:%.*]] = insertelement <4 x i32> poison, i32 [[TMP37]], i32 0
				; CHECK-NEXT: [[TMP42:%.*]] = insertelement <4 x i32> [[TMP41]], i32 [[TMP38]], i32 1
				; CHECK-NEXT: [[TMP43:%.*]] = insertelement <4 x i32> [[TMP42]], i32 [[TMP39]], i32 2
				; CHECK-NEXT: [[TMP44:%.*]] = insertelement <4 x i32> [[TMP43]], i32 [[TMP40]], i32 3
				; CHECK-NEXT: [[TMP45:%.*]] = add <4 x i32> [[TMP36]], [[TMP44]]
				; CHECK-NEXT: [[TMP46:%.*]] = extractelement <4 x i32> [[TMP45]], i32 0
				; CHECK-NEXT: store i32 [[TMP46]], ptr [[TMP13]], align 4
				; CHECK-NEXT: [[TMP47:%.*]] = extractelement <4 x i32> [[TMP45]], i32 1
				; CHECK-NEXT: store i32 [[TMP47]], ptr [[TMP14]], align 4
				; CHECK-NEXT: [[TMP48:%.*]] = extractelement <4 x i32> [[TMP45]], i32 2
				; CHECK-NEXT: store i32 [[TMP48]], ptr [[TMP15]], align 4
				; CHECK-NEXT: [[TMP49:%.*]] = extractelement <4 x i32> [[TMP45]], i32 3
				; CHECK-NEXT: store i32 [[TMP49]], ptr [[TMP16]], align 4
	; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 4			; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 4
	; CHECK-NEXT: [[TMP23:%.*]] = icmp eq i64 [[INDEX_NEXT]], 16			; CHECK-NEXT: [[TMP50:%.*]] = icmp eq i64 [[INDEX_NEXT]], 16
	; CHECK-NEXT: br i1 [[TMP23]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP4:![0-9]+]]			; CHECK-NEXT: br i1 [[TMP50]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP4:![0-9]+]]
	; CHECK: middle.block:			; CHECK: middle.block:
	; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i64 17, 16			; CHECK-NEXT: br label [[SCALAR_PH]]
	; CHECK-NEXT: br i1 [[CMP_N]], label [[EXIT:%.*]], label [[SCALAR_PH]]
	; CHECK: scalar.ph:			; CHECK: scalar.ph:
	; CHECK-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ 49, [[MIDDLE_BLOCK]] ], [ 1, [[ENTRY:%.]] ]			; CHECK-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ 49, [[MIDDLE_BLOCK]] ], [ 1, [[ENTRY:%.]] ]
	; CHECK-NEXT: [[BC_RESUME_VAL1:%.*]] = phi i64 [ 52, [[MIDDLE_BLOCK]] ], [ 4, [[ENTRY]] ]			; CHECK-NEXT: [[BC_RESUME_VAL1:%.*]] = phi i64 [ 52, [[MIDDLE_BLOCK]] ], [ 4, [[ENTRY]] ]
	; CHECK-NEXT: br label [[LOOP:%.*]]			; CHECK-NEXT: br label [[LOOP:%.*]]
	; CHECK: loop:			; CHECK: loop:
	; CHECK-NEXT: [[IV_1:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[IV_1_NEXT:%.]], [[LOOP]] ]			; CHECK-NEXT: [[IV_1:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[IV_1_NEXT:%.]], [[LOOP]] ]
	; CHECK-NEXT: [[IV_2:%.]] = phi i64 [ [[BC_RESUME_VAL1]], [[SCALAR_PH]] ], [ [[IV_2_NEXT:%.]], [[LOOP]] ]			; CHECK-NEXT: [[IV_2:%.]] = phi i64 [ [[BC_RESUME_VAL1]], [[SCALAR_PH]] ], [ [[IV_2_NEXT:%.]], [[LOOP]] ]
	; CHECK-NEXT: [[IV_1_NEXT]] = add nuw nsw i64 [[IV_1]], 3			; CHECK-NEXT: [[IV_1_NEXT]] = add nuw nsw i64 [[IV_1]], 3
	; CHECK-NEXT: [[IV_1_PLUS_4:%.*]] = add nuw nsw i64 [[IV_1]], 4			; CHECK-NEXT: [[IV_1_PLUS_4:%.*]] = add nuw nsw i64 [[IV_1]], 4
	; CHECK-NEXT: [[GEP_IV_1_PLUS_4:%.*]] = getelementptr inbounds i32, ptr [[ARR]], i64 [[IV_1_PLUS_4]]			; CHECK-NEXT: [[GEP_IV_1_PLUS_4:%.*]] = getelementptr inbounds i32, ptr [[ARR]], i64 [[IV_1_PLUS_4]]
	; CHECK-NEXT: [[L1:%.*]] = load i32, ptr [[GEP_IV_1_PLUS_4]], align 4			; CHECK-NEXT: [[L1:%.*]] = load i32, ptr [[GEP_IV_1_PLUS_4]], align 4
	; CHECK-NEXT: [[GEP_IV_2:%.*]] = getelementptr inbounds i32, ptr [[ARR]], i64 [[IV_2]]			; CHECK-NEXT: [[GEP_IV_2:%.*]] = getelementptr inbounds i32, ptr [[ARR]], i64 [[IV_2]]
	; CHECK-NEXT: store i32 [[L1]], ptr [[GEP_IV_2]], align 4			; CHECK-NEXT: store i32 [[L1]], ptr [[GEP_IV_2]], align 4
	; CHECK-NEXT: [[IV_1_PLUS_2:%.*]] = add nuw nsw i64 [[IV_1]], 2			; CHECK-NEXT: [[IV_1_PLUS_2:%.*]] = add nuw nsw i64 [[IV_1]], 2
	; CHECK-NEXT: [[GEP_IV_1_PLUS_2:%.*]] = getelementptr inbounds i32, ptr [[ARR]], i64 [[IV_1_PLUS_2]]			; CHECK-NEXT: [[GEP_IV_1_PLUS_2:%.*]] = getelementptr inbounds i32, ptr [[ARR]], i64 [[IV_1_PLUS_2]]
	; CHECK-NEXT: [[L3:%.*]] = load i32, ptr [[GEP_IV_2]], align 4			; CHECK-NEXT: [[L3:%.*]] = load i32, ptr [[GEP_IV_2]], align 4
	; CHECK-NEXT: [[L2:%.*]] = load i32, ptr [[GEP_IV_1_PLUS_2]], align 4			; CHECK-NEXT: [[L2:%.*]] = load i32, ptr [[GEP_IV_1_PLUS_2]], align 4
	; CHECK-NEXT: [[ADD:%.*]] = add i32 [[L3]], [[L2]]			; CHECK-NEXT: [[ADD:%.*]] = add i32 [[L3]], [[L2]]
	; CHECK-NEXT: store i32 [[ADD]], ptr [[GEP_IV_2]], align 4			; CHECK-NEXT: store i32 [[ADD]], ptr [[GEP_IV_2]], align 4
	; CHECK-NEXT: [[IV_2_NEXT]] = add nuw nsw i64 [[IV_2]], 3			; CHECK-NEXT: [[IV_2_NEXT]] = add nuw nsw i64 [[IV_2]], 3
	; CHECK-NEXT: [[ICMP:%.*]] = icmp ugt i64 [[IV_2]], 50			; CHECK-NEXT: [[ICMP:%.*]] = icmp ugt i64 [[IV_2]], 50
	; CHECK-NEXT: br i1 [[ICMP]], label [[EXIT]], label [[LOOP]], !llvm.loop [[LOOP5:![0-9]+]]			; CHECK-NEXT: br i1 [[ICMP]], label [[EXIT:%.*]], label [[LOOP]], !llvm.loop [[LOOP5:![0-9]+]]
	; CHECK: exit:			; CHECK: exit:
	; CHECK-NEXT: ret void			; CHECK-NEXT: ret void
	;			;
	entry:			entry:
	br label %loop			br label %loop

	loop:			loop:
	%iv.1 = phi i64 [ 1, %entry ], [ %iv.1.next, %loop ]			%iv.1 = phi i64 [ 1, %entry ], [ %iv.1.next, %loop ]
	Show All 20 Lines

llvm/test/Transforms/LoopVectorize/X86/interleaved-accesses-sink-store-across-load.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 3			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 3
	; RUN: opt -passes=loop-vectorize -mcpu=skx -enable-masked-interleaved-mem-accesses=1 -force-vector-width=4 -force-vector-interleave=1 -S %s \| FileCheck %s			; RUN: opt -passes=loop-vectorize -mcpu=skx -enable-masked-interleaved-mem-accesses=1 -force-vector-width=4 -force-vector-interleave=1 -S %s \| FileCheck %s
	target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128-ni:1-p2:32:8:8:32-ni:2"			target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128-ni:1-p2:32:8:8:32-ni:2"
	target triple = "x86_64-apple-macos"			target triple = "x86_64-apple-macos"

	; This is currently miscompiled.			; %l2 load and the preceeding store has a dependency. We should not sink
	; %l2 load and the preceeding store has a dependency. However, we currently sink
	; that store into the last store (by creating an interleaved store group). This			; that store into the last store (by creating an interleaved store group). This
	; means the loaded %l2 has incorrect value.			; means the loaded %l2 will have incorrect value.
				AyalUnsubmitted Not Done Reply Inline Actions This additional/distinct testcase of preventing the sinking of a store is fixed by the same patch that compares all member of a load-groupB with a storeA, right? Ayal: This additional/distinct testcase of preventing the sinking of a store is fixed by the same…
				annaAuthorUnsubmitted Done Reply Inline Actions Unfortunately, no. That only marks `GroupB` for completion, thereby preventing hoisting of a load to an earlier load group. To show what happens in this test case: we reverse traverse the memory access. So, the first interleave group created is: store i32 %add, ptr %gep.iv.2 <-- first access store i32 %mul, ptr %gep.iv.1.plus.2 <-- added into the (store) groupB. Let's call this StoreGroup1. We next come to B: `%l3 = load i32, ptr %gep.iv.2` Interleave group created with that inst. We create next interleave group: %l3 = load i32, ptr %gep.iv.2 %l2 = load i32, ptr %gep.iv.1.plus.2 Call this `LoadGroup1`. Wit that same `B`, we continue `A` accesses with A as `store i32 %mul, ptr %gep.iv.1.plus.2`. With the patch I have we will do two things: release `StoreGroup1` because we see the store has a dependency with `%l2` Mark `LoadGroup1` as complete (and break from this inner loop traversing `AI`). If we were to only compare `BI` when deciding for store release: if (GroupA && StoreGroups.contains(GroupA) && !canReorderMemAccessesForInterleavedGroups(&AI, &BI)) then, we will not release StoreGroup1 since BI (%l3) has no dependency. Which means even though the load group is correctly completed (and we will not hoist %l2 to location of %l1), we will sink the store (store i32 %mul, ptr %gep.iv.1.plus.2) below the dependent load (%l2). anna: Unfortunately, no. That only marks `GroupB` for completion, thereby preventing hoisting of a…
				annaAuthorUnsubmitted Done Reply Inline Actions @Ayal For completeness, here's the diff showing why the fix suggested in the comments won't be enough: https://reviews.llvm.org/differential/diff/542078/. The test in `interleaved-accesses-sink-store-across-load.ll` shows that we still sunk the store incorrectly (see the masked.store at the end). anna: @Ayal For completeness, here's the diff showing why the fix suggested in the comments won't be…
				AyalUnsubmitted Not Done Reply Inline Actions Ah, that's what I was curious about! Thanks for the detailed and enlightening description! Led to the revised proposal above. nit: suggest to rename `%iv.2` as `%iv.1.plus.3` Ayal: Ah, that's what I was curious about! Thanks for the detailed and enlightening description! Led…
	; We do not release this store group correctly because the next interleave group
	; chosen compares only the memory access of last load in program (%l3) against the dependent store location
	; (%gep.iv.1.plus.2) and they are different, thereby incorrectly assuming no
	; dependency. We need to compare against all loads in that interleaved group
	; (%l2 is part of it).
	define void @avoid_sinking_store_across_load(ptr %arr) {			define void @avoid_sinking_store_across_load(ptr %arr) {
	; CHECK-LABEL: define void @avoid_sinking_store_across_load			; CHECK-LABEL: define void @avoid_sinking_store_across_load
	; CHECK-SAME: (ptr [[ARR:%.*]]) #[[ATTR0:[0-9]+]] {			; CHECK-SAME: (ptr [[ARR:%.*]]) #[[ATTR0:[0-9]+]] {
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]			; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
	; CHECK: vector.ph:			; CHECK: vector.ph:
	; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]			; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
	; CHECK: vector.body:			; CHECK: vector.body:
	; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]			; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
	; CHECK-NEXT: [[VEC_IND:%.]] = phi <4 x i64> [ <i64 1, i64 4, i64 7, i64 10>, [[VECTOR_PH]] ], [ [[VEC_IND_NEXT:%.]], [[VECTOR_BODY]] ]			; CHECK-NEXT: [[VEC_IND:%.]] = phi <4 x i64> [ <i64 1, i64 4, i64 7, i64 10>, [[VECTOR_PH]] ], [ [[VEC_IND_NEXT:%.]], [[VECTOR_BODY]] ]
	; CHECK-NEXT: [[VEC_IND2:%.]] = phi <4 x i64> [ <i64 4, i64 7, i64 10, i64 13>, [[VECTOR_PH]] ], [ [[VEC_IND_NEXT3:%.]], [[VECTOR_BODY]] ]			; CHECK-NEXT: [[VEC_IND2:%.]] = phi <4 x i64> [ <i64 4, i64 7, i64 10, i64 13>, [[VECTOR_PH]] ], [ [[VEC_IND_NEXT3:%.]], [[VECTOR_BODY]] ]
	; CHECK-NEXT: [[TMP0:%.*]] = mul i64 [[INDEX]], 3			; CHECK-NEXT: [[TMP0:%.*]] = mul i64 [[INDEX]], 3
	; CHECK-NEXT: [[OFFSET_IDX:%.*]] = add i64 1, [[TMP0]]			; CHECK-NEXT: [[OFFSET_IDX:%.*]] = add i64 1, [[TMP0]]
	; CHECK-NEXT: [[TMP1:%.*]] = add i64 [[OFFSET_IDX]], 0			; CHECK-NEXT: [[TMP1:%.*]] = add i64 [[OFFSET_IDX]], 0
	; CHECK-NEXT: [[TMP2:%.*]] = add nuw nsw i64 [[TMP1]], 4			; CHECK-NEXT: [[TMP2:%.*]] = add nuw nsw i64 [[TMP1]], 4
	; CHECK-NEXT: [[TMP3:%.*]] = getelementptr inbounds i32, ptr [[ARR]], i64 [[TMP2]]			; CHECK-NEXT: [[TMP3:%.*]] = getelementptr inbounds i32, ptr [[ARR]], i64 [[TMP2]]
	; CHECK-NEXT: [[TMP4:%.*]] = getelementptr inbounds i32, ptr [[TMP3]], i32 -2			; CHECK-NEXT: [[TMP4:%.*]] = getelementptr inbounds i32, ptr [[TMP3]], i32 0
	; CHECK-NEXT: [[WIDE_VEC:%.*]] = load <12 x i32>, ptr [[TMP4]], align 4			; CHECK-NEXT: [[WIDE_VEC:%.*]] = load <12 x i32>, ptr [[TMP4]], align 4
	; CHECK-NEXT: [[STRIDED_VEC:%.*]] = shufflevector <12 x i32> [[WIDE_VEC]], <12 x i32> poison, <4 x i32> <i32 0, i32 3, i32 6, i32 9>			; CHECK-NEXT: [[STRIDED_VEC:%.*]] = shufflevector <12 x i32> [[WIDE_VEC]], <12 x i32> poison, <4 x i32> <i32 0, i32 3, i32 6, i32 9>
	; CHECK-NEXT: [[STRIDED_VEC4:%.*]] = shufflevector <12 x i32> [[WIDE_VEC]], <12 x i32> poison, <4 x i32> <i32 1, i32 4, i32 7, i32 10>
	; CHECK-NEXT: [[STRIDED_VEC5:%.*]] = shufflevector <12 x i32> [[WIDE_VEC]], <12 x i32> poison, <4 x i32> <i32 2, i32 5, i32 8, i32 11>
	; CHECK-NEXT: [[TMP5:%.*]] = getelementptr inbounds i32, ptr [[ARR]], <4 x i64> [[VEC_IND2]]			; CHECK-NEXT: [[TMP5:%.*]] = getelementptr inbounds i32, ptr [[ARR]], <4 x i64> [[VEC_IND2]]
	; CHECK-NEXT: [[TMP6:%.*]] = add nuw nsw <4 x i64> [[VEC_IND]], <i64 2, i64 2, i64 2, i64 2>			; CHECK-NEXT: [[TMP6:%.*]] = add nuw nsw <4 x i64> [[VEC_IND]], <i64 2, i64 2, i64 2, i64 2>
	; CHECK-NEXT: [[TMP7:%.*]] = getelementptr inbounds i32, ptr [[ARR]], <4 x i64> [[TMP6]]			; CHECK-NEXT: [[TMP7:%.*]] = getelementptr inbounds i32, ptr [[ARR]], <4 x i64> [[TMP6]]
	; CHECK-NEXT: [[TMP8:%.*]] = mul <4 x i32> [[STRIDED_VEC5]], <i32 25, i32 25, i32 25, i32 25>			; CHECK-NEXT: [[TMP8:%.*]] = mul <4 x i32> [[STRIDED_VEC]], <i32 25, i32 25, i32 25, i32 25>
	; CHECK-NEXT: call void @llvm.masked.scatter.v4i32.v4p0(<4 x i32> [[TMP8]], <4 x ptr> [[TMP7]], i32 4, <4 x i1> <i1 true, i1 true, i1 true, i1 true>)			; CHECK-NEXT: call void @llvm.masked.scatter.v4i32.v4p0(<4 x i32> [[TMP8]], <4 x ptr> [[TMP7]], i32 4, <4 x i1> <i1 true, i1 true, i1 true, i1 true>)
	; CHECK-NEXT: [[TMP9:%.*]] = add <4 x i32> [[STRIDED_VEC4]], [[STRIDED_VEC]]			; CHECK-NEXT: [[TMP9:%.*]] = extractelement <4 x ptr> [[TMP7]], i32 0
	; CHECK-NEXT: call void @llvm.masked.scatter.v4i32.v4p0(<4 x i32> [[TMP9]], <4 x ptr> [[TMP5]], i32 4, <4 x i1> <i1 true, i1 true, i1 true, i1 true>)			; CHECK-NEXT: [[TMP10:%.*]] = getelementptr i32, ptr [[TMP9]], i32 0
				; CHECK-NEXT: [[WIDE_VEC4:%.*]] = load <12 x i32>, ptr [[TMP10]], align 4
				; CHECK-NEXT: [[STRIDED_VEC5:%.*]] = shufflevector <12 x i32> [[WIDE_VEC4]], <12 x i32> poison, <4 x i32> <i32 0, i32 3, i32 6, i32 9>
				; CHECK-NEXT: [[STRIDED_VEC6:%.*]] = shufflevector <12 x i32> [[WIDE_VEC4]], <12 x i32> poison, <4 x i32> <i32 1, i32 4, i32 7, i32 10>
				; CHECK-NEXT: [[TMP11:%.*]] = add <4 x i32> [[STRIDED_VEC6]], [[STRIDED_VEC5]]
				; CHECK-NEXT: call void @llvm.masked.scatter.v4i32.v4p0(<4 x i32> [[TMP11]], <4 x ptr> [[TMP5]], i32 4, <4 x i1> <i1 true, i1 true, i1 true, i1 true>)
	; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 4			; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 4
	; CHECK-NEXT: [[VEC_IND_NEXT]] = add <4 x i64> [[VEC_IND]], <i64 12, i64 12, i64 12, i64 12>			; CHECK-NEXT: [[VEC_IND_NEXT]] = add <4 x i64> [[VEC_IND]], <i64 12, i64 12, i64 12, i64 12>
	; CHECK-NEXT: [[VEC_IND_NEXT3]] = add <4 x i64> [[VEC_IND2]], <i64 12, i64 12, i64 12, i64 12>			; CHECK-NEXT: [[VEC_IND_NEXT3]] = add <4 x i64> [[VEC_IND2]], <i64 12, i64 12, i64 12, i64 12>
	; CHECK-NEXT: [[TMP10:%.*]] = icmp eq i64 [[INDEX_NEXT]], 16			; CHECK-NEXT: [[TMP12:%.*]] = icmp eq i64 [[INDEX_NEXT]], 16
	; CHECK-NEXT: br i1 [[TMP10]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]			; CHECK-NEXT: br i1 [[TMP12]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
	; CHECK: middle.block:			; CHECK: middle.block:
	; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i64 17, 16			; CHECK-NEXT: br label [[SCALAR_PH]]
	; CHECK-NEXT: br i1 [[CMP_N]], label [[EXIT:%.*]], label [[SCALAR_PH]]
	; CHECK: scalar.ph:			; CHECK: scalar.ph:
	; CHECK-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ 49, [[MIDDLE_BLOCK]] ], [ 1, [[ENTRY:%.]] ]			; CHECK-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ 49, [[MIDDLE_BLOCK]] ], [ 1, [[ENTRY:%.]] ]
	; CHECK-NEXT: [[BC_RESUME_VAL1:%.*]] = phi i64 [ 52, [[MIDDLE_BLOCK]] ], [ 4, [[ENTRY]] ]			; CHECK-NEXT: [[BC_RESUME_VAL1:%.*]] = phi i64 [ 52, [[MIDDLE_BLOCK]] ], [ 4, [[ENTRY]] ]
	; CHECK-NEXT: br label [[LOOP:%.*]]			; CHECK-NEXT: br label [[LOOP:%.*]]
	; CHECK: loop:			; CHECK: loop:
	; CHECK-NEXT: [[IV_1:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[IV_1_NEXT:%.]], [[LOOP]] ]			; CHECK-NEXT: [[IV_1:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[IV_1_NEXT:%.]], [[LOOP]] ]
	; CHECK-NEXT: [[IV_2:%.]] = phi i64 [ [[BC_RESUME_VAL1]], [[SCALAR_PH]] ], [ [[IV_2_NEXT:%.]], [[LOOP]] ]			; CHECK-NEXT: [[IV_2:%.]] = phi i64 [ [[BC_RESUME_VAL1]], [[SCALAR_PH]] ], [ [[IV_2_NEXT:%.]], [[LOOP]] ]
	; CHECK-NEXT: [[IV_1_NEXT]] = add nuw nsw i64 [[IV_1]], 3			; CHECK-NEXT: [[IV_1_NEXT]] = add nuw nsw i64 [[IV_1]], 3
	; CHECK-NEXT: [[IV_1_PLUS_4:%.*]] = add nuw nsw i64 [[IV_1]], 4			; CHECK-NEXT: [[IV_1_PLUS_4:%.*]] = add nuw nsw i64 [[IV_1]], 4
	; CHECK-NEXT: [[GEP_IV_1_PLUS_4:%.*]] = getelementptr inbounds i32, ptr [[ARR]], i64 [[IV_1_PLUS_4]]			; CHECK-NEXT: [[GEP_IV_1_PLUS_4:%.*]] = getelementptr inbounds i32, ptr [[ARR]], i64 [[IV_1_PLUS_4]]
	; CHECK-NEXT: [[L1:%.*]] = load i32, ptr [[GEP_IV_1_PLUS_4]], align 4			; CHECK-NEXT: [[L1:%.*]] = load i32, ptr [[GEP_IV_1_PLUS_4]], align 4
	; CHECK-NEXT: [[GEP_IV_2:%.*]] = getelementptr inbounds i32, ptr [[ARR]], i64 [[IV_2]]			; CHECK-NEXT: [[GEP_IV_2:%.*]] = getelementptr inbounds i32, ptr [[ARR]], i64 [[IV_2]]
	; CHECK-NEXT: [[IV_1_PLUS_2:%.*]] = add nuw nsw i64 [[IV_1]], 2			; CHECK-NEXT: [[IV_1_PLUS_2:%.*]] = add nuw nsw i64 [[IV_1]], 2
	; CHECK-NEXT: [[GEP_IV_1_PLUS_2:%.*]] = getelementptr inbounds i32, ptr [[ARR]], i64 [[IV_1_PLUS_2]]			; CHECK-NEXT: [[GEP_IV_1_PLUS_2:%.*]] = getelementptr inbounds i32, ptr [[ARR]], i64 [[IV_1_PLUS_2]]
	; CHECK-NEXT: [[MUL:%.*]] = mul i32 [[L1]], 25			; CHECK-NEXT: [[MUL:%.*]] = mul i32 [[L1]], 25
	; CHECK-NEXT: store i32 [[MUL]], ptr [[GEP_IV_1_PLUS_2]], align 4			; CHECK-NEXT: store i32 [[MUL]], ptr [[GEP_IV_1_PLUS_2]], align 4
	; CHECK-NEXT: [[L2:%.*]] = load i32, ptr [[GEP_IV_1_PLUS_2]], align 4			; CHECK-NEXT: [[L2:%.*]] = load i32, ptr [[GEP_IV_1_PLUS_2]], align 4
	; CHECK-NEXT: [[L3:%.*]] = load i32, ptr [[GEP_IV_2]], align 4			; CHECK-NEXT: [[L3:%.*]] = load i32, ptr [[GEP_IV_2]], align 4
	; CHECK-NEXT: [[ADD:%.*]] = add i32 [[L3]], [[L2]]			; CHECK-NEXT: [[ADD:%.*]] = add i32 [[L3]], [[L2]]
	; CHECK-NEXT: store i32 [[ADD]], ptr [[GEP_IV_2]], align 4			; CHECK-NEXT: store i32 [[ADD]], ptr [[GEP_IV_2]], align 4
	; CHECK-NEXT: [[IV_2_NEXT]] = add nuw nsw i64 [[IV_2]], 3			; CHECK-NEXT: [[IV_2_NEXT]] = add nuw nsw i64 [[IV_2]], 3
	; CHECK-NEXT: [[ICMP:%.*]] = icmp ugt i64 [[IV_2]], 50			; CHECK-NEXT: [[ICMP:%.*]] = icmp ugt i64 [[IV_2]], 50
	; CHECK-NEXT: br i1 [[ICMP]], label [[EXIT]], label [[LOOP]], !llvm.loop [[LOOP3:![0-9]+]]			; CHECK-NEXT: br i1 [[ICMP]], label [[EXIT:%.*]], label [[LOOP]], !llvm.loop [[LOOP3:![0-9]+]]
	; CHECK: exit:			; CHECK: exit:
	; CHECK-NEXT: ret void			; CHECK-NEXT: ret void
	;			;
	entry:			entry:
	br label %loop			br label %loop

	loop:			loop:
	%iv.1 = phi i64 [ 1, %entry ], [ %iv.1.next, %loop ]			%iv.1 = phi i64 [ 1, %entry ], [ %iv.1.next, %loop ]
	Show All 21 Lines

llvm/test/Transforms/LoopVectorize/interleaved-accesses-max-dependences.ll

This file was added.

				; RUN: opt -passes=loop-vectorize -force-vector-width=4 -force-vector-interleave=1 -enable-interleaved-mem-accesses=true -max-dependences=0 -S %s \| FileCheck %s
				target datalayout = "e-m:e-i64:64-i128:128-n32:64-S128"

				; None of these stores have dependences between them, so we can successfully
				; interleave them even though the max-dependences threshold is 0.
				define void @three_interleaved_stores(ptr %arr) {
				; CHECK-LABEL: define void @three_interleaved_stores
				; CHECK: store <12 x i8>
				entry:
				br label %loop

				loop:
				%i = phi i64 [ 0, %entry ], [ %i.next, %loop ]
				%i.plus.1 = add nuw nsw i64 %i, 1
				%i.plus.2 = add nuw nsw i64 %i, 2
				%gep.i.plus.0 = getelementptr inbounds i8, ptr %arr, i64 %i
				%gep.i.plus.1 = getelementptr inbounds i8, ptr %arr, i64 %i.plus.1
				%gep.i.plus.2 = getelementptr inbounds i8, ptr %arr, i64 %i.plus.2
				store i8 1, ptr %gep.i.plus.0
				store i8 1, ptr %gep.i.plus.1
				store i8 1, ptr %gep.i.plus.2
				%i.next = add nuw nsw i64 %i, 3
				%icmp = icmp ugt i64 %i, 1032
				br i1 %icmp, label %exit, label %loop

				exit:
				ret void
				}

llvm/test/Transforms/LoopVectorize/interleaved-accesses-use-after-free.ll

This file was added.

; REQUIRES: asserts

; RUN: opt -passes=loop-vectorize -debug-only=loop-accesses,vectorutils -force-vector-width=4 -disable-output %s 2>&1 | FileCheck %s

target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"

target triple = "x86_64-redhat-linux-gnu"

%struct.foo = type { ptr, ptr, ptr }

%struct.pluto = type <{ %struct.wombat, %struct.spam, %struct.wibble, [6 x i8] }>

%struct.wombat = type { %struct.barney }

%struct.barney = type { %struct.widget }

%struct.widget = type { %struct.hoge }

%struct.hoge = type { %struct.pluto.0 }

%struct.pluto.0 = type { %struct.foo }

%struct.spam = type { %struct.barney.1 }

%struct.barney.1 = type { %struct.ham }

%struct.ham = type { %struct.bar }

%struct.bar = type { %struct.barney.2 }

%struct.barney.2 = type { %struct.hoge.3 }

%struct.hoge.3 = type { ptr, ptr, ptr }

%struct.wibble = type { %struct.spam.4 }

%struct.spam.4 = type { [2 x %struct.zot] }

%struct.zot = type { %struct.bar.5 }

%struct.bar.5 = type { i8 }

%struct.baz = type { i64, %struct.pluto }

; CHECK: LAA: Found a loop in test: bb4

; CHECK: Too many dependences, stopped recording

AyalUnsubmitted

Not Done

Ahh, but this test does exceed the threshold leading to no dependencies being recorded, which is needed to reproduce the bug?

Checking if GroupA!=GroupB above before calling canReorderMemAccessesForInterleavedGroups() fixes interleave group construction when this threshold is exceeded?

Ayal: Ahh, but this test does exceed the threshold leading to no dependencies being recorded, which…

annaAuthorUnsubmitted

Done

Actually, this test does lead to no dependences being recorded. The CHECK line is incorrect (we do have more than 100 dependences).
And yes, even with too many dependences, guarding it with the check allows us to avoid incorrectly releasing a group (technically had no dependences within it). I will update check line and the comment.

anna: Actually, this test does lead to no dependences being recorded. The CHECK line is incorrect (we…

; If no dependences are recorded because there are too many, LoopAccessAnalysis

; just conservatively returns true for any pair of instructions compared (even

AyalUnsubmitted

Done

; CHECK: Too many dependences, stopped recording

- ; If too many dependences are recorded, LoopAccessAnalysis just conservatively

+ ; If no dependences are recorded because there are too many, LoopAccessAnalysis just conservatively

; returns true for any pair of instructions compared (even those belonging to

Ayal:

; those belonging to the same store group). This tests make sure that we do not

; incorrectly release a store group which had no dependences between its

; members, even if we have no dependences recorded because there are too many.

AyalUnsubmitted

Done

; the same store group). This tests make sure that we do not incorrectly release

; a store group which had no dependences between its members, even if we have

- ; too many dependences recorded.

+ ; no dependences recorded because there are too many.

; CHECK: LV: Creating an interleave group with: store ptr null, ptr %phi5, align 8

Ayal:

; CHECK: LV: Creating an interleave group with: store ptr null, ptr %phi5, align 8

; CHECK: LV: Inserted: store ptr %load12, ptr %getelementptr11, align 8

; CHECK: into the interleave group with store ptr null, ptr %phi5

; CHECK: LV: Inserted: store ptr %load7, ptr %getelementptr, align 8

; CHECK: into the interleave group with store ptr null, ptr %phi5

annaAuthorUnsubmitted

Done

Will remove this check line. We do vectorize. This testcase exists to catch use-after-free asan crash/hangs on regular build. So, we don't need any check lines.

anna: Will remove this check line. We do vectorize. This testcase exists to catch use-after-free asan…

; CHECK: LV: Creating an interleave group with: store ptr null, ptr %getelementptr13, align 8

; CHECK: LV: Inserted: store ptr null, ptr %phi6, align 8

; CHECK: into the interleave group with store ptr null, ptr %getelementptr13

; CHECK: LV: Invalidated store group due to dependence between store ptr %load7, ptr %getelementptr, align 8 and store ptr null, ptr %getelementptr13, align 8

; CHECK-NOT: LV: Invalidated store group due to dependence between

AyalUnsubmitted

Done

note: worth clarifying which of the two groups is invalidated (a shortcoming of existing debug prints) - this is checking that exactly one store group is invalidated, due to a dependence between a store of one group and a store of the other. Alternatively, can check the generated scalar stores and shuffles that feed the interleaved store.

Ayal: note: worth clarifying which of the two groups is invalidated (a shortcoming of existing debug…

annaAuthorUnsubmitted

Done

Yeah, we'll need to read the source code in line with the debug statement:

Invalidated store group due to dependence between   store ptr %load7, ptr %getelementptr, align 8 and   store ptr null, ptr %getelementptr13, align 8

A is store ptr %load7, ptr %getelementptr, align 8.
We know that the store group being invalidated is the one containing A, which means the store group invalidated is the right one (and dependentInst is not part of that group) :

store ptr null, ptr %phi5, align 8
store ptr %load7, ptr %getelementptr, align 8
store ptr %load12, ptr %getelementptr11, align 8

I'll add a comment clarifying which store group is invalidated.

Also, side note: I have a patch which prints out the interleave groups at the end of this analysis. Looks like a generally useful thing to have (especially to make sure the analysis is right, even if we don't end up vectorizing).

anna: Yeah, we'll need to read the source code in line with the debug statement: ``` Invalidated…

; Note: The (only) invalidated store group is the one containing A (store ptr %load7, ptr %getelementptr, align 8) which is:

; Group with instructions:

; store ptr null, ptr %phi5, align 8

; store ptr %load7, ptr %getelementptr, align 8

; store ptr %load12, ptr %getelementptr11, align 8

define void @test(ptr %arg, ptr %arg1) local_unnamed_addr #0 {

bb:

br label %bb2

bb2: ; preds = %bb4, %bb

%phi = phi ptr [ %arg, %bb ], [ %phi3, %bb4 ]

%phi3 = phi ptr [ %arg1, %bb ], [ null, %bb4 ]

br label %bb4

bb4: ; preds = %bb4, %bb2

%phi5 = phi ptr [ %getelementptr15, %bb4 ], [ %phi, %bb2 ]

%phi6 = phi ptr [ %getelementptr14, %bb4 ], [ %phi3, %bb2 ]

%load = load i64, ptr %phi5, align 8

store i64 %load, ptr %phi, align 8

store i64 0, ptr %phi3, align 8

%load7 = load ptr, ptr %phi6, align 8

%load8 = load ptr, ptr %phi5, align 8

store ptr %load8, ptr %phi6, align 8

%getelementptr = getelementptr %struct.foo, ptr %phi5, i64 0, i32 1

%load9 = load ptr, ptr %phi5, align 8

store ptr %load9, ptr %phi6, align 8

%load10 = load ptr, ptr %phi5, align 8

store ptr %load10, ptr %phi6, align 8

store ptr null, ptr %phi5, align 8

store ptr %load7, ptr %getelementptr, align 8

%getelementptr11 = getelementptr %struct.pluto, ptr %phi5, i64 0, i32 1

%load12 = load ptr, ptr %phi6, align 8

%getelementptr13 = getelementptr %struct.pluto, ptr %phi6, i64 0, i32 1, i32 0, i32 0, i32 0, i32 0, i32 0, i32 2

store ptr null, ptr %phi6, align 8

store ptr null, ptr %getelementptr13, align 8

store ptr %load12, ptr %getelementptr11, align 8

store ptr null, ptr %phi5, align 8

%getelementptr14 = getelementptr inbounds %struct.baz, ptr %phi6, i64 1

%getelementptr15 = getelementptr %struct.baz, ptr %phi5, i64 1

%icmp = icmp eq ptr %phi6, %phi

br i1 %icmp, label %bb2, label %bb4

}

; Function Attrs: memory(readwrite, inaccessiblemem: none)

declare void @foo() local_unnamed_addr #0

; Function Attrs: memory(argmem: readwrite)

declare void @pluto() local_unnamed_addr #1

attributes #0 = { memory(readwrite, inaccessiblemem: none) }

attributes #1 = { memory(argmem: readwrite) }

This is an archive of the discontinued LLVM Phabricator instance.

[LV] Complete load groups and release store groups in presence of dependencyClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 548371

llvm/lib/Analysis/VectorUtils.cpp

llvm/test/Transforms/LoopVectorize/X86/interleaved-accesses-hoist-load-across-store.ll

llvm/test/Transforms/LoopVectorize/X86/interleaved-accesses-sink-store-across-load.ll

llvm/test/Transforms/LoopVectorize/interleaved-accesses-max-dependences.ll

llvm/test/Transforms/LoopVectorize/interleaved-accesses-use-after-free.ll

[LV] Complete load groups and release store groups in presence of dependency
ClosedPublic