This is an archive of the discontinued LLVM Phabricator instance.

[LV] Complete load groups and release store groups in presence of dependency
ClosedPublic

Authored by anna on Jul 17 2023, 3:10 PM.

Details

Summary

This is a more complete fix for CompleteLoadGroups introduced in
D154309. We need to check for dependency between A and every member of
the load Group of B.
This patch also fixes another miscompile seen when we incorrectly sink stores
below a depending load (see testcase in
interleaved-accesses-sink-store-across-load.ll). This is fixed by
releasing store groups.

Diff Detail

Event Timeline

anna created this revision.Jul 17 2023, 3:10 PM
Herald added a project: Restricted Project. · View Herald TranscriptJul 17 2023, 3:10 PM
anna requested review of this revision.Jul 17 2023, 3:10 PM
Herald added a project: Restricted Project. · View Herald TranscriptJul 17 2023, 3:10 PM
Ayal added a comment.Jul 19 2023, 4:26 AM

Thanks for following-up on this!

llvm/lib/Analysis/VectorUtils.cpp
1165–1166

nit: while we're here, suffice to place this classification of GroupB and its insertion into either StoreGroups or LoadGroups, next to its creation above.

1190–1191

nit: one empty line suffices?

Would it be clearer to do something like:

auto GroupA = getInterleaveGroup(A);
if (GroupA && StoreGroups.contains(GroupA) &&
    !canReorderMemAccessesForInterleavedGroups(&*AI, &*BI)) {
    LLVM_DEBUG(dbgs() << "LV: Invalidated store group due to "
                         "dependence between " << *A << " and "<< *B << '\n');
  StoreGroups.remove(GroupA);
  releaseGroup(GroupA);
}
if (A->mayWriteToMemory() && GroupB && LoadGroups.contains(GroupB)) {
  bool CompleteGroupB = false;
  for (uint32_t Index = 0; Index < GroupB->getFactor(); ++Index) {
    Instruction *MemberOfGroupB = GroupB->getMember(Index);
    if (MemberOfGroupB &&
        !canReorderMemAccessesForInterleavedGroups(
            &*AI, &*AccessStrideInfo.find(MemberOfGroupB))) {
      CompleteGroupB = true;
      break;
    }
  }
  if (CompleteGroupB) {
    LLVM_DEBUG(dbgs() << "LV: Marking interleave group for " << *B
                      << " as complete.\n");
      CompletedLoadGroups.insert(GroupB);
    break;
  }
}

Would be nicer with an if (std::any_of(...)) { ...; break; }, with support from InterleaveGroup to Iterate over its members.

A test that requires both releasing GroupA and completing GroupB would help emphasize that the former must precede the latter, due to its break.

1202–1246

Suffice to place this break only when indicating that (load) GroupB is completed; can continue to expand (store) GroupB even if (store) GroupA is released?

1212

Should be if (DependentInst)?

llvm/test/Transforms/LoopVectorize/X86/interleaved-accesses-sink-store-across-load.ll
8

This additional/distinct testcase of preventing the sinking of a store is fixed by the same patch that compares all member of a load-groupB with a storeA, right?

anna marked an inline comment as done.Jul 19 2023, 9:10 AM
anna added inline comments.
llvm/lib/Analysis/VectorUtils.cpp
1190–1191

Actually, this wouldn't be enough AFAICT (it fixes the first test case added as a follow-on to the original review: pr63602_2). The test I added in interleaved-accesses-sink-store-across-load.ll won't be fixed with this. The reason is we need to "release store group" if there is a dependency between AI and *any load in the (load) GroupB*.
So, this patch does two things:

  1. retains the order that store group is released first and LoadGroup is marked completed second (we just needed to do both, the order doesn't matter since I left the break where it is: line 1248).
  2. Make sure that we check all loads in GroupB for a dependency against AI, for both GroupA release and GroupB completion.

I've added more details in that test case comment below to show why #2 is needed.

Would be nicer with an if (std::any_of(...)) { ...; break; }, with support from InterleaveGroup to Iterate over its members.

I had that locally, but when recording a dependentInst didn't find a good need for it.

anna added inline comments.Jul 19 2023, 9:10 AM
llvm/lib/Analysis/VectorUtils.cpp
1202–1246

yes, good point.

Btw, I also found a third miscompile (which was the one I was initially trying to root-cause before running into the two miscompiles I'm trying to fix here :)).
I want to fix that separately from this review after adding a test case. It is caused when`A` is a store that has a dependency with GroupB but A is not yet part of an interleave group.

1212

yes! thanks! I've updated the tests to corrected ones as well.

llvm/test/Transforms/LoopVectorize/X86/interleaved-accesses-sink-store-across-load.ll
8

Unfortunately, no. That only marks GroupB for completion, thereby preventing hoisting of a load to an earlier load group.

To show what happens in this test case: we reverse traverse the memory access. So, the first interleave group created is:

store i32 %add, ptr %gep.iv.2 <-- first access
store i32 %mul, ptr %gep.iv.1.plus.2 <-- added into the (store) groupB.

Let's call this StoreGroup1.

We next come to B: %l3 = load i32, ptr %gep.iv.2
Interleave group created with that inst.
We create next interleave group:

%l3 = load i32, ptr %gep.iv.2
%l2 = load i32, ptr %gep.iv.1.plus.2

Call this LoadGroup1.

Wit that same B, we continue A accesses with A as store i32 %mul, ptr %gep.iv.1.plus.2.

With the patch I have we will do two things:

  1. release StoreGroup1 because we see the store has a dependency with %l2
  2. Mark LoadGroup1 as complete (and break from this inner loop traversing AI).

If we were to *only* compare BI when deciding for store release:

if (GroupA && StoreGroups.contains(GroupA) &&
    !canReorderMemAccessesForInterleavedGroups(&*AI, &*BI))

then, we will not release StoreGroup1 since BI (%l3) has no dependency.

Which means even though the load group is correctly completed (and we will not hoist %l2 to location of %l1), we will sink the store (store i32 %mul, ptr %gep.iv.1.plus.2) below the dependent load (%l2).

anna updated this revision to Diff 542062.Jul 19 2023, 9:12 AM

addressed most review comments: corrected if (DependentInst) check and couple other NFC.

anna added inline comments.Jul 19 2023, 9:45 AM
llvm/test/Transforms/LoopVectorize/X86/interleaved-accesses-sink-store-across-load.ll
8

@Ayal For completeness, here's the diff showing why the fix suggested in the comments won't be enough: https://reviews.llvm.org/differential/diff/542078/. The test in interleaved-accesses-sink-store-across-load.ll shows that we still sunk the store incorrectly (see the masked.store at the end).

Ayal added inline comments.Jul 20 2023, 12:36 PM
llvm/lib/Analysis/VectorUtils.cpp
1160–1164

Avoid too early continue here, scan A's in search of potential GroupA's to release.

1161–1164

Ahh, when B belongs to a completed load group, no A can be added to it, but early-continuing here also prevents releasing the store group of any conflicting A. As outlined above: "Even if we don't create a group for B, we continue with the bottom-up algorithm to ensure we don't break any of B's dependences" - continuing the bottom-up algorithm even if the group of B is complete would fix avoid_sinking_store_across_load: considering %l2 as B would release the store group. But other cases may evade this "avoid early-continue here" fix - when the conflict to release a store group is encountered before the store group is formed; here's an example, worth adding to the documentation:

// For example, assume we have the sequence of accesses shown below in a
// stride-3 loop:
//
// (1, 3) is a group | A[i] = a;   // (1)
//                     b = A[i+1]; // (2) |
//                   | A[i+2] = c; // (3)
//                     d = A[i];   // (4) | (2, 4) is a group
// but cannot have both (1,3) and (2,4) groups!
// Because the former sinks (1) to (3), the latter hoists (4) to (2),
// and there's a dependence between (1) and (4).
// Whenever B is a member of a load group, consider it along with
// all its members, because they will all be hoisted to B, or earlier.

Note: exact full overlap of load-after-store and store-after-store dependencies, as shown in these examples, are best optimized by eliminating the redundant load or first store, respectively. Partial overlap dependencies may be more involved to resolve, by converting to fixed-order-recurrences and/or hoisting a load above a conditional store.

Note: this algorithm of building interleave groups is worth revisiting, possibly by actually moving group members using VPlan, to both simplify and potentially optimize its decisions, e.g.: should a load group be built or a store group, if either is possible but not both? Keeping a conflicting member out of a group (thereby allowing several others to join) may be better than Completing it.

1190–1191

OK, very well. Perhaps the following would be clearer, provided it is correct:

  auto DependentMember = [&](InterleaveGroup<Instruction> *Group,
                             StrideEntry *A) -> Instruction* {
    for (uint32_t Index = 0; Index < Group->getFactor(); ++Index) {
      Instruction *MemberOfGroupB = Group->getMember(Index);
      if (MemberOfGroupB &&
          !canReorderMemAccessesForInterleavedGroups(
              A, &*AccessStrideInfo.find(MemberOfGroupB)))
        return MemberOfGroupB;
    }
    return nullptr;
  };

if (A->mayWriteToMemory()) { // Otherwise dependencies are tolerable.
  Instruction *DependentInst = nullptr;
  if (GroupB && LoadGroups.contains(GroupB)) // Check all GroupB members.
    DependentInst = DependentMember(GroupB, &*AI);
  else if (!canReorderMemAccessesForInterleavedGroups(&*AI, &*BI))
    DependentInst = B;
  if (DependentInst) {
    auto GroupA = getInterleaveGroup(A);
    if (GroupA && StoreGroups.contains(GroupA)) {
      LLVM_DEBUG(dbgs() << "LV: Invalidated store group due to "
                           "dependence between " << *A << " and "<< *DependentInst << '\n');
      StoreGroups.remove(GroupA);
      releaseGroup(GroupA);
    }
    if (GroupB && LoadGroups.contains(GroupB)) {
      LLVM_DEBUG(dbgs() << "LV: Marking interleave group for " << *B
                        << " as complete.\n");
      CompletedLoadGroups.insert(GroupB);
    }
  }
}
if (CompletedLoadGroups.contains(GroupB)) {
  // Skip trying to add A to B, continue to look for other conflicting A's in groups to be released.
  continue;
}
llvm/test/Transforms/LoopVectorize/X86/interleaved-accesses-sink-store-across-load.ll
8

Ah, that's what I was curious about! Thanks for the detailed and enlightening description! Led to the revised proposal above.

nit: suggest to rename %iv.2 as %iv.1.plus.3

anna marked an inline comment as done.Jul 21 2023, 10:09 AM
anna added inline comments.
llvm/lib/Analysis/VectorUtils.cpp
1161–1164

But other cases may evade this "avoid early-continue here" fix - when the conflict to release a store group is encountered before the store group is formed;

Yes, this is the exact case I faced for the last miscompile (which I haven't fixed yet). The store group wasn't formed yet and your added example for the doc is exactly what happens.
Note that when we switch around (3) and (4) in the example, we will not have a miscompile since we formed an interleave store group first (1,3). And then we come to load with B as d = A[i] and identify that the store group (1,3) must be released (since we will traverse backwards and reach A as A[i] = a which is a dependency with B and is part of a store group).
My idea to fix that was to record such stores and state they are the "last insertion point" (i.e. other stores can be sunk to the store, but we cannot sink store further down).

Also, thanks for catching the "early continue" here.

here's an example, worth adding to the documentation:

I will add this documentation when I precommit the miscompile test, if that's okay with you (since it would show the IR example as well)? We will need a separate fix for it as well.

1190–1191

yes, this works and keeps the test cases results the same as we have currently.

anna updated this revision to Diff 542979.Jul 21 2023, 10:17 AM
anna marked an inline comment as done.

addressed review comments.

anna marked an inline comment as done.Jul 21 2023, 10:17 AM
anna added a comment.Jul 25 2023, 11:38 AM

Hi Ayal, any more comments? Thanks.

Ayal accepted this revision.Jul 25 2023, 12:53 PM

Hi Ayal, any more comments? Thanks.

This is fine, thanks Anna for following-up!
Added some minor nits.

llvm/lib/Analysis/VectorUtils.cpp
1161–1164

here's an example, worth adding to the documentation:

I will add this documentation when I precommit the miscompile test, if that's okay with you (since it would show the IR example as well)? We will need a separate fix for it as well.

Sure, by all means.

1204
1206
1207
1213

nit: an empty line would help separate the setting of DependentInst from it use.

1217
1229
This revision is now accepted and ready to land.Jul 25 2023, 12:53 PM
anna marked 7 inline comments as done.Jul 25 2023, 1:32 PM

Thank you for the review Ayal!

This revision was landed with ongoing or failed builds.Jul 25 2023, 2:32 PM
This revision was automatically updated to reflect the committed changes.
MatzeB added a subscriber: MatzeB.EditedJul 26 2023, 10:16 AM

I'm seeing compiler crashes in our (C++, ThinLTO) builds with this change. Working on a reproducer or fix.

Asan report from the crash I am experiencing: https://reviews.llvm.org/P8312

MatzeB added inline comments.Jul 26 2023, 10:56 AM
llvm/lib/Analysis/VectorUtils.cpp
1225–1226

For the crashes I see GroupA == GroupB so any later access to GroupB after freeing GroupA here fail.

Uploaded a reduced input here: https://reviews.llvm.org/F28456762

Run with opt -O2 vectorize_loop_crash.ll.

You may need an asan-enabled build as regular builds don't reliably crash on use-after-free problems.

anna added inline comments.Jul 26 2023, 11:56 AM
llvm/lib/Analysis/VectorUtils.cpp
1225–1226

thank you for the reproducer! I will revert for now and fix it.

FWIW, I also ran into cases triggered by this commit, where compilation of one file has started taking an excessive amount of time (unsure if it completes or not). It can be reproduced with this file: https://martin.st/temp/RegisterBankEmitter-preproc.cpp

With this file, compiled with clang++ -target i686-w64-mingw32 -w -c -std=c++17 -O3 -fno-exceptions -fno-rtti RegisterBankEmitter-preproc.cpp, used to take ~6 seconds before this commit. After this commit, it doesn't complete within at least 3 minutes.

anna added a comment.Jul 27 2023, 6:35 AM

Thanks @mstorsjo and just for info, the use-after-free error shows up as a hang in non-asan builds (so its highly likely both reports are the same bug). I've reduced the first repro with bugpoint and it is due to GroupA == GroupB.

anna added a comment.Jul 27 2023, 8:58 AM

I reduced the testcase enough to show how GroupA can be same as GroupB, but there are two ways to fix this:

  • Identify that GroupA is same as GroupB and once we released groupA, we should iterate to the next B instruction (in the outer loop).
  • GroupA which is being released should never be same as GroupB and we do this by making sure that there's no dependency between *any of* the stores when inserting into GroupB. AFAICT, we don't do that since we check only between A and B when inserting (store) A into (store) groupB.

I will go with the first fix (since it seems the easier way to do). I think if we were to make this a more robust algorithm, we should just remove the stores that are dependent rather than releasing the entire storeGroup.

anna reopened this revision.Jul 27 2023, 10:08 AM
This revision is now accepted and ready to land.Jul 27 2023, 10:08 AM
anna updated this revision to Diff 544837.Jul 27 2023, 10:08 AM

fixed use-after-free error (with added testcase). This bug wasn't there before the patch because we would break out of the inner loop accessing A, whereas now we were continuing to see which other A accesses needed to release the group.

anna requested review of this revision.Jul 27 2023, 10:09 AM

Fixed the bug caused by this patch.

llvm/lib/Analysis/VectorUtils.cpp
1225

JFI, this break was preventing the use-after-free error before this patch (when groupA == groupB).

anna added inline comments.Jul 27 2023, 10:13 AM
llvm/test/Transforms/LoopVectorize/interleaved-accesses-use-after-free.ll
38

Will remove this check line. We do vectorize. This testcase exists to catch use-after-free asan crash/hangs on regular build. So, we don't need any check lines.

Ayal added a comment.Jul 30 2023, 7:53 AM

fixed use-after-free error (with added testcase). This bug wasn't there before the patch because we would break out of the inner loop accessing A, whereas now we were continuing to see which other A accesses needed to release the group.

Hmm, this error seems to stem from a case of "Too many dependences, stopped recording" (LoopAccessAnalysis.cpp), which leads canReorderMemAccessesForInterleavedGroups() to answer false for any given pair of stores - even those joined together in the same interleave group. However, all members of an interleave group are effectively checked for independence upon insertion, implying they are pairwise reorderable, even in the absence of recorded dependencies.

An alternative outlined above is to augment canReorderMemAccessesForInterleavedGroups() with a pre-check if the two instructions belong to the same (store) group.

Having Dependences.size() surpass MaxDependences calls for an extensive testcase, given the latter defaults to 100.
An alternative is to reduce this threshold by setting max-dependences thereby helping to produce a simpler minimal testcase.

llvm/lib/Analysis/VectorUtils.cpp
1202–1246

Need to also move the definition of GroupA earlier.

anna added a comment.Aug 2 2023, 9:28 AM

Thanks Ayal for the root cause. I'll update the patch

anna updated this revision to Diff 546506.Aug 2 2023, 9:31 AM

addressed review comment for use-after-free error (updated test to show the LoopAccessAnalysis bailout no longer present).

Ayal added inline comments.Aug 2 2023, 1:23 PM
llvm/test/Transforms/LoopVectorize/interleaved-accesses-use-after-free.ll
26

Ahh, but this test does exceed the threshold leading to no dependencies being recorded, which is needed to reproduce the bug?

Checking if GroupA!=GroupB above before calling canReorderMemAccessesForInterleavedGroups() fixes interleave group construction when this threshold is exceeded?

anna added inline comments.Aug 3 2023, 9:52 AM
llvm/test/Transforms/LoopVectorize/interleaved-accesses-use-after-free.ll
26

Actually, this test does lead to no dependences being recorded. The CHECK line is incorrect (we do have more than 100 dependences).
And yes, even with too many dependences, guarding it with the check allows us to avoid incorrectly releasing a group (technically had no dependences within it). I will update check line and the comment.

anna updated this revision to Diff 546917.Aug 3 2023, 10:10 AM

updated test case with correct check lines and comments.

Ayal added a comment.Aug 3 2023, 3:17 PM

...
Having Dependences.size() surpass MaxDependences calls for an extensive testcase, given the latter defaults to 100.
An alternative is to reduce this threshold by setting max-dependences thereby helping to produce a simpler minimal testcase.

The extensive test reproducer depends on current default value of MaxDependences.
How about also adding a minimal testcase such as something like:

; RUN: opt -passes=loop-vectorize -force-vector-width=4 -force-vector-interleave=1 -enable-interleaved-mem-accesses=true --max-dependences=0 -S %s | FileCheck %s
target datalayout = "e-m:e-i64:64-i128:128-n32:64-S128"

define void @three_interleaved_stores(ptr %arr) {
; CHECK-LABEL: define void @three_interleaved_stores
; CHECK:   store <12 x i8>
entry:
  br label %loop

loop:
  %i = phi i64 [ 0, %entry ], [ %i.next, %loop ]
  %i.plus.1 = add nuw nsw i64 %i, 1
  %i.plus.2 = add nuw nsw i64 %i, 2
  %gep.i.plus.0 = getelementptr inbounds i8, ptr %arr, i64 %i
  %gep.i.plus.1 = getelementptr inbounds i8, ptr %arr, i64 %i.plus.1
  %gep.i.plus.2 = getelementptr inbounds i8, ptr %arr, i64 %i.plus.2
  store i8 1, ptr %gep.i.plus.0
  store i8 1, ptr %gep.i.plus.1
  store i8 1, ptr %gep.i.plus.2
  %i.next = add nuw nsw i64 %i, 3
  %icmp = icmp ugt i64 %i, 1032
  br i1 %icmp, label %exit, label %loop

exit:
  ret void
}
llvm/test/Transforms/LoopVectorize/interleaved-accesses-use-after-free.ll
28
31–32
anna updated this revision to Diff 547078.Aug 3 2023, 6:48 PM
anna marked 2 inline comments as done.

added Ayal's minimal reproducer. Addressed review comments.

Ayal added inline comments.Aug 5 2023, 2:39 PM
llvm/lib/Analysis/VectorUtils.cpp
1195

BTW, unrelated to this patch: would be good to move the definition of canReorderMemAccessesForInterleavedGroups() from (being inlined in) VectorUtils.h to VectorUtils.cpp.

Ayal accepted this revision.Aug 5 2023, 3:01 PM

Looks good to me, thanks for fixing!

Worth checking with @MatzeB and @mstorsjo if this fixes their cases.

llvm/test/Transforms/LoopVectorize/interleaved-accesses-use-after-free.ll
44

note: worth clarifying which of the two groups is invalidated (a shortcoming of existing debug prints) - this is checking that exactly one store group is invalidated, due to a dependence between a store of one group and a store of the other. Alternatively, can check the generated scalar stores and shuffles that feed the interleaved store.

This revision is now accepted and ready to land.Aug 5 2023, 3:01 PM

Looks good to me, thanks for fixing!

Worth checking with @MatzeB and @mstorsjo if this fixes their cases.

At least in an initial test, things seem to work still, so I think it'd be safe to try to reland this, and I'd see if it breaks something else after testing all configurations.

MatzeB added a comment.Aug 8 2023, 8:42 AM

Thanks, but there's no need to wait for me, land it. Given the previous test-case works this should be good. We have extensive nightly tests running with our codebase, but if something breaks again our oncall should contact you :)

anna added a comment.Aug 8 2023, 1:25 PM

Thanks everyone for the review and test cases. I'll try landing this again today.

llvm/test/Transforms/LoopVectorize/interleaved-accesses-use-after-free.ll
44

Yeah, we'll need to read the source code in line with the debug statement:

Invalidated store group due to dependence between   store ptr %load7, ptr %getelementptr, align 8 and   store ptr null, ptr %getelementptr13, align 8

A is store ptr %load7, ptr %getelementptr, align 8.
We know that the store group being invalidated is the one containing A, which means the store group invalidated is the right one (and dependentInst is not part of that group) :

store ptr null, ptr %phi5, align 8
store ptr %load7, ptr %getelementptr, align 8
store ptr %load12, ptr %getelementptr11, align 8

I'll add a comment clarifying which store group is invalidated.

Also, side note: I have a patch which prints out the interleave groups at the end of this analysis. Looks like a generally useful thing to have (especially to make sure the analysis is right, even if we don't end up vectorizing).

fhahn added a comment.Aug 8 2023, 1:29 PM

Thanks for following up on this!

This revision was landed with ongoing or failed builds.Aug 8 2023, 3:10 PM
This revision was automatically updated to reflect the committed changes.
anna marked an inline comment as done.
Ayal added a comment.Aug 9 2023, 4:07 AM

Better move the test to llvm/test/Transforms/LoopVectorize/X86/interleaved-accesses-use-after-free.ll ?

I wonder why that test needs a triple given that the other one doesn't, maybe that line was left over from another one?

@anna feel free to move the test if that makes more sense.

anna added a comment.Aug 9 2023, 1:03 PM

Thanks, I've moved the test and removed the x86-registered-target