This is a more complete fix for CompleteLoadGroups introduced in
D154309. We need to check for dependency between A and every member of
the load Group of B.
This patch also fixes another miscompile seen when we incorrectly sink stores
below a depending load (see testcase in
interleaved-accesses-sink-store-across-load.ll). This is fixed by
releasing store groups.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
Thanks for following-up on this!
llvm/lib/Analysis/VectorUtils.cpp | ||
---|---|---|
1165–1166 | nit: while we're here, suffice to place this classification of GroupB and its insertion into either StoreGroups or LoadGroups, next to its creation above. | |
1190–1191 | nit: one empty line suffices? Would it be clearer to do something like: auto GroupA = getInterleaveGroup(A); if (GroupA && StoreGroups.contains(GroupA) && !canReorderMemAccessesForInterleavedGroups(&*AI, &*BI)) { LLVM_DEBUG(dbgs() << "LV: Invalidated store group due to " "dependence between " << *A << " and "<< *B << '\n'); StoreGroups.remove(GroupA); releaseGroup(GroupA); } if (A->mayWriteToMemory() && GroupB && LoadGroups.contains(GroupB)) { bool CompleteGroupB = false; for (uint32_t Index = 0; Index < GroupB->getFactor(); ++Index) { Instruction *MemberOfGroupB = GroupB->getMember(Index); if (MemberOfGroupB && !canReorderMemAccessesForInterleavedGroups( &*AI, &*AccessStrideInfo.find(MemberOfGroupB))) { CompleteGroupB = true; break; } } if (CompleteGroupB) { LLVM_DEBUG(dbgs() << "LV: Marking interleave group for " << *B << " as complete.\n"); CompletedLoadGroups.insert(GroupB); break; } } Would be nicer with an if (std::any_of(...)) { ...; break; }, with support from InterleaveGroup to Iterate over its members. A test that requires both releasing GroupA and completing GroupB would help emphasize that the former must precede the latter, due to its break. | |
1203–1243 | Suffice to place this break only when indicating that (load) GroupB is completed; can continue to expand (store) GroupB even if (store) GroupA is released? | |
1212 | Should be if (DependentInst)? | |
llvm/test/Transforms/LoopVectorize/X86/interleaved-accesses-sink-store-across-load.ll | ||
8 | This additional/distinct testcase of preventing the sinking of a store is fixed by the same patch that compares all member of a load-groupB with a storeA, right? |
llvm/lib/Analysis/VectorUtils.cpp | ||
---|---|---|
1190–1191 | Actually, this wouldn't be enough AFAICT (it fixes the first test case added as a follow-on to the original review: pr63602_2). The test I added in interleaved-accesses-sink-store-across-load.ll won't be fixed with this. The reason is we need to "release store group" if there is a dependency between AI and *any load in the (load) GroupB*.
I've added more details in that test case comment below to show why #2 is needed.
I had that locally, but when recording a dependentInst didn't find a good need for it. |
llvm/lib/Analysis/VectorUtils.cpp | ||
---|---|---|
1203–1243 | yes, good point. Btw, I also found a third miscompile (which was the one I was initially trying to root-cause before running into the two miscompiles I'm trying to fix here :)). | |
1212 | yes! thanks! I've updated the tests to corrected ones as well. | |
llvm/test/Transforms/LoopVectorize/X86/interleaved-accesses-sink-store-across-load.ll | ||
8 | Unfortunately, no. That only marks GroupB for completion, thereby preventing hoisting of a load to an earlier load group. To show what happens in this test case: we reverse traverse the memory access. So, the first interleave group created is: store i32 %add, ptr %gep.iv.2 <-- first access store i32 %mul, ptr %gep.iv.1.plus.2 <-- added into the (store) groupB. Let's call this StoreGroup1. We next come to B: %l3 = load i32, ptr %gep.iv.2 %l3 = load i32, ptr %gep.iv.2 %l2 = load i32, ptr %gep.iv.1.plus.2 Call this LoadGroup1. Wit that same B, we continue A accesses with A as store i32 %mul, ptr %gep.iv.1.plus.2. With the patch I have we will do two things:
If we were to *only* compare BI when deciding for store release: if (GroupA && StoreGroups.contains(GroupA) && !canReorderMemAccessesForInterleavedGroups(&*AI, &*BI)) then, we will not release StoreGroup1 since BI (%l3) has no dependency. Which means even though the load group is correctly completed (and we will not hoist %l2 to location of %l1), we will sink the store (store i32 %mul, ptr %gep.iv.1.plus.2) below the dependent load (%l2). |
addressed most review comments: corrected if (DependentInst) check and couple other NFC.
llvm/test/Transforms/LoopVectorize/X86/interleaved-accesses-sink-store-across-load.ll | ||
---|---|---|
8 | @Ayal For completeness, here's the diff showing why the fix suggested in the comments won't be enough: https://reviews.llvm.org/differential/diff/542078/. The test in interleaved-accesses-sink-store-across-load.ll shows that we still sunk the store incorrectly (see the masked.store at the end). |
llvm/lib/Analysis/VectorUtils.cpp | ||
---|---|---|
1160–1164 | Avoid too early continue here, scan A's in search of potential GroupA's to release. | |
1161–1164 | Ahh, when B belongs to a completed load group, no A can be added to it, but early-continuing here also prevents releasing the store group of any conflicting A. As outlined above: "Even if we don't create a group for B, we continue with the bottom-up algorithm to ensure we don't break any of B's dependences" - continuing the bottom-up algorithm even if the group of B is complete would fix avoid_sinking_store_across_load: considering %l2 as B would release the store group. But other cases may evade this "avoid early-continue here" fix - when the conflict to release a store group is encountered before the store group is formed; here's an example, worth adding to the documentation: // For example, assume we have the sequence of accesses shown below in a // stride-3 loop: // // (1, 3) is a group | A[i] = a; // (1) // b = A[i+1]; // (2) | // | A[i+2] = c; // (3) // d = A[i]; // (4) | (2, 4) is a group // but cannot have both (1,3) and (2,4) groups! // Because the former sinks (1) to (3), the latter hoists (4) to (2), // and there's a dependence between (1) and (4). // Whenever B is a member of a load group, consider it along with // all its members, because they will all be hoisted to B, or earlier. Note: exact full overlap of load-after-store and store-after-store dependencies, as shown in these examples, are best optimized by eliminating the redundant load or first store, respectively. Partial overlap dependencies may be more involved to resolve, by converting to fixed-order-recurrences and/or hoisting a load above a conditional store. Note: this algorithm of building interleave groups is worth revisiting, possibly by actually moving group members using VPlan, to both simplify and potentially optimize its decisions, e.g.: should a load group be built or a store group, if either is possible but not both? Keeping a conflicting member out of a group (thereby allowing several others to join) may be better than Completing it. | |
1190–1191 | OK, very well. Perhaps the following would be clearer, provided it is correct: auto DependentMember = [&](InterleaveGroup<Instruction> *Group, StrideEntry *A) -> Instruction* { for (uint32_t Index = 0; Index < Group->getFactor(); ++Index) { Instruction *MemberOfGroupB = Group->getMember(Index); if (MemberOfGroupB && !canReorderMemAccessesForInterleavedGroups( A, &*AccessStrideInfo.find(MemberOfGroupB))) return MemberOfGroupB; } return nullptr; }; if (A->mayWriteToMemory()) { // Otherwise dependencies are tolerable. Instruction *DependentInst = nullptr; if (GroupB && LoadGroups.contains(GroupB)) // Check all GroupB members. DependentInst = DependentMember(GroupB, &*AI); else if (!canReorderMemAccessesForInterleavedGroups(&*AI, &*BI)) DependentInst = B; if (DependentInst) { auto GroupA = getInterleaveGroup(A); if (GroupA && StoreGroups.contains(GroupA)) { LLVM_DEBUG(dbgs() << "LV: Invalidated store group due to " "dependence between " << *A << " and "<< *DependentInst << '\n'); StoreGroups.remove(GroupA); releaseGroup(GroupA); } if (GroupB && LoadGroups.contains(GroupB)) { LLVM_DEBUG(dbgs() << "LV: Marking interleave group for " << *B << " as complete.\n"); CompletedLoadGroups.insert(GroupB); } } } if (CompletedLoadGroups.contains(GroupB)) { // Skip trying to add A to B, continue to look for other conflicting A's in groups to be released. continue; } | |
llvm/test/Transforms/LoopVectorize/X86/interleaved-accesses-sink-store-across-load.ll | ||
8 | Ah, that's what I was curious about! Thanks for the detailed and enlightening description! Led to the revised proposal above. nit: suggest to rename %iv.2 as %iv.1.plus.3 |
llvm/lib/Analysis/VectorUtils.cpp | ||
---|---|---|
1161–1164 |
Yes, this is the exact case I faced for the last miscompile (which I haven't fixed yet). The store group wasn't formed yet and your added example for the doc is exactly what happens. Also, thanks for catching the "early continue" here.
I will add this documentation when I precommit the miscompile test, if that's okay with you (since it would show the IR example as well)? We will need a separate fix for it as well. | |
1190–1191 | yes, this works and keeps the test cases results the same as we have currently. |
This is fine, thanks Anna for following-up!
Added some minor nits.
llvm/lib/Analysis/VectorUtils.cpp | ||
---|---|---|
1161–1164 |
Sure, by all means. | |
1205 | ||
1207 | ||
1208 | ||
1214 | nit: an empty line would help separate the setting of DependentInst from it use. | |
1218 | ||
1230 |
I'm seeing compiler crashes in our (C++, ThinLTO) builds with this change. Working on a reproducer or fix.
llvm/lib/Analysis/VectorUtils.cpp | ||
---|---|---|
1226–1227 | For the crashes I see GroupA == GroupB so any later access to GroupB after freeing GroupA here fail. |
Uploaded a reduced input here: https://reviews.llvm.org/F28456762
Run with opt -O2 vectorize_loop_crash.ll.
You may need an asan-enabled build as regular builds don't reliably crash on use-after-free problems.
llvm/lib/Analysis/VectorUtils.cpp | ||
---|---|---|
1226–1227 | thank you for the reproducer! I will revert for now and fix it. |
FWIW, I also ran into cases triggered by this commit, where compilation of one file has started taking an excessive amount of time (unsure if it completes or not). It can be reproduced with this file: https://martin.st/temp/RegisterBankEmitter-preproc.cpp
With this file, compiled with clang++ -target i686-w64-mingw32 -w -c -std=c++17 -O3 -fno-exceptions -fno-rtti RegisterBankEmitter-preproc.cpp, used to take ~6 seconds before this commit. After this commit, it doesn't complete within at least 3 minutes.
Thanks @mstorsjo and just for info, the use-after-free error shows up as a hang in non-asan builds (so its highly likely both reports are the same bug). I've reduced the first repro with bugpoint and it is due to GroupA == GroupB.
I reduced the testcase enough to show how GroupA can be same as GroupB, but there are two ways to fix this:
- Identify that GroupA is same as GroupB and once we released groupA, we should iterate to the next B instruction (in the outer loop).
- GroupA which is being released should never be same as GroupB and we do this by making sure that there's no dependency between *any of* the stores when inserting into GroupB. AFAICT, we don't do that since we check only between A and B when inserting (store) A into (store) groupB.
I will go with the first fix (since it seems the easier way to do). I think if we were to make this a more robust algorithm, we should just remove the stores that are dependent rather than releasing the entire storeGroup.
fixed use-after-free error (with added testcase). This bug wasn't there before the patch because we would break out of the inner loop accessing A, whereas now we were continuing to see which other A accesses needed to release the group.
Fixed the bug caused by this patch.
llvm/lib/Analysis/VectorUtils.cpp | ||
---|---|---|
1225 | JFI, this break was preventing the use-after-free error before this patch (when groupA == groupB). |
llvm/test/Transforms/LoopVectorize/interleaved-accesses-use-after-free.ll | ||
---|---|---|
37 ↗ | (On Diff #544837) | Will remove this check line. We do vectorize. This testcase exists to catch use-after-free asan crash/hangs on regular build. So, we don't need any check lines. |
Hmm, this error seems to stem from a case of "Too many dependences, stopped recording" (LoopAccessAnalysis.cpp), which leads canReorderMemAccessesForInterleavedGroups() to answer false for any given pair of stores - even those joined together in the same interleave group. However, all members of an interleave group are effectively checked for independence upon insertion, implying they are pairwise reorderable, even in the absence of recorded dependencies.
An alternative outlined above is to augment canReorderMemAccessesForInterleavedGroups() with a pre-check if the two instructions belong to the same (store) group.
Having Dependences.size() surpass MaxDependences calls for an extensive testcase, given the latter defaults to 100.
An alternative is to reduce this threshold by setting max-dependences thereby helping to produce a simpler minimal testcase.
llvm/lib/Analysis/VectorUtils.cpp | ||
---|---|---|
1203–1243 | Need to also move the definition of GroupA earlier. |
addressed review comment for use-after-free error (updated test to show the LoopAccessAnalysis bailout no longer present).
llvm/test/Transforms/LoopVectorize/interleaved-accesses-use-after-free.ll | ||
---|---|---|
25 ↗ | (On Diff #546506) | Ahh, but this test does exceed the threshold leading to no dependencies being recorded, which is needed to reproduce the bug? Checking if GroupA!=GroupB above before calling canReorderMemAccessesForInterleavedGroups() fixes interleave group construction when this threshold is exceeded? |
llvm/test/Transforms/LoopVectorize/interleaved-accesses-use-after-free.ll | ||
---|---|---|
25 ↗ | (On Diff #546506) | Actually, this test does lead to no dependences being recorded. The CHECK line is incorrect (we do have more than 100 dependences). |
The extensive test reproducer depends on current default value of MaxDependences.
How about also adding a minimal testcase such as something like:
; RUN: opt -passes=loop-vectorize -force-vector-width=4 -force-vector-interleave=1 -enable-interleaved-mem-accesses=true --max-dependences=0 -S %s | FileCheck %s target datalayout = "e-m:e-i64:64-i128:128-n32:64-S128" define void @three_interleaved_stores(ptr %arr) { ; CHECK-LABEL: define void @three_interleaved_stores ; CHECK: store <12 x i8> entry: br label %loop loop: %i = phi i64 [ 0, %entry ], [ %i.next, %loop ] %i.plus.1 = add nuw nsw i64 %i, 1 %i.plus.2 = add nuw nsw i64 %i, 2 %gep.i.plus.0 = getelementptr inbounds i8, ptr %arr, i64 %i %gep.i.plus.1 = getelementptr inbounds i8, ptr %arr, i64 %i.plus.1 %gep.i.plus.2 = getelementptr inbounds i8, ptr %arr, i64 %i.plus.2 store i8 1, ptr %gep.i.plus.0 store i8 1, ptr %gep.i.plus.1 store i8 1, ptr %gep.i.plus.2 %i.next = add nuw nsw i64 %i, 3 %icmp = icmp ugt i64 %i, 1032 br i1 %icmp, label %exit, label %loop exit: ret void }
llvm/test/Transforms/LoopVectorize/interleaved-accesses-use-after-free.ll | ||
---|---|---|
27 ↗ | (On Diff #546917) | |
30–31 ↗ | (On Diff #546917) |
llvm/lib/Analysis/VectorUtils.cpp | ||
---|---|---|
1195 | BTW, unrelated to this patch: would be good to move the definition of canReorderMemAccessesForInterleavedGroups() from (being inlined in) VectorUtils.h to VectorUtils.cpp. |
llvm/test/Transforms/LoopVectorize/interleaved-accesses-use-after-free.ll | ||
---|---|---|
43 ↗ | (On Diff #547078) | note: worth clarifying which of the two groups is invalidated (a shortcoming of existing debug prints) - this is checking that exactly one store group is invalidated, due to a dependence between a store of one group and a store of the other. Alternatively, can check the generated scalar stores and shuffles that feed the interleaved store. |
At least in an initial test, things seem to work still, so I think it'd be safe to try to reland this, and I'd see if it breaks something else after testing all configurations.
Thanks, but there's no need to wait for me, land it. Given the previous test-case works this should be good. We have extensive nightly tests running with our codebase, but if something breaks again our oncall should contact you :)
Thanks everyone for the review and test cases. I'll try landing this again today.
llvm/test/Transforms/LoopVectorize/interleaved-accesses-use-after-free.ll | ||
---|---|---|
43 ↗ | (On Diff #547078) | Yeah, we'll need to read the source code in line with the debug statement: Invalidated store group due to dependence between store ptr %load7, ptr %getelementptr, align 8 and store ptr null, ptr %getelementptr13, align 8 A is store ptr %load7, ptr %getelementptr, align 8. store ptr null, ptr %phi5, align 8 store ptr %load7, ptr %getelementptr, align 8 store ptr %load12, ptr %getelementptr11, align 8 I'll add a comment clarifying which store group is invalidated. Also, side note: I have a patch which prints out the interleave groups at the end of this analysis. Looks like a generally useful thing to have (especially to make sure the analysis is right, even if we don't end up vectorizing). |
I've added an x86 requirement to the test: https://github.com/llvm/llvm-project/commit/c09bdfe6f77afc1378edaa959d71993b038ca9a7
Better move the test to llvm/test/Transforms/LoopVectorize/X86/interleaved-accesses-use-after-free.ll ?
I wonder why that test needs a triple given that the other one doesn't, maybe that line was left over from another one?
@anna feel free to move the test if that makes more sense.
Ahh, when B belongs to a completed load group, no A can be added to it, but early-continuing here also prevents releasing the store group of any conflicting A. As outlined above: "Even if we don't create a group for B, we continue with the bottom-up algorithm to ensure we don't break any of B's dependences" - continuing the bottom-up algorithm even if the group of B is complete would fix avoid_sinking_store_across_load: considering %l2 as B would release the store group. But other cases may evade this "avoid early-continue here" fix - when the conflict to release a store group is encountered before the store group is formed; here's an example, worth adding to the documentation:
Note: exact full overlap of load-after-store and store-after-store dependencies, as shown in these examples, are best optimized by eliminating the redundant load or first store, respectively. Partial overlap dependencies may be more involved to resolve, by converting to fixed-order-recurrences and/or hoisting a load above a conditional store.
Note: this algorithm of building interleave groups is worth revisiting, possibly by actually moving group members using VPlan, to both simplify and potentially optimize its decisions, e.g.: should a load group be built or a store group, if either is possible but not both? Keeping a conflicting member out of a group (thereby allowing several others to join) may be better than Completing it.