This patch is motivated by pr48057 (https://bugs.llvm.org/show_bug.cgi?id=48057), where an output dependency is not detected since loop interchange did not check a store instruction with itself.
Details
- Reviewers
- Whitney - bmahjour - Meinersbur 
- Group Reviewers
- Restricted Project 
- Commits
- rGabc8ca65c3de: [LoopInterchange] Detect output dependency of a store instruction with itself
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
As mentioned in the LoopWG call, I don't think this has anything to do with dominance relations or diverging branches. Instead store i16 %conv9.i, i16* @e may be executed multiple times and the last written value written to @e must the value stored after the loop (the problem should occur even if the store is executed every time, i.e. not diverging). This is a classic output-dependency (write-after-write). I suggest to have a look at the dependency analysis and why it missed this case.
Thanks Michael! I did look futher into it, and I do agree that there is output dependency that writes into @e in each iteration. Nevertheless if the store is executed every time (not diverging), IMHO if I'm not mistaken, it seems that this output dependency does not invalid loop interchange since we only care about the final value written into @e, which is the value of array b[d][0] at the final interation. This value b[d=0][0] does not change before and after interchange. So the output dependence is this example seems to be okay, please correct me if I'm wrong.
Regarding dependency analysis: for this example (e.g., test1() in pr48057.ll), no dependence is detected, since for all mem instructions (on lines 57, 63 ,65 in pr48057.ll), the memory locations do not alias.
I'd appreciate if you could let me know if what I said makes sense to you, thank you very much!
Couldn't the same problem happen in theory without control flow divergence? For example consider a loop like this:
for (c = 0; c <= 7; c++) {
  for (d = 4; d; d--)
    e = ((b[d+2][c]) ? b[d][0] : e);where the ternary operator turns into a select instruction in LLVM IR.
Traditional dependency analysis only considers pairwises dependencies ("hazards") between to execution, which is formulated like "StmtA(i) and StmtB(j) are dependent if they may access the same memory at least on of them is a write" (Read-after-Read is not a dependency). In principle an output-dependency does not matter if overwritten anyway before being used. However, DependenceAnalysis doesn't even have an API to consider that. It also does not make a difference between "unconditionally overwrite" (in literature called a "kill") and a conditional or partial write:
bool Dependence::isOutput() const {
  return Src->mayWriteToMemory() && Dst->mayWriteToMemory();
}Regarding dependency analysis: for this example (e.g., test1() in pr48057.ll), no dependence is detected, since for all mem instructions (on lines 57, 63 ,65 in pr48057.ll), the memory locations do not alias.
store i16 %conv9.i, i16* @e hazards with itself from different iterations (same memory, both are "write"). It would not be very different if it was an array and indexed e[f(d,c)] where f is a speculable/const function (and potentially always returns 0). If no dependence is detected, I'd consider it a bug.
| llvm/test/Transforms/LoopInterchange/pr48057.ll | ||
|---|---|---|
| 13–22 | Instead of using the fuzzer output, it would be nice to reduce it making it more similar to what a user would write (e.g. avoid globals, explicit loop variables, etc). Eg.: void test1(char b[][5], int *e) {
   for (int c = 0; c < 8; ++c) 
     for (int d = 0; d < 5; ++d)
       if(b[c][d])
         *e = c + d;
 } | |
Thanks for the comment! IIUC the source code you wrote would likely result in control-flow divergence as well, even it is "select" form in the source code. The inner loop would look like the following where e is represented by the phi node %cond:
for.body2: ; preds = %for.cond1.preheader, %cond.end %indvars.iv = phi i64 [ 4, %for.cond1.preheader ], [ %indvars.iv.next, %cond.end ] %cond12 = phi i16 [ %cond1.lcssa57, %for.cond1.preheader ], [ %cond, %cond.end ] %4 = add nuw nsw i64 %indvars.iv, 2 %arrayidx4 = getelementptr inbounds [8 x [8 x i8]], [8 x [8 x i8]]* @b, i64 0, i64 %4, i64 %indvars.iv9 %5 = load i8, i8* %arrayidx4, align 1, !tbaa !9 %tobool5.not = icmp eq i8 %5, 0 br i1 %tobool5.not, label %cond.end, label %cond.true cond.true: ; preds = %for.body2 %arrayidx8 = getelementptr inbounds [8 x [8 x i8]], [8 x [8 x i8]]* @b, i64 0, i64 %indvars.iv, i64 0 %6 = load i8, i8* %arrayidx8, align 8, !tbaa !9 %conv9 = sext i8 %6 to i16 br label %cond.end cond.end: ; preds = %for.body2, %cond.true %cond = phi i16 [ %conv9, %cond.true ], [ %cond12, %for.body2 ] %indvars.iv.next = add nsw i64 %indvars.iv, -1 %tobool.not = icmp eq i64 %indvars.iv.next, 0 br i1 %tobool.not, label %for.inc12, label %for.body2, !llvm.loop !10
Secondly, I've manually changed the IR to make it use "select" instructions and not divergent (doing if-conversion essentially). This way loop interchange would bail out from "unable to recognize reduction or induction phis", since e is a phi node and the operation on e is like a "reduction" but the "reduction operator" is the select instruction, so it is not a real reduction and loop interchange bails. Hence I think this case is under our control.
Just to summarize, IMO I guess we would need to handle output dependency under control-flow dependence anyways. I'm looking forward to your thoughts.
store i16 %conv9.i, i16* @e hazards with itself from different iterations (same memory, both are "write"). It would not be very different if it was an array and indexed e[f(d,c)] where f is a speculable/const function (and potentially always returns 0). If no dependence is detected, I'd consider it a bug.
Thanks Michael, when we check dependency in loop interchange, we check instructions in pair and we did not check an instruction with itself. That's why we did not detect the output dependency that store i16 %conv9.i, i16* @e hazards with itself. I've updated the check so I could detect the output dependency now (patch not updated yet).
In principle an output-dependency does not matter if overwritten anyway before being used. However, DependenceAnalysis doesn't even have an API to consider that. It also does not make a difference between "unconditionally overwrite" (in literature called a "kill") and a conditional or partial write:
I agree with you. The problem in this bug is the "conditional or partial write" you mentioned, in other words, output dependency under control-flow divergence. I can add code to detect "output dependency under control-flow divergence" in loop interchange, or in DependenceAnalysis.cpp. IIUC you prefer to do it in DependenceAnalysis.cpp and add new API there, so I'm working on an design and I'd appreciate it if we could further discuss it maybe on the next loop opt meeting.
I don't see the update yet.
The problem in this bug is the "conditional or partial write" you mentioned, in other words, output dependency under control-flow divergence. I can add code to detect "output dependency under control-flow divergence" in loop interchange, or in DependenceAnalysis.cpp.
The problem is more complex than that, it's not just control-flow divergence (by which I think you mean the write being conditional; if conditional on a loop-invariant condition it could be solved with a loop unswitch), but any kind of "non-kill" memory write (e.g. partial overwrite of only some of the bytes, unpredictable/non-constant indices or base addresses, reuse of previous values in eg. an overlapping memmove call, etc.) and potential reads before the kill instruction. If the address can be accessed by multiple instructions, this is an NP-complete problem.
You are free to tackle that problem maybe just for special cases, but I suggest to try that in a separate patch.
Updated the patch for now, such that during dependence analysis in loop interchange, it takes into account a store instruction with itself, thus can determine the output dependency. If we detect a "Scalar" output dependency under control-flow divergence, we would just bail.
Nevertheless this change exposed another problem. It fails three lit test cases under loop interchange: lcssa-preheader.ll, perserve-lcssa.ll, pr45743-move-from-inner-preheader.ll. They fail because there is a store instruction inside the innermost loop like this:
; inside the inner loop, @Array is a 2D array %Address = gep @Array, 0, InnerIndvar, OuterIndvar store 0, %Address
We should be able to analyze that there is no output dependency between this store and itself since the direction vector should be an all-zero vector. However, da returns a direction vector of "* *" for this output dependency meaning we don't know about dependency directions and we bail from loop interchange, which we should not.
According to our discussion, I've updated the test cases and retitled this patch to "Detect output dependency of a store instruction with itself". I revised three test files to make dependence analysis work more reasonably: lcssa-preheader.ll, perserve-lcssa.ll, pr45743-move-from-inner-preheader.ll.
I removed the legacy test case lcssa_08() in lcssa-preheader.ll since it says we should not move the instruction %wide.trip.count = zext i32 %m to i64 to BB outer.header otherwise lcssa would break. The original patch (https://reviews.llvm.org/D75943) claimed it should be moved into outer.preheader, which was the initial behavior of this test.
In the early days our capability to deal with lcssa phis was limited. More recently after this patch https://reviews.llvm.org/rG8393b9fd1f36d9273fa0720872e3996495aacc1c, we do move %wide.trip.count = zext i32 %m to i64` to BB outer.header, because we've dealt with lcssa phis more appropriately. This to some extent invalidates the original test lcssa_08(), hence I removed it in this patch.
| llvm/test/Transforms/LoopInterchange/lcssa-preheader.ll | ||
|---|---|---|
| 5 ↗ | (On Diff #407746) | This is a valid test case to interchange, and looks like after this patch, it won't be interchanged anymore. I believe @congzhe is looking into seeing if we can treat scalar output dependencies as interchange-preventing so that we can still get scenarios like this one. | 
| llvm/test/Transforms/LoopInterchange/lcssa-preheader.ll | ||
|---|---|---|
| 5 ↗ | (On Diff #407746) | Thanks Bardia, likely I was not clear enough during our last discussion -- just to clarify the fact that this test case wont be interchanged after this patch is due to the insufficient capability of DependenceAnalysis/Delinearization which gives [* *] as the direction vector. Remember that we now detect output dependency and take the store instruction into consideration, which results in a direction vector of [* *]. IIUC this problem seems independent of the current patch -- this patch fixes the deficiency that output dependencies are not considered, if it exposes other problems in other aspects I'm thinking to fix them in other patches maybe? Previously I thought to remove this test since it seems not very meaningful. But actually with a trivial change in the test (similar to the trival changes in the two other tests in this patch), DependenceAnalysis could work properly and this test can be interchanged. I'll update the patch accordingly. Regarding your comment that if we can treat scalar output dependencies as interchange-preventing so that we can still get scenarios like this one: I would like to first clarify again that scalar dependency is not related to this test/scenario, what fails this test is that dependence analysis gives [* *] as the direction vector. I've also followed your suggestion and checked that if we treat scalar output dependencies as interchange-preventing, we would fail the following 10 tests. I looked into a few of them and IMHO it seems that scalar output dependencies seem to be legal in terms of interchange. Please correct me if I'm wrong. Failed Tests (10): LLVM :: Transforms/LoopInterchange/inner-only-reductions.ll LLVM :: Transforms/LoopInterchange/innermost-latch-uses-values-in-middle-header.ll LLVM :: Transforms/LoopInterchange/interchange-no-deps.ll LLVM :: Transforms/LoopInterchange/lcssa.ll LLVM :: Transforms/LoopInterchange/outer-header-jump-to-inner-latch.ll LLVM :: Transforms/LoopInterchange/phi-ordering.ll LLVM :: Transforms/LoopInterchange/pr43176-move-to-new-latch.ll LLVM :: Transforms/LoopInterchange/pr43797-lcssa-for-multiple-outer-loop-blocks.ll LLVM :: Transforms/LoopInterchange/reductions-across-inner-and-outer-loop.ll LLVM :: Transforms/LoopInterchange/vector-gep-operand.ll | 
Updated a previous failing test case in lcssa-preheader.ll.
Previously this test could not be interchanged since dependence analysis would return [* *] as the direction vector for the output dependency detected, now it could be interchanged.
| llvm/test/Transforms/LoopInterchange/lcssa-preheader.ll | ||
|---|---|---|
| 5 ↗ | (On Diff #407746) | 
 Sure, DependenceAnalysis may have issues, but I'm wondering if those same "issues" are the reason this patch prevents interchange for the motivating test case. For instance, for lcssa_08 the reason we see [* *] is because of delinearization issues which could be improved in future (you can test and verify this with -da-disable-delinearization-checks). I think keeping this test and adding -da-disable-delinearization-checks is better than removing it all together. Also with -da-disable-delinearization-checks does the motivating test case still pass with this patch? 
 Thanks for trying this. 
 Could you elaborate on why you think they should not be interchange preventing? This whole patch is premised on detecting output dependencies and treating them as interchange preventing. | 
| llvm/test/Transforms/LoopInterchange/lcssa-preheader.ll | ||
|---|---|---|
| 5 ↗ | (On Diff #407746) | 
 I fully agree, we can interchange this case after adding -da-disable-delinearization-checks. I'll update the patch shortly. 
 Just to clarify a bit, the current patch does not prevent interchange for any case, what this patch does is to fix the deficiency that we did not detect output dependency before. I'm hoping this is one step closer to solving the motivating bug. 
 If we take a look at no_mem_instrs() in Transforms/LoopInterchange/interchange-no-deps.ll, the output dependency on store i64 %indvars.iv, i64* %ptr, align 4 is a scalar output dependency and is supposed to be legal to interchange, if we treat scalar output dependencies as interchange-preventing, this test would fail. Similar situation occurs for lcssa_05() and lcssa_06() in Transforms/LoopInterchange/lcssa.ll. | 
Updated test lcssa_08() by adding -da-disable-delinearization-check such that it could be interchanged.
| llvm/test/Transforms/LoopInterchange/lcssa-preheader.ll | ||
|---|---|---|
| 5 ↗ | (On Diff #407746) | 
 Does that mean this patch does not solve the motivating problem (pr48057) either? If that's the case, I think we should try to understand the problem better before proceeding further. I noticed a different miscompile due to incorrect handling of scalar dependencies and reported it here: https://github.com/llvm/llvm-project/issues/54176. Scalar dependencies are special in that they carry across all iterations of the loop, so they must be handled with care. For correctness, I think we should treat them more conservatively by default and only allow interchange when it can be proven to be safe. I'm planning to take a deeper look into the legality logic of loop interchange later this week. | 
With the recent updates we are no longer losing test coverage. Per discussion in the Loop Opt WG call, I'll approve to get this in as a small incremental improvement. I'll look into https://github.com/llvm/llvm-project/issues/54176 and follow up with more patches.
clang-format: please reformat the code
- if (!LIL.canInterchangeLoops(InnerLoopId, OuterLoopId, DependencyMatrix, DT)) { + if (!LIL.canInterchangeLoops(InnerLoopId, OuterLoopId, DependencyMatrix, + DT)) {