Motivating example: for (j = 0; j < n; j++) { for (k = 0; k < l; k++) { z[j] += x[k]*y[n*k+j]; } } In the code above we should be able to use LoopInterchange to interchange the j and k loop as it allows the innermost loop to be vectorized. However in the current state the LICM promotes z[j] to scalar effective changing the code to: for (j = 0; j < n; j++) { int tmp = z[j]; for (k = 0; k < l; k++) { tmp += x[k]*y[n*k+j]; } z[j] = tmp; } After LICM the loops are not tightly nested and hence LoopInterchange cannot work. The patch aims to look for removablePHIs in the InnerLoop Header and sink the load and hoist store of LICM promoted variables if they help in loop-interchange. Used lit-test from fhahn's patch here: https://reviews.llvm.org/D53027?vs=on&id=168822#toc
Details
Diff Detail
Unit Tests
Time | Test | |
---|---|---|
70 ms | x64 debian > Flang.Lower/OpenACC::acc-reduction.f90 |
Event Timeline
@fhahn I used the lit-test from your patch from here: https://reviews.llvm.org/D53027?vs=on&id=168822#toc
Adding more reviewers. (Better for someone who's spent more time with this pass recently to look.)
Can you summarize the algorithm in the description, in particular what a "removablePHIs" is and what distinguishes a inner-only reduction from an inner reduction?
llvm/test/Transforms/LoopInterchange/inner-only-reductions.ll | ||
---|---|---|
149 | [nit] unrelated change |
In addition to Michael's comment, I'd like to suggest several comments:
- This patch "undo" LICM within loop interchange in order to form a tightly nest loopnest. This sounds like a phase ordering problem, i.e., you could just place loop interchange before LICM and then you'll be able to interchange the loopnest. If it is a phase ordering problem then I'm not sure if it makes perfect sense to undo pass A within pass B, because things could quickly get messy if we choose to develop in this way.
- Also, have you considered the "loopnest" version of loop passes? For example if you use the loopnest invariant code motion (LNICM) pass instead of LICM pass in the pipeline, you'll not hoist z[j] into int tmp = z[j] (in the example from your summary), because LNICM will only do hoisting if a variable is invariant across the entire loopnest.
- In D53027, support for inner-only reductions is removed because of miscompilation of certain cases, as well as profitability concerns since interchange would move loads and stores from the outer loop into the inner loop. Have you thought about addressing these problems?
One possible miscompilation, as mentioned in https://reviews.llvm.org/D53027#1272786, is that it would miscompile if we interchange the following code, given that y is a global variable.
for (unsigned i = 0; i < N; i++) { y = 0; for (unsigned j = 0; j < N; j++) { A[i][j] += 1; y += A[i][j]; } }
@Meinersbur. Thank you for the review.
By inner-only reduction, I mean that the outer loop is not involved in the
reduction at all. For eg:
for (j = 0; j < n; j++) { for (k = 0; k < l; k++) { z[j] += x[k]*y[n*k+j]; } }
In the code above, reduction is performed on each z[j] only by the inner k loop. The
outer loop is not involved in the reduction at all. Loop interchange can handle such
cases if the load and store of z[j] remain within the inner loop. But due to LICM, the load
and store are moved into the outer loop, thereby creating a PHI in the inner-loop that
is not part of reduction across the outerloop. To handle this, we sink the loads
and hoist the stores to the inner loop. This eliminates the inner loop PHI and also
makes the two loops tightly nested, allowing us to loop interchange.
In the patch we try to keep track of such Load-LoadPHI-StorePHI-Store chains.
In places where we check if we can handle a PHI, we match if it's a "removablePHI" -a LoadPHI
or the StorePHI that can be removed by moving the loads and stores inwards
In the end, if we do decide that loop-interchange is profitable we move the loads
and store inwards and let LoopInterchange do the rest.
@congzhe Thank you for the review.
- If we just force a run of LoopInterchangePass before LICM then I do see the interchanged loop, but then if the programmer tries to write the program in a different manner, it might not work. For eg:
for (j = 0; j < n; j++) { for (k = 0; k < l; k++) { z[j] += x[k]*y[n*k+j]; } }
would work, but this would not:
for (j = 0; j < n; j++) { int tmp = z[j]; for (k = 0; k < l; k++) { tmp += x[k]*y[n*k+j]; } z[j] = tmp; }
- Arguement 1 holds here too, if the programmer writes the program a little differently then the interchange won't happen.
- I don't think there should be miscompiles as we are just undoing LICM only in very specific cases. With regards to profitability, we still go through the same profitability analysis The mentioned code does not miscompile. Here are the debug logs:
Processing LoopList of size = 2 Found 2 Loads and Stores to analyze Found anti dependency between Src and Dst Src: %1 = load i32, ptr %arrayidx5, align 4, !tbaa !5 Dst: store i32 %add, ptr %arrayidx5, align 4, !tbaa !5 Processing InnerLoopId = 1 and OuterLoopId = 0 Inner loop PHI is not part of reductions across the outer loop. Only inner loops with induction or reduction PHI nodes are supported currently. Not legal because of current transform limitation Not interchanging loops. Cannot prove legality.
I haven't done an extensive analysis of the performance characteristics of this
change. Can you please suggest some way on how I can check if this patch causes
any degradations? The pass is OFF by default unless one enables it explicitly.
llvm/lib/Transforms/Scalar/LoopInterchange.cpp | ||
---|---|---|
275–280 | Can you document this data structure and its fields? | |
385 | I think "RemovablePHI" is the wrong name. The PHIs are not just removed, but demoted. Suggestion: "SinkablePHIs" ("sink" as in "sink into loop body"). | |
561 | In the loop optimization group call we discussed whether the CacheCost analysis will have the wrong result because it only sees the promoted reduction register, not the additional load/stores in the innermost loop. After thinking about it, CacheCost is defined in number of touched cache lines: This does not change for the outer loop, and the inner loop only accesses the same cache line repeatedly, so I don't think it matters. | |
725 | [nit] unrelated change | |
871 | There are other instructions that can access memory, e.g. CallInst (as in memset(...)). tightlyNested might already check for such instructions; if so, please add a comment about this here. | |
872 | DI might not be necessary there. Since the idea is to undo what LICM does, you can use the same-strength analysis as LICM. LICM uses AliasSetTracker and MemorySSA, i.e. just asking AliasAnalysis/MemorySSA whether the pointers cannot alias (which DI does internally as well) should be sufficient. | |
880 | ++ modifies the iterator. Might have consequences if getIterator() returns a reference. | |
909 | If I am not mistaken, there is no previous check whether all accesses are simple already. That is, it will crash ind debug builds if someone uses e.g. std::atomic. I think it should also return False in this case. | |
914–916 | [style] | |
934 | Leftover code | |
970 | It seems unnecessary to clone the instruction and create a new one? Why not using the existing instruction and move it into the loop? | |
979 | [nit] Leftover code | |
982–984 | Shouldn't the LoopExitPHI be removed as well? | |
1085 | Unless you removed it yourself, shouldn't all instructions have a parent? |
@Meinersbur Sorry I was on vacation for a couple of months. I've made the changes you recommended. PTAL
llvm/lib/Transforms/Scalar/LoopInterchange.cpp | ||
---|---|---|
871 | You are right. TightlyNested will later check for presence of other instructions that may read or write to memory and we do not proceed with loop interchange in that case. | |
909 | At this time, populateDependencyMatrix has already ensured that all the loads and stores are simple. | |
970 | Not sure why I was cloning earlier. Fixed it. | |
982–984 | Yep. It should have been if there are no users. Fixed it. | |
1085 | Seems like a mistake I made earlier. Fixed. |
@Meinersbur Ping! Is it possible to get this patch in before the release/17.x branch is created?
Can you document this data structure and its fields?