This is an archive of the discontinued LLVM Phabricator instance.

[Loop-Interchange] Allow inner-loop only reductions
Needs ReviewPublic

Authored by quic_aankit on Feb 16 2023, 3:30 PM.

Details

Summary
Motivating example:

    for (j = 0; j < n; j++) {
        for (k = 0; k < l; k++) {
            z[j] += x[k]*y[n*k+j];
        }
    }

In the code above we should be able to use LoopInterchange to interchange the
j and k loop as it allows the innermost loop to be vectorized. However in the
current state the LICM promotes z[j] to scalar effective changing the code to:

    for (j = 0; j < n; j++) {
        int tmp = z[j];
        for (k = 0; k < l; k++) {
            tmp += x[k]*y[n*k+j];
        }
        z[j] = tmp;
    }

After LICM the loops are not tightly nested and hence LoopInterchange cannot
work.

The patch aims to look for removablePHIs in the InnerLoop Header
and sink the load and hoist store of LICM promoted variables if they
help in loop-interchange.

Used lit-test from fhahn's patch here: https://reviews.llvm.org/D53027?vs=on&id=168822#toc

Diff Detail

Event Timeline

quic_aankit created this revision.Feb 16 2023, 3:30 PM
Herald added a project: Restricted Project. · View Herald TranscriptFeb 16 2023, 3:30 PM
quic_aankit requested review of this revision.Feb 16 2023, 3:30 PM
Herald added a project: Restricted Project. · View Herald TranscriptFeb 16 2023, 3:30 PM
quic_aankit retitled this revision from Allow inner-loop only reductions to [Loop-Interchange] Allow inner-loop only reductions.Feb 16 2023, 3:36 PM
quic_aankit edited the summary of this revision. (Show Details)Feb 16 2023, 3:40 PM

@fhahn @eli.friedman Can you please review this patch?

efriedma added subscribers: ram-NK, Narutoworld.

Adding more reviewers. (Better for someone who's spent more time with this pass recently to look.)

quic_aankit added a comment.EditedMar 23 2023, 11:24 PM

Ping! Thank you @efriedma for adding more reviewers

Meinersbur added a comment.EditedApr 26 2023, 12:24 PM

Can you summarize the algorithm in the description, in particular what a "removablePHIs" is and what distinguishes a inner-only reduction from an inner reduction?

llvm/test/Transforms/LoopInterchange/inner-only-reductions.ll
149

[nit] unrelated change

In addition to Michael's comment, I'd like to suggest several comments:

  1. This patch "undo" LICM within loop interchange in order to form a tightly nest loopnest. This sounds like a phase ordering problem, i.e., you could just place loop interchange before LICM and then you'll be able to interchange the loopnest. If it is a phase ordering problem then I'm not sure if it makes perfect sense to undo pass A within pass B, because things could quickly get messy if we choose to develop in this way.
  1. Also, have you considered the "loopnest" version of loop passes? For example if you use the loopnest invariant code motion (LNICM) pass instead of LICM pass in the pipeline, you'll not hoist z[j] into int tmp = z[j] (in the example from your summary), because LNICM will only do hoisting if a variable is invariant across the entire loopnest.
  1. In D53027, support for inner-only reductions is removed because of miscompilation of certain cases, as well as profitability concerns since interchange would move loads and stores from the outer loop into the inner loop. Have you thought about addressing these problems?

One possible miscompilation, as mentioned in https://reviews.llvm.org/D53027#1272786, is that it would miscompile if we interchange the following code, given that y is a global variable.

for (unsigned i = 0; i < N; i++) {
  y = 0;
  for (unsigned j = 0; j < N; j++) {
    A[i][j] += 1;
    y += A[i][j];
  }
}
congzhe added a project: Restricted Project.Apr 26 2023, 8:04 PM

Summarize the algorithm in the description, in particular what a "removablePHIs" is and what distinguishes a inner-only reduction from an inner reduction?

@Meinersbur. Thank you for the review.

By inner-only reduction, I mean that the outer loop is not involved in the
reduction at all. For eg:

for (j = 0; j < n; j++) {
    for (k = 0; k < l; k++) {
        z[j] += x[k]*y[n*k+j];
    }
}

In the code above, reduction is performed on each z[j] only by the inner k loop. The
outer loop is not involved in the reduction at all. Loop interchange can handle such
cases if the load and store of z[j] remain within the inner loop. But due to LICM, the load
and store are moved into the outer loop, thereby creating a PHI in the inner-loop that
is not part of reduction across the outerloop. To handle this, we sink the loads
and hoist the stores to the inner loop. This eliminates the inner loop PHI and also
makes the two loops tightly nested, allowing us to loop interchange.

In the patch we try to keep track of such Load-LoadPHI-StorePHI-Store chains.
In places where we check if we can handle a PHI, we match if it's a "removablePHI" -a LoadPHI
or the StorePHI that can be removed by moving the loads and stores inwards

In the end, if we do decide that loop-interchange is profitable we move the loads
and store inwards and let LoopInterchange do the rest.

In addition to Michael's comment, I'd like to suggest several comments:

  1. This patch "undo" LICM within loop interchange in order to form a tightly nest loopnest. This sounds like a phase ordering problem, i.e., you could just place loop interchange before LICM and then you'll be able to interchange the loopnest. If it is a phase ordering problem then I'm not sure if it makes perfect sense to undo pass A within pass B, because things could quickly get messy if we choose to develop in this way.
  1. Also, have you considered the "loopnest" version of loop passes? For example if you use the loopnest invariant code motion (LNICM) pass instead of LICM pass in the pipeline, you'll not hoist z[j] into int tmp = z[j] (in the example from your summary), because LNICM will only do hoisting if a variable is invariant across the entire loopnest.
  1. In D53027, support for inner-only reductions is removed because of miscompilation of certain cases, as well as profitability concerns since interchange would move loads and stores from the outer loop into the inner loop. Have you thought about addressing these problems?

One possible miscompilation, as mentioned in https://reviews.llvm.org/D53027#1272786, is that it would miscompile if we interchange the following code, given that y is a global variable.

for (unsigned i = 0; i < N; i++) {
  y = 0;
  for (unsigned j = 0; j < N; j++) {
    A[i][j] += 1;
    y += A[i][j];
  }
}

@congzhe Thank you for the review.

  1. If we just force a run of LoopInterchangePass before LICM then I do see the interchanged loop, but then if the programmer tries to write the program in a different manner, it might not work. For eg:
for (j = 0; j < n; j++) {
    for (k = 0; k < l; k++) {
        z[j] += x[k]*y[n*k+j];
    }
}

would work, but this would not:

for (j = 0; j < n; j++) {
    int tmp = z[j];
    for (k = 0; k < l; k++) {
        tmp += x[k]*y[n*k+j];
    }
    z[j] = tmp;
}
  1. Arguement 1 holds here too, if the programmer writes the program a little differently then the interchange won't happen.
  1. I don't think there should be miscompiles as we are just undoing LICM only in very specific cases. With regards to profitability, we still go through the same profitability analysis The mentioned code does not miscompile. Here are the debug logs:
	Processing LoopList of size = 2
	Found 2 Loads and Stores to analyze
	Found anti dependency between Src and Dst
	Src:  %1 = load i32, ptr %arrayidx5, align 4, !tbaa !5
	Dst:  store i32 %add, ptr %arrayidx5, align 4, !tbaa !5
	Processing InnerLoopId = 1 and OuterLoopId = 0
	Inner loop PHI is not part of reductions across the outer loop.
	Only inner loops with induction or reduction PHI nodes are supported currently.
	Not legal because of current transform limitation
	Not interchanging loops. Cannot prove legality.

I haven't done an extensive analysis of the performance characteristics of this
change. Can you please suggest some way on how I can check if this patch causes
any degradations? The pass is OFF by default unless one enables it explicitly.

Meinersbur added inline comments.May 17 2023, 12:48 PM
llvm/lib/Transforms/Scalar/LoopInterchange.cpp
275–280

Can you document this data structure and its fields?

385

I think "RemovablePHI" is the wrong name. The PHIs are not just removed, but demoted. Suggestion: "SinkablePHIs" ("sink" as in "sink into loop body").

561

In the loop optimization group call we discussed whether the CacheCost analysis will have the wrong result because it only sees the promoted reduction register, not the additional load/stores in the innermost loop.

After thinking about it, CacheCost is defined in number of touched cache lines: This does not change for the outer loop, and the inner loop only accesses the same cache line repeatedly, so I don't think it matters.

725

[nit] unrelated change

871

There are other instructions that can access memory, e.g. CallInst (as in memset(...)). tightlyNested might already check for such instructions; if so, please add a comment about this here.

872

DI might not be necessary there. Since the idea is to undo what LICM does, you can use the same-strength analysis as LICM. LICM uses AliasSetTracker and MemorySSA, i.e. just asking AliasAnalysis/MemorySSA whether the pointers cannot alias (which DI does internally as well) should be sufficient.

880

++ modifies the iterator. Might have consequences if getIterator() returns a reference.

909

If I am not mistaken, there is no previous check whether all accesses are simple already. That is, it will crash ind debug builds if someone uses e.g. std::atomic. I think it should also return False in this case.

914–916

[style]

934

Leftover code

970

It seems unnecessary to clone the instruction and create a new one? Why not using the existing instruction and move it into the loop?

979

[nit] Leftover code

982–984

Shouldn't the LoopExitPHI be removed as well?

1085

Unless you removed it yourself, shouldn't all instructions have a parent?

@Meinersbur Sorry I was on vacation for a couple of months. I've made the changes you recommended. PTAL

quic_aankit marked 6 inline comments as done.Jul 12 2023, 1:21 PM
quic_aankit added inline comments.
llvm/lib/Transforms/Scalar/LoopInterchange.cpp
871

You are right. TightlyNested will later check for presence of other instructions that may read or write to memory and we do not proceed with loop interchange in that case.

909

At this time, populateDependencyMatrix has already ensured that all the loads and stores are simple.

970

Not sure why I was cloning earlier. Fixed it.

982–984

Yep. It should have been if there are no users. Fixed it.

1085

Seems like a mistake I made earlier. Fixed.

quic_aankit marked an inline comment as done.

@Meinersbur Ping! Is it possible to get this patch in before the release/17.x branch is created?

Ping for review!