This is an archive of the discontinued LLVM Phabricator instance.

[LoopInterchange] Support loop interchange with floating point reductions
ClosedPublic

Authored by congzhe on Jan 16 2022, 8:50 PM.

Details

Summary

Enabled loop interchange support for floating point reductions if fastmath is enabled.

Previously when we encouter a floating point PHI node in the outer loop exit block, we bailed out since we could not detect floating point reductions in the early days. Now we remove this limiation since we are able to detect floating point reductions.

Diff Detail

Unit TestsFailed

Event Timeline

congzhe created this revision.Jan 16 2022, 8:50 PM
congzhe requested review of this revision.Jan 16 2022, 8:50 PM
Herald added a project: Restricted Project. · View Herald TranscriptJan 16 2022, 8:50 PM
congzhe edited the summary of this revision. (Show Details)Jan 16 2022, 8:51 PM
Meinersbur added inline comments.
llvm/lib/Transforms/Scalar/LoopInterchange.cpp
903

I couldn't find a reference for unsafe-fp-math, but I would have guessed that it forces all instructions to be fast-math, even if not marked as fast. If already marked as fast, then no additional tag unsafe-fp-math should be necessary.

Instead of isFast, would allowReassoc be sufficient (or whatever flag controls commutativity)?

The logic here will check only one fp instruction. What if two instructions are involved? Such as:

  for (...)
    for (...) {
      sum += A[i];
      sum += B[j];
   }
print(sum);
congzhe updated this revision to Diff 401088.Jan 18 2022, 8:44 PM
congzhe added inline comments.Jan 18 2022, 9:04 PM
llvm/lib/Transforms/Scalar/LoopInterchange.cpp
903

Thanks a lot for the review! I updated the patch accordingly.

Instead of using isFast(), I now use hasAllowReassoc().

Instead of only checking the variable obtained from followLCSSA(&PHI), now I added a simple data flow analysis areAllInstsReassoc() that checks whether all insturctions involved in the FP reduction allow reassociation. Hopefully this is sufficient for the two-fp-instruction case you provided (and cases where multiple fp instructions are involved).

Regarding "unsafe-fp-math" VS "fast": thank you and I see your point. And yes, I did see cases where the function has the "unsafe-fp-math" attribute but its instructions do not (please see example 1 below). Nevertheless I did also see test cases where instructions are marked "fast" but the surrounding function does not have the "unsafe-fp-math" attribute (please see example 2 below). It seems like there is some minor inconsistency between the attribute and the flag, so I just wanted to be conservatively correct by checking both -- if we have the attribute then that's fine, otherwise we continue checking instruction flags. However, if you think the logic can be simpilfied I'll be glad to update the code.

Example 1 (clang/test/CodeGen/fp-options-to-fast-math-flags.c):

Note that the fast flag is not generated even if compiled with -ffast-math.

float test(float a) {
  return a + fn(a);
}

// CHECK-FAST: [[CALL_RES:%.+]] = call reassoc nnan ninf nsz arcp afn float @fn(float noundef {{%.+}})
// CHECK-FAST: {{%.+}} = fadd reassoc nnan ninf nsz arcp afn float {{%.+}}, [[CALL_RES]]

Example 2 (llvm/test/Transforms/InstSimplify/ConstProp/math-1.ll):

define double @f_acos() {
; CHECK-LABEL: @f_acos(
; CHECK-NEXT:    ret double 0.000000e+00
;
  %res = tail call fast double @acos(double 1.0)
  ret double %res
}

There is the function findInnerReductionPhi that calls RecurrenceDescriptor::isReductionPHI to determine whether an operation is a reduction and also already checks fast-math flags. Isn't that check already sufficient?

Instead of using isFast(), I now use hasAllowReassoc().

I am not a floating-point-expert, so I was asking which flags are actually required. I think at least "nsz" would be necessary are well because (+0) * (-0) is (+0) but (-0) * (+0) is (-0). See https://en.wikipedia.org/wiki/Signed_zero. Similar non-commutativity rules may apply to NaN/Inf. When in doubt, isFast should be the safe option.

Regarding "unsafe-fp-math" VS "fast":

IIRC, the per-operation flags have superseded the per-function unsafe-fp-math and the latter only used for compatibility. Please look up a reference or how other passes handle this. See e.g. https://lists.llvm.org/pipermail/llvm-dev/2012-October/054980.html. Unfortunately it doesn't mention how to handle unsafe-fp-math.

congzhe updated this revision to Diff 402606.Jan 24 2022, 11:40 AM
congzhe added a reviewer: Meinersbur.

@Meinersbur Thank you for the comments, I've updated the patch accordingly.

I've made use of RecurrenceDescriptor::isReductionPHI, especially RecurrenceDescriptor::getExactFPMathInst() to determine whether it is safe to reorer FP instructions. This is similar to what we've done in loop vectorization.

However, it seems that there is a bug in RecurrenceDescriptor, as it fails to detect cases where there's two fadd instructions involved in the FP reduction and only one of them has the "fast" flag (please refer to test5 in reductions-across-inner-and-outer-loop.ll below). In this case RecurrenceDescriptor::getExactFPMathInst() is supposed to return the fadd instruction that does not have the "fast" flag meaning it is unsafe to reorder them, but it returns NULL meaning it determines it is safe to reorder them.

I've changed IVDescriptor.cpp to fix the bug and temporarily included the change in this patch, but I'll post this part of change as another phabricator patch and rebase this patch on top.

Instead of using isFast(), I now use hasAllowReassoc().

Since we now use helper functions from IVDescriptor, this is less of an issue for now. Nevertheless in RecurrenceDescriptor::isRecurrenceInstr() it seems like they only check I->hasAllowReassoc().

Regarding "unsafe-fp-math" VS "fast":

it does look like "unsafe-fp-math" is deprecated and it is not dealt with by any mid-end passes, so I removed the handling of "unsafe-fp-math".

The IVDescriptor patch was posted here:
https://reviews.llvm.org/D118073.

congzhe added a comment.EditedJan 26 2022, 9:16 AM

Hi Michael @Meinersbur, I just checked that if we remove the check

if (PHI.getType()->isFloatingPointTy() &&
        RedDesc[followLCSSA(&PHI)].getExactFPMathInst() != nullptr)

then test5() in reductions-across-inner-and-outer-loop.ll would fail, the loop would be interchanged although reordering should not be allowed.

As a follow-up to our discussion: I understand your point that RecurrenceDescriptor::isReductionPHI() is supposed to return true if FP reductions can be reordered, and false otherwise. Nevertheless as the above test case showed, it may not work as expected (please correct me if I'm mistaken) . If we take the loop vectorization pass as an example, the way it decides whether reductions can be reordered is as follows (in function LoopVectorizationLegality::canVectorizeInstrs()):

RecurrenceDescriptor RedDes;
if (RecurrenceDescriptor::isReductionPHI(Phi, TheLoop, RedDes, DB, AC,
                                                 DT)) {
  Requirements->addExactFPMathInst(RedDes.getExactFPMathInst());
  AllowedExit.insert(RedDes.getLoopExitInstr());
  Reductions[Phi] = RedDes;
  continue;
}

Similarly they also used getExactFPMathInst() to get the 1st FP instruction that does not allow reordering, as an indication whether the FP reduction can be reordered, even if RecurrenceDescriptor::isReductionPHI() returns true. So probably the isReductionPHI() did not work as we expected. If you wish, I'll be glad to modify this API to make it return true if FP reductions can be reordered and false otherwise, and change the code in vectorization as well.

I'd very much appreciate it if you could let me know your comments.

congzhe updated this revision to Diff 404134.Jan 28 2022, 12:27 PM

Rebased for now on my IVDescriptor patch https://reviews.llvm.org/D118073.

I applied the patch myself the check.

The other uses of IVDescriptors is LoopVectorizationLegality.

if (RecurrenceDescriptor::isReductionPHI(Phi, TheLoop, RedDes, DB, AC,
                                         DT)) {
  Requirements->addExactFPMathInst(RedDes.getExactFPMathInst());
  AllowedExit.insert(RedDes.getLoopExitInstr());
  Reductions[Phi] = RedDes;
  continue;
}

addExactFPMathInst as the comments:

/// This holds vectorization requirements that must be verified late in
/// the process. The requirements are set by legalize and costmodel. Once
/// vectorization has been determined to be possible and profitable the
/// requirements can be verified by looking for metadata or compiler options.
/// For example, some loops require FP commutativity which is only allowed if
/// vectorization is explicitly specified or if the fast-math compiler option
/// has been provided.
/// Late evaluation of these requirements allows helpful diagnostics to be
/// composed that tells the user what need to be done to vectorize the loop. For
/// example, by specifying #pragma clang loop vectorize or -ffast-math. Late
/// evaluation should be used only when diagnostics can generated that can be
/// followed by a non-expert user.
class LoopVectorizationRequirements {
public:
  /// Track the 1st floating-point instruction that can not be reassociated.
  void addExactFPMathInst(Instruction *I);

I suggest to do the same with LoopInterchange. That is, RD.getFastMathFlags() contains requirements that still needs to be checked.

I applied the patch myself the check.

The other uses of IVDescriptors is LoopVectorizationLegality.

if (RecurrenceDescriptor::isReductionPHI(Phi, TheLoop, RedDes, DB, AC,
                                         DT)) {
  Requirements->addExactFPMathInst(RedDes.getExactFPMathInst());
  AllowedExit.insert(RedDes.getLoopExitInstr());
  Reductions[Phi] = RedDes;
  continue;
}

addExactFPMathInst as the comments:

/// This holds vectorization requirements that must be verified late in
/// the process. The requirements are set by legalize and costmodel. Once
/// vectorization has been determined to be possible and profitable the
/// requirements can be verified by looking for metadata or compiler options.
/// For example, some loops require FP commutativity which is only allowed if
/// vectorization is explicitly specified or if the fast-math compiler option
/// has been provided.
/// Late evaluation of these requirements allows helpful diagnostics to be
/// composed that tells the user what need to be done to vectorize the loop. For
/// example, by specifying #pragma clang loop vectorize or -ffast-math. Late
/// evaluation should be used only when diagnostics can generated that can be
/// followed by a non-expert user.
class LoopVectorizationRequirements {
public:
  /// Track the 1st floating-point instruction that can not be reassociated.
  void addExactFPMathInst(Instruction *I);

I suggest to do the same with LoopInterchange. That is, RD.getFastMathFlags() contains requirements that still needs to be checked.

Thanks Michael, this is what I described as well (I updated my previous reply, just in case you did not notice: https://reviews.llvm.org/D117450#3272927).

Are you suggesting me to modify RedDesc[followLCSSA(&PHI)].getExactFPMathInst() != nullptr to something like !RedDesc[followLCSSA(&PHI)].getFastMathFlags().isfast() in areOuterLoopExitPHIsSupported()?

Thanks Michael, this is what I described as well (I updated my previous reply, just in case you did not notice: https://reviews.llvm.org/D117450#3272927).

Sorry, I missed that. Probably because I had the Phabricator page already loaded when you submitted the comment.

Are you suggesting me to modify RedDesc[followLCSSA(&PHI)].getExactFPMathInst() != nullptr to something like !RedDesc[followLCSSA(&PHI)].getFastMathFlags().isfast() in areOuterLoopExitPHIsSupported()?

I suggest this in findInnerReductionPhi:

if (RecurrenceDescriptor::isReductionPHI(PHI, L, RD)) {
  if (RD.needsExactFPMath())
    return nullptr;
  return PHI;
}

It might be more fine-grained using RD.getFastMathFlags() and check against compiler flags (if RecurrenceDescriptor did not already take them into account). LoopVectorize has additional hints/compiler flags that act like -ffast-math just for vectorization.

congzhe updated this revision to Diff 405526.Feb 2 2022, 10:15 PM
congzhe added a comment.EditedFeb 2 2022, 10:34 PM

I suggest this in findInnerReductionPhi:

if (RecurrenceDescriptor::isReductionPHI(PHI, L, RD)) {
  if (RD.needsExactFPMath())
    return nullptr;
  return PHI;
}

Thanks, I've updated the patch accordingly. I used RD.getExactFPMathInst() instead, since needsExactFPMath() does not belong to RecurrenceDescriptor.

It might be more fine-grained using RD.getFastMathFlags() and check against compiler flags (if RecurrenceDescriptor did not already take them into account). LoopVectorize has additional hints/compiler flags that act like -ffast-math just for vectorization.

I do agree that it would be more fine-grained using RD.getFastMathFlags() and check against flags. I see SLP checks noNaNs() for FMax/Fmin, and loop vectorizer relies on getExactFPMathInst() and additional hints. I'm thinking maybe at the moment we could just rely on RD.getExactFPMathInst() to be safe, and we could make it finer-grained later. I'd appreciate it if you could let me know your thoughts.

Meinersbur accepted this revision.Feb 4 2022, 9:58 AM

Thanks, I've updated the patch accordingly. I used RD.getExactFPMathInst() instead, since needsExactFPMath() does not belong to RecurrenceDescriptor.

Correct, unfortunately. I found needsExactFPMath more descriptive. If you agree, you could a definition just like InstDesc has.

I do agree that it would be more fine-grained using RD.getFastMathFlags() and check against flags. I see SLP checks noNaNs() for FMax/Fmin, and loop vectorizer relies on getExactFPMathInst() and additional hints. I'm thinking maybe at the moment we could just rely on RD.getExactFPMathInst() to be safe, and we could make it finer-grained later.

Agreed.

LGTM. Please address the nitpicks before committing.

Thank you for your contribution!

llvm/lib/Transforms/Scalar/LoopInterchange.cpp
885

[nit] unrelated whitespace change

899–900

[style] In my interpretation of the coding standard, the braces should stay here. See the "Use braces for the outer if since the nested for is braced." example.

This revision is now accepted and ready to land.Feb 4 2022, 9:58 AM
congzhe updated this revision to Diff 406279.Feb 6 2022, 1:50 PM

Thanks for all the comments, I've addressed them and will land the patch shortly.

This revision was landed with ongoing or failed builds.Feb 6 2022, 2:09 PM
This revision was automatically updated to reflect the committed changes.