This is an archive of the discontinued LLVM Phabricator instance.

[Transforms][LICM] Add the ability to undo unprofitable reassociation
ClosedPublic

Authored by pawosm01 on Jun 6 2023, 9:09 AM.

Details

Summary

Consider the following piece of code:

void innermost_loop(int i, double d1, double d2, double delta, int n, double cells[n])
{
  int j;
  const double d1d = d1 * delta;
  const double d2d = d2 * delta;

  for (j = 0; j <= i; j++)
    cells[j] = d1d * cells[j + 1] + d2d * cells[j];
}

When compiling at -Ofast level, after the "Reassociate expressions"
pass, this code is transformed into an equivalent of:

int j;

for (j = 0; j <= i; j++)
  cells[j] = (d1 * cells[j + 1] + d2 * cells[j]) * delta;

Effectively, the computation of those loop invariants isn't done
before the loop anymore, we have one extra multiplication on each
loop iteration instead. Sadly, this results in a significant
performance hit.

Similarly, specifically crafted user code will also experience
inability to hoist those invariants.

This patch is solving this issue by adding the ability to undo such
reassociation into the LICM pass. Note that for doing such
transformation this pass requires the same conditions as the
"Reassociate expressions" pass, namely, the involved binary operators
must have the reassociations allowed (e.g. by specifying the fast
attribute) and they must have single use only.

Some parts of this patch were suggested by Nikita Popov.

Diff Detail

Event Timeline

pawosm01 created this revision.Jun 6 2023, 9:09 AM
Herald added a project: Restricted Project. · View Herald TranscriptJun 6 2023, 9:09 AM
pawosm01 requested review of this revision.Jun 6 2023, 9:09 AM
Herald added a project: Restricted Project. · View Herald TranscriptJun 6 2023, 9:09 AM
pawosm01 updated this revision to Diff 528981.Jun 6 2023, 12:09 PM

Making it slightly less readable, per clang-format request.

mgabka added a subscriber: mgabka.Jun 7 2023, 12:36 AM

Back to the drawing board: my patch can’t cope with IR emitted by flang-new: Instruction does not dominate all uses

pawosm01 updated this revision to Diff 531108.Jun 13 2023, 4:17 PM

Fixing the problem I've found while using this patch with a larger codebase.

pawosm01 updated this revision to Diff 533183.Jun 21 2023, 2:05 AM

I had to simplify this logic. After some time even I couldn't understand it.

huntergr added inline comments.Jul 4 2023, 2:00 AM
llvm/lib/Transforms/Scalar/LICM.cpp
2762–2763

This seems a little odd, like you might be processing more candidates than you originally identified and want to prevent wrapping?

I think it might be better to see if you can add candidate operations to a SmallVector worklist and process them directly from that instead of walking back through the IR a second time.

qcolombet added inline comments.Jul 4 2023, 6:42 AM
llvm/lib/Transforms/Scalar/LICM.cpp
2685

Could you add a comment on what this function is doing and what it means to return true or false?

It'll make the review easier.

pawosm01 updated this revision to Diff 537892.Jul 6 2023, 2:59 PM

Reaction to the review comments.

pawosm01 marked 2 inline comments as done.Jul 6 2023, 3:00 PM
pawosm01 added inline comments.
llvm/lib/Transforms/Scalar/LICM.cpp
2685

I'll try.

2762–2763

Yeah, that simplifies the code indeed.

huntergr added inline comments.Jul 14 2023, 2:26 AM
llvm/lib/Transforms/Scalar/LICM.cpp
2713–2714

Can just 'return false;' here instead of taking the extra step before returning.

Same with the other checks in this loop.

2738

I don't think we need a count here; just a boolean 'CandidateFound' would suffice.

2750–2760

Using a loop over the ops with a break after the substitution is performed feels a bit unwieldy. Could you try recording the Ops in a known order in Changes (probably using std::swap as you did before), so you know that e.g. the first one is always loop invariant? I don't think we need to maintain the order of operations for an fmul so you could overwrite both, but if there is a good reason to maintain the order then you could just make the nested pair in changes a tuple instead and record the index of the invariant operand.

pawosm01 updated this revision to Diff 540712.Jul 15 2023, 11:26 AM
pawosm01 marked 2 inline comments as done.

post-review changes

pawosm01 marked 3 inline comments as done.Jul 15 2023, 11:56 AM
pawosm01 added inline comments.
llvm/lib/Transforms/Scalar/LICM.cpp
2713–2714

Yes, same in the other places.

2738

It isn't as easy as it seems. As it turns, this transformation shouldn't be applied for every situation where at least one candidate is found.

There are two problems here:

  1. a simple expression like inv1 * var1 * inv2 will be found as a candidate and transformed into var1 * (inv1 * inv2) where (inv1 * inv2) part will be hoisted. But this expression itself can be a part of a longer expression, e.g. inv1 * var1 * inv2 + inv1 * var2 * inv2 + inv1 * var3 * inv2 which before the introduction of my change could easily be transformed into (var1 + var2 + var3) * (inv1 * inv2) with (inv1 * inv2) hoisted. The transformation I'm proposing should rather focus on the expressions described in the comment block placed before this function which usually require undoing the actions of some previous pass (namely, the Reasociate pass), and leave simpler cases alone and make the existing test cases supporting them just pass without any awfully messy change. Hence, I've added AnyAdds variable to make sure the intended expression patterns are being found. The previous oversimlified condition eliminated such cases by a sheer coincidence.
  1. with the expression long enough and small enough trip count there is a possibility that the transformation leading to the hoisting can cause performance degradation. If most of the operations in a particularly long expression involves non-invariants, there will be more multiplications introduced than operations hoisted. By changing the type of the Candidates counter to a signed integer and decrementing it whenever there is no invariant in the operation (hence no candidate for hoisting), the applicability of this transformation can be limited to the situations where the numbers of non-invariant operations and hoisting candidates are at least equal. A crude estimate and naïve one, I can admit that, but anything better would overcomplicate the logic here, e.g. by introducing instruction cost estimation and trip count estimation, the cost of doing it can easily outgrow any benefits.
2750–2760

yeah, I've simplified this.

huntergr added inline comments.Jul 20 2023, 1:39 AM
llvm/lib/Transforms/Scalar/LICM.cpp
2710–2712

It may be worthwhile considering an upper limit on the number of multiplies you support -- depending on the trip count of the loop, you may actually be introducing extra work for no gain (especially for a target with wide vectors that may complete everything in one vector iteration).

2738

Yes, I see that if you have multiplies with no loop-invariant operands in the chain of adds then you need to add more multiplies inside the loop, which seems to defeat the purpose of the optimization. So perhaps we should avoid performing this transformation if there is such a multiply in the chain.

It might be possible to perform it with only 1 such multiply since you'll be reordering the operations and may reduce the latency, but that would require benchmarking to confirm and is dependent on the workload and target, but I think the safe side is to bail out.

pawosm01 updated this revision to Diff 542461.Jul 20 2023, 5:46 AM
pawosm01 marked 3 inline comments as done.

Following the reviewer's suggestions.

pawosm01 marked 2 inline comments as done.Jul 20 2023, 5:48 AM
pawosm01 added inline comments.
llvm/lib/Transforms/Scalar/LICM.cpp
2710–2712

I've imposed a reasonable upper limit below.

This revision is now accepted and ready to land.Jul 20 2023, 6:08 AM
nikic added a subscriber: nikic.Jul 20 2023, 6:38 AM
nikic added inline comments.
llvm/lib/Transforms/Scalar/LICM.cpp
2685

deassociate -> reassociate

2697–2698

Here and elsewhere: Remove all unnecessary parentheses, apply de Morgan as needed.

2708

I think you just reinvented Use? That's how you reference a specific operand of an instruction.

pawosm01 updated this revision to Diff 542877.Jul 21 2023, 5:51 AM
pawosm01 marked an inline comment as done.

Post-review changes again.

pawosm01 marked 3 inline comments as done.Jul 21 2023, 5:53 AM
pawosm01 added inline comments.
llvm/lib/Transforms/Scalar/LICM.cpp
2708

Ok, I've changed it to make use of Use, I hope you'll like it.

pawosm01 marked an inline comment as done.Jul 21 2023, 6:03 AM
nikic added inline comments.Jul 21 2023, 6:45 AM
llvm/lib/Transforms/Scalar/LICM.cpp
2725

This should be &Op->getOperandUse(Ops[L.isLoopInvariant(Ops[0]) ? 0 : 1]) or so. But I think it would be less awkward to do something like this, so you don't go back and forth between operand indices, Values and Uses:

if (Op->getOpcode() != Instruction::FMul) {
  if (OpNext->getOpcode() != Instruction::FMul)
    return false;
  std::swap(Op, OpNext);
}
if (!Op->hasOneUse() || !Op->hasAllowReassoc() || L.isLoopInvariant(Op))
  return false;
Use &U0 = Op->getOperandUse(0);
Use &U1 = Op->getOperandUse(1);
if (L.isLoopInvariant(U0))
  Changes.push_back(&U0);
else if (L.isLoopInvariant(U1))
  Changes.push_back(&U1);
else
  return false;
pawosm01 updated this revision to Diff 542923.Jul 21 2023, 8:09 AM

more post-review changes.

pawosm01 marked an inline comment as done.Jul 21 2023, 8:10 AM
pawosm01 added inline comments.
llvm/lib/Transforms/Scalar/LICM.cpp
2725

ok, thanks for your valuable input.

pawosm01 marked an inline comment as done.Jul 21 2023, 8:11 AM
nikic added inline comments.Jul 25 2023, 4:05 AM
llvm/lib/Transforms/Scalar/LICM.cpp
2708

Do we need to require that the fadd is reassoc as well?

2742

This copies the FMF flags from the top-level fmul, but is that correct? We have multiple fadds and fmuls involved in the transform, why is it safe to use the flags from that one in particular?

(The test coverage for this is not great, because you just use "fast" everywhere.)

pawosm01 marked 2 inline comments as done.Jul 25 2023, 6:15 AM
pawosm01 added inline comments.
llvm/lib/Transforms/Scalar/LICM.cpp
2708

Yes, this match specifically requires that the expression is of the form that is described in the comment written above this function: ((A1 * B1) + (A2 * B2) + ...). I wrote this function to solve the problem at hand, to which I know that my solution provides noticeable performance benefit. Trying to make it more general would require a whole lot more research and performance impact analysis that would go far beyond what I wanted to fix. I'm offering this solution which I based on the other hoist... functions in this file, I believe if one finds a reason for extending this (to handle their own performance issue at hand), they could do that basing on what I did propose.

2742

Yes I assume it's correct. What is happening here is that the fmul instruction from which the flag is taken is effectively being copied to every subexpression in the parenthesis, it is reasonable to copy the fast flag too. As checking for associations being allowed is implicitly checking the fast flag, nothing harmful is happening here.

pawosm01 marked 2 inline comments as done.Jul 25 2023, 6:16 AM
nikic added inline comments.Jul 25 2023, 9:27 AM
llvm/lib/Transforms/Scalar/LICM.cpp
2708

Let me rephrase this: Isn't this missing a reassoc check on the fadd? We check the root fmul above, and we check the inner fmuls below, but we don't seem to check anything about the fadd.

2738

I didn't quite get your argument in 1. for why it's a bad idea to transform inv1 * var1 * inv2 into var1 * (inv1 * inv2). Why is there a problem if it's part of a longer expression? Wouldn't we still get (var1 + var2 + var3) * (inv1 * inv2) as the end result if inv1 * inv2 is hoisted out of each part and then CSEd?

As far as I can tell, the transform should be profitable regardless of whether there is an fadd or not.

2742

The transform only requires the reassoc flag on each instruction, while this is also copying all other FMF flags. It's not obvious to me that this is correct.

One could probably make an argument along the lines of "if the final fmul is nnan, then that implies that the fadd chain result is not nan. As the result would be nan if any operand were nan, we can propagate the nnan flag up the chain". This sounds plausible. But I'm not sure the same logic holds up for ninf and other FMF flags.

pawosm01 updated this revision to Diff 544424.Jul 26 2023, 10:16 AM

more changes

pawosm01 marked 3 inline comments as done.Jul 26 2023, 10:21 AM
pawosm01 added inline comments.
llvm/lib/Transforms/Scalar/LICM.cpp
2708

I assumed it would be checked anyhow, when OpNext is processed, but yeah, it is more clear to check it early, yet also it should be checked whether it is FAdd in the first place

2738

Followed your observation, I've added such ability, plus a simple test case covering it.

2742

Ok, so let's copy it from the affected instruction itself...

pawosm01 marked 3 inline comments as done.Jul 26 2023, 10:22 AM
nikic added inline comments.Jul 27 2023, 5:21 AM
llvm/lib/Transforms/Scalar/LICM.cpp
2707

As a final note, I would rewrite this loop into a worklist iteration, something along these lines:

SmallVector<BinaryOperator *> Worklist;
Worklist.push_back(VariantOp);
while (!Worklist.empty()) {
  BinaryOperator *BO = Worklist.pop_back_val();
  if (!BO->hasOneUse())
    return false;
  BinaryOperator *Op0, *Op1;
  if (match(BO, m_FAdd(m_BinOp(Op0), m_BinOp(Op1))) &&
      I->hasAllocReassoc()) {
    Worklist.push_back(Op0);
    Worklist.push_back(Op1);
    continue;
  }
  if (BO->getOpcode() != Instruction::FMul || !BO->hasAllocReassoc() ||
      L.isLoopInvariant(BO))
    return false;
  Use &U0 = BO->getOperandUse(0);
  Use &U1 = BO->getOperandUse(1);
   if (L.isLoopInvariant(U0))
     Changes.push_back(&U0);
   else if (L.isLoopInvariant(U1))
     Changes.push_back(&U1);
   else
     return false;
   if (Changes.size() > 5U) // A reasonable upper limit
     return false;
}

This has the benefit of not relying on specific structure for the fadds (is there something that guarantees that these form a linear chain?) and I think is also less confusing because the fadd and fmul handling is pretty cleanly separated.

I also wanted to ask where the number 5 for the cutoff comes from. Is that just chosen arbitrarily?

2747

dyn_cast + assert == cast

pawosm01 updated this revision to Diff 545071.Jul 28 2023, 2:11 AM
pawosm01 edited the summary of this revision. (Show Details)

Following reviewer's advice.

pawosm01 marked an inline comment as done.Jul 28 2023, 2:17 AM
pawosm01 added inline comments.
llvm/lib/Transforms/Scalar/LICM.cpp
2707

Well, I deliberately wanted to avoid recursion, even if explicitly using stack (here named a Worklist). But this looks so neat that let's have it this way.

2747

ah, I didn't notice that, I'll get back to it soon.

pawosm01 marked an inline comment as done.Jul 28 2023, 2:17 AM
llvm/lib/Transforms/Scalar/LICM.cpp
2707

I also wanted to ask where the number 5 for the cutoff comes from. Is that just chosen arbitrarily?

This was chosen arbitrarily indeed, the choice was based on an observation of the problem at hand.

pawosm01 updated this revision to Diff 545083.Jul 28 2023, 3:04 AM

a missing change instantiated

pawosm01 marked an inline comment as done.Jul 28 2023, 3:05 AM
pawosm01 added inline comments.
llvm/lib/Transforms/Scalar/LICM.cpp
2747

Done.

nikic accepted this revision.Jul 28 2023, 6:28 AM

LGTM

llvm/lib/Transforms/Scalar/LICM.cpp
2743

nit: Can move this IRBuilder outside the loop.

paulwalker-arm added inline comments.
llvm/lib/Transforms/Scalar/LICM.cpp
2707

Perhaps it's worth replacing the hardwired value with a command line option that defaults to 5?

pawosm01 updated this revision to Diff 545292.Jul 28 2023, 3:23 PM

Added new option, as suggested by the reviewer.

pawosm01 marked an inline comment as done.Jul 28 2023, 3:26 PM
pawosm01 added inline comments.
llvm/lib/Transforms/Scalar/LICM.cpp
2707

Agreed. I've added one.

pawosm01 marked an inline comment as done.Jul 28 2023, 3:26 PM
pawosm01 updated this revision to Diff 545295.Jul 28 2023, 3:36 PM

One more change.

pawosm01 marked an inline comment as done.Jul 28 2023, 3:37 PM
pawosm01 added inline comments.
llvm/lib/Transforms/Scalar/LICM.cpp
2743

Ah, yeah, good catch; corrected.

pawosm01 marked an inline comment as done.Jul 28 2023, 3:37 PM
paulwalker-arm accepted this revision.Jul 31 2023, 8:56 AM

Thanks for adding the command line option @pawosm01.

llvm/lib/Transforms/Scalar/LICM.cpp
133

Up to you but I think "licm-max-num-fp-reassociations" or "licm-fp-reassociation-cap" would be more in keeping with the existing naming of licm options.

pawosm01 updated this revision to Diff 545766.Jul 31 2023, 1:00 PM

following reviewer's advice

pawosm01 marked an inline comment as done.Jul 31 2023, 1:01 PM
pawosm01 added inline comments.
llvm/lib/Transforms/Scalar/LICM.cpp
133

yes, "licm-max-num-fp-reassociations" seems a better name for an option.

pawosm01 marked an inline comment as done.Jul 31 2023, 1:01 PM
bgraur added a subscriber: bgraur.Aug 8 2023, 10:38 AM

Heads-up: this is likely causing a miscompile. We're working on a reproducer.

bgraur added a comment.Aug 9 2023, 7:28 AM

Heads-up: this is likely causing a miscompile. We're working on a reproducer.

This was a false alarm. Please disregard.