This is an archive of the discontinued LLVM Phabricator instance.

[InstCombine] Optimize addition/subtraction operations of splats of vscale multiplied by a constant
AcceptedPublic

Authored by igor.kirillov on Jul 13 2023, 9:43 AM.

Details

Summary

This patch enhances the InstCombine pass to optimize addition and
subtraction operations involving splat values optionally multiplied
by a constant or shifted by a constant. The transformation combines
the operations as follows:

(A +/- splat(B)) +/- splat(C) -> A +/- splat(B +/- C)

This optimization improves the performance of vectorized code with
interleave factor > 1, where indices are part of an expression.
For example, in cases such as the following loop:

for (int i = A; i < B; ++i)
  out[i + offset] = (i + X) * Y;

Diff Detail

Event Timeline

igor.kirillov created this revision.Jul 13 2023, 9:43 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 13 2023, 9:43 AM
igor.kirillov requested review of this revision.Jul 13 2023, 9:43 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 13 2023, 9:43 AM
paulwalker-arm added inline comments.Jul 18 2023, 4:42 AM
llvm/lib/Transforms/InstCombine/InstructionCombining.cpp
1684

This function seems to handle cases where the opcode is not relevant. However, you only care about two specific opcodes so this doesn't looks like the correct resting place for this code. Placing it somewhere more specific to add and sub might allow you to simplify the logic.

1689

Should this be C = getSplatValue(LHS);? if so then perhaps this highlights some missing tests?

1705–1706

Not sure this naming is correct because the opcode says nothing about the signedness of the data.

goldstein.w.n added inline comments.
llvm/lib/Transforms/InstCombine/InstructionCombining.cpp
1684

Although you could imagine extending this function for any assosiative binop, i.e xor, or, and, mul...
Probably cleaner would to be split this to a helper function so that if you don't have supported opcodes you cna just return nullptr. Then there won't be so much nesting.

1709

Can you add a comment explaining this. Its not exactly clear how/why the positive bools impl the opcodes they do.

igor.kirillov marked 2 inline comments as done.

Move code to a separate function and completely refactor it

igor.kirillov marked 2 inline comments as done.Jul 24 2023, 3:49 PM
igor.kirillov added inline comments.
llvm/lib/Transforms/InstCombine/InstructionCombining.cpp
1684

This function seems to handle cases where the opcode is not relevant. However, you only care about two specific opcodes, so this doesn't looks like the correct resting place for this code. Placing it somewhere more specific to add and sub might allow you to simplify the logic.

@paulwalker-arm, Yes, that was my first idea, but I took a look at InstCombineAddSub.cpp, and it has a minimal amount of code that is vector specific, so I decided to put the logic into foldVectorBinop.

@goldstein.w.n, Put everything into a separate function. I am unsure LoopVectorize could generate such code (with xor etc.).

1689

Yes, indeed! But the algorithm now is completely different.

1705–1706

No more signedness

paulwalker-arm added inline comments.Aug 3 2023, 9:09 AM
llvm/lib/Transforms/InstCombine/InstructionCombining.cpp
1617

It's worth verifying but I don't think you need the commutative version here because constants should always be canonicalised to the RHS.

1621

If I'm nit picking I think you should use VScale consistently and not Vscale.

1629–1632

This seems wasteful given the number of times you're matching m_SplatVscale. Perhaps drop the first match from all the if blocks and then once you've figured out the add/sub combo you can have a single block that does:

if (!match(SplatB, m_SplatVscale) || !match(SplatC, m_SplatVscale))
  return nullptr;

Perhaps you don't even need m_SplatVscale because getSplatValue returns null on failure so the check could be:

if (!B || !match(B, m_ConstMultipliedVscale) ||
    !C || !match(C, m_ConstMultipliedVscale))
  return nullptr;
1682–1683

I still think this is wasteful as we're going to run through several redundant match calls before eventually hitting the null return for every bin op other than Add and Sub. To me it seems better to call this directly within visitAdd and visitSub. You'll see there's related precedent here with foldBinOpShiftWithShift amongst others.

igor.kirillov marked 3 inline comments as done.
  • Vscale -> VScale
  • Move foldVScaleSplatAddSub call to visitAdd and visitSub
  • Refactor pattern detection algorithm
igor.kirillov marked 2 inline comments as done.Aug 7 2023, 5:36 AM
igor.kirillov added inline comments.
llvm/lib/Transforms/InstCombine/InstructionCombining.cpp
1629–1632

What do you think about this version? My first version was using getSplatValue, but then I had to decide the exact order of A, B and C. Using the current approach, we can first match A, SpatB, SplatC and then match the exact add/sub operations without calling splat matcher several times.

Matt added a subscriber: Matt.Aug 7 2023, 3:44 PM
paulwalker-arm accepted this revision.Aug 9 2023, 9:37 AM

A couple of minor requests but otherwise looks good.

llvm/lib/Transforms/InstCombine/InstructionCombining.cpp
1618

by

1629–1632

Seems like a reasonable compromise to me.

1660–1661

Better to move the initialisation here (i.e. Value *B = ) since there's no other uses.

This revision is now accepted and ready to land.Aug 9 2023, 9:37 AM
nikic requested changes to this revision.Aug 9 2023, 9:57 AM
nikic added reviewers: nikic, goldstein.w.n.
nikic added a subscriber: nikic.

Please make sure to pre-commit test coverage (https://llvm.org/docs/TestingGuide.html#precommit-workflow-for-tests). This is also missing some tests, in particular multi-use and negative tests.

Why are the LoopVectorize changes seen here not regressions (caused by a multi-use transform)? Why does this transform only operate on multiplies of vscale?

paulwalker-arm added a comment.EditedAug 9 2023, 10:14 AM

The rational for making these vscale specific transformations is that vscale is effectively a constant with targets that support scalable vectors typically having instructions where vscale is implied. This is the same reason why there are no multi-use checks. Whilst the changes to LoopVectorize output look like regressions the extra IR is loop invariant that will be hoisted and generally leads to better generated code. This is especially true for cases where vector loops are interleaved.

nikic added a comment.Aug 9 2023, 11:33 AM

The rational for making these vscale specific transformations is that vscale is effectively a constant with targets that support scalable vectors typically having instructions where vscale is implied. This is the same reason why there are no multi-use checks. Whilst the changes to LoopVectorize output look like regressions the extra IR is loop invariant that will be hoisted and generally leads to better generated code. This is especially true for cases where vector loops are interleaved.

Is this something that we expect to hold true for all architectures with scalable vectors? I tried to look at sve and rvv codegen here: https://llvm.godbolt.org/z/9xdWrbPKv I can see how the sve case is always beneficial, even for multi-use cases, but the rvv case looks like it would only be an improvement for the one-use case.

I guess I'm making a generalisation here based on scalable vectors needing to calculate addresses and the like based on vscale and so am assuming specialised add/sub instructions will always exist. My eyes are not tuned in to reading rvv asm, can @craig.topper or @reames help clarify whether this combine also makes sense from a RISCV point of view.

I should add there's a growing need to sinking some of the vscale related logic back into loops to help with instruction selection, but regardless of this I still think it's preferable for this logic to be scalar rather than vector computations.

igor.kirillov marked 4 inline comments as done.
  • Added test with @use applied to nested add
  • Minor fixes

Please make sure to pre-commit test coverage (https://llvm.org/docs/TestingGuide.html#precommit-workflow-for-tests). This is also missing some tests, in particular multi-use and negative tests.

@nikic Added a test with use. Could you clarify what you mean by negative tests in this context?
I am aware of pre-commits. I just wasn't sure we wanted this patch at all. That's why I will pre-commit them just before landing the patch.

Ping about the RISC-V conversation and more test requirements

nikic added a comment.Aug 29 2023, 7:28 AM

Please make sure to pre-commit test coverage (https://llvm.org/docs/TestingGuide.html#precommit-workflow-for-tests). This is also missing some tests, in particular multi-use and negative tests.

@nikic Added a test with use. Could you clarify what you mean by negative tests in this context?
I am aware of pre-commits. I just wasn't sure we wanted this patch at all. That's why I will pre-commit them just before landing the patch.

Negative test = test where no transform happens. Generally speaking, you want one negative test for every condition in the transform, such that each test fails exactly one condition.

I guess I'm making a generalisation here based on scalable vectors needing to calculate addresses and the like based on vscale and so am assuming specialised add/sub instructions will always exist. My eyes are not tuned in to reading rvv asm, can @craig.topper or @reames help clarify whether this combine also makes sense from a RISCV point of view.

To make the question more precise: If you have spat(vscale * C1) + splat(vscale * C2), does it make sense to transform it into splat(vscale * (C1+C2)) if neither splat(vscale * C1) nor splat(vscale * C2) are going away?

Please make sure to pre-commit test coverage (https://llvm.org/docs/TestingGuide.html#precommit-workflow-for-tests). This is also missing some tests, in particular multi-use and negative tests.

@nikic Added a test with use. Could you clarify what you mean by negative tests in this context?
I am aware of pre-commits. I just wasn't sure we wanted this patch at all. That's why I will pre-commit them just before landing the patch.

Negative test = test where no transform happens. Generally speaking, you want one negative test for every condition in the transform, such that each test fails exactly one condition.

I guess I'm making a generalisation here based on scalable vectors needing to calculate addresses and the like based on vscale and so am assuming specialised add/sub instructions will always exist. My eyes are not tuned in to reading rvv asm, can @craig.topper or @reames help clarify whether this combine also makes sense from a RISCV point of view.

To make the question more precise: If you have spat(vscale * C1) + splat(vscale * C2), does it make sense to transform it into splat(vscale * (C1+C2)) if neither splat(vscale * C1) nor splat(vscale * C2) are going away?

RISC-V does not have any special add/sub instructions for vscale values. We define vscale as vector length in bytes divided by 8 and the vector length should always be divisible by 8. We have a read only status register called vlenb that returns the vector length in bytes. In the base case vscale is always expanded to a read of vlenb followed by dividing by 8 (shifted right by 3). If vscale is multiplied by a constant, we will try to combine the right shift with the multiplier so that we don't shift right only to immediately shift left again.

So I think @nikic is correct that for RISC-V this only makes sense for the one use case.