This is an archive of the discontinued LLVM Phabricator instance.

[InstSimplify] Remove the remainder loop if we know the mask is always true
ClosedPublic

Authored by Allen on Jul 11 2023, 5:13 AM.

Details

Summary

We check the loop trip count is known a power of 2 to determine
whether the tail loop can be eliminated in D146199.
However, the remainder loop of mask scalable loop can also be removed
If we know the mask is always going to be true for every vector iteration.
Depend on the assume of power-of-two vscale on D155350

proofs: https://alive2.llvm.org/ce/z/bT62Wa

Fix https://github.com/llvm/llvm-project/issues/63616.

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
nikic requested changes to this revision.Jul 11 2023, 6:45 AM
nikic added a subscriber: nikic.
nikic added inline comments.
llvm/include/llvm/Transforms/InstCombine/InstCombiner.h
492

This needs to be either a data layout property or a function attribute. Though it would probably best to change LangRef to require that vscale is always a power of two -- I think consensus has shifted towards non-pow2 vscales not being necessary.

This revision now requires changes to proceed.Jul 11 2023, 6:45 AM
Allen updated this revision to Diff 539082.Jul 11 2023, 7:12 AM
Allen marked 2 inline comments as done.

address comments

Allen marked 4 inline comments as done.Jul 11 2023, 7:18 AM

Added new case in file llvm/test/Transforms/InstCombine/AArch64/rem-mul-shl.ll because it need a option -mtriple=aarch64-none-linux-gnu

llvm/include/llvm/Transforms/InstCombine/InstCombiner.h
492

according https://reviews.llvm.org/D141486, the document already clarify that the VScale will be known to be a power of 2

llvm/lib/Transforms/InstCombine/InstCombineMulDivRem.cpp
1900

yes, apply your comment, thanks

1901

Apply your comment, thanks

1903

Yes, you are right,thanks

nikic requested changes to this revision.Jul 11 2023, 7:40 AM

(Removing from review queue per previous comment -- you need to expose the pow2 property in a TTI-independent manner via one of LangRef, DataLayout or function attribute.)

This revision now requires changes to proceed.Jul 11 2023, 7:40 AM
Allen marked an inline comment as done.Jul 11 2023, 8:26 AM

(Removing from review queue per previous comment -- you need to expose the pow2 property in a TTI-independent manner via one of LangRef, DataLayout or function attribute.)

Thanks for your comment, if I understand correctly, do you mean I need add a new function attribute vscale_pow2 for example?

paulwalker-arm added a subscriber: paulwalker-arm.EditedJul 11 2023, 9:05 AM

@nikic What's the rational for not being allow to use TTI during instcombine? TTI is used for target specific combines.

I don't like the idea of having to decorate every function with information that is essentially constant for the target. Changing the LangRef seems like a backward step given LLVM already supports non-power-of-two values of vscale, which will be much harder to re-add once lost. That said, if no target supports non-power-of-two values of vscale then I'll not fight to keep such support if that's the consensus.

As a halfway house, what if we changed the definition of vscale_range to imply vscale is power-of-two. Sure that's still a loss of functionality but it's smaller and critically maintains the path for supporting arbitrary vscale values whilst ensuring existing targets can encode the power-of-two-ness within the IR without needing any code changes.

@nikic What's the rational for not being allow to use TTI during instcombine? TTI is used for target specific combines.

The general rationale is that it's a target-independent canonicalization pass. This isn't really relevant for this particular case (because the query is about legality rather than profitability).

The issue that is more relevant here is layering. The proper way to perform this fold is to teach isKnownToBeAPowerOfTwo() from ValueTracking about vscale. If you do that, you most likely don't even need any special code in InstCombine. However, we definitely don't want a TTI dependency in ValueTracking (which is used from literally everywhere, so we'd either have TTI dependencies everywhere, or get reduced functionality in most places). The information needs to be available in a target-independent way.

I don't like the idea of having to decorate every function with information that is essentially constant for the target.

DataLayout would provide a way to avoid decorating individual functions.

Changing the LangRef seems like a backward step given LLVM already supports non-power-of-two values of vscale, which will be much harder to re-add once lost. That said, if no target supports non-power-of-two values of vscale then I'll not fight to keep such support if that's the consensus.

We shouldn't keep dead code around due to sunk costs. The dead code has ongoing costs (like, we wouldn't even have this conversation without it). Of course, is there are (current or foreseeable future) targets that have non-pow2 vscale then we should certainly keep it, but if not, then imho we should get rid of this at the root.

As a halfway house, what if we changed the definition of vscale_range to imply vscale is power-of-two. Sure that's still a loss of functionality but it's smaller and critically maintains the path for supporting arbitrary vscale values whilst ensuring existing targets can encode the power-of-two-ness within the IR without needing any code changes.

Sounds reasonable to me.

Allen added a comment.EditedJul 11 2023, 8:17 PM

So now can I just delete the checking TTI.isVScaleKnownToBeAPowerOfTwo() with above halfway house, which is assume the m_VScale() is power-of-two values for all targets ? or use a DataLayout here.

So now can I just delete the checking TTI.isVScaleKnownToBeAPowerOfTwo() with above halfway house, which is assume the m_VScale() is power-of-two values for all targets ? or use a DataLayout here.

To implement my suggestion you'll need to update the LangRef to document that vscale_range implies vscale is a power-of-two. This is best done as a separate patch because you'll also need to updating the parsing of the attribute to reject values that are not a power-of-two. This patch should include the RISC-V folk so we've go agreement between the current targets that support scalable vectors.

At that point you should be able to implement the combine (or update ValueTracking) using purely information within the IR without any uses of TTI.

I'm curious how such an optimisation is related to other passes such as ValueTracking or ScalarEvolution. Indeed, if we can use information about the op0 such as the LSBs or as a ScalarEvolutionValue, we could support non constant value.

Allen added a comment.Jul 12 2023, 6:11 PM

Thanks @v01dXYZ and @paulwalker-arm, I'll I'm going to reimplement it based on ValueTracking.

nit: the title says reminder not remainder

Allen updated this revision to Diff 539860.Jul 12 2023, 11:28 PM
Allen retitled this revision from [InstCombine] Remove the reminder loop if we know the mask is always true to [InstCombine] Remove the remainder loop if we know the mask is always true.
Allen edited the summary of this revision. (Show Details)

I think this is a really useful patch @Allen - thank you! It's definitely a step in the right direction. I do have a few comments on the patch ...

llvm/lib/Analysis/ValueTracking.cpp
2014 ↗(On Diff #539860)

Perhaps this is better done in a separate patch? That way we can see what tests are affected by this change alone, since it is quite significant. Also, in the same patch you will need to update the LangRef to say that using the vscale_range attribute implies vscale is a power of 2.

Then, in a follow-on patch you can add the InstCombineMulDivRem.cpp change, which will show the tests that have changed purely due to the urem optimisation.

llvm/lib/Transforms/InstCombine/InstCombineMulDivRem.cpp
1945

I could be wrong, but I suspect this is a fairly expensive function to call. Perhaps it's worth restructuring the code to only call it when you know Op0 is a power of 2? For example, something like:

if (match(Op0, m_Power2(RemC)) {
  KnownBits Known = computeKnownBits(Op1, 0, &I);
  ...
}
1950

I think you can avoid the and by simply returning null, i.e.

return ConstantInt::getNullValue(Ty);
llvm/test/Transforms/InstCombine/AArch64/rem-mul-shl.ll
39 ↗(On Diff #539860)

Based on your ValueTracking change I think you also need a negative test when vscale_range is set to something like (1,15)

paulwalker-arm added inline comments.Jul 13 2023, 2:31 AM
llvm/test/Transforms/InstCombine/AArch64/rem-mul-shl.ll
39 ↗(On Diff #539860)

Based on my suggestion such a test cannot be written because it will be bad IR that will generate an error once the LangRef change is made, which needs to happen before we can have patches that rely on the new behaviour.

paulwalker-arm added inline comments.Jul 13 2023, 2:38 AM
llvm/lib/Analysis/ValueTracking.cpp
2019–2023 ↗(On Diff #539860)

Based on my suggestion you shouldn't need to check the values of the vscale_range attribute. Simply having the attribute is enough to imply the power-of-two-ness.

nikic requested changes to this revision.Jul 13 2023, 2:40 AM

(Removing from review queue as this needs a LangRef change first.)

This revision now requires changes to proceed.Jul 13 2023, 2:40 AM
Matt added a subscriber: Matt.Jul 13 2023, 2:35 PM
Allen updated this revision to Diff 540848.Jul 16 2023, 7:28 PM
Allen marked an inline comment as done.
Allen edited the summary of this revision. (Show Details)

rebase after the separated commit D155350

Allen marked 4 inline comments as done.Jul 16 2023, 7:31 PM
Allen added inline comments.
llvm/lib/Analysis/ValueTracking.cpp
2014 ↗(On Diff #539860)

Done, thanks

llvm/lib/Transforms/InstCombine/InstCombineMulDivRem.cpp
1945

apply your comment, thanks

1950

there is compilation problem if I return ConstantInt::getNullValue(Ty) directly

error: cannot convert 'llvm::Constant*' to 'llvm::Instruction*' in return
paulwalker-arm added inline comments.Jul 17 2023, 3:29 AM
llvm/lib/Transforms/InstCombine/InstCombineMulDivRem.cpp
1950

I believe this suggests that rather than an InstCombine this transformation should live in InstSimplify (e.g. simplifyURemInst).

Allen updated this revision to Diff 540957.Jul 17 2023, 5:03 AM
Allen marked 2 inline comments as done.
Allen retitled this revision from [InstCombine] Remove the remainder loop if we know the mask is always true to [InstSimplify] Remove the remainder loop if we know the mask is always true.

refactor with InstSimplify according comment

Allen marked an inline comment as done.Jul 17 2023, 5:05 AM
Allen added inline comments.
llvm/lib/Transforms/InstCombine/InstCombineMulDivRem.cpp
1950

Apply your comment, thanks

nikic added a comment.Jul 17 2023, 6:03 AM

Please add an alive2 proof to the patch description.

llvm/lib/Analysis/InstructionSimplify.cpp
1286 ↗(On Diff #540957)

This check looks a bit roundabout. I think what you actually want to check is that Known.getMaxValue().ule(*RemC)?

goldstein.w.n added inline comments.Jul 17 2023, 9:08 AM
llvm/lib/Analysis/InstructionSimplify.cpp
1283 ↗(On Diff #540957)

move the match of Op0 against a constant to before the far more expensive isKnownToBeAPowerOfTwo check on Op1.

1288 ↗(On Diff #540957)

Since we are already here, might as well also add an else if(Known.getMinValue().ugt(*RemC)) { return Op0; }

Proofs: https://alive2.llvm.org/ce/z/FkTMoy

Allen updated this revision to Diff 541364.Jul 18 2023, 12:44 AM
Allen marked an inline comment as done.
nikic added inline comments.Jul 18 2023, 12:55 AM
llvm/lib/Analysis/InstructionSimplify.cpp
1285 ↗(On Diff #541364)

This should be just RemC->uge(Known.getMaxValue()). Same below.

1288 ↗(On Diff #540957)

It looks like these proofs also work with urem replaced by srem: https://alive2.llvm.org/ce/z/-8RHjq So we should move this into simplifyRem and support both.

Allen marked an inline comment as done.Jul 18 2023, 12:59 AM
Allen added inline comments.
llvm/lib/Analysis/InstructionSimplify.cpp
1283 ↗(On Diff #540957)

Done, thanks

1286 ↗(On Diff #540957)

I still use the getActiveBits because the value of Known.getMaxValue() is not a power-of-two value.
For example, we can easily get the the ConstantRange(8, 16) from the vscale_range(1,16), but then the
Known = ConstantRange(8, 16)->toKnownBits() will get the value of Known.getMaxValue() is 31 instead of 16,
so I can't compare them with its value directly.

I also change the boundary test for test case urem_vscale_range and urem_shl_vscale_out_of_range

nikic added inline comments.Jul 18 2023, 1:17 AM
llvm/lib/Analysis/InstructionSimplify.cpp
1286 ↗(On Diff #540957)

Okay, I see. Would it work if you used computeConstantRange() instead of computeKnownBits()? That should give an accurate range for vscale.

Allen updated this revision to Diff 541459.Jul 18 2023, 4:53 AM
Allen marked an inline comment as done.

refactor with computeConstantRange according comment

Allen marked 4 inline comments as done.Jul 18 2023, 4:56 AM
Allen added inline comments.
llvm/lib/Analysis/InstructionSimplify.cpp
1286 ↗(On Diff #540957)

thanks, the new API computeConstantRange works

Allen updated this revision to Diff 541971.Jul 19 2023, 5:43 AM
Allen marked an inline comment as done.

update the Upper bound

Can you add the alive2 links to the summary / commit message?

goldstein.w.n added inline comments.Jul 19 2023, 10:32 AM
llvm/lib/Analysis/ValueTracking.cpp
8420 ↗(On Diff #541971)

Does vscale also enforce that C is within BitWidth range? Otherwise do we need a check here?

Allen updated this revision to Diff 542418.Jul 20 2023, 4:01 AM

Add checking the range of shift number according comment

Allen added a comment.Jul 20 2023, 4:05 AM

I don't add alive2 because vscale is not supported, https://github.com/AliveToolkit/alive2/issues/923

I don't add alive2 because vscale is not supported, https://github.com/AliveToolkit/alive2/issues/923

The patch doesn't rely on alive2, just on the power of 2 nature of the arguments.
You can use: https://alive2.llvm.org/ce/z/FkTMoy + adding the srem versions.

nikic added inline comments.Jul 20 2023, 9:22 AM
llvm/lib/Analysis/InstructionSimplify.cpp
1286 ↗(On Diff #540957)

Hrm, it looks like computeConstantRange() only works with an additional special case for shl of vscale. That's ... not great, and probably fragile, because the same thing will happen with more complex patterns.

I think now that I understand why you did this, I would prefer to go back to your previous patch that used getActiveBits(). Just add a comment that we need to use getActiveBits() to make use of the additional power of two knowledge.

Allen marked 2 inline comments as done.Jul 21 2023, 12:18 AM

It seems the alive2 doesn't support the semantics of vscale, https://github.com/AliveToolkit/alive2/issues/923

I don't add alive2 because vscale is not supported, https://github.com/AliveToolkit/alive2/issues/923

The patch doesn't rely on alive2, just on the power of 2 nature of the arguments.
You can use: https://alive2.llvm.org/ce/z/FkTMoy + adding the srem versions.

The case in the link can't be optimized with this patch because we can't infer the operands of urem is power-of-two with isKnownToBeAPowerOfTwo now, so I'll try it with a separate patch

Allen updated this revision to Diff 542781.EditedJul 21 2023, 12:22 AM
Allen edited the summary of this revision. (Show Details)

address comment
a) revert the checking with getActiveBits
b) as the rem --> and done in D155350, so this transformation live in simplifyAndInst instead of simplifyURemInst

It seems the alive2 doesn't support the semantics of vscale, https://github.com/AliveToolkit/alive2/issues/923

I don't add alive2 because vscale is not supported, https://github.com/AliveToolkit/alive2/issues/923

The patch doesn't rely on alive2, just on the power of 2 nature of the arguments.
You can use: https://alive2.llvm.org/ce/z/FkTMoy + adding the srem versions.

The case in the link can't be optimized with this patch because we can't infer the operands of urem is power-of-two with isKnownToBeAPowerOfTwo now, so I'll try it with a separate patch

Yes, but the link show that the transform is semantically equivilent. The case in the link covers anything we detect in the patch (assuming non-buggy codes).

goldstein.w.n accepted this revision.Jul 24 2023, 4:26 PM

ping ?

Can you

  1. Add some tests that aren't reliant on vscale (just some simple cases is fine).
  2. would still like alive2 link.

Code change looks fine to me.

goldstein.w.n requested changes to this revision.Jul 24 2023, 4:29 PM

(Sorry, misclicked approve earlier).

This revision now requires changes to proceed.Jul 24 2023, 4:29 PM
Allen updated this revision to Diff 543839.Jul 24 2023, 11:39 PM

Add 2 new cases with alive2 link proof

Allen marked an inline comment as done.EditedJul 25 2023, 12:02 AM

Can you

  1. Add some tests that aren't reliant on vscale (just some simple cases is fine).
  2. would still like alive2 link.

Code change looks fine to me.

hi @goldstein.w.n, I added 2 new cases and_add_shl and and_add_shl_overlap with alive2 link according your comment.
Because the canonicalizeLowbitMask always fold 1 << x into ~(-1 << x), so I also add extra code to match this.
I also try to debug something like https://alive2.llvm.org/ce/z/r5XLZj, and find the assume works on op0_p2 instead of pow2 itself, so there is some different to handle that case.

Allen edited the summary of this revision. (Show Details)Jul 26 2023, 6:02 PM
goldstein.w.n added inline comments.Jul 28 2023, 10:14 AM
llvm/lib/Analysis/InstructionSimplify.cpp
2138 ↗(On Diff #543839)

m_Not instead of m_Xor(v, -1).

Also the comment doesn't quite match the codes.

Allen updated this revision to Diff 545320.Jul 28 2023, 6:25 PM
Allen marked an inline comment as done.Jul 28 2023, 6:28 PM
Allen added inline comments.
llvm/lib/Analysis/InstructionSimplify.cpp
2138 ↗(On Diff #543839)

Apply your comment, thanks. (Also adjust the comment)

goldstein.w.n added inline comments.Jul 28 2023, 9:56 PM
llvm/test/Transforms/InstCombine/and-add-shl.ll
40 ↗(On Diff #545320)

You only have test for the first case (need 1 for not case). Also can you precommit the tests?

Allen updated this revision to Diff 545348.Jul 29 2023, 2:03 AM
Allen marked an inline comment as done.

precommit test on D156591

Allen marked an inline comment as done.Jul 29 2023, 2:07 AM
Allen added inline comments.
llvm/test/Transforms/InstCombine/and-add-shl.ll
40 ↗(On Diff #545320)

add a new case and_not_shl for not case, and also precommit tests on D156591, thanks.

Allen updated this revision to Diff 545516.Jul 30 2023, 11:53 PM
Allen marked an inline comment as done.

Fixes test Transforms/LoopVectorize/AArch64/sve-widen-phi.ll

Allen updated this revision to Diff 545903.Jul 31 2023, 7:44 PM

rebase to top

This revision was not accepted when it landed; it landed in state Needs Review.Jul 31 2023, 8:23 PM
This revision was landed with ongoing or failed builds.
This revision was automatically updated to reflect the committed changes.
nikic added a comment.Aug 1 2023, 12:09 AM

proofs: https://alive2.llvm.org/ce/z/FkTMoy

These proofs are for a different transform (urem x) than what was implemented (and x - 1).

llvm/lib/Analysis/InstructionSimplify.cpp
2146 ↗(On Diff #545908)

I don't think this second fold should be added. This is something that can be handled via simple range propagation. In fact, IPSCCP does handle this already. We could make CVP handle it as well, if we wanted.

proofs: https://alive2.llvm.org/ce/z/FkTMoy

These proofs are for a different transform (urem x) than what was implemented (and x - 1).

sorry :(

Figured was okay as urem by pow2 == and by mask.

Allen edited the summary of this revision. (Show Details)Aug 1 2023, 1:04 AM
Allen updated this revision to Diff 545994.Aug 1 2023, 3:32 AM

Remove the 2nd fold, and update the alive2 link to use and directly, https://alive2.llvm.org/ce/z/bT62Wa

Allen reopened this revision.Aug 1 2023, 3:35 AM
Allen marked an inline comment as done.
Allen added inline comments.
llvm/lib/Analysis/InstructionSimplify.cpp
2146 ↗(On Diff #545908)

Thanks, I'll try this with CVP , and now adopt the 2nd fold

nikic added inline comments.Aug 1 2023, 4:02 AM
llvm/lib/Analysis/InstructionSimplify.cpp
83 ↗(On Diff #545994)

Unnecessary change.

llvm/test/Transforms/InstCombine/rem-mul-shl.ll
887 ↗(On Diff #545994)

Please also add tests that are directly in the add -1 form (as that's what is actually being folded).

We should also test the case where the constant operand is not a power of two, as it is a pre-condition of your transform. (Actually, we don't really need a power of two, it would be sufficient if it does not have any bits that may be part of the mask set. But it's a requirement for your current implementation.)

Allen updated this revision to Diff 546018.Aug 1 2023, 6:13 AM
Allen marked an inline comment as done.

Address comments

Allen updated this revision to Diff 546019.Aug 1 2023, 6:18 AM
Allen marked 2 inline comments as done.
nikic accepted this revision.Aug 1 2023, 6:21 AM

LGTM

llvm/test/Transforms/InstCombine/rem-mul-shl.ll
901 ↗(On Diff #546019)

This one does have low bits set (it would be a negative test that cannot be folded).

An example that can be folded is constant 3072 (3 * 1024).

This revision is now accepted and ready to land.Aug 1 2023, 6:21 AM
nikic added inline comments.Aug 1 2023, 6:21 AM
llvm/lib/Analysis/InstructionSimplify.cpp
2121 ↗(On Diff #546019)

Shift -> X to match the comment. This value doesn't need to be a shift.

nikic added inline comments.Aug 1 2023, 6:23 AM
llvm/lib/Analysis/InstructionSimplify.cpp
2121 ↗(On Diff #546019)

Ignore this comment, X is not the same value. (Shift it not a great name, but I'm not sure what to call it right now.)

Allen updated this revision to Diff 546022.Aug 1 2023, 6:26 AM

thanks, change the const value to 3072 according comment for test and_add_shl_vscale_not_power2

Allen updated this revision to Diff 546042.Aug 1 2023, 7:11 AM

rebase to top

This revision was landed with ongoing or failed builds.Aug 1 2023, 7:23 AM
This revision was automatically updated to reflect the committed changes.

This commit caused misoptimizations in the WMA decoder in ffmpeg, observed on all architectures. The misoptimization can be observed with https://martin.st/temp/wma-preproc.c, compiled with clang -target aarch64-linux-gnu -c -O3 wma-preproc.c -o libavcodec/wma.o.

For a full runtime reproducible case:

$ git clone git://source.ffmpeg.org/ffmpeg
$ mkdir ffmpeg-build
$ cd ffmpeg-build
$ ../ffmpeg/configure --cc=clang --samples=../fate-samples
$ make -j$(nproc)
$ make fate-rsync
$ make fate-wmapro-2ch

If it takes a long time to fix, I'd appreciate reverting it in the meantime.

Allen added a comment.Aug 2 2023, 4:17 AM

Sorry for the trouble.
Would you show how do I check whether the current function is normal based on the running result?
In addition, I see a library file named libswscale/libswscale.a. I don't know if this is simulating the sve feature. If so, the current optimization based on vscale is a power-of-two value. I don't know whether this assumption will affect the results.

Sorry for the trouble.
Would you show how do I check whether the current function is normal based on the running result?

With the steps outlined above, cloning ffmpeg and compiling it, if you run make -j$(nproc) fate-wmapro-2ch, it should print one TEST line and exit with a 0 exit code if the code was correctly compiled, or print an error if it was misoptimized.

In addition, I see a library file named libswscale/libswscale.a. I don't know if this is simulating the sve feature. If so, the current optimization based on vscale is a power-of-two value. I don't know whether this assumption will affect the results.

That is an entirely unrelated library for scaling and color conversion of video frames, it has nothing to do with the ARM SVE feature.

Allen added a comment.Aug 2 2023, 4:35 AM

To document this issue, I submitted an record for this, https://github.com/llvm/llvm-project/issues/64339, I'll continue to follow up on this issue there.