We check the loop trip count is known a power of 2 to determine
whether the tail loop can be eliminated in D146199.
However, the remainder loop of mask scalable loop can also be removed
If we know the mask is always going to be true for every vector iteration.
Depend on the assume of power-of-two vscale on D155350
Details
Diff Detail
Event Timeline
llvm/include/llvm/Transforms/InstCombine/InstCombiner.h | ||
---|---|---|
492 ↗ | (On Diff #539044) | This needs to be either a data layout property or a function attribute. Though it would probably best to change LangRef to require that vscale is always a power of two -- I think consensus has shifted towards non-pow2 vscales not being necessary. |
Added new case in file llvm/test/Transforms/InstCombine/AArch64/rem-mul-shl.ll because it need a option -mtriple=aarch64-none-linux-gnu
llvm/include/llvm/Transforms/InstCombine/InstCombiner.h | ||
---|---|---|
492 ↗ | (On Diff #539044) | according https://reviews.llvm.org/D141486, the document already clarify that the VScale will be known to be a power of 2 |
llvm/lib/Transforms/InstCombine/InstCombineMulDivRem.cpp | ||
1900 ↗ | (On Diff #539029) | yes, apply your comment, thanks |
1901 ↗ | (On Diff #539029) | Apply your comment, thanks |
1903 ↗ | (On Diff #539029) | Yes, you are right,thanks |
(Removing from review queue per previous comment -- you need to expose the pow2 property in a TTI-independent manner via one of LangRef, DataLayout or function attribute.)
Thanks for your comment, if I understand correctly, do you mean I need add a new function attribute vscale_pow2 for example?
@nikic What's the rational for not being allow to use TTI during instcombine? TTI is used for target specific combines.
I don't like the idea of having to decorate every function with information that is essentially constant for the target. Changing the LangRef seems like a backward step given LLVM already supports non-power-of-two values of vscale, which will be much harder to re-add once lost. That said, if no target supports non-power-of-two values of vscale then I'll not fight to keep such support if that's the consensus.
As a halfway house, what if we changed the definition of vscale_range to imply vscale is power-of-two. Sure that's still a loss of functionality but it's smaller and critically maintains the path for supporting arbitrary vscale values whilst ensuring existing targets can encode the power-of-two-ness within the IR without needing any code changes.
The general rationale is that it's a target-independent canonicalization pass. This isn't really relevant for this particular case (because the query is about legality rather than profitability).
The issue that is more relevant here is layering. The proper way to perform this fold is to teach isKnownToBeAPowerOfTwo() from ValueTracking about vscale. If you do that, you most likely don't even need any special code in InstCombine. However, we definitely don't want a TTI dependency in ValueTracking (which is used from literally everywhere, so we'd either have TTI dependencies everywhere, or get reduced functionality in most places). The information needs to be available in a target-independent way.
I don't like the idea of having to decorate every function with information that is essentially constant for the target.
DataLayout would provide a way to avoid decorating individual functions.
Changing the LangRef seems like a backward step given LLVM already supports non-power-of-two values of vscale, which will be much harder to re-add once lost. That said, if no target supports non-power-of-two values of vscale then I'll not fight to keep such support if that's the consensus.
We shouldn't keep dead code around due to sunk costs. The dead code has ongoing costs (like, we wouldn't even have this conversation without it). Of course, is there are (current or foreseeable future) targets that have non-pow2 vscale then we should certainly keep it, but if not, then imho we should get rid of this at the root.
As a halfway house, what if we changed the definition of vscale_range to imply vscale is power-of-two. Sure that's still a loss of functionality but it's smaller and critically maintains the path for supporting arbitrary vscale values whilst ensuring existing targets can encode the power-of-two-ness within the IR without needing any code changes.
Sounds reasonable to me.
So now can I just delete the checking TTI.isVScaleKnownToBeAPowerOfTwo() with above halfway house, which is assume the m_VScale() is power-of-two values for all targets ? or use a DataLayout here.
To implement my suggestion you'll need to update the LangRef to document that vscale_range implies vscale is a power-of-two. This is best done as a separate patch because you'll also need to updating the parsing of the attribute to reject values that are not a power-of-two. This patch should include the RISC-V folk so we've go agreement between the current targets that support scalable vectors.
At that point you should be able to implement the combine (or update ValueTracking) using purely information within the IR without any uses of TTI.
I'm curious how such an optimisation is related to other passes such as ValueTracking or ScalarEvolution. Indeed, if we can use information about the op0 such as the LSBs or as a ScalarEvolutionValue, we could support non constant value.
Thanks @v01dXYZ and @paulwalker-arm, I'll I'm going to reimplement it based on ValueTracking.
I think this is a really useful patch @Allen - thank you! It's definitely a step in the right direction. I do have a few comments on the patch ...
llvm/lib/Analysis/ValueTracking.cpp | ||
---|---|---|
2014 ↗ | (On Diff #539860) | Perhaps this is better done in a separate patch? That way we can see what tests are affected by this change alone, since it is quite significant. Also, in the same patch you will need to update the LangRef to say that using the vscale_range attribute implies vscale is a power of 2. Then, in a follow-on patch you can add the InstCombineMulDivRem.cpp change, which will show the tests that have changed purely due to the urem optimisation. |
llvm/lib/Transforms/InstCombine/InstCombineMulDivRem.cpp | ||
1945 ↗ | (On Diff #539860) | I could be wrong, but I suspect this is a fairly expensive function to call. Perhaps it's worth restructuring the code to only call it when you know Op0 is a power of 2? For example, something like: if (match(Op0, m_Power2(RemC)) { KnownBits Known = computeKnownBits(Op1, 0, &I); ... } |
1950 ↗ | (On Diff #539860) | I think you can avoid the and by simply returning null, i.e. return ConstantInt::getNullValue(Ty); |
llvm/test/Transforms/InstCombine/AArch64/rem-mul-shl.ll | ||
39 ↗ | (On Diff #539860) | Based on your ValueTracking change I think you also need a negative test when vscale_range is set to something like (1,15) |
llvm/test/Transforms/InstCombine/AArch64/rem-mul-shl.ll | ||
---|---|---|
39 ↗ | (On Diff #539860) | Based on my suggestion such a test cannot be written because it will be bad IR that will generate an error once the LangRef change is made, which needs to happen before we can have patches that rely on the new behaviour. |
llvm/lib/Analysis/ValueTracking.cpp | ||
---|---|---|
2019–2023 ↗ | (On Diff #539860) | Based on my suggestion you shouldn't need to check the values of the vscale_range attribute. Simply having the attribute is enough to imply the power-of-two-ness. |
llvm/lib/Analysis/ValueTracking.cpp | ||
---|---|---|
2014 ↗ | (On Diff #539860) | Done, thanks |
llvm/lib/Transforms/InstCombine/InstCombineMulDivRem.cpp | ||
1945 ↗ | (On Diff #539860) | apply your comment, thanks |
1950 ↗ | (On Diff #539860) | there is compilation problem if I return ConstantInt::getNullValue(Ty) directly error: cannot convert 'llvm::Constant*' to 'llvm::Instruction*' in return |
llvm/lib/Transforms/InstCombine/InstCombineMulDivRem.cpp | ||
---|---|---|
1950 ↗ | (On Diff #539860) | I believe this suggests that rather than an InstCombine this transformation should live in InstSimplify (e.g. simplifyURemInst). |
llvm/lib/Transforms/InstCombine/InstCombineMulDivRem.cpp | ||
---|---|---|
1950 ↗ | (On Diff #539860) | Apply your comment, thanks |
Please add an alive2 proof to the patch description.
llvm/lib/Analysis/InstructionSimplify.cpp | ||
---|---|---|
1286 | This check looks a bit roundabout. I think what you actually want to check is that Known.getMaxValue().ule(*RemC)? |
llvm/lib/Analysis/InstructionSimplify.cpp | ||
---|---|---|
1285 | This should be just RemC->uge(Known.getMaxValue()). Same below. | |
1288 | It looks like these proofs also work with urem replaced by srem: https://alive2.llvm.org/ce/z/-8RHjq So we should move this into simplifyRem and support both. |
llvm/lib/Analysis/InstructionSimplify.cpp | ||
---|---|---|
1283 | Done, thanks | |
1286 | I still use the getActiveBits because the value of Known.getMaxValue() is not a power-of-two value. I also change the boundary test for test case urem_vscale_range and urem_shl_vscale_out_of_range |
llvm/lib/Analysis/InstructionSimplify.cpp | ||
---|---|---|
1286 | Okay, I see. Would it work if you used computeConstantRange() instead of computeKnownBits()? That should give an accurate range for vscale. |
llvm/lib/Analysis/InstructionSimplify.cpp | ||
---|---|---|
1286 | thanks, the new API computeConstantRange works |
llvm/lib/Analysis/ValueTracking.cpp | ||
---|---|---|
8420 ↗ | (On Diff #541971) | Does vscale also enforce that C is within BitWidth range? Otherwise do we need a check here? |
I don't add alive2 because vscale is not supported, https://github.com/AliveToolkit/alive2/issues/923
The patch doesn't rely on alive2, just on the power of 2 nature of the arguments.
You can use: https://alive2.llvm.org/ce/z/FkTMoy + adding the srem versions.
llvm/lib/Analysis/InstructionSimplify.cpp | ||
---|---|---|
1286 | Hrm, it looks like computeConstantRange() only works with an additional special case for shl of vscale. That's ... not great, and probably fragile, because the same thing will happen with more complex patterns. I think now that I understand why you did this, I would prefer to go back to your previous patch that used getActiveBits(). Just add a comment that we need to use getActiveBits() to make use of the additional power of two knowledge. |
It seems the alive2 doesn't support the semantics of vscale, https://github.com/AliveToolkit/alive2/issues/923
The case in the link can't be optimized with this patch because we can't infer the operands of urem is power-of-two with isKnownToBeAPowerOfTwo now, so I'll try it with a separate patch
address comment
a) revert the checking with getActiveBits
b) as the rem --> and done in D155350, so this transformation live in simplifyAndInst instead of simplifyURemInst
Yes, but the link show that the transform is semantically equivilent. The case in the link covers anything we detect in the patch (assuming non-buggy codes).
Can you
- Add some tests that aren't reliant on vscale (just some simple cases is fine).
- would still like alive2 link.
Code change looks fine to me.
Can you
- Add some tests that aren't reliant on vscale (just some simple cases is fine).
- would still like alive2 link.
Code change looks fine to me.
hi @goldstein.w.n, I added 2 new cases and_add_shl and and_add_shl_overlap with alive2 link according your comment.
Because the canonicalizeLowbitMask always fold 1 << x into ~(-1 << x), so I also add extra code to match this.
I also try to debug something like https://alive2.llvm.org/ce/z/r5XLZj, and find the assume works on op0_p2 instead of pow2 itself, so there is some different to handle that case.
llvm/lib/Analysis/InstructionSimplify.cpp | ||
---|---|---|
2138 | m_Not instead of m_Xor(v, -1). Also the comment doesn't quite match the codes. |
llvm/lib/Analysis/InstructionSimplify.cpp | ||
---|---|---|
2138 | Apply your comment, thanks. (Also adjust the comment) |
llvm/test/Transforms/InstCombine/and-add-shl.ll | ||
---|---|---|
40 | You only have test for the first case (need 1 for not case). Also can you precommit the tests? |
These proofs are for a different transform (urem x) than what was implemented (and x - 1).
llvm/lib/Analysis/InstructionSimplify.cpp | ||
---|---|---|
2146 | I don't think this second fold should be added. This is something that can be handled via simple range propagation. In fact, IPSCCP does handle this already. We could make CVP handle it as well, if we wanted. |
Remove the 2nd fold, and update the alive2 link to use and directly, https://alive2.llvm.org/ce/z/bT62Wa
llvm/lib/Analysis/InstructionSimplify.cpp | ||
---|---|---|
2146 | Thanks, I'll try this with CVP , and now adopt the 2nd fold |
llvm/lib/Analysis/InstructionSimplify.cpp | ||
---|---|---|
83 | Unnecessary change. | |
llvm/test/Transforms/InstCombine/rem-mul-shl.ll | ||
887 | Please also add tests that are directly in the add -1 form (as that's what is actually being folded). We should also test the case where the constant operand is not a power of two, as it is a pre-condition of your transform. (Actually, we don't really need a power of two, it would be sufficient if it does not have any bits that may be part of the mask set. But it's a requirement for your current implementation.) |
LGTM
llvm/test/Transforms/InstCombine/rem-mul-shl.ll | ||
---|---|---|
901 | This one does have low bits set (it would be a negative test that cannot be folded). An example that can be folded is constant 3072 (3 * 1024). |
llvm/lib/Analysis/InstructionSimplify.cpp | ||
---|---|---|
2121 | Shift -> X to match the comment. This value doesn't need to be a shift. |
llvm/lib/Analysis/InstructionSimplify.cpp | ||
---|---|---|
2121 | Ignore this comment, X is not the same value. (Shift it not a great name, but I'm not sure what to call it right now.) |
thanks, change the const value to 3072 according comment for test and_add_shl_vscale_not_power2
This commit caused misoptimizations in the WMA decoder in ffmpeg, observed on all architectures. The misoptimization can be observed with https://martin.st/temp/wma-preproc.c, compiled with clang -target aarch64-linux-gnu -c -O3 wma-preproc.c -o libavcodec/wma.o.
For a full runtime reproducible case:
$ git clone git://source.ffmpeg.org/ffmpeg $ mkdir ffmpeg-build $ cd ffmpeg-build $ ../ffmpeg/configure --cc=clang --samples=../fate-samples $ make -j$(nproc) $ make fate-rsync $ make fate-wmapro-2ch
If it takes a long time to fix, I'd appreciate reverting it in the meantime.
Sorry for the trouble.
Would you show how do I check whether the current function is normal based on the running result?
In addition, I see a library file named libswscale/libswscale.a. I don't know if this is simulating the sve feature. If so, the current optimization based on vscale is a power-of-two value. I don't know whether this assumption will affect the results.
With the steps outlined above, cloning ffmpeg and compiling it, if you run make -j$(nproc) fate-wmapro-2ch, it should print one TEST line and exit with a 0 exit code if the code was correctly compiled, or print an error if it was misoptimized.
In addition, I see a library file named libswscale/libswscale.a. I don't know if this is simulating the sve feature. If so, the current optimization based on vscale is a power-of-two value. I don't know whether this assumption will affect the results.
That is an entirely unrelated library for scaling and color conversion of video frames, it has nothing to do with the ARM SVE feature.
To document this issue, I submitted an record for this, https://github.com/llvm/llvm-project/issues/64339, I'll continue to follow up on this issue there.
Unnecessary change.