This patch enables canonicalization of SPF_ABS and SPF_ABS to the abs intrinsic. I'll comment in-line for remaining issues...
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
llvm/lib/Transforms/InstCombine/InstCombineSelect.cpp | ||
---|---|---|
1070 | I'm wondering what to do about this one use check. Should we canonicalize even if the compare has other uses? We are folding cmp+sub+select into abs or abs+sub. However, both cmp and sub can potentially have extra uses. | |
llvm/test/Transforms/InstCombine/abs_abs.ll | ||
94 | This is the main remaining problem: What happens here is that we see abs(x) > 0, convert this to abs(x) != 0 (based on abs range) and then to x != 0. The last step happens due to the fold I added in rGada8a17d945c17c5603e24824f642ca199412adf, previously this worked. At that point we are left with something like x != 0 ? abs(x) : -abs(x), which is no longer a proper abs. What do you think the best way to handle this would be? Should this pattern be added to the SPF_ABS recognition? Just explicitly match it in InstCombine? |
I would expect the tests in llvm to only be subset of the possible second order effects that can come up from changing this. Perhaps abs is simple enough, but you might want to run the llvm test suite to see if there are any other interesting binary differences.
clang/test/CodeGen/arm-mve-intrinsics/vmaxaq.c | ||
---|---|---|
3 ↗ | (On Diff #290127) | Hmm. These needn't be running -O3. I'll see to removing that. Looks OK though, from what I understand about how these get lowered. |
I've looked at some of the diffs on test suite. The main change I'm seeing is that loops containing abs now get unrolled more. I'm not sure whether that's expected/desired or not.
I've also spotted two possible folds (these aren't regressions though, they aren't recognized in select form either):
- abs(x, false) < 0 => x == INT_MIN
- https://alive2.llvm.org/ce/z/Mu93Wx
To give an example of what I mean by more unrolling, https://editor.mergely.com/z5GcKJGU/ is the diff for Bitcode/simd_ops/CMakeFiles/simd_ops_test_op_pabsd_237. This is a loop with llvm.abs.v4i32 operations that now has double the unrolling factor.
llvm/lib/Transforms/InstCombine/InstCombineSelect.cpp | ||
---|---|---|
1070 | Here's how dropping the one use check would look like: https://gist.github.com/nikic/3dece504951cba82415fed2d6aa876bf Nominally, we are only increasing instruction count for the combination of nabs with extra uses on both icmp and sub. |
LGTM - I think we should give this a try as-is (with the one-use check still there), see if anything regresses, then ease/remove the use check as a follow-on.
As noted, we may need to adjust cost models to account for the size/speed difference that's showing up in unrolling/inlining. That's probably because we assume that an intrinsic is expanded to a single instruction vs. the current cmp+sub+select being 3 instructions?
Okay, let's do that.
As noted, we may need to adjust cost models to account for the size/speed difference that's showing up in unrolling/inlining. That's probably because we assume that an intrinsic is expanded to a single instruction vs. the current cmp+sub+select being 3 instructions?
The default abs cost model already uses icmp+select+sub. Though X86 has custom cost model for the vector variants, which is generally cheaper, so it makes sense that more unrolling is seen there. I expect that this part of the change already happened when the vector intrinsics were switched over recently.
This broke a few tests for me (generating code that now gives the fail result at runtime).
I'm not entirely sure which bit is the culprit, but the difference in output (that breaks tests) for one object file is available at https://martin.st/temp/vc1_block-diff-aarch64.txt, and https://martin.st/temp/vc1_block-diff-armv7.txt. For the aarch64 version, it looks like some conditionals are inverted, like these changes:
- b.ge .LBB1_124 + b.hs .LBB1_124
and
- csel w14, w15, w14, gt + csel w14, w15, w14, hi
(with no seemingly related changes that would change the roles of the registers).
The input files for reproducing the issues are available at https://martin.st/temp/vc1_block-aarch64.c and https://martin.st/temp/vc1_block-armv7.c, which can be compiled with clang -target {aarch64,armv7}-w64-mingw32 -c -O2 vc1_block-{aarch64,armv7}.c.
I know this has already been reverted but just FYI that I've bisected a ~2% regression in SPEC2017 x264_r on AArch64 to this commit. Presumably this is due to the extra unrolling / cost modelling issue already mentioned?
That would be my guess. I'm not familiar with the unroller or inliner, but I don't even see them using the cost model at 1st glance. Are they purely counting IR instructions?
If I missed the cost model calls, the cost model is still not correct for code-size:
$ cat cost.ll declare i32 @llvm.abs.i32(i32, i1) define i32 @abs(i32 %x, i32 %y) { %I32 = call i32 @llvm.abs.i32(i32 %x, i1 false) %negx = sub i32 0, %x %cmp = icmp sgt i32 %x, -1 %sel = select i1 %cmp, i32 %x, i32 %negx ret i32 %I32 } $ opt -cost-model -cost-kind=code-size -analyze -mtriple=aarch64-- cost.ll Printing analysis 'Cost Model Analysis' for function 'abs': Cost Model: Found an estimated cost of 1 for instruction: %I32 = call i32 @llvm.abs.i32(i32 %x, i1 false) Cost Model: Found an estimated cost of 1 for instruction: %negx = sub i32 0, %x Cost Model: Found an estimated cost of 1 for instruction: %cmp = icmp sgt i32 %x, -1 Cost Model: Found an estimated cost of 1 for instruction: %sel = select i1 %cmp, i32 %x, i32 %negx
So we're assuming the size cost is either 1 or 3 depending on how we encoded abs() in IR, but neither is correct for AArch64 IIUC:
$ llc -o - -mtriple=aarch64-- cost.ll cmp w0, #0 // =0 cneg w0, w0, mi ret
Based on an initial look, the changes in comparison predicates here are probably a red herring. If I understand right, those predicates are switching from signed to unsigned comparison (e.g. gt to hi). I also see similar changes in IR. However, all the cases I looked at are actually correct (e.g. abs(x) > abs(y) can be compared signed or unsigned for poisoning abs). Assuming that this code is clean under ubsan, the problem is likely something else.
I just checked, and the code is indeed clean under ubsan.
The changes in the asm is more than just a few places, so it's quite a bit of work to dissect which one of them (if the changes are localized into smaller individual changes) is the one that's causing the error...
Reopening this so we don't forget...
I believe @spatel is working on the cost modelling. I did not have much luck tracking down the miscompile, at least did not spot anything incriminating in the llvm-diff.
Yes, I'm working through the mess of the TTI cost model with things like:
c963bde0152a
It's a slog...
D90554 / f7eac51b9b
I think that change makes this patch ready to try again. If we do see regressions, then it should now be easy to adjust target-specific cost modeling of abs() intrinsics to fix those.
llvm/lib/Transforms/InstCombine/InstCombineSelect.cpp | ||
---|---|---|
1070 | I think we shouldn't have one-use checks here, yes. |
We still need to track down the miscompile that @mstorsjo reported before reapplying this patch.
This needs someone with access to AArch64 hardware to look into https://reviews.llvm.org/D87188#2281093 to make progress. I don't have any AArch64 hardware, and judging by the time it takes to build cmake on an AArch64 machine in the GCC compile farm, there's no way I'm going to be doing LLVM builds there.
@mstorsjo please can you be more specific what kind of tests are starting to fail?
Do those tests just check that the code compiled into an identical assembly?
I reduced vc1_block-aarch64.c, and got:
; ModuleID = 'input.ll' source_filename = "vc1_block-aarch64.c" target datalayout = "e-m:w-p:64:64-i32:32-i64:64-i128:128-n32:64-S128" target triple = "aarch64-w64-windows-gnu" define i1 @barney(i8 %arg, i8 %arg9) { bb: %tmp = sext i8 %arg to i32 %tmp10 = icmp slt i32 %tmp, 0 %tmp11 = sub nsw i32 0, %tmp %tmp12 = select i1 %tmp10, i32 %tmp11, i32 %tmp %tmp13 = shl nuw nsw i32 %tmp12, 1 %tmp14 = zext i8 %arg9 to i32 %tmp15 = add nuw nsw i32 %tmp13, %tmp14 %tmp16 = icmp slt i32 %tmp15, 2 ret i1 %tmp16 } declare i32 @eggs()
which used to get optimized into
; ModuleID = '<stdin>' source_filename = "vc1_block-aarch64.c" target datalayout = "e-m:w-p:64:64-i32:32-i64:64-i128:128-n32:64-S128" target triple = "aarch64-w64-windows-gnu" define i1 @barney(i8 %arg, i8 %arg9) { bb: %tmp = sext i8 %arg to i32 %tmp10 = icmp slt i32 %tmp, 0 %tmp11 = sub nsw i32 0, %tmp %tmp12 = select i1 %tmp10, i32 %tmp11, i32 %tmp %tmp13 = shl nuw nsw i32 %tmp12, 1 %tmp14 = zext i8 %arg9 to i32 %tmp15 = add nuw nsw i32 %tmp13, %tmp14 %tmp16 = icmp slt i32 %tmp15, 2 ret i1 %tmp16 } declare i32 @eggs()
and is now optimized into
; ModuleID = '<stdin>' source_filename = "vc1_block-aarch64.c" target datalayout = "e-m:w-p:64:64-i32:32-i64:64-i128:128-n32:64-S128" target triple = "aarch64-w64-windows-gnu" define i1 @barney(i8 %arg, i8 %arg9) { bb: %tmp = sext i8 %arg to i32 %0 = call i32 @llvm.abs.i32(i32 %tmp, i1 true) %tmp13 = shl nuw nsw i32 %0, 1 %tmp14 = zext i8 %arg9 to i32 %tmp15 = add nuw nsw i32 %tmp13, %tmp14 %tmp16 = icmp ult i32 %tmp15, 2 ret i1 %tmp16 } declare i32 @eggs() ; Function Attrs: nofree nosync nounwind readnone speculatable willreturn declare i32 @llvm.abs.i32(i32, i1 immarg) #0 attributes #0 = { nofree nosync nounwind readnone speculatable willreturn }
which is not a miscompile:
$ /repositories/alive2/build-Clang-release/alive-tv old.ll new.ll ---------------------------------------- define i1 @barney(i8 %arg, i8 %arg9) { %bb: %tmp = sext i8 %arg to i32 %tmp10 = icmp slt i32 %tmp, 0 %tmp11 = sub nsw i32 0, %tmp %tmp12 = select i1 %tmp10, i32 %tmp11, i32 %tmp %tmp13 = shl nsw nuw i32 %tmp12, 1 %tmp14 = zext i8 %arg9 to i32 %tmp15 = add nsw nuw i32 %tmp13, %tmp14 %tmp16 = icmp slt i32 %tmp15, 2 ret i1 %tmp16 } => define i1 @barney(i8 %arg, i8 %arg9) { %bb: %tmp = sext i8 %arg to i32 %0 = abs i32 %tmp, 1 %tmp13 = shl nsw nuw i32 %0, 1 %tmp14 = zext i8 %arg9 to i32 %tmp15 = add nsw nuw i32 %tmp13, %tmp14 %tmp16 = icmp ult i32 %tmp15, 2 ret i1 %tmp16 } Transformation seems to be correct! Summary: 1 correct transformations 0 incorrect transformations 0 Alive2 errors
The original assembly was:
.text .file "vc1_block-aarch64.c" .def barney; .scl 2; .type 32; .endef .globl barney // -- Begin function barney .p2align 2 barney: // @barney // %bb.0: // %bb sxtb w8, w0 cmp w8, #0 // =0 cneg w8, w8, mi lsl w8, w8, #1 add w8, w8, w1, uxtb cmp w8, #2 // =2 cset w0, lt ret // -- End function
and new one is
.text .file "vc1_block-aarch64.c" .def barney; .scl 2; .type 32; .endef .globl barney // -- Begin function barney .p2align 2 barney: // @barney // %bb.0: // %bb sxtb w8, w0 cmp w8, #0 // =0 cneg w8, w8, mi lsl w8, w8, #1 add w8, w8, w1, uxtb cmp w8, #2 // =2 cset w0, lo ret // -- End function
with diff being
$ diff old.s new.s 17c17 < cset w0, lt --- > cset w0, lo
No, it's video/audio decoding tests, and after that change specific reference input samples produce a different output hash than before.
I can retry the change (I see that the patch is recently rebased and hopefully still should apply on the latest main version) soon and see if I can pinpoint what's going wrong any closer than before.
Thanks for reducing!
If I'm seeing it correctly, the codegen looks fine - the difference is in the IR icmp predicate changing from slt to ult, and that diff appears to be correct independent of whether or not we use the abs intrinsic:
https://alive2.llvm.org/ce/z/VDDqBj
So either the source/test has an incorrect expectation and/or something outside of this function is wrong?
It's also correct without the nsw/nuw flags since the inputs are i8 and extended to i32 it can't see i32 INT_MIN.
Just to be sure, i just run the entire compilation of vc1_block-aarch64.c through alive2, and as far as i can tell, it did not report any IR-level miscompilations.
So either the original code has bugs/UB's/whatever, or this is an AArch64 backend bug (cc @t.p.northover), or the IR problem is in a place alive2 can't find.
FWIW, the testcase does run without any remarks in ubsan at least...
I retried the issue now, and I've reduced it down to one standalone function that causes the errors. https://martin.st/temp/vc1_block-p-block2.c is this function separated into a standalone translation unit - that I can run in the test framework. Unfortunately, the rather small difference further up in the optimization chain cascades into quite a lot of differences at the end, and I don't know really where to go from there to pinpoint the issue.
FWIW, the issue is reproducible for an aarch64-linux-gnu target as well. If it would help someone, I could make a prepackaged build for that target, with object files + this single source function that breaks it + sample input files to run it with.
(Also, for studying the effect of this one, on an IR level, the difference is already present in the output of -S -emit-llvm from clang, unless using -Xclang -disable-llvm-passes, but I'm sure you already took that into account.)
@mstorsjo i've bothered @dmajor to give this a try on some of mozilla's CI, and unfortunately it came back green..
I don't recall if i can reproduce it locally (on vanilla test suite, i'll recheck),
but right now it would seem that you are the only one who can reproduce this :/
Were you able to make any progress on a somewhat more actionable reproducer?
What would be really helpful is a standalone source that can be compiled and run (here by be locally, on linux x86_64)
that produces different result with/without this change.
IIRC I'm only able to reproduce this for aarch64 (and possibly arm) targets unfortunately. I can make a kind of standalone testcase with one single file built from source, linked and executed along with a bigger piece of code (either from source or prebuilt object files), but it'd be for aarch64-linux (or aarch64-darwin if that helps - I haven't tried reproducing it there but I'd presume it behaves the same as I noticed the bug first on aarch64 windows).
Ok, here's a reproduction environment, requiring aarch64 linux (tested on ubuntu 18.04, but anything similar or newer should be fine, and AWS and other cloud providers makes it fairly easy to spin up such an instance) - with https://martin.st/temp/ffmpeg-repro.zip downloaded and unzipped:
$ gcc -c vc1_block-debug.c -o vc1_block-debug.o $ gcc *.o -Wl,--start-group *.a -Wl,--end-group -o ffmpeg -lm -lpthread -lz $ ./ffmpeg -v quiet -i SA00040.vc1 -f framecrc - #software: Lavf58.65.100 #tb 0: 1/25 #media_type 0: video #codec_id 0: rawvideo #dimensions 0: 176x144 #sar 0: 1/1 0, 0, 0, 1, 38016, 0xa6f15db5 0, 1, 1, 1, 38016, 0xa6f15db5 0, 2, 2, 1, 38016, 0xa6f15db5 0, 4, 4, 1, 38016, 0x5c4ef0e7 0, 5, 5, 1, 38016, 0x53a42d1d 0, 6, 6, 1, 38016, 0x68f7d89e 0, 7, 7, 1, 38016, 0xc15f4368 0, 8, 8, 1, 38016, 0xc15f4368 0, 9, 9, 1, 38016, 0xd1bd47a8 0, 10, 10, 1, 38016, 0xd1bd47a8 0, 11, 11, 1, 38016, 0xe1e821ca 0, 12, 12, 1, 38016, 0xe1e821ca 0, 13, 13, 1, 38016, 0xe1e821ca 0, 14, 14, 1, 38016, 0xe1e821ca 0, 15, 15, 1, 38016, 0xe1e821ca $ clang -c vc1_block-debug.c -o vc1_block-debug.o -O2 # Recompile the problematic function with clang with this patch included $ gcc *.o -Wl,--start-group *.a -Wl,--end-group -o ffmpeg -lm -lpthread -lz $ ./ffmpeg -v quiet -i SA00040.vc1 -f framecrc - #software: Lavf58.65.100 #tb 0: 1/25 #media_type 0: video #codec_id 0: rawvideo #dimensions 0: 176x144 #sar 0: 1/1 0, 0, 0, 1, 38016, 0xa6f15db5 0, 1, 1, 1, 38016, 0xa6f15db5 0, 2, 2, 1, 38016, 0xa6f15db5 0, 4, 4, 1, 38016, 0x5c4ef0e7 0, 5, 5, 1, 38016, 0x53a42d1d 0, 6, 6, 1, 38016, 0x68f7d89e 0, 7, 7, 1, 38016, 0xc15f4368 0, 8, 8, 1, 38016, 0xc15f4368 0, 9, 9, 1, 38016, 0xd1bd47a8 0, 10, 10, 1, 38016, 0xd1bd47a8 0, 11, 11, 1, 38016, 0x3bf00b9a # The hashes from here onwards differ 0, 12, 12, 1, 38016, 0x3bf00b9a 0, 13, 13, 1, 38016, 0x3bf00b9a 0, 14, 14, 1, 38016, 0x3bf00b9a 0, 15, 15, 1, 38016, 0x3bf00b9a
Alternative repro, fully from source:
$ git clone https://github.com/mstorsjo/ffmpeg.git $ cd ffmpeg $ git checkout origin/vc1-debug $ mkdir ../ffmpeg-build; cd ../ffmpeg-build $ ../ffmpeg/configure --disable-everything --enable-decoder=vc1 --enable-demuxer=vc1 --enable-encoder=rawvideo --enable-muxer=framecrc --enable-protocol=file --enable-protocol=pipe --enable-parser=vc1 --enable-bsf=extract_extradata --samples=$(pwd)/../samples $ make -j$(nproc) $ make fate-rsync # Fetch reference test samples $ make fate-vc1 # Succeeds $ clang -c ../ffmpeg/libavcodec/vc1_block-debug.c -o libavcodec/vc1_block-debug.o -O3 -I../ffmpeg -I. # Recompile the problematic function with clang with this patch included $ make fate-vc1 # Fails TEST vc1_sa00040 --- /home/ubuntu/code/ffmpeg/tests/ref/fate/vc1_sa00040 2020-12-18 12:04:27.216500325 +0000 +++ tests/data/fate/vc1_sa00040 2020-12-18 12:10:33.352940826 +0000 @@ -13,8 +13,8 @@ 0, 8, 8, 1, 38016, 0xc15f4368 0, 9, 9, 1, 38016, 0xd1bd47a8 0, 10, 10, 1, 38016, 0xd1bd47a8 -0, 11, 11, 1, 38016, 0xe1e821ca -0, 12, 12, 1, 38016, 0xe1e821ca -0, 13, 13, 1, 38016, 0xe1e821ca -0, 14, 14, 1, 38016, 0xe1e821ca -0, 15, 15, 1, 38016, 0xe1e821ca +0, 11, 11, 1, 38016, 0x3bf00b9a +0, 12, 12, 1, 38016, 0x3bf00b9a +0, 13, 13, 1, 38016, 0x3bf00b9a +0, 14, 14, 1, 38016, 0x3bf00b9a +0, 15, 15, 1, 38016, 0x3bf00b9a Test vc1_sa00040 failed. Look at tests/data/fate/vc1_sa00040.err for details. /home/ubuntu/code/ffmpeg/tests/Makefile:255: recipe for target 'fate-vc1_sa00040' failed make: *** [fate-vc1_sa00040] Error 1
Thanks for the reproducer. I verified that it does indeed fail with this patch.
It seems to be doing this as a knock-on effect: https://godbolt.org/z/Y4z3je, which does not verify: https://alive2.llvm.org/ce/z/PN7Rv5 ?
Yeah. The reproducer seems to work OK with a patch something like this:
diff --git a/llvm/lib/Analysis/InstructionSimplify.cpp b/llvm/lib/Analysis/InstructionSimplify.cpp index 35c21a0..c517286 100644 --- a/llvm/lib/Analysis/InstructionSimplify.cpp +++ b/llvm/lib/Analysis/InstructionSimplify.cpp @@ -4020,11 +4020,11 @@ static Value *simplifySelectWithICmpCond(Value *CondVal, Value *TrueVal, // X == 0 ? abs(X) : -abs(X) --> -abs(X) // X == 0 ? -abs(X) : abs(X) --> abs(X) - if (match(TrueVal, m_Intrinsic<Intrinsic::abs>(m_Value(X))) && - match(FalseVal, m_Neg(m_Intrinsic<Intrinsic::abs>(m_Specific(X))))) + if (match(TrueVal, m_Intrinsic<Intrinsic::abs>(m_Specific(CmpLHS))) && + match(FalseVal, m_Neg(m_Intrinsic<Intrinsic::abs>(m_Specific(CmpLHS))))) return FalseVal; - if (match(TrueVal, m_Neg(m_Intrinsic<Intrinsic::abs>(m_Value(X)))) && - match(FalseVal, m_Intrinsic<Intrinsic::abs>(m_Specific(X)))) + if (match(TrueVal, m_Neg(m_Intrinsic<Intrinsic::abs>(m_Specific(CmpLHS)))) && + match(FalseVal, m_Intrinsic<Intrinsic::abs>(m_Specific(CmpLHS)))) return FalseVal; }
(I must admit, was fully expecting this to be something wrong in the AArch64 backend. I'll leave that fix to you if you are happy to add tests and whatnot.)
Indeed, that is what i have already committed locally and will push in a sec.
(I must admit, was fully expecting this to be something wrong in the AArch64 backend. I'll leave that fix to you if you are happy to add tests and whatnot.)
Yeah, i also was almost convinced it was :)
Awesome, thanks guys!
I'll let you know tomorrow in the unlikely case if this still caused some other regressions as well (or possibly later, if my nightly build ends up broken for other reasons).
My tests ran successfully with a version with this applied, so the case can be considered closed now.
This patch regressed the following test of Transforms/InstCombine/abs-1.ll:
(need to drop NSW in this case).
define i8 @nabs_canonical_3(i8 %x) { %0: %cmp = icmp slt i8 %x, 0 %neg = sub nsw i8 0, %x %abs = select i1 %cmp, i8 %x, i8 %neg ret i8 %abs } => define i8 @nabs_canonical_3(i8 %x) { %0: %1 = abs i8 %x, 1 %abs = sub nsw i8 0, %1 ret i8 %abs } Transformation doesn't verify! ERROR: Target is more poisonous than source Example: i8 %x = #x80 (128, -128) Source: i1 %cmp = #x1 (1) i8 %neg = poison i8 %abs = #x80 (128, -128) Target: i8 %1 = poison i8 %abs = poison Source value: #x80 (128, -128) Target value: poison
(same bug occurs with @nabs_nabs_x01 in Transforms/InstCombine/abs_abs.ll)
Heads up: Breaks a test for us: https://bugs.chromium.org/p/chromium/issues/detail?id=1161542
(No reduced repro yet, might be UB, just fyi at this point.)
Thanks for headsup. For now i'll deal with the problem @nlopes pointed out above in a bit..
Just to follow up, this ended up being UB on our end (fix: https://android-review.googlesource.com/c/platform/external/perfetto/+/1535483)
I'm wondering what to do about this one use check. Should we canonicalize even if the compare has other uses?
We are folding cmp+sub+select into abs or abs+sub. However, both cmp and sub can potentially have extra uses.