Page MenuHomePhabricator

[InstCombine] Canonicalize SPF to abs intrinc
ClosedPublic

Authored by lebedev.ri on Sep 5 2020, 6:40 AM.

Details

Summary

This patch enables canonicalization of SPF_ABS and SPF_ABS to the abs intrinsic. I'll comment in-line for remaining issues...

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
nikic added inline comments.Sep 5 2020, 6:53 AM
llvm/lib/Transforms/InstCombine/InstCombineSelect.cpp
1070

I'm wondering what to do about this one use check. Should we canonicalize even if the compare has other uses?

We are folding cmp+sub+select into abs or abs+sub. However, both cmp and sub can potentially have extra uses.

llvm/test/Transforms/InstCombine/abs_abs.ll
94

This is the main remaining problem: What happens here is that we see abs(x) > 0, convert this to abs(x) != 0 (based on abs range) and then to x != 0. The last step happens due to the fold I added in rGada8a17d945c17c5603e24824f642ca199412adf, previously this worked. At that point we are left with something like x != 0 ? abs(x) : -abs(x), which is no longer a proper abs.

What do you think the best way to handle this would be? Should this pattern be added to the SPF_ABS recognition? Just explicitly match it in InstCombine?

nikic updated this revision to Diff 290127.Sep 6 2020, 12:47 AM

Rebase over D87197, update affected clang tests.

I would expect the tests in llvm to only be subset of the possible second order effects that can come up from changing this. Perhaps abs is simple enough, but you might want to run the llvm test suite to see if there are any other interesting binary differences.

clang/test/CodeGen/arm-mve-intrinsics/vmaxaq.c
3

Hmm. These needn't be running -O3. I'll see to removing that. Looks OK though, from what I understand about how these get lowered.

nikic added a comment.Sep 6 2020, 11:58 AM

I've looked at some of the diffs on test suite. The main change I'm seeing is that loops containing abs now get unrolled more. I'm not sure whether that's expected/desired or not.

I've also spotted two possible folds (these aren't regressions though, they aren't recognized in select form either):

nikic added a comment.Sep 6 2020, 12:17 PM

To give an example of what I mean by more unrolling, https://editor.mergely.com/z5GcKJGU/ is the diff for Bitcode/simd_ops/CMakeFiles/simd_ops_test_op_pabsd_237. This is a loop with llvm.abs.v4i32 operations that now has double the unrolling factor.

nikic updated this revision to Diff 291407.Sep 12 2020, 11:17 AM
nikic retitled this revision from [InstCombine] Canonicalize SPF to abs intrinc (WIP) to [InstCombine] Canonicalize SPF to abs intrinc.

Rebase

nikic added inline comments.Sep 12 2020, 11:35 AM
llvm/lib/Transforms/InstCombine/InstCombineSelect.cpp
1070

Here's how dropping the one use check would look like: https://gist.github.com/nikic/3dece504951cba82415fed2d6aa876bf

Nominally, we are only increasing instruction count for the combination of nabs with extra uses on both icmp and sub.

spatel accepted this revision.Sep 16 2020, 1:48 PM

LGTM - I think we should give this a try as-is (with the one-use check still there), see if anything regresses, then ease/remove the use check as a follow-on.
As noted, we may need to adjust cost models to account for the size/speed difference that's showing up in unrolling/inlining. That's probably because we assume that an intrinsic is expanded to a single instruction vs. the current cmp+sub+select being 3 instructions?

This revision is now accepted and ready to land.Sep 16 2020, 1:48 PM
This revision was automatically updated to reflect the committed changes.
Herald added a project: Restricted Project. · View Herald TranscriptSep 17 2020, 1:29 PM
Herald added a subscriber: cfe-commits. · View Herald Transcript
nikic added a comment.Sep 17 2020, 1:38 PM

LGTM - I think we should give this a try as-is (with the one-use check still there), see if anything regresses, then ease/remove the use check as a follow-on.

Okay, let's do that.

As noted, we may need to adjust cost models to account for the size/speed difference that's showing up in unrolling/inlining. That's probably because we assume that an intrinsic is expanded to a single instruction vs. the current cmp+sub+select being 3 instructions?

The default abs cost model already uses icmp+select+sub. Though X86 has custom cost model for the vector variants, which is generally cheaper, so it makes sense that more unrolling is seen there. I expect that this part of the change already happened when the vector intrinsics were switched over recently.

This broke a few tests for me (generating code that now gives the fail result at runtime).

I'm not entirely sure which bit is the culprit, but the difference in output (that breaks tests) for one object file is available at https://martin.st/temp/vc1_block-diff-aarch64.txt, and https://martin.st/temp/vc1_block-diff-armv7.txt. For the aarch64 version, it looks like some conditionals are inverted, like these changes:

-	b.ge	.LBB1_124
+	b.hs	.LBB1_124

and

-	csel	w14, w15, w14, gt
+	csel	w14, w15, w14, hi

(with no seemingly related changes that would change the roles of the registers).

The input files for reproducing the issues are available at https://martin.st/temp/vc1_block-aarch64.c and https://martin.st/temp/vc1_block-armv7.c, which can be compiled with clang -target {aarch64,armv7}-w64-mingw32 -c -O2 vc1_block-{aarch64,armv7}.c.

I know this has already been reverted but just FYI that I've bisected a ~2% regression in SPEC2017 x264_r on AArch64 to this commit. Presumably this is due to the extra unrolling / cost modelling issue already mentioned?

I know this has already been reverted but just FYI that I've bisected a ~2% regression in SPEC2017 x264_r on AArch64 to this commit. Presumably this is due to the extra unrolling / cost modelling issue already mentioned?

That would be my guess. I'm not familiar with the unroller or inliner, but I don't even see them using the cost model at 1st glance. Are they purely counting IR instructions?

If I missed the cost model calls, the cost model is still not correct for code-size:

$ cat cost.ll
declare i32        @llvm.abs.i32(i32, i1)
define i32 @abs(i32 %x, i32 %y) {
  %I32 = call i32 @llvm.abs.i32(i32 %x, i1 false)
  %negx = sub i32 0, %x
  %cmp = icmp sgt i32 %x, -1
  %sel = select i1 %cmp, i32 %x, i32 %negx 
  ret i32 %I32
}

$ opt  -cost-model -cost-kind=code-size -analyze -mtriple=aarch64--  cost.ll 
Printing analysis 'Cost Model Analysis' for function 'abs':
Cost Model: Found an estimated cost of 1 for instruction:   %I32 = call i32 @llvm.abs.i32(i32 %x, i1 false)
Cost Model: Found an estimated cost of 1 for instruction:   %negx = sub i32 0, %x
Cost Model: Found an estimated cost of 1 for instruction:   %cmp = icmp sgt i32 %x, -1
Cost Model: Found an estimated cost of 1 for instruction:   %sel = select i1 %cmp, i32 %x, i32 %negx

So we're assuming the size cost is either 1 or 3 depending on how we encoded abs() in IR, but neither is correct for AArch64 IIUC:

$ llc -o - -mtriple=aarch64-- cost.ll 
	cmp	w0, #0                          // =0
	cneg	w0, w0, mi
	ret

This broke a few tests for me (generating code that now gives the fail result at runtime).

I'm not entirely sure which bit is the culprit, but the difference in output (that breaks tests) for one object file is available at https://martin.st/temp/vc1_block-diff-aarch64.txt, and https://martin.st/temp/vc1_block-diff-armv7.txt. For the aarch64 version, it looks like some conditionals are inverted, like these changes:

-	b.ge	.LBB1_124
+	b.hs	.LBB1_124

and

-	csel	w14, w15, w14, gt
+	csel	w14, w15, w14, hi

(with no seemingly related changes that would change the roles of the registers).

The input files for reproducing the issues are available at https://martin.st/temp/vc1_block-aarch64.c and https://martin.st/temp/vc1_block-armv7.c, which can be compiled with clang -target {aarch64,armv7}-w64-mingw32 -c -O2 vc1_block-{aarch64,armv7}.c.

Based on an initial look, the changes in comparison predicates here are probably a red herring. If I understand right, those predicates are switching from signed to unsigned comparison (e.g. gt to hi). I also see similar changes in IR. However, all the cases I looked at are actually correct (e.g. abs(x) > abs(y) can be compared signed or unsigned for poisoning abs). Assuming that this code is clean under ubsan, the problem is likely something else.

Based on an initial look, the changes in comparison predicates here are probably a red herring. If I understand right, those predicates are switching from signed to unsigned comparison (e.g. gt to hi). I also see similar changes in IR. However, all the cases I looked at are actually correct (e.g. abs(x) > abs(y) can be compared signed or unsigned for poisoning abs). Assuming that this code is clean under ubsan, the problem is likely something else.

I just checked, and the code is indeed clean under ubsan.

The changes in the asm is more than just a few places, so it's quite a bit of work to dissect which one of them (if the changes are localized into smaller individual changes) is the one that's causing the error...

nikic reopened this revision.Oct 22 2020, 1:20 PM

Reopening this so we don't forget...

I believe @spatel is working on the cost modelling. I did not have much luck tracking down the miscompile, at least did not spot anything incriminating in the llvm-diff.

This revision is now accepted and ready to land.Oct 22 2020, 1:20 PM

Reopening this so we don't forget...

I believe @spatel is working on the cost modelling. I did not have much luck tracking down the miscompile, at least did not spot anything incriminating in the llvm-diff.

Yes, I'm working through the mess of the TTI cost model with things like:
c963bde0152a
It's a slog...

Reopening this so we don't forget...

I believe @spatel is working on the cost modelling. I did not have much luck tracking down the miscompile, at least did not spot anything incriminating in the llvm-diff.

Yes, I'm working through the mess of the TTI cost model with things like:
c963bde0152a
It's a slog...

D90554 / f7eac51b9b
I think that change makes this patch ready to try again. If we do see regressions, then it should now be easy to adjust target-specific cost modeling of abs() intrinsics to fix those.

lebedev.ri added inline comments.Nov 10 2020, 6:38 AM
llvm/lib/Transforms/InstCombine/InstCombineSelect.cpp
1070

I think we shouldn't have one-use checks here, yes.

D90554 / f7eac51b9b
I think that change makes this patch ready to try again. If we do see regressions, then it should now be easy to adjust target-specific cost modeling of abs() intrinsics to fix those.

We still need to track down the miscompile that @mstorsjo reported before reapplying this patch.

nikic abandoned this revision.Dec 6 2020, 9:40 AM

This needs someone with access to AArch64 hardware to look into https://reviews.llvm.org/D87188#2281093 to make progress. I don't have any AArch64 hardware, and judging by the time it takes to build cmake on an AArch64 machine in the GCC compile farm, there's no way I'm going to be doing LLVM builds there.

lebedev.ri commandeered this revision.Dec 10 2020, 4:37 AM
lebedev.ri edited reviewers, added: nikic; removed: lebedev.ri.

@mstorsjo please can you be more specific what kind of tests are starting to fail?
Do those tests just check that the code compiled into an identical assembly?

I reduced vc1_block-aarch64.c, and got:

; ModuleID = 'input.ll'
source_filename = "vc1_block-aarch64.c"
target datalayout = "e-m:w-p:64:64-i32:32-i64:64-i128:128-n32:64-S128"
target triple = "aarch64-w64-windows-gnu"

define i1 @barney(i8 %arg, i8 %arg9) {
bb:
  %tmp = sext i8 %arg to i32
  %tmp10 = icmp slt i32 %tmp, 0
  %tmp11 = sub nsw i32 0, %tmp
  %tmp12 = select i1 %tmp10, i32 %tmp11, i32 %tmp
  %tmp13 = shl nuw nsw i32 %tmp12, 1
  %tmp14 = zext i8 %arg9 to i32
  %tmp15 = add nuw nsw i32 %tmp13, %tmp14
  %tmp16 = icmp slt i32 %tmp15, 2
  ret i1 %tmp16
}

declare i32 @eggs()

which used to get optimized into

; ModuleID = '<stdin>'
source_filename = "vc1_block-aarch64.c"
target datalayout = "e-m:w-p:64:64-i32:32-i64:64-i128:128-n32:64-S128"
target triple = "aarch64-w64-windows-gnu"

define i1 @barney(i8 %arg, i8 %arg9) {
bb:
  %tmp = sext i8 %arg to i32
  %tmp10 = icmp slt i32 %tmp, 0
  %tmp11 = sub nsw i32 0, %tmp
  %tmp12 = select i1 %tmp10, i32 %tmp11, i32 %tmp
  %tmp13 = shl nuw nsw i32 %tmp12, 1
  %tmp14 = zext i8 %arg9 to i32
  %tmp15 = add nuw nsw i32 %tmp13, %tmp14
  %tmp16 = icmp slt i32 %tmp15, 2
  ret i1 %tmp16
}

declare i32 @eggs()

and is now optimized into

; ModuleID = '<stdin>'
source_filename = "vc1_block-aarch64.c"
target datalayout = "e-m:w-p:64:64-i32:32-i64:64-i128:128-n32:64-S128"
target triple = "aarch64-w64-windows-gnu"

define i1 @barney(i8 %arg, i8 %arg9) {
bb:
  %tmp = sext i8 %arg to i32
  %0 = call i32 @llvm.abs.i32(i32 %tmp, i1 true)
  %tmp13 = shl nuw nsw i32 %0, 1
  %tmp14 = zext i8 %arg9 to i32
  %tmp15 = add nuw nsw i32 %tmp13, %tmp14
  %tmp16 = icmp ult i32 %tmp15, 2
  ret i1 %tmp16
}

declare i32 @eggs()

; Function Attrs: nofree nosync nounwind readnone speculatable willreturn
declare i32 @llvm.abs.i32(i32, i1 immarg) #0

attributes #0 = { nofree nosync nounwind readnone speculatable willreturn }

which is not a miscompile:

$ /repositories/alive2/build-Clang-release/alive-tv old.ll new.ll

----------------------------------------
define i1 @barney(i8 %arg, i8 %arg9) {
%bb:
  %tmp = sext i8 %arg to i32
  %tmp10 = icmp slt i32 %tmp, 0
  %tmp11 = sub nsw i32 0, %tmp
  %tmp12 = select i1 %tmp10, i32 %tmp11, i32 %tmp
  %tmp13 = shl nsw nuw i32 %tmp12, 1
  %tmp14 = zext i8 %arg9 to i32
  %tmp15 = add nsw nuw i32 %tmp13, %tmp14
  %tmp16 = icmp slt i32 %tmp15, 2
  ret i1 %tmp16
}
=>
define i1 @barney(i8 %arg, i8 %arg9) {
%bb:
  %tmp = sext i8 %arg to i32
  %0 = abs i32 %tmp, 1
  %tmp13 = shl nsw nuw i32 %0, 1
  %tmp14 = zext i8 %arg9 to i32
  %tmp15 = add nsw nuw i32 %tmp13, %tmp14
  %tmp16 = icmp ult i32 %tmp15, 2
  ret i1 %tmp16
}
Transformation seems to be correct!

Summary:
  1 correct transformations
  0 incorrect transformations
  0 Alive2 errors

The original assembly was:

        .text
        .file   "vc1_block-aarch64.c"
        .def     barney;
        .scl    2;
        .type   32;
        .endef
        .globl  barney                          // -- Begin function barney
        .p2align        2
barney:                                 // @barney
// %bb.0:                               // %bb
        sxtb    w8, w0
        cmp     w8, #0                          // =0
        cneg    w8, w8, mi
        lsl     w8, w8, #1
        add     w8, w8, w1, uxtb
        cmp     w8, #2                          // =2
        cset    w0, lt
        ret
                                        // -- End function

and new one is

        .text
        .file   "vc1_block-aarch64.c"
        .def     barney;
        .scl    2;
        .type   32;
        .endef
        .globl  barney                          // -- Begin function barney
        .p2align        2
barney:                                 // @barney
// %bb.0:                               // %bb
        sxtb    w8, w0
        cmp     w8, #0                          // =0
        cneg    w8, w8, mi
        lsl     w8, w8, #1
        add     w8, w8, w1, uxtb
        cmp     w8, #2                          // =2
        cset    w0, lo
        ret
                                        // -- End function

with diff being

$ diff old.s new.s 
17c17
<       cset    w0, lt
---
>       cset    w0, lo

@mstorsjo please can you be more specific what kind of tests are starting to fail?
Do those tests just check that the code compiled into an identical assembly?

No, it's video/audio decoding tests, and after that change specific reference input samples produce a different output hash than before.

I can retry the change (I see that the patch is recently rebased and hopefully still should apply on the latest main version) soon and see if I can pinpoint what's going wrong any closer than before.

Partial rebase (without updating test coverage)

This revision is now accepted and ready to land.Dec 10 2020, 5:02 AM

Partial rebase (without updating test coverage)

Thanks for reducing!
If I'm seeing it correctly, the codegen looks fine - the difference is in the IR icmp predicate changing from slt to ult, and that diff appears to be correct independent of whether or not we use the abs intrinsic:
https://alive2.llvm.org/ce/z/VDDqBj

So either the source/test has an incorrect expectation and/or something outside of this function is wrong?

Partial rebase (without updating test coverage)

Thanks for reducing!
If I'm seeing it correctly, the codegen looks fine - the difference is in the IR icmp predicate changing from slt to ult, and that diff appears to be correct independent of whether or not we use the abs intrinsic:
https://alive2.llvm.org/ce/z/VDDqBj

So either the source/test has an incorrect expectation and/or something outside of this function is wrong?

It's also correct without the nsw/nuw flags since the inputs are i8 and extended to i32 it can't see i32 INT_MIN.

Partial rebase (without updating test coverage)

Thanks for reducing!
If I'm seeing it correctly, the codegen looks fine - the difference is in the IR icmp predicate changing from slt to ult, and that diff appears to be correct independent of whether or not we use the abs intrinsic:
https://alive2.llvm.org/ce/z/VDDqBj

So either the source/test has an incorrect expectation and/or something outside of this function is wrong?

It's also correct without the nsw/nuw flags since the inputs are i8 and extended to i32 it can't see i32 INT_MIN.

Just to be sure, i just run the entire compilation of vc1_block-aarch64.c through alive2, and as far as i can tell, it did not report any IR-level miscompilations.
So either the original code has bugs/UB's/whatever, or this is an AArch64 backend bug (cc @t.p.northover), or the IR problem is in a place alive2 can't find.

Just to be sure, i just run the entire compilation of vc1_block-aarch64.c through alive2, and as far as i can tell, it did not report any IR-level miscompilations.
So either the original code has bugs/UB's/whatever, or this is an AArch64 backend bug (cc @t.p.northover), or the IR problem is in a place alive2 can't find.

FWIW, the testcase does run without any remarks in ubsan at least...

I retried the issue now, and I've reduced it down to one standalone function that causes the errors. https://martin.st/temp/vc1_block-p-block2.c is this function separated into a standalone translation unit - that I can run in the test framework. Unfortunately, the rather small difference further up in the optimization chain cascades into quite a lot of differences at the end, and I don't know really where to go from there to pinpoint the issue.

FWIW, the issue is reproducible for an aarch64-linux-gnu target as well. If it would help someone, I could make a prepackaged build for that target, with object files + this single source function that breaks it + sample input files to run it with.

(Also, for studying the effect of this one, on an IR level, the difference is already present in the output of -S -emit-llvm from clang, unless using -Xclang -disable-llvm-passes, but I'm sure you already took that into account.)

@mstorsjo i've bothered @dmajor to give this a try on some of mozilla's CI, and unfortunately it came back green..
I don't recall if i can reproduce it locally (on vanilla test suite, i'll recheck),
but right now it would seem that you are the only one who can reproduce this :/

Were you able to make any progress on a somewhat more actionable reproducer?

What would be really helpful is a standalone source that can be compiled and run (here by be locally, on linux x86_64)
that produces different result with/without this change.

@mstorsjo i've bothered @dmajor to give this a try on some of mozilla's CI, and unfortunately it came back green..
I don't recall if i can reproduce it locally (on vanilla test suite, i'll recheck),
but right now it would seem that you are the only one who can reproduce this :/

Were you able to make any progress on a somewhat more actionable reproducer?

What would be really helpful is a standalone source that can be compiled and run (here by be locally, on linux x86_64)
that produces different result with/without this change.

IIRC I'm only able to reproduce this for aarch64 (and possibly arm) targets unfortunately. I can make a kind of standalone testcase with one single file built from source, linked and executed along with a bigger piece of code (either from source or prebuilt object files), but it'd be for aarch64-linux (or aarch64-darwin if that helps - I haven't tried reproducing it there but I'd presume it behaves the same as I noticed the bug first on aarch64 windows).

@mstorsjo i've bothered @dmajor to give this a try on some of mozilla's CI, and unfortunately it came back green..
I don't recall if i can reproduce it locally (on vanilla test suite, i'll recheck),
but right now it would seem that you are the only one who can reproduce this :/

Were you able to make any progress on a somewhat more actionable reproducer?

What would be really helpful is a standalone source that can be compiled and run (here by be locally, on linux x86_64)
that produces different result with/without this change.

IIRC I'm only able to reproduce this for aarch64 (and possibly arm) targets unfortunately. I can make a kind of standalone testcase with one single file built from source, linked and executed along with a bigger piece of code (either from source or prebuilt object files), but it'd be for aarch64-linux (or aarch64-darwin if that helps - I haven't tried reproducing it there but I'd presume it behaves the same as I noticed the bug first on aarch64 windows).

Ok, here's a reproduction environment, requiring aarch64 linux (tested on ubuntu 18.04, but anything similar or newer should be fine, and AWS and other cloud providers makes it fairly easy to spin up such an instance) - with https://martin.st/temp/ffmpeg-repro.zip downloaded and unzipped:

$ gcc -c vc1_block-debug.c -o vc1_block-debug.o
$ gcc *.o -Wl,--start-group *.a -Wl,--end-group -o ffmpeg -lm -lpthread -lz
$ ./ffmpeg -v quiet -i SA00040.vc1 -f framecrc -
#software: Lavf58.65.100
#tb 0: 1/25
#media_type 0: video
#codec_id 0: rawvideo
#dimensions 0: 176x144
#sar 0: 1/1
0,          0,          0,        1,    38016, 0xa6f15db5
0,          1,          1,        1,    38016, 0xa6f15db5
0,          2,          2,        1,    38016, 0xa6f15db5
0,          4,          4,        1,    38016, 0x5c4ef0e7
0,          5,          5,        1,    38016, 0x53a42d1d
0,          6,          6,        1,    38016, 0x68f7d89e
0,          7,          7,        1,    38016, 0xc15f4368
0,          8,          8,        1,    38016, 0xc15f4368
0,          9,          9,        1,    38016, 0xd1bd47a8
0,         10,         10,        1,    38016, 0xd1bd47a8
0,         11,         11,        1,    38016, 0xe1e821ca
0,         12,         12,        1,    38016, 0xe1e821ca
0,         13,         13,        1,    38016, 0xe1e821ca
0,         14,         14,        1,    38016, 0xe1e821ca
0,         15,         15,        1,    38016, 0xe1e821ca
$ clang -c vc1_block-debug.c -o vc1_block-debug.o -O2 # Recompile the problematic function with clang with this patch included
$ gcc *.o -Wl,--start-group *.a -Wl,--end-group -o ffmpeg -lm -lpthread -lz
$ ./ffmpeg -v quiet -i SA00040.vc1 -f framecrc -
#software: Lavf58.65.100
#tb 0: 1/25
#media_type 0: video
#codec_id 0: rawvideo
#dimensions 0: 176x144
#sar 0: 1/1
0,          0,          0,        1,    38016, 0xa6f15db5
0,          1,          1,        1,    38016, 0xa6f15db5
0,          2,          2,        1,    38016, 0xa6f15db5
0,          4,          4,        1,    38016, 0x5c4ef0e7
0,          5,          5,        1,    38016, 0x53a42d1d
0,          6,          6,        1,    38016, 0x68f7d89e
0,          7,          7,        1,    38016, 0xc15f4368
0,          8,          8,        1,    38016, 0xc15f4368
0,          9,          9,        1,    38016, 0xd1bd47a8
0,         10,         10,        1,    38016, 0xd1bd47a8
0,         11,         11,        1,    38016, 0x3bf00b9a # The hashes from here onwards differ
0,         12,         12,        1,    38016, 0x3bf00b9a
0,         13,         13,        1,    38016, 0x3bf00b9a
0,         14,         14,        1,    38016, 0x3bf00b9a
0,         15,         15,        1,    38016, 0x3bf00b9a

Alternative repro, fully from source:

$ git clone https://github.com/mstorsjo/ffmpeg.git
$ cd ffmpeg
$ git checkout origin/vc1-debug 
$ mkdir ../ffmpeg-build; cd ../ffmpeg-build
$ ../ffmpeg/configure --disable-everything --enable-decoder=vc1 --enable-demuxer=vc1 --enable-encoder=rawvideo --enable-muxer=framecrc --enable-protocol=file --enable-protocol=pipe --enable-parser=vc1 --enable-bsf=extract_extradata --samples=$(pwd)/../samples
$ make -j$(nproc)
$ make fate-rsync # Fetch reference test samples
$ make fate-vc1 # Succeeds
$ clang -c ../ffmpeg/libavcodec/vc1_block-debug.c -o libavcodec/vc1_block-debug.o -O3 -I../ffmpeg -I. # Recompile the problematic function with clang with this patch included
$ make fate-vc1 # Fails
TEST    vc1_sa00040
--- /home/ubuntu/code/ffmpeg/tests/ref/fate/vc1_sa00040	2020-12-18 12:04:27.216500325 +0000
+++ tests/data/fate/vc1_sa00040	2020-12-18 12:10:33.352940826 +0000
@@ -13,8 +13,8 @@
 0,          8,          8,        1,    38016, 0xc15f4368
 0,          9,          9,        1,    38016, 0xd1bd47a8
 0,         10,         10,        1,    38016, 0xd1bd47a8
-0,         11,         11,        1,    38016, 0xe1e821ca
-0,         12,         12,        1,    38016, 0xe1e821ca
-0,         13,         13,        1,    38016, 0xe1e821ca
-0,         14,         14,        1,    38016, 0xe1e821ca
-0,         15,         15,        1,    38016, 0xe1e821ca
+0,         11,         11,        1,    38016, 0x3bf00b9a
+0,         12,         12,        1,    38016, 0x3bf00b9a
+0,         13,         13,        1,    38016, 0x3bf00b9a
+0,         14,         14,        1,    38016, 0x3bf00b9a
+0,         15,         15,        1,    38016, 0x3bf00b9a
Test vc1_sa00040 failed. Look at tests/data/fate/vc1_sa00040.err for details.
/home/ubuntu/code/ffmpeg/tests/Makefile:255: recipe for target 'fate-vc1_sa00040' failed
make: *** [fate-vc1_sa00040] Error 1

Thanks for the reproducer. I verified that it does indeed fail with this patch.

It seems to be doing this as a knock-on effect: https://godbolt.org/z/Y4z3je, which does not verify: https://alive2.llvm.org/ce/z/PN7Rv5 ?

Thanks for the reproducer. I verified that it does indeed fail with this patch.

It seems to be doing this as a knock-on effect: https://godbolt.org/z/Y4z3je, which does not verify: https://alive2.llvm.org/ce/z/PN7Rv5 ?

@dmgreen THANK YOU!
That was sufficient information for me, the actual miscompiling fold is D87197.

Yeah. The reproducer seems to work OK with a patch something like this:

diff --git a/llvm/lib/Analysis/InstructionSimplify.cpp b/llvm/lib/Analysis/InstructionSimplify.cpp
index 35c21a0..c517286 100644                                                                     
--- a/llvm/lib/Analysis/InstructionSimplify.cpp                                                   
+++ b/llvm/lib/Analysis/InstructionSimplify.cpp                                                   
@@ -4020,11 +4020,11 @@ static Value *simplifySelectWithICmpCond(Value *CondVal, Value *TrueVal,  
                                                                                                  
     // X == 0 ? abs(X) : -abs(X) --> -abs(X)                                                     
     // X == 0 ? -abs(X) : abs(X) --> abs(X)                                                      
-    if (match(TrueVal, m_Intrinsic<Intrinsic::abs>(m_Value(X))) &&                               
-        match(FalseVal, m_Neg(m_Intrinsic<Intrinsic::abs>(m_Specific(X)))))                      
+    if (match(TrueVal, m_Intrinsic<Intrinsic::abs>(m_Specific(CmpLHS))) &&                       
+        match(FalseVal, m_Neg(m_Intrinsic<Intrinsic::abs>(m_Specific(CmpLHS)))))                 
       return FalseVal;                                                                           
-    if (match(TrueVal, m_Neg(m_Intrinsic<Intrinsic::abs>(m_Value(X)))) &&                        
-        match(FalseVal, m_Intrinsic<Intrinsic::abs>(m_Specific(X))))                             
+    if (match(TrueVal, m_Neg(m_Intrinsic<Intrinsic::abs>(m_Specific(CmpLHS)))) &&                
+        match(FalseVal, m_Intrinsic<Intrinsic::abs>(m_Specific(CmpLHS))))                        
       return FalseVal;                                                                           
   }

(I must admit, was fully expecting this to be something wrong in the AArch64 backend. I'll leave that fix to you if you are happy to add tests and whatnot.)

Yeah. The reproducer seems to work OK with a patch something like this:

diff --git a/llvm/lib/Analysis/InstructionSimplify.cpp b/llvm/lib/Analysis/InstructionSimplify.cpp
index 35c21a0..c517286 100644                                                                     
--- a/llvm/lib/Analysis/InstructionSimplify.cpp                                                   
+++ b/llvm/lib/Analysis/InstructionSimplify.cpp                                                   
@@ -4020,11 +4020,11 @@ static Value *simplifySelectWithICmpCond(Value *CondVal, Value *TrueVal,  
                                                                                                  
     // X == 0 ? abs(X) : -abs(X) --> -abs(X)                                                     
     // X == 0 ? -abs(X) : abs(X) --> abs(X)                                                      
-    if (match(TrueVal, m_Intrinsic<Intrinsic::abs>(m_Value(X))) &&                               
-        match(FalseVal, m_Neg(m_Intrinsic<Intrinsic::abs>(m_Specific(X)))))                      
+    if (match(TrueVal, m_Intrinsic<Intrinsic::abs>(m_Specific(CmpLHS))) &&                       
+        match(FalseVal, m_Neg(m_Intrinsic<Intrinsic::abs>(m_Specific(CmpLHS)))))                 
       return FalseVal;                                                                           
-    if (match(TrueVal, m_Neg(m_Intrinsic<Intrinsic::abs>(m_Value(X)))) &&                        
-        match(FalseVal, m_Intrinsic<Intrinsic::abs>(m_Specific(X))))                             
+    if (match(TrueVal, m_Neg(m_Intrinsic<Intrinsic::abs>(m_Specific(CmpLHS)))) &&                
+        match(FalseVal, m_Intrinsic<Intrinsic::abs>(m_Specific(CmpLHS))))                        
       return FalseVal;                                                                           
   }

Indeed, that is what i have already committed locally and will push in a sec.

(I must admit, was fully expecting this to be something wrong in the AArch64 backend. I'll leave that fix to you if you are happy to add tests and whatnot.)

Yeah, i also was almost convinced it was :)

Once again, thank you @mstorsjo and @dmgreen!

Once again, thank you @mstorsjo and @dmgreen!

Awesome, thanks guys!

I'll let you know tomorrow in the unlikely case if this still caused some other regressions as well (or possibly later, if my nightly build ends up broken for other reasons).

Once again, thank you @mstorsjo and @dmgreen!

Awesome, thanks guys!

I'll let you know tomorrow in the unlikely case if this still caused some other regressions as well (or possibly later, if my nightly build ends up broken for other reasons).

My tests ran successfully with a version with this applied, so the case can be considered closed now.

nlopes added a subscriber: nlopes.Dec 22 2020, 6:26 AM

This patch regressed the following test of Transforms/InstCombine/abs-1.ll:
(need to drop NSW in this case).

define i8 @nabs_canonical_3(i8 %x) {
%0:
  %cmp = icmp slt i8 %x, 0
  %neg = sub nsw i8 0, %x
  %abs = select i1 %cmp, i8 %x, i8 %neg
  ret i8 %abs
}
=>
define i8 @nabs_canonical_3(i8 %x) {
%0:
  %1 = abs i8 %x, 1
  %abs = sub nsw i8 0, %1
  ret i8 %abs
}
Transformation doesn't verify!
ERROR: Target is more poisonous than source

Example:
i8 %x = #x80 (128, -128)

Source:
i1 %cmp = #x1 (1)
i8 %neg = poison
i8 %abs = #x80 (128, -128)

Target:
i8 %1 = poison
i8 %abs = poison
Source value: #x80 (128, -128)
Target value: poison

(same bug occurs with @nabs_nabs_x01 in Transforms/InstCombine/abs_abs.ll)

Heads up: Breaks a test for us: https://bugs.chromium.org/p/chromium/issues/detail?id=1161542

(No reduced repro yet, might be UB, just fyi at this point.)

Heads up: Breaks a test for us: https://bugs.chromium.org/p/chromium/issues/detail?id=1161542

(No reduced repro yet, might be UB, just fyi at this point.)

Thanks for headsup. For now i'll deal with the problem @nlopes pointed out above in a bit..

This patch regressed the following test of Transforms/InstCombine/abs-1.ll:
(need to drop NSW in this case).

define i8 @nabs_canonical_3(i8 %x) {
%0:
  %cmp = icmp slt i8 %x, 0
  %neg = sub nsw i8 0, %x
  %abs = select i1 %cmp, i8 %x, i8 %neg
  ret i8 %abs
}
=>
define i8 @nabs_canonical_3(i8 %x) {
%0:
  %1 = abs i8 %x, 1
  %abs = sub nsw i8 0, %1
  ret i8 %abs
}
Transformation doesn't verify!
ERROR: Target is more poisonous than source

Example:
i8 %x = #x80 (128, -128)

Source:
i1 %cmp = #x1 (1)
i8 %neg = poison
i8 %abs = #x80 (128, -128)

Target:
i8 %1 = poison
i8 %abs = poison
Source value: #x80 (128, -128)
Target value: poison

(same bug occurs with @nabs_nabs_x01 in Transforms/InstCombine/abs_abs.ll)

Thanks, fixed in rGf8079355c604fead0f8538548bd7eb51fcc81e31.

Heads up: Breaks a test for us: https://bugs.chromium.org/p/chromium/issues/detail?id=1161542

(No reduced repro yet, might be UB, just fyi at this point.)

Thanks for headsup. For now i'll deal with the problem @nlopes pointed out above in a bit..

Just to follow up, this ended up being UB on our end (fix: https://android-review.googlesource.com/c/platform/external/perfetto/+/1535483)

nlopes added a comment.Jan 2 2021, 8:51 AM

Heads up: Breaks a test for us: https://bugs.chromium.org/p/chromium/issues/detail?id=1161542

(No reduced repro yet, might be UB, just fyi at this point.)

Thanks for headsup. For now i'll deal with the problem @nlopes pointed out above in a bit..

Just to follow up, this ended up being UB on our end (fix: https://android-review.googlesource.com/c/platform/external/perfetto/+/1535483)

Nice! 😀 Thanks for the update.