Page MenuHomePhabricator

[InstCombine] Canonicalize SPF to abs intrinc
AcceptedPublic

Authored by nikic on Sep 5 2020, 6:40 AM.

Details

Summary

This patch enables canonicalization of SPF_ABS and SPF_ABS to the abs intrinsic. I'll comment in-line for remaining issues...

Diff Detail

Event Timeline

nikic created this revision.Sep 5 2020, 6:40 AM
Herald added a project: Restricted Project. · View Herald TranscriptSep 5 2020, 6:40 AM
nikic requested review of this revision.Sep 5 2020, 6:40 AM
nikic added inline comments.Sep 5 2020, 6:53 AM
llvm/lib/Transforms/InstCombine/InstCombineSelect.cpp
1052

I'm wondering what to do about this one use check. Should we canonicalize even if the compare has other uses?

We are folding cmp+sub+select into abs or abs+sub. However, both cmp and sub can potentially have extra uses.

llvm/test/Transforms/InstCombine/abs_abs.ll
94

This is the main remaining problem: What happens here is that we see abs(x) > 0, convert this to abs(x) != 0 (based on abs range) and then to x != 0. The last step happens due to the fold I added in rGada8a17d945c17c5603e24824f642ca199412adf, previously this worked. At that point we are left with something like x != 0 ? abs(x) : -abs(x), which is no longer a proper abs.

What do you think the best way to handle this would be? Should this pattern be added to the SPF_ABS recognition? Just explicitly match it in InstCombine?

nikic updated this revision to Diff 290127.Sep 6 2020, 12:47 AM

Rebase over D87197, update affected clang tests.

I would expect the tests in llvm to only be subset of the possible second order effects that can come up from changing this. Perhaps abs is simple enough, but you might want to run the llvm test suite to see if there are any other interesting binary differences.

clang/test/CodeGen/arm-mve-intrinsics/vmaxaq.c
3 ↗(On Diff #290127)

Hmm. These needn't be running -O3. I'll see to removing that. Looks OK though, from what I understand about how these get lowered.

nikic added a comment.Sep 6 2020, 11:58 AM

I've looked at some of the diffs on test suite. The main change I'm seeing is that loops containing abs now get unrolled more. I'm not sure whether that's expected/desired or not.

I've also spotted two possible folds (these aren't regressions though, they aren't recognized in select form either):

nikic added a comment.Sep 6 2020, 12:17 PM

To give an example of what I mean by more unrolling, https://editor.mergely.com/z5GcKJGU/ is the diff for Bitcode/simd_ops/CMakeFiles/simd_ops_test_op_pabsd_237. This is a loop with llvm.abs.v4i32 operations that now has double the unrolling factor.

nikic updated this revision to Diff 291407.Sep 12 2020, 11:17 AM
nikic retitled this revision from [InstCombine] Canonicalize SPF to abs intrinc (WIP) to [InstCombine] Canonicalize SPF to abs intrinc.

Rebase

nikic added inline comments.Sep 12 2020, 11:35 AM
llvm/lib/Transforms/InstCombine/InstCombineSelect.cpp
1052

Here's how dropping the one use check would look like: https://gist.github.com/nikic/3dece504951cba82415fed2d6aa876bf

Nominally, we are only increasing instruction count for the combination of nabs with extra uses on both icmp and sub.

spatel accepted this revision.Sep 16 2020, 1:48 PM

LGTM - I think we should give this a try as-is (with the one-use check still there), see if anything regresses, then ease/remove the use check as a follow-on.
As noted, we may need to adjust cost models to account for the size/speed difference that's showing up in unrolling/inlining. That's probably because we assume that an intrinsic is expanded to a single instruction vs. the current cmp+sub+select being 3 instructions?

This revision is now accepted and ready to land.Sep 16 2020, 1:48 PM
This revision was automatically updated to reflect the committed changes.
Herald added a project: Restricted Project. · View Herald TranscriptSep 17 2020, 1:29 PM
Herald added a subscriber: cfe-commits. · View Herald Transcript
nikic added a comment.Sep 17 2020, 1:38 PM

LGTM - I think we should give this a try as-is (with the one-use check still there), see if anything regresses, then ease/remove the use check as a follow-on.

Okay, let's do that.

As noted, we may need to adjust cost models to account for the size/speed difference that's showing up in unrolling/inlining. That's probably because we assume that an intrinsic is expanded to a single instruction vs. the current cmp+sub+select being 3 instructions?

The default abs cost model already uses icmp+select+sub. Though X86 has custom cost model for the vector variants, which is generally cheaper, so it makes sense that more unrolling is seen there. I expect that this part of the change already happened when the vector intrinsics were switched over recently.

This broke a few tests for me (generating code that now gives the fail result at runtime).

I'm not entirely sure which bit is the culprit, but the difference in output (that breaks tests) for one object file is available at https://martin.st/temp/vc1_block-diff-aarch64.txt, and https://martin.st/temp/vc1_block-diff-armv7.txt. For the aarch64 version, it looks like some conditionals are inverted, like these changes:

-	b.ge	.LBB1_124
+	b.hs	.LBB1_124

and

-	csel	w14, w15, w14, gt
+	csel	w14, w15, w14, hi

(with no seemingly related changes that would change the roles of the registers).

The input files for reproducing the issues are available at https://martin.st/temp/vc1_block-aarch64.c and https://martin.st/temp/vc1_block-armv7.c, which can be compiled with clang -target {aarch64,armv7}-w64-mingw32 -c -O2 vc1_block-{aarch64,armv7}.c.

I know this has already been reverted but just FYI that I've bisected a ~2% regression in SPEC2017 x264_r on AArch64 to this commit. Presumably this is due to the extra unrolling / cost modelling issue already mentioned?

I know this has already been reverted but just FYI that I've bisected a ~2% regression in SPEC2017 x264_r on AArch64 to this commit. Presumably this is due to the extra unrolling / cost modelling issue already mentioned?

That would be my guess. I'm not familiar with the unroller or inliner, but I don't even see them using the cost model at 1st glance. Are they purely counting IR instructions?

If I missed the cost model calls, the cost model is still not correct for code-size:

$ cat cost.ll
declare i32        @llvm.abs.i32(i32, i1)
define i32 @abs(i32 %x, i32 %y) {
  %I32 = call i32 @llvm.abs.i32(i32 %x, i1 false)
  %negx = sub i32 0, %x
  %cmp = icmp sgt i32 %x, -1
  %sel = select i1 %cmp, i32 %x, i32 %negx 
  ret i32 %I32
}

$ opt  -cost-model -cost-kind=code-size -analyze -mtriple=aarch64--  cost.ll 
Printing analysis 'Cost Model Analysis' for function 'abs':
Cost Model: Found an estimated cost of 1 for instruction:   %I32 = call i32 @llvm.abs.i32(i32 %x, i1 false)
Cost Model: Found an estimated cost of 1 for instruction:   %negx = sub i32 0, %x
Cost Model: Found an estimated cost of 1 for instruction:   %cmp = icmp sgt i32 %x, -1
Cost Model: Found an estimated cost of 1 for instruction:   %sel = select i1 %cmp, i32 %x, i32 %negx

So we're assuming the size cost is either 1 or 3 depending on how we encoded abs() in IR, but neither is correct for AArch64 IIUC:

$ llc -o - -mtriple=aarch64-- cost.ll 
	cmp	w0, #0                          // =0
	cneg	w0, w0, mi
	ret

This broke a few tests for me (generating code that now gives the fail result at runtime).

I'm not entirely sure which bit is the culprit, but the difference in output (that breaks tests) for one object file is available at https://martin.st/temp/vc1_block-diff-aarch64.txt, and https://martin.st/temp/vc1_block-diff-armv7.txt. For the aarch64 version, it looks like some conditionals are inverted, like these changes:

-	b.ge	.LBB1_124
+	b.hs	.LBB1_124

and

-	csel	w14, w15, w14, gt
+	csel	w14, w15, w14, hi

(with no seemingly related changes that would change the roles of the registers).

The input files for reproducing the issues are available at https://martin.st/temp/vc1_block-aarch64.c and https://martin.st/temp/vc1_block-armv7.c, which can be compiled with clang -target {aarch64,armv7}-w64-mingw32 -c -O2 vc1_block-{aarch64,armv7}.c.

Based on an initial look, the changes in comparison predicates here are probably a red herring. If I understand right, those predicates are switching from signed to unsigned comparison (e.g. gt to hi). I also see similar changes in IR. However, all the cases I looked at are actually correct (e.g. abs(x) > abs(y) can be compared signed or unsigned for poisoning abs). Assuming that this code is clean under ubsan, the problem is likely something else.

Based on an initial look, the changes in comparison predicates here are probably a red herring. If I understand right, those predicates are switching from signed to unsigned comparison (e.g. gt to hi). I also see similar changes in IR. However, all the cases I looked at are actually correct (e.g. abs(x) > abs(y) can be compared signed or unsigned for poisoning abs). Assuming that this code is clean under ubsan, the problem is likely something else.

I just checked, and the code is indeed clean under ubsan.

The changes in the asm is more than just a few places, so it's quite a bit of work to dissect which one of them (if the changes are localized into smaller individual changes) is the one that's causing the error...

nikic reopened this revision.Oct 22 2020, 1:20 PM

Reopening this so we don't forget...

I believe @spatel is working on the cost modelling. I did not have much luck tracking down the miscompile, at least did not spot anything incriminating in the llvm-diff.

This revision is now accepted and ready to land.Oct 22 2020, 1:20 PM

Reopening this so we don't forget...

I believe @spatel is working on the cost modelling. I did not have much luck tracking down the miscompile, at least did not spot anything incriminating in the llvm-diff.

Yes, I'm working through the mess of the TTI cost model with things like:
c963bde0152a
It's a slog...

Reopening this so we don't forget...

I believe @spatel is working on the cost modelling. I did not have much luck tracking down the miscompile, at least did not spot anything incriminating in the llvm-diff.

Yes, I'm working through the mess of the TTI cost model with things like:
c963bde0152a
It's a slog...

D90554 / f7eac51b9b
I think that change makes this patch ready to try again. If we do see regressions, then it should now be easy to adjust target-specific cost modeling of abs() intrinsics to fix those.

lebedev.ri added inline comments.Tue, Nov 10, 6:38 AM
llvm/lib/Transforms/InstCombine/InstCombineSelect.cpp
1052

I think we shouldn't have one-use checks here, yes.

D90554 / f7eac51b9b
I think that change makes this patch ready to try again. If we do see regressions, then it should now be easy to adjust target-specific cost modeling of abs() intrinsics to fix those.

We still need to track down the miscompile that @mstorsjo reported before reapplying this patch.