Page MenuHomePhabricator

[CVP] Teach CorrelatedValuePropagation to reduce the width of lshr instruction.
AbandonedPublic

Authored by lebedev.ri on May 19 2018, 4:45 PM.

Details

Summary

Counter-proposal to D46760.
I suppose, continuation of D44102.

If the second operand of udiv/urem is power-of-two,
instcombine will transform that into lshr/and,
and CVP does not handle them.
https://godbolt.org/g/hhT9bc

Do note that teaching CVP to only handle the lshr width
reduction is already sufficient to replace D46760,
since it reduces use count of zext,
thus instcombine is able to propagate it.

I have looked into teaching CVP about and handling,
and it will be more complicated.

https://rise4fun.com/Alive/zfP

Diff Detail

Repository
rL LLVM

Event Timeline

lebedev.ri created this revision.May 19 2018, 4:45 PM

Rebased ontop of parent differentials.

lebedev.ri edited the summary of this revision. (Show Details)May 20 2018, 2:54 AM
lebedev.ri edited the summary of this revision. (Show Details)

Add one more run-line to the phase ordering test to demonstrate
how instcombine is able to cleanup after CVP.

Do note that teaching CVP to only handle the lshr width
reduction is already sufficient to replace D46760,
since it reduces use count of zext,
thus instcombine is able to propagate it.

I have looked into teaching CVP about and handling,
and it will be more complicated.

Seems reasonable to me. I dunno if this solves @bixia's problem or not, but even if it doesn't, seems reasonable...

Would like a CVP person to approve, though.

lib/Transforms/Scalar/CorrelatedValuePropagation.cpp
604

getConstantRange doesn't do the right thing when Value is a constant?

613

I don't think this is sufficient to ensure that RHS.getZExtValue() below doesn't assert? (For example, RHS could be an int128, as I read the langref.)

Fix handling of large shifts.

@jlebar thank you for looking at this!

lib/Transforms/Scalar/CorrelatedValuePropagation.cpp
604

Hm, it seems it does. I suppose this was a premature optimization :)

613

Nice catch!

lebedev.ri marked 4 inline comments as done.

Actually attach actually updated diff this time :)

InstCombiner::visitLShr can perform the same transformation for the cases where correlated-value propagation is not needed to discover the range of the values.
However, unlike the transformation here, InstCombiner::visitLShr carefully makes sure that the transformation won't increase the total number of ZExt/ZExt instructions (Op1.hasOneUse check). Why the transformation here doesn't need similar check? Is it safe to remove such a check in InstCombiner::visitLShr?

Why the transformation here doesn't need similar check?

I would like that design criteria to be documented somewhere, too.

Is it safe to remove such a check in InstCombiner::visitLShr?

No:

Re: instcombine - if some transform is increasing instruction count by not checking for uses, that's a bug. I think we established that conclusively in D44266.

bixia added a comment.May 21 2018, 8:41 AM

InstCombiner::visitShl performs similar narrowing without checking user count and can increase the total number to ZExt instructions.

InstCombiner::visitShl performs similar narrowing without checking user count and can increase the total number to ZExt instructions.

Sounds like a bug then, per @spatel's comment quoted in https://reviews.llvm.org/D47113#1106261?

I think this patch requires discussion on llvm-dev so everyone is clear on the direction and outcome:

  1. We're saying that CVP (a target-independent IR canonicalization pass) will always try to narrow the width of binops.
  2. It doesn't matter if that means increasing cast instruction count.
  3. It doesn't matter if the narrow type is not legal in the target's datalayout (as long as we have a power-of-2).

The fact that we're dealing with 1 opcode at a time is only because we're not considering the general pattern and consequences.
We overlooked these questions in the specific case of div/rem (D44102) because we assumed that narrower div/rem are always better for analysis and codegen, but I doubt that we can extend that reasoning to all binops on all targets. This will fight with transforms in instcombine (see canEvaluateSExtd / canEvaluateZExtd).

Minor code/style comments only. Leaving the broader discussion to engaged parties.

lib/Transforms/Scalar/CorrelatedValuePropagation.cpp
603

minor: second operand

648

You're repeating a pattern which is already there, but we should probably introduce an iteration outside the processX functions for this.

I think this patch requires discussion on llvm-dev so everyone is clear on the direction and outcome:

http://lists.llvm.org/pipermail/llvm-dev/2018-May/123534.html

So yes, I ran some quick benchmarks and I believe this will cause regressions in some circumstances. In one case I looked at (which is running under our special LTO pipeline and may be a little difficult to replicate), we start off with this:

%shr = lshr i32 %sub, 6
%arrayidx = getelementptr inbounds i16, i16* %AllocationMap, i32 %shr

This is turned into:

%shr.lhs.trunc = trunc i32 %sub to i16
%shr.rhs.trunc = trunc i32 6 to i16
%shr = lshr i16 %shr.lhs.trunc, %shr.rhs.trunc
%shr.zext = zext i16 %shr to i32
%arrayidx = getelementptr inbounds i16, i16* %AllocationMap, i32 %shr.zext

Which gets turned right back into:

%shr = lshr i32 %sub, 6
%shr.zext = and i32 %shr, 1023
%arrayidx11 = getelementptr inbounds i16, i16* %AllocationMap, i32 %shr.zext

I think extra And node will, under most circumstances, be removed during isel. But here this is part of a loop, and the extra cost causes us to go over the loop unroll threshold, so the loop is no longer fully unrolled.

Another case on v6m (thumb1only) looks more like a simple extra instruction in the final assembly. In either case the extra And 1023 seems to only be causing trouble.

I'm running some more benchmarks and will see what happens on other cores/benchmarks.

So yes, I ran some quick benchmarks

Thanks!

and I believe this will cause regressions in some circumstances. In one case I looked at (which is running under our special LTO pipeline and may be a little difficult to replicate), we start off with this:

%shr = lshr i32 %sub, 6
%arrayidx = getelementptr inbounds i16, i16* %AllocationMap, i32 %shr

This is turned into:

%shr.lhs.trunc = trunc i32 %sub to i16
%shr.rhs.trunc = trunc i32 6 to i16
%shr = lshr i16 %shr.lhs.trunc, %shr.rhs.trunc

%shr.zext = zext i16 %shr to i32
%arrayidx = getelementptr inbounds i16, i16* %AllocationMap, i32 %shr.zext

Hmm, i think here we could avoid zext.

Which gets turned right back into:
%shr = lshr i32 %sub, 6
%shr.zext = and i32 %shr, 1023
%arrayidx11 = getelementptr inbounds i16, i16* %AllocationMap, i32 %shr.zext
I think extra And node will, under most circumstances, be removed during isel. But here this is part of a loop, and the extra cost causes us to go over the loop unroll threshold, so the loop is no longer fully unrolled.

Another case on v6m (thumb1only) looks more like a simple extra instruction in the final assembly. In either case the extra And 1023 seems to only be causing trouble.

Thank you!
So to not much surprise, @spatel was right in https://reviews.llvm.org/D47113#1106601

This all makes me think we should look in the third direction:

You want to narrow multi-use sequences of code because instcombine can't do that profitably using minimal peepholes:
...
The original motivation for -aggressive-instcombine (TruncInstCombine) was something almost like that - see D38313. Can you extend that? Note that the general problem isn't about udiv/urem, lshr, or any particular binop. It's about narrowing a sequence of arbitrary binops (and maybe even more than binops).

I'm running some more benchmarks and will see what happens on other cores/benchmarks.

Once done, could you please post something to the thread, so it too would contain the knowledge?

%shr.zext = zext i16 %shr to i32
%arrayidx = getelementptr inbounds i16, i16* %AllocationMap, i32 %shr.zext

Hmm, i think here we could avoid zext.

Actually, uhm, https://godbolt.org/g/jtNCSp, is the idx always canonicalized to i64?

Sorry, I forgot to mention the important fact that those results were Arm, specifically thumbv8m.baseline on a cortex-m23 (where the pointers are 32bit, and only i32 are legal types). Try this data layout for arm code:
target datalayout = "e-m:e-p:32:32-i64:64-v128:64:128-a:0:32-n32-S64"

This should show the extra And node being added (although in this case it won't lead to different assembly I believe, just larger IR. isel can remove the And).
https://godbolt.org/g/xjmSPn

The tests I ran above were embedded benchmarks on microcontroller cpus. They are quick and easy enough to run as they tend to contain no noise. The ones that went down were mostly thumb-1 only cores, but the cortex-m7 also showed decreases on the same test. The extra tests I've ran now are a your more standard set of linux benchmarks on cortex-a cores for both arm and aarch64 (spec, the llvm testsuite etc). The only thing that didn't look like noise was drop3 from the BitBench, which got better. Its doing some odd looking bit manipulation in a loop. Playing around a bit the score seems to go up or down depending on which parts of the loop are enabled.

As Sanjay found in http://lists.llvm.org/pipermail/llvm-dev/2018-January/120522.html, converting to illegal types can be beneficial if it leads to extra folds, but is difficult to tell when exactly it will make things better or worse. Reducing the type width from an i64 to an i32 is almost certainly a good thing to do on Arm, but reducing an i32 to an i8 isn't so cut-and-dry. The safe option here would be to only convert to legal types (or perhaps also reducing to types larger than the largest legal type. I'm not sure what that would do in the motivating case).

reames requested changes to this revision.Feb 4 2019, 3:56 PM

Marking just remove from review queue since discussion appears to have stalled.

This revision now requires changes to proceed.Feb 4 2019, 3:56 PM
Herald added a project: Restricted Project. · View Herald TranscriptFeb 4 2019, 3:56 PM
lebedev.ri abandoned this revision.Jun 21 2019, 8:52 AM