This is an archive of the discontinued LLVM Phabricator instance.

[DAGCombiner] Combine OR as ADD when no common bits are set
ClosedPublic

Authored by bjope on Mar 25 2019, 2:45 AM.

Details

Summary

The DAGCombiner is rewriting (canonicalizing) an ISD::ADD
with no common bits set in the operands as an ISD::OR node.

This could sometimes result in "missing out" on some
combines that normally are performed for ADD. To be more
specific this could happen if we already have rewritten an
ADD into OR, and later (after legalizations or combines)
we expose patterns that could have been optimized if we
had seen the OR as an ADD (e.g. reassociations based on ADD).

To make the DAG combiner less sensitive to if ADD or OR is
used for these "no common bits set" ADD/OR operations we
now apply most of the ADD combines also to an OR operation,
when value tracking indicates that the operands have no
common bits set.

Diff Detail

Repository
rL LLVM

Event Timeline

bjope created this revision.Mar 25 2019, 2:45 AM
bjope added a comment.Mar 25 2019, 2:51 AM

Hello reviewers! Do you think this is a good idea?

If you agree to the idea presented here, then I'll probably need some help from Hexagon regarding the llvm/test/CodeGen/Hexagon/subi-asl.ll test case (which no longer is triggering "subi-asl"). Should the test case be updated? Should we still get subi-asl here?

I've mostly seen improvements for our OOT target when doing this, but for example llvm/test/CodeGen/X86/split-store.ll also exposes a case when we trigger a rewrite into using SUB.

Hello reviewers! Do you think this is a good idea?

It's an interesting idea. :)

I've mostly seen improvements for our OOT target when doing this, but for example llvm/test/CodeGen/X86/split-store.ll also exposes a case when we trigger a rewrite into using SUB.

Yes, we'd classify that as a slight regression for x86.

Do you have a sense of how many different folds we're missing in the tests where you show improvements? If it's a small number, we're probably better off just duplicating that code inside 'visitOR', so we don't have to deal with the regressions.

bjope added a comment.Mar 26 2019, 8:31 AM

Hello reviewers! Do you think this is a good idea?

It's an interesting idea. :)

I've mostly seen improvements for our OOT target when doing this, but for example llvm/test/CodeGen/X86/split-store.ll also exposes a case when we trigger a rewrite into using SUB.

Yes, we'd classify that as a slight regression for x86.

Isn't split-store.ll showing an improvement (we get one subb instead of andb+orb)?

However, signbit-shift.ll might show a regression (since we get one more instruction and use an extra register). However, I'm not that familiar with the vector instructions to understand if it really is a regression (maybe those movdqa instructions are easier to schedule, or having shorter latency or something).

Do you have a sense of how many different folds we're missing in the tests where you show improvements? If it's a small number, we're probably better off just duplicating that code inside 'visitOR', so we don't have to deal with the regressions.

We got lots of selection patterns using "add", for example patterns involving addressing modes, multiply-and-add, etc. Currently our OOT target is using some hooks to rewrite ADD->OR in the first rounds of DAG combiner, and then in the last DAG combiner run we instead rewrite OR->ADD. That is a little bit hacky, so I'm trying to avoid that and either duplicate patterns (detecting or-with-no-common-bits in tablegen patterns), or rewriting to "add" in the PreprocessISelDAG hook. It was when working on that I noticed lots of regressions due to no longer doing the DAG combines that would trigger if using ADD instead of OR in the last DAG combine run.

Examples:

  • not triggering the reassociation like ((X + C) + Y) => ((X + Y) + C) resulted in some regressions
  • not triggering the combines involving SUB such as ((0 - A) + B) => (B - A) resulted in some regressions

I can try to make it more selective (duplicating some specific folds). I actually started out that way, but then I figured that it might be better to do all possible folds from visitADD.

Normally we try to do most folds in visitADD before we do rewrite into OR (so that should be the normal case when we have ADD in the input). But if for example opt starts rewriting ADD->OR so we have OR already in the input we won't even try those folds before we rewrite ADD->OR. This makes the DAG combiner sensitive to the input form IMHO. This could be amended by this patch, but if I start to be more selective we most likely will still end up with different results for semantically equivalent input.
Note however that I'm not only focusing on being less sensitive to if ADD or OR has been used in the input to SelectionDAG. Some of the problems I've seen is related to order of combines/lowering, such as doing the ADD->OR rewrite before we have done other simplifications that would have triggered the folds in visitADD if we for example had delayed the ADD->OR rewrite to after legalization.

Hello reviewers! Do you think this is a good idea?

It's an interesting idea. :)

I've mostly seen improvements for our OOT target when doing this, but for example llvm/test/CodeGen/X86/split-store.ll also exposes a case when we trigger a rewrite into using SUB.

Yes, we'd classify that as a slight regression for x86.

Isn't split-store.ll showing an improvement (we get one subb instead of andb+orb)?

Yes - sorry, I reversed that with signbit-shift.ll. So we would call signbit-shift.ll a slight regression because of the extra mov instruction. We are probably missing a generic combine. Might be similar to the hexagon diff?

However, signbit-shift.ll might show a regression (since we get one more instruction and use an extra register). However, I'm not that familiar with the vector instructions to understand if it really is a regression (maybe those movdqa instructions are easier to schedule, or having shorter latency or something).

The Hexagon testcase can be fixed---it's probably just a matter of changing the selection pattern for the instruction we're checking.

Hello reviewers! Do you think this is a good idea?

It's an interesting idea. :)

I've mostly seen improvements for our OOT target when doing this, but for example llvm/test/CodeGen/X86/split-store.ll also exposes a case when we trigger a rewrite into using SUB.

Yes, we'd classify that as a slight regression for x86.

Isn't split-store.ll showing an improvement (we get one subb instead of andb+orb)?

Yes - sorry, I reversed that with signbit-shift.ll. So we would call signbit-shift.ll a slight regression because of the extra mov instruction. We are probably missing a generic combine. Might be similar to the hexagon diff?

Seems to be foldAddSubMasked1 that now is doing the fold

(or (and (AssertSext X, i1), 1), C) --> (sub C, X)

which looks pretty nice in the DAG, but at least in this case (X86 + vector ops) the load of the splat vector C from constant pool can't be folded into the PSUBD, so we get a separate MOVAPS for that:

  t24: v4i32,ch = MOVAPSrm<Mem:(load 16 from constant-pool)> Register:i64 $rip, TargetConstant:i8<1>, Register:i64 $noreg, TargetConstantPool:i32<<4 x i32> <i32 42, i32 42, i32 42, i32 42>> 0, Register:i32 $noreg, t0
  t21: v4i32 = PCMPGTDrr t2, V_SETALLONES:v4i32
t20: v4i32 = PSUBDrr t24, t21

instead of this output from isel where the load from constant pool is folded into the POR

    t20: v4i32 = PCMPGTDrr t2, V_SETALLONES:v4i32
  t22: v4i32 = PSRLDri t20, TargetConstant:i8<31>
t15: v4i32,ch = PORrm<Mem:(load 16 from constant-pool)> t22, Register:i64 $rip, TargetConstant:i8<1>, Register:i64 $noreg, TargetConstantPool:i32<<4 x i32> <i32 42, i32 42, i32 42, i32 42>> 0, Register:i32 $noreg, t0

Afaict, the code after ISel looks better when we do the rewrite into using sub (the height of the DAG is reduced by one). However, since both the PCMPGTD and the PSUBD has tied operands we lose when doing register allocation, and we have to insert an extra move to be pass the result in xmm0. I do not think that we can detect that in the generic DAGCombiner. And if the test case just had been a little bit different (receiving %x in xmm1, if that is possible), then there wouldn't have been an extra MOVDQA.

So is this regression anything to bother about? (I can add another test case where %x isn't taken from the first argument.)

bjope added a comment.Mar 28 2019, 6:14 AM

The Hexagon testcase can be fixed---it's probably just a matter of changing the selection pattern for the instruction we're checking.

I've done some more analysis for llvm/test/CodeGen/Hexagon/subi-asl.ll.

The first DAG combine that makes a difference is that with this patch we fold

    t30: i32 = shl t11, Constant:i32<1>
  t31: i32 = sub Constant:i32<0>, t30
t28: i32 = or t9, t31

into

  t30: i32 = shl t11, Constant:i32<1>
t32: i32 = sub t9, t30

IMO that looks like a nice fold.

A few combines later we end up with the following before selection:

Address tree balanced selection DAG:SelectionDAG has 27 nodes:
  t0: ch = EntryToken
    t44: i32 = HexagonISD::CONST32_GP TargetGlobalAddress:i32<i32* @this_insn_number> 0
  t11: i32,ch = load<(dereferenceable load 4 from @this_insn_number)> t0, t44, undef:i32
    t2: i32,ch = CopyFromReg t0, Register:i32 %1
  t41: i32,ch = load<(load 2 from %ir.cgep59, align 4), sext from i16> t0, t2, undef:i32
  t30: i32 = shl t11, Constant:i32<1>
    t18: ch = TokenFactor t41:1, t11:1
    t17: i32,ch = CopyFromReg t0, Register:i32 %0
  t23: ch,glue = CopyToReg t18, Register:i32 $r0, t17
        t37: i1 = setcc t41, Constant:i32<56>, seteq:ch
        t48: i32 = sub Constant:i32<1>, t30
        t47: i32 = sub Constant:i32<0>, t30
      t49: i32 = select t37, t48, t47
    t25: ch,glue = CopyToReg t23, Register:i32 $r1, t49, t23:1
  t27: ch,glue = HexagonISD::TC_RETURN t25, TargetGlobalAddress:i32<void (%struct.rtx_def*, i32)* @reg_is_born> 0, Register:i32 $r0, Register:i32 $r1, RegisterMask:Untyped

and since there now is two uses of t30 the patterns for selecting subi_asl won't trigger (they check that there only is one use of the shl).

Without the patch we instead would get

Address tree balanced selection DAG:SelectionDAG has 27 nodes:
  t0: ch = EntryToken
    t40: i32 = HexagonISD::CONST32_GP TargetGlobalAddress:i32<i32* @this_insn_number> 0
  t11: i32,ch = load<(dereferenceable load 4 from @this_insn_number)> t0, t40, undef:i32
    t2: i32,ch = CopyFromReg t0, Register:i32 %1
  t38: i32,ch = load<(load 2 from %ir.cgep59, align 4), sext from i16> t0, t2, undef:i32
    t30: i32 = shl t11, Constant:i32<1>
  t31: i32 = sub Constant:i32<0>, t30
    t18: ch = TokenFactor t38:1, t11:1
    t17: i32,ch = CopyFromReg t0, Register:i32 %0
  t23: ch,glue = CopyToReg t18, Register:i32 $r0, t17
        t35: i1 = setcc t38, Constant:i32<56>, seteq:ch
        t41: i32 = or t31, Constant:i32<1>
      t42: i32 = select t35, t41, t31
    t25: ch,glue = CopyToReg t23, Register:i32 $r1, t42, t23:1
  t27: ch,glue = HexagonISD::TC_RETURN t25, TargetGlobalAddress:i32<void (%struct.rtx_def*, i32)* @reg_is_born> 0, Register:i32 $r0, Register:i32 $r1, RegisterMask:Untyped

So the height of the DAG (looking at the operands for the select) seem to be one less with this patch (select ... (sub (shl)) ...) instead of (select ... (or (sub (shl ..))) ...) .

If you think that the old codegen (using subi_asl) actually is superior here, then we might need a pattern doing

(sub 1, (shl x, c)) => (setbit_i (subi_asl_ri 0, x, c), 0)

But I guess with a predicate to only do it if the shl has multiple uses, because otherwise

(sub 1, (shl x, c)) => (subi_asl_ri 1, x, c)

would be better.
I actually think that we need even more predicates to distinguish when the setbit_i solution is better, because it does not look optimal (at least not with my limited knowledge about Hexagon).

bjope updated this revision to Diff 193124.Apr 1 2019, 10:45 AM

Added extra test in test/CodeGen/X86/signbit-shift.ll to show that add_zext_ifpos_vec_splat only turn up as a regression due to register contraints.

The new add_zext_ifpos_vec_splat2 show an improvement since we get

pcmpeqd %xmm0, %xmm0
pcmpgtd %xmm0, %xmm1
movdqa .LCPI3_0(%rip), %xmm0 # xmm0 = [42,42,42,42]
psubd %xmm1, %xmm0

instead of

movdqa %xmm1, %xmm0
pcmpeqd %xmm1, %xmm1
pcmpgtd %xmm1, %xmm0
psrld $31, %xmm0
por .LCPI3_0(%rip), %xmm0

spatel added a comment.Apr 2 2019, 6:22 AM

Thanks for expanding on the x86 example. I agree now that it's a good idea to try these optimizations.
@RKSimon may know from looking, but this might mean we can remove the more specific fold from rL357351 ?

Thanks for expanding on the x86 example. I agree now that it's a good idea to try these optimizations.
@RKSimon may know from looking, but this might mean we can remove the more specific fold from rL357351 ?

IIRC that transform was being done pre-DAG - so the OR was already stuck on the wrong side of the zext - would be interested to see though (we still don't match PAVGB if it uses OR after ZEXT).

Something I didn't do but I think would be useful is to add a 'bool SelectionDAG::isAddLike(SDValue N, SDValue &Op0, SDValue &Op1)' helper that will match against ADD/OR/SHL etc. that all perform some form of addition.

llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
2230 ↗(On Diff #193124)

I realise it reduces the patch - but would it be better to move this in the new visitAdd?

2236 ↗(On Diff #193124)

move this in the new visitAdd?

bjope updated this revision to Diff 193514.Apr 3 2019, 8:56 AM

Updated according to suggestion from @RKSimon, i.e. moving the combines only
done for ADD into the visitADD function.

I also renamed the old visitADDLike method to visitADDLikeCommutative, and the
new method visitADDorNoCommonBitsOR (from earlier version of this patch) is
now called visitADDLike.

bjope added a comment.Apr 9 2019, 6:29 AM

Friendly ping!

lebedev.ri added inline comments.
llvm/test/CodeGen/X86/signbit-shift.ll
31–58 ↗(On Diff #193514)

Aren't these two tests identical?

bjope added inline comments.Apr 9 2019, 7:59 AM
llvm/test/CodeGen/X86/signbit-shift.ll
31–58 ↗(On Diff #193514)

add_zext_ifpos_vec_splat2 is receiving %x in the second argument (so I guess it will be mapped to xmm1).

That second test was added to show that what seems to be a degradation in add_zext_ifpos_vec_splat happens due to register constraints (resulting in an extra movdqa at the end). Maybe I should omit the new test before commit. After all, it does not serve any extra purpose when it comes to testing "signbit-shift".

lebedev.ri added inline comments.Apr 9 2019, 8:33 AM
llvm/test/CodeGen/X86/signbit-shift.ll
31–58 ↗(On Diff #193514)

That extra move may be regalloc problem (or maybe that is already optimal).
The 'real' problem here is that we now have that extra movdqa {{.*#+}} xmm1 = [42,42,42,42],
i guess because psubd can only take memory operand in other operand.
I'd drop the extra test.

bjope updated this revision to Diff 194645.Apr 11 2019, 1:28 AM

Removed the add_zext_ifpos_vec_splat2 test from test/CodeGen/X86/signbit-shift.ll again (as suggested by @lebedev.ri). That test was only added to demonstrate why add_zext_ifpos_vec_splat gets an extra movdqa with this patch (due to unfortunate reg constraints), but it did not contribute anything new when it comes to testing "signbit-shift".

If there still are worries about this, then maybe I can do the visitADD refactoring in a separate patch? And then I can limit this patch to the part where we start to use visitADDLike from visitOR.

spatel accepted this revision.Apr 18 2019, 9:03 AM
spatel added subscribers: rampitec, arsenm.

LGTM (see inline for a couple of nits) - but I'd prefer that someone with AMDGPU knowledge (@arsenm @nhaehnle @rampitec ?) confirm those diffs too.

llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
2106 ↗(On Diff #194645)

typo: necessarily

2107 ↗(On Diff #194645)

typo: known -> know

This revision is now accepted and ready to land.Apr 18 2019, 9:03 AM
bjope added a comment.Apr 19 2019, 3:06 AM

LGTM (see inline for a couple of nits) - but I'd prefer that someone with AMDGPU knowledge (@arsenm @nhaehnle @rampitec ?) confirm those diffs too.

Thanks! I won't push this until next week anyway.

LGTM for amdgpu test changes.

This revision was automatically updated to reflect the committed changes.