This is an archive of the discontinued LLVM Phabricator instance.

[DAGCombiner] scalarize binop followed by extractelement
ClosedPublic

Authored by spatel on Dec 14 2018, 2:39 PM.

Details

Summary

As noted in PR39973 and D55558:
https://bugs.llvm.org/show_bug.cgi?id=39973
...this is a partial implementation of a fold that we do as an IR canonicalization in instcombine:

// extelt (binop X, Y), Index --> binop (extelt X, Index), (extelt Y, Index)

We want to have this in the DAG too because as we can see in some of the test diffs (reductions), the pattern may not be visible in IR.
Given that this is already an IR canonicalization, any backend that would prefer a vector op over a scalar op is expected to already have the reverse transform in DAG lowering (not sure if that's a realistic expectation though).

There's an ARM test diff that shows that we also have a reverse transform in IR in CGP. I'm not sure what to make of that. Isn't the updated asm better than the existing code for that test?

Note that I changed some existing regression tests leading up to this patch trying to preserve their intent. Let me know if we need to do any more of that for the remaining diffs here. These were the preliminary commits (and so those tests are not affected by this patch):
rL349160
rL349163
rL349164
rL349166
rL349176
rL349177

Diff Detail

Event Timeline

spatel created this revision.Dec 14 2018, 2:39 PM
RKSimon added inline comments.Dec 15 2018, 5:00 AM
lib/CodeGen/SelectionDAG/DAGCombiner.cpp
15578

I wonder whether we need a oneuse check on the constant as well? Some of the cases below suggest that a vector constant load might be happening whatever.

test/CodeGen/SystemZ/knownbits.ll
51

This will need rebasing after rL349264

test/CodeGen/X86/and-load-fold.ll
11

X86ISelLowering's combineExtractWithShuffle is supposed to handle this.....

spatel marked an inline comment as done.Dec 15 2018, 6:57 AM
spatel added inline comments.
lib/CodeGen/SelectionDAG/DAGCombiner.cpp
15578

If we put a one-use check on the constant, the effect will be to remove all of the horizontal-reduce-* and vector-reduce-* diffs from this patch.

I thought those were good changes though - better to get the data out of the vector unit as soon as possible since we know we're going to end up in a scalar?

spatel updated this revision to Diff 178377.Dec 15 2018, 2:43 PM

Patch updated:
No code changes, but corrected code comment and rebased tests.

uweigand added a subscriber: jonpa.Dec 17 2018, 4:11 AM

Just looking at the SystemZ test cases:

  • The knownbits.ll case shows what might be an unintended consequence. We have a vector "not" followed by a vector "and", which are merged into a single vector "and complement" operation, and this is then followed by an element extraction. Your patch moves the "and" to the scalar side, but leaves the "not" a vector operation. This means that we can now no longer combine the two into an "and complement". Would it maybe be better to do the same transformation also for unary operations like "not"?
  • The vec-trunc-to-i1.ll case is supposed to verify that a particular vector operation doesn't crash (see PR 32275). With your patch, we no longer have that vector operation. While this leads to better code and thus is certainly a useful transformation, it means that we should probably modify the test case so it again tests the original problem. @jonpa, can you have a look?

Just looking at the SystemZ test cases:

  • The knownbits.ll case shows what might be an unintended consequence. We have a vector "not" followed by a vector "and", which are merged into a single vector "and complement" operation, and this is then followed by an element extraction. Your patch moves the "and" to the scalar side, but leaves the "not" a vector operation. This means that we can now no longer combine the two into an "and complement". Would it maybe be better to do the same transformation also for unary operations like "not"?

Thanks for explaining that output! Yes, I think we do want to extend this for unary ops like ISD::FNEG, but there's is no generic unary ISD::NOT node. We just do 'xor X, -1'. The same code example compiled for x86 behaves like we would expect - both logic ops are scalarized. So I looked at the debug output for this test with SystemZ target:

    t15: v4i32 = setcc t2, t4, setne:ch
  t17: i32 = extract_vector_elt t15, Constant:i32<3>
t18: i32 = and t17, Constant:i32<1>

And then as legalization progresses:

        t21: v16i8 = SystemZISD::BYTE_MASK Constant:i32<65535>
      t22: v4i32 = bitcast t21
    t23: v4i32 = xor t19, t22
  t17: i32 = extract_vector_elt t23, Constant:i32<3>
t18: i32 = and t17, Constant:i32<1>

So we do not recognize the t23 xor node as a 'not' because the constant value is hidden behind a bitcast of a target-specific constant. We can't handle that case here in generic combining without some kind of callback to know what to do with that constant node.

This does make me worry that I need to add an OpaqueConstant check into this somehow. Assuming that it is possible to have a build_vector that contains opaque constants?

Ah, right, I missed that.

Now I'm wondering if we maybe introduce the BYTE_MASK too early. It looks like X86 (and other targets) leave an all-ones constant vector as a BUILD_VECTOR until the very end of instruction selection; maybe we should do the same on SystemZ to allow common optimizations to still recognize the value ...

spatel updated this revision to Diff 178478.Dec 17 2018, 9:16 AM

Patch updated:
Avoid transforming with integers if the build vector contains any opaque constants. I'm not sure how that can happen, so there are no test diffs here vs. the last rev. There's apparently no opaque constant equivalent for FP constants.

jonpa added a comment.Dec 18 2018, 1:05 AM

The vec-trunc-to-i1.ll case is supposed to verify that a particular vector operation doesn't crash (see PR 32275). With your patch, we no longer have that vector operation. While this leads to better code and thus is certainly a useful transformation, it means that we should probably modify the test case so it again tests the original problem. @jonpa, can you have a look?

I tried reverting my patch for the DAG type legalizer, and found that vec-trunc-to-i1.ll still fails with this new patch applied. It fails during type legalization, while this patch at least in this case gets enabled only after type legalization. So I think the test still serves its purpose.

I tried reverting my patch for the DAG type legalizer, and found that vec-trunc-to-i1.ll still fails with this new patch applied. It fails during type legalization, while this patch at least in this case gets enabled only after type legalization. So I think the test still serves its purpose.

Ah, good. Thanks for checking!

@efriedma / @dmgreen (or anyone else with ARM experience) - any thoughts about the ARM diffs?

efriedma added inline comments.Dec 20 2018, 11:22 AM
test/CodeGen/ARM/vector-promotion.ll
16

This is destroying the whole point of the test: we want to avoid the expensive float->int register transfer.

87–89

Same here; this is destroying the point of the test.

394

This transform is okay, but I'd like to see a new test replacing the load with an intrinsic that produces a vector, to preserve the original intent.

spatel marked an inline comment as done.Dec 21 2018, 7:25 AM
spatel added inline comments.
test/CodeGen/ARM/vector-promotion.ll
16

So this raises what appears to be a show-stopper for this patch. We have this reverse transform in IR, so we'd have to redo that as a late DAG transform (we'd have 2 reversals to get back to the code we started with). I don't think that I can cleanly use the TLI hook that was added for the CGP transform because it takes IR type params.

But this ties into something I noticed in InstCombine/DAGCombine. We have a DAG function called scalarizeExtractedVectorLoad() (except there seems to be a bug in how it is called - it's not used with a constant extract index).

I was wondering why we don't do that as an IR canonicalization. That would appear to make both this test and the next diff moot - we would remove all vector ops right from the start. Do you see a problem with that?

To make that clearer, I think the ideal ARM asm for this test has no vector ops:
ldr r0, [r0, #4]
orr r0, r0, #1
str r0, [r1]
bx lr

efriedma added inline comments.Dec 21 2018, 12:35 PM
test/CodeGen/ARM/vector-promotion.ll
16

I guess we could canonicalize load+op+store patterns in IR? It might have unexpected results in some cases, though... e.g. we currently emit different code for add i64 vs. add <1 x i64>, and the clang NEON intrinsics depend on that to some extent. Maybe we could discuss it.

I think that's really besides the point here, though... the "load" here is really just supposed to be a placeholder for "some operation that produces a vector".

spatel updated this revision to Diff 179926.Jan 2 2019, 12:54 PM

Patch updated:
Avoid controversy by adding a default-off TLI hook for this transform. Enable it for x86, so those diffs are similar to the previous rev of the patch. No ARM changes now, and also no SystemZ changes (probably better to fix the BYTE_MASK transform first?).

RKSimon accepted this revision.Jan 3 2019, 6:41 AM

LGTM - thanks

This revision is now accepted and ready to land.Jan 3 2019, 6:41 AM

The X86 changes look good to me.

andreadb accepted this revision.Jan 3 2019, 6:43 AM
This revision was automatically updated to reflect the committed changes.