This is an archive of the discontinued LLVM Phabricator instance.

[DAGCombiner] use narrow vector ops to eliminate concat/extract (PR32790)
ClosedPublic

Authored by spatel on May 12 2017, 9:56 AM.

Details

Summary

This patch started off a lot smaller, but then I discovered that bitcasts seem to always complicate the relatively simple pattern of:

// extract (binop (concat X1, X2), (concat Y1, Y2)), N --> binop XN, YN

...and that made the code grow.

Hopefully, I've added enough comments to avoid too much confusion. If there's any way to simplify this, I'd be happy to hear it.

The TODO about extending to more than bitwise logic is there because we really will regress several x86 tests including madd, psad, and even a plain integer-multiply-by-2 or shift-left-by-1. I don't think there's anything fundamentally wrong with this patch that would cause those regressions; those folds are just missing or brittle.

If we extend to more binops, I found that this patch will fire on at least one non-x86 regression test. There's an ARM NEON test in test/CodeGen/ARM/coalesce-subregs.ll with a pattern like:

            t5: v2f32 = vector_shuffle<0,3> t2, t4
          t6: v1i64 = bitcast t5
          t8: v1i64 = BUILD_VECTOR Constant:i64<0>
        t9: v2i64 = concat_vectors t6, t8
      t10: v4f32 = bitcast t9
    t12: v4f32 = fmul t11, t10
  t13: v2i64 = bitcast t12
t16: v1i64 = extract_subvector t13, Constant:i32<0>

There was no functional change in the codegen from this transform from what I could see though.

For the x86 test changes:

  1. PR32790() is the closest call. We don't reduce the AVX1 instruction count in that case, but we improve throughput. Also, on a core like Jaguar that double-pumps 256-bit ops, there's an unseen win because two 128-bit ops have the same cost as the wider 256-bit op. SSE/AVX2/AXV512 are not affected which is expected because only AVX1 has the extract/concat ops to match the pattern.
  2. do_not_use_256bit_op() is the best case. Everyone wins by avoiding the concat/extract. Related bug for IR filed as: https://bugs.llvm.org/show_bug.cgi?id=33026
  3. The SSE diffs in vector-trunc-math.ll are just scheduling/RA, so nothing real AFAICT.
  4. The AVX1 diffs in vector-tzcnt-256.ll are all the same pattern: we reduced the instruction count by one in each case by eliminating two insert/extract while adding one narrower logic op.

Diff Detail

Event Timeline

spatel created this revision.May 12 2017, 9:56 AM
RKSimon added inline comments.May 12 2017, 11:13 AM
lib/CodeGen/SelectionDAG/DAGCombiner.cpp
14470

Could you use isExtractSubvectorCheap instead here?

We might even need a isExtractSubvectorFree option as well...

14476

I'm starting to think that X86ISelLowering's peekThroughBitcasts and peekThroughOneUseBitcasts helpers should be exposed globally.

spatel added inline comments.May 12 2017, 12:38 PM
lib/CodeGen/SelectionDAG/DAGCombiner.cpp
14470

I might have missed the intent of the question, but I don't think we want to limit the transform based on whether the extract is cheap or not. If this succeeds, then we're going to bypass the extract completely. The reason for limiting to 1/2 size extract is because, for example, a 1/4 extract sequence might become:

W = wideop (concat X1, X2, X3, X4), Y
N1 = extract W, 1 
N2 = extract W, 2 
N3 = extract W, 3 
N4 = extract W, 4

N1 = binop X1, (extract Y, 1)
N2 = binop X2, (extract Y, 2)
N3 = binop X3, (extract Y, 3)
N4 = binop X4, (extract Y, 4)

In that case, we may have increased the instruction count, so we'd need some kind of cost calc to know if it's worthwhile. It would be a good transform for the case where we know that both X and Y are formed from concats, but I figured we should leave that for later...because it makes the code even more complicated!

14476

Probably a good idea. Although we always fold a bitcast-of-a-bitcast, right? I've never seen a need for those to loop.

RKSimon added inline comments.May 25 2017, 7:03 AM
lib/CodeGen/SelectionDAG/DAGCombiner.cpp
14476

Interestingly, there are a load of codegen regressions if you tweak them to just peek through the first bitcast.....

RKSimon edited edge metadata.May 25 2017, 7:56 AM

Couple of minor thoughts but nothing critical.

lib/CodeGen/SelectionDAG/DAGCombiner.cpp
14462

visitEXTRACT_SUBVECTOR is doing quite a bit of similar stuff already (constant index, peeling through bitcasts of the source vector, etc.) is it worth reusing some of that?

14516

x86 has a tendency to convert to concat_vector to chains of insert_subvectors quite early - have you noticed if we're missing anything because of this?

spatel added inline comments.May 25 2017, 11:17 AM
lib/CodeGen/SelectionDAG/DAGCombiner.cpp
14476

Update on this one (and I suppose it should be obvious in hindsight): we're missing basic folds in the cases I looked at so far.

Ie, given that we have a generic DAGCombine rule to fold bitcast(bitcast(x)), there shouldn't be any reason that we would do worse if that rule was applied before some other fold.

It's independent of this patch, but I'll start trying to solve those regressions, so we can get rid of those peekThroughBitcast() loops.

spatel added inline comments.May 25 2017, 1:25 PM
lib/CodeGen/SelectionDAG/DAGCombiner.cpp
14462

Yes, there's some refactoring potential here. Now that I'm looking at the problems revealed by those peekThroughBitcast() cases, I strongly suspect that I'm going to be making another change here in the near future. Ok, if I add a TODO for this patch, and then I'll try to get this to be a bit cleaner?

14516

Honestly, I haven't looked at more than PR32790 and the tests that changed here. But I assume you're right, so there's probably more to come. :)

RKSimon accepted this revision.May 25 2017, 2:33 PM

LGTM - adding TODOs based on both my comments is fine.

This revision is now accepted and ready to land.May 25 2017, 2:33 PM
This revision was automatically updated to reflect the committed changes.