This is an archive of the discontinued LLVM Phabricator instance.

[DAGCombiner] sink target-supported cast op after concat vectors
ClosedPublic

Authored by spatel on May 4 2020, 12:46 PM.

Details

Summary

Try to combine N short vector cast ops into 1 wide vector cast op:
concat (cast X), (cast Y)... -> cast (concat X, Y...)

This is part of solving PR45794:
https://bugs.llvm.org/show_bug.cgi?id=45794

As noted in the code comment, this is uglier than I was hoping because the opcode determines whether we pass the source or destination type to isOperationLegalOrCustom(). Also IIUC, there's no way to validate what the other (dest or src) type is. Without the extra legality check on that, there's an ARM regression test in test/CodeGen/ARM/isel-v8i32-crash.ll that will crash trying to lower an unsupported v8f32 to v8i16.

Diff Detail

Event Timeline

spatel created this revision.May 4 2020, 12:46 PM
craig.topper added inline comments.May 4 2020, 1:16 PM
llvm/test/CodeGen/X86/avx-shift.ll
177

Do we do the best thing if the shl is used by another operation that needs to be split? Do we keep the vcvttps2dq split?

spatel updated this revision to Diff 261932.May 4 2020, 2:33 PM

Patch updated:
Added another AVX split/concat test - no diff here.

spatel marked an inline comment as done.May 4 2020, 2:36 PM
spatel added inline comments.
llvm/test/CodeGen/X86/avx-shift.ll
177

Does the next test (vshift08_add) cover the scenario you're thinking of? There's no difference on that one because the concat isn't directly after the cast.

craig.topper added inline comments.May 4 2020, 3:53 PM
llvm/test/CodeGen/X86/avx-shift.ll
177

I think it does.

Let me see if this works how I think it does. The shift will be legalized by LegalizeVectorOps first because we run that stage by legalizing operands before users. So the shift gets lowered first. When the shift gets lowered, it should split and produce a concat. Then each part of the split should get legalized. Then the add gets legalized which produces another split. getNode for the extracts for that split will look through the concat produced by the shift? Leaving that concat dead. Then a new concat will be produced for the add split?

spatel marked an inline comment as done.May 5 2020, 5:23 AM
spatel added inline comments.
llvm/test/CodeGen/X86/avx-shift.ll
177

I couldn't visualize it without looking at debug output, but that looks about right to me:

Legalizing vector op: t7: v8i32 = shl t6, t2
-->
...
Creating new node: t18: v4i32 = shl t13, t15
Creating new node: t19: v4i32 = shl t13, t17
Creating new node: t20: v8i32 = concat_vectors t18, t19

Legalizing vector op: t18: v4i32 = shl t13, t15
--> 
...
Creating new node: t28: v4i32 = fp_to_sint t27
Creating new node: t29: v4i32 = mul t13, t28

Legalizing vector op: t8: v8i32 = add t20, t4
-->
...
Creating new node: t40: v4i32 = add t29, t38
Creating new node: t41: v4i32 = add t36, t39
Creating new node: t42: v8i32 = concat_vectors t40, t41

So the add is already directly using the 128-bit "t29" mul node. And we only show the final concat here - "t20" is gone:

Vector-legalized selection DAG: %bb.0 'vshift08:'
SelectionDAG has 33 nodes:
  t0: ch = EntryToken
  t2: v8i32,ch = CopyFromReg t0, Register:v8i32 %0
  t4: v8i32,ch = CopyFromReg t0, Register:v8i32 %1
                  t15: v4i32 = extract_subvector t2, Constant:i64<0>
                t31: v4i32 = X86ISD::VSHLI t15, TargetConstant:i8<23>
              t26: v4i32 = add t31, t25
            t27: v4f32 = bitcast t26
          t28: v4i32 = fp_to_sint t27
        t29: v4i32 = mul t13, t28
        t38: v4i32 = extract_subvector t4, Constant:i64<0>
      t40: v4i32 = add t29, t38
                  t17: v4i32 = extract_subvector t2, Constant:i64<4>
                t37: v4i32 = X86ISD::VSHLI t17, TargetConstant:i8<23>
              t33: v4i32 = add t37, t25
            t34: v4f32 = bitcast t33
          t35: v4i32 = fp_to_sint t34
        t36: v4i32 = mul t13, t35
        t39: v4i32 = extract_subvector t4, Constant:i64<4>
      t41: v4i32 = add t36, t39
    t42: v8i32 = concat_vectors t40, t41
  t11: ch,glue = CopyToReg t0, Register:v8i32 $ymm0, t42
  t13: v4i32 = BUILD_VECTOR Constant:i32<1>, Constant:i32<1>, Constant:i32<1>, Constant:i32<1>
  t25: v4i32 = BUILD_VECTOR Constant:i32<1065353216>, Constant:i32<1065353216>, Constant:i32<1065353216>, Constant:i32<1065353216>
  t12: ch = X86ISD::RET_FLAG t11, TargetConstant:i32<0>, Register:v8i32 $ymm0, t11:1
craig.topper accepted this revision.May 5 2020, 1:36 PM

Thanks for checking. LGTM

This revision is now accepted and ready to land.May 5 2020, 1:36 PM
This revision was automatically updated to reflect the committed changes.