Page MenuHomePhabricator

[DAGCombine] Simplify (truncate (build_pair x, y)) -> (truncate x) or x
Needs RevisionPublic

Authored by kparzysz on Oct 11 2022, 10:57 AM.

Details

Summary

In the attached Hexagon test, the vector type <32 x i64> gets scalarized into a number of extracts and build_pairs. It is subsequently not optimized further, and instead of a single HVX instruction, we end up with a lot of scalar code.

Without this patch, before type legalization we have:

Optimized lowered selection DAG: %bb.0 'f0:b0'
SelectionDAG has 15 nodes:
  t0: ch = EntryToken
              t2: v32i32,ch = CopyFromReg t0, Register:v32i32 %0
              t19: v32i32 = splat_vector Constant:i32<1>
            t18: v32i32 = shl t2, t19
          t7: v64i32 = concat_vectors t18, undef:v32i32
        t9: v64i32 = vector_shuffle<0,u,1,u,2,u,3,u,4,u,5,u,6,u,7,u,8,u,9,u,10,u,11,u,12,u,13,u,14,u,15,u,16,u,17,u,18,u,19,u,20,u,21,u,22,u,23,u,24,u,25,u,26,u,27,u,28,u,29,u,30,u,31,u> t7, undef:v64i32
      t10: v32i64 = bitcast t9
    t11: v32i32 = truncate t10
  t13: ch,glue = CopyToReg t0, Register:v32i32 $v0, t11
  t14: ch = HexagonISD::RET_FLAG t13, Register:v32i32 $v0, t13:1

After

Type-legalized selection DAG: %bb.0 'f0:b0'
SelectionDAG has 205 nodes:
  t0: ch = EntryToken
        t2: v32i32,ch = CopyFromReg t0, Register:v32i32 %0
        t19: v32i32 = splat_vector Constant:i32<1>
      t18: v32i32 = shl t2, t19
    t7: v64i32 = concat_vectors t18, undef:v32i32
  t9: v64i32 = vector_shuffle<0,u,1,u,2,u,3,u,4,u,5,u,6,u,7,u,8,u,9,u,10,u,11,u,12,u,13,u,14,u,15,u,16,u,17,u,18,u,19,u,20,u,21,u,22,u,23,u,24,u,25,u,26,u,27,u,28,u,29,u,30,u,31,u> t7, undef:v64i32
          t31: i32 = extract_vector_elt t9, Constant:i32<0>
          t32: i32 = extract_vector_elt t9, Constant:i32<1>
        t157: i64 = build_pair t31, t32
      t227: i32 = truncate t157
          t34: i32 = extract_vector_elt t9, Constant:i32<2>
          t36: i32 = extract_vector_elt t9, Constant:i32<3>
        t158: i64 = build_pair t34, t36
      t229: i32 = truncate t158
          t38: i32 = extract_vector_elt t9, Constant:i32<4>
          t40: i32 = extract_vector_elt t9, Constant:i32<5>
        t159: i64 = build_pair t38, t40
      t231: i32 = truncate t159
          t42: i32 = extract_vector_elt t9, Constant:i32<6>
          t44: i32 = extract_vector_elt t9, Constant:i32<7>
        t160: i64 = build_pair t42, t44
      t233: i32 = truncate t160
[...]

This should be a harmless change, but there is one degradation on x86: llvm/test/CodeGen/X86/test-shrink.ll

The DAG before instruction selection:

SelectionDAG has 14 nodes:
  t0: ch = EntryToken
            t41: i32,ch = load<(load (s16) from %fixed-stack.1, align 4), zext from i16> t0, FrameIndex:i32<-1>, undef:i32
          t40: i32 = and t41, Constant:i32<32768>
        t38: i16 = truncate t40
      t32: i32 = X86ISD::CMP t38, Constant:i16<0>
    t34: ch = X86ISD::BRCOND t0, BasicBlock:ch<no 0x82ec648>, TargetConstant:i8<8>, t32
  t18: ch = br t34, BasicBlock:ch<yes 0x82ec550>

The i32 = load (s16) does not match the pattern that generates AND32rm, and we generate two instructions instead of one.

Diff Detail

Event Timeline

kparzysz created this revision.Oct 11 2022, 10:57 AM
Herald added a project: Restricted Project. · View Herald TranscriptOct 11 2022, 10:57 AM
kparzysz requested review of this revision.Oct 11 2022, 10:57 AM
Herald added a project: Restricted Project. · View Herald TranscriptOct 11 2022, 10:57 AM
Herald added a subscriber: wdng. · View Herald Transcript
craig.topper added inline comments.Oct 11 2022, 3:57 PM
llvm/test/CodeGen/X86/test-shrink.ll
870

The IR here isn't canonical according to InstCombine so it's hard to say if this is a real regression or not.

pengfei added inline comments.Oct 11 2022, 11:31 PM
llvm/test/CodeGen/X86/test-shrink.ll
870

I see the tests are introduced for shrinking code fold:
https://github.com/llvm/llvm-project/commit/42cd8cd8626a7f5eb14b0b43b866dd90bd33277b
Does the code not work for i16 anymore, or we can use another test case?

pengfei added inline comments.Oct 12 2022, 12:31 AM
llvm/test/CodeGen/X86/test-shrink.ll
870

It reduces one uop for non-minsize case, but increases 2 bytes for minsize case, which supposes to be degradation? https://godbolt.org/z/4v8z1sh1x

Maybe better to initially make this Hexagon only?

Maybe better to initially make this Hexagon only?

Yes, I have this as a target-specific DAG combine in a local branch, so I'm not depending on this patch.

Maybe better to initially make this Hexagon only?

Yes, I have this as a target-specific DAG combine in a local branch, so I'm not depending on this patch.

What do you want to do next - merge your local branch Hexagon patch into trunk?

What do you want to do next - merge your local branch Hexagon patch into trunk?

Sorry, missed this comment. The Hexagon-specific code is already merged. As for this, I was planning to fix the x86 codegen to get this case, but I haven't had time to do it yet. I believe this only applies to extending loads, where the consumer accesses the extension bits, and only when the consumer can be modified to access the memory directly. Only existing patterns/selections would need to be updated, because without such explicit change from using register to memory this patch wouldn't have any effect on the code size.

arsenm added inline comments.Nov 16 2022, 3:00 PM
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
13508

Capitalize

13511

Capitalize

13513

Can just unconditionally truncate, it folds out the equality case for you

arsenm requested changes to this revision.Nov 16 2022, 3:01 PM

LGTM besides the use getNode truncate type check. I'm pretty sure I've tried to do this one before

This revision now requires changes to proceed.Nov 16 2022, 3:01 PM