This is an archive of the discontinued LLVM Phabricator instance.

[DAG] SelectionDAG::getNode(N1,N2) - detect N2 constant vector splats as well as scalars
ClosedPublic

Authored by RKSimon on Jan 26 2022, 9:23 AM.

Details

Summary

We already perform some basic folds (add/sub with zero etc.) on scalar types, this patch adds some basic support for constant splats as well in a few cases (we can add more with future test coverage).

In the cases I've enabled, we can handle buildvector implicit truncation as we're not creating new constant nodes from the vector types - we're just returning existing nodes. This allows us to get a number of extra cases in the aarch64 tests.

I haven't enabled support for undefs in buildvector splats, as we're often checking for zero/allones patterns that return the original constant and we shouldn't be returning undef elements in some of these cases - we can enable this later if we're OK with creating new constants.

Diff Detail

Event Timeline

RKSimon created this revision.Jan 26 2022, 9:23 AM
RKSimon requested review of this revision.Jan 26 2022, 9:23 AM
Herald added a project: Restricted Project. · View Herald TranscriptJan 26 2022, 9:23 AM
craig.topper added inline comments.Jan 26 2022, 11:43 AM
llvm/test/CodeGen/AArch64/neon-bitwise-instructions.ll
825

Checking my understanding. This happened before because vselect expansion created twos Ands and an Or. One of the Ands constant folded immediately. Then custom legalization for Or created a special AArch64ISD::ORRi. This all occurred inside of legalize vector ops.

There's no DAG combine for AArch64ISD::ORRi to catch this after legalize vector ops.

Your patch fixes this because the Or will be simplified when it is created.

RKSimon added inline comments.Jan 26 2022, 11:55 AM
llvm/test/CodeGen/AArch64/neon-bitwise-instructions.ll
825

That matches what I think is going on as well - in all the cases here, by avoiding creating an unnecessary node, we're changing the order of node evaluation just enough to stop getting caught in a deadend.

This is a preliminary fix necessary to improve PTEST codgen where we're failing to fold the PCMPEQ(X,0) -> SUB(X,0) -> X in time.

This revision is now accepted and ready to land.Jan 26 2022, 1:20 PM
david-arm added inline comments.
llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp
5634

I wonder if this is also not canonicalising constant vector splats to the RHS as well? Not saying you should fix this in your patch though - its just an observation!

craig.topper added inline comments.Jan 27 2022, 12:45 AM
llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp
5634

Despite it's name it does handle ISD::SPLAT_VECTOR. Is that what you meant?

david-arm added inline comments.Jan 27 2022, 12:47 AM
llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp
5634

Yeah. Oh well that's a bit confusing then. :(

This revision was landed with ongoing or failed builds.Jan 27 2022, 2:59 AM
This revision was automatically updated to reflect the committed changes.
vitalybuka added inline comments.
llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp
5649

@RKSimon After this patch we see something that appears false msan report

cc @eugenis

We have a basic block

182:                                              ; preds = %82
  %183 = call <32 x i16> @llvm.x86.avx512.packusdw.512(<16 x i32> %176, <16 x i32> poison) #28, !dbg !276
  %184 = and <32 x i16> %183, zeroinitializer, !dbg !277
  %185 = or <32 x i16> zeroinitializer, %184, !dbg !277
  %186 = or <32 x i16> %185, <i16 0, i16 0, i16 0, i16 0, i16 poison, i16 poison, i16 poison, i16 poison, i16 0, i16 0, i16 0, i16 0, i16 poison, i16 poison, i16 poison, i16 poison, i16 0, i16 0, i16 0, i16 0, i16 poison, i16 poison, i16 poison, i16 poison, i16 0, i16 0, i16 0, i16 0, i16 poison, i16 poison, i16 poison, i16 poison>, !dbg !277
  %187 = and <32 x i16> %183, <i16 32767, i16 32767, i16 32767, i16 32767, i16 poison, i16 poison, i16 poison, i16 poison, i16 32767, i16 32767, i16 32767, i16 32767, i16 poison, i16 poison, i16 poison, i16 poison, i16 32767, i16 32767, i16 32767, i16 32767, i16 poison, i16 poison, i16 poison, i16 poison, i16 32767, i16 32767, i16 32767, i16 32767, i16 poison, i16 poison, i16 poison, i16 poison>, !dbg !277
  %188 = bitcast <32 x i16> %186 to i512, !dbg !277
  %_mscmp34 = icmp ne i512 %188, 0, !dbg !277
  br i1 %_mscmp34, label %189, label %190, !dbg !277, !prof !27

before the patch

bb.4 (%ir-block.182):
; predecessors: %bb.2
  successors: %bb.5, %bb.6

  %291:vr512 = IMPLICIT_DEF
  %290:vr512 = VPACKUSDWZrr %1:vr512, killed %291:vr512, debug-location !276; ./third_party/highway/hwy/ops/x86_512-inl.h:3056:30 @[ ./third_party/highway/hwy/examples/skeleton.cc:62:18 ]
  %292:vr512 = VMOVDQA64Zrm $rip, 1, $noreg, %const.1, $noreg, debug-location !277 :: (load (s512) from constant-pool); ./third_party/highway/hwy/ops/x86_512-inl.h:3066:29 @[ ./third_party/highway/hwy/examples/skeleton.cc:62:18 ]
  %2:vr512 = VPANDQZrr killed %290:vr512, killed %292:vr512, debug-location !277; ./third_party/highway/hwy/ops/x86_512-inl.h:3066:29 @[ ./third_party/highway/hwy/examples/skeleton.cc:62:18 ]
  %293:gr8 = MOV8ri 1
  TEST8rr %293:gr8, %293:gr8, implicit-def $eflags, debug-location !277; ./third_party/highway/hwy/ops/x86_512-inl.h:3066:29 @[ ./third_party/highway/hwy/examples/skeleton.cc:62:18 ]
  JCC_1 %bb.6, 5, implicit $eflags, debug-location !277; ./third_party/highway/hwy/ops/x86_512-inl.h:3066:29 @[ ./third_party/highway/hwy/examples/skeleton.cc:62:18 ]
  JMP_1 %bb.5, debug-location !277; ./third_party/highway/hwy/ops/x86_512-inl.h:3066:29 @[ ./third_party/highway/hwy/examples/skeleton.cc:62:18 ]

now it's

bb.4 (%ir-block.182):
; predecessors: %bb.2
  successors: %bb.5, %bb.6

  %291:vr512 = IMPLICIT_DEF
  %290:vr512 = VPACKUSDWZrr %1:vr512, killed %291:vr512, debug-location !276; ./third_party/highway/hwy/ops/x86_512-inl.h:3056:30 @[ ./third_party/highway/hwy/examples/skeleton.cc:62:18 ]
  %292:vr512 = VMOVDQA64Zrm $rip, 1, $noreg, %const.1, $noreg, debug-location !277 :: (load (s512) from constant-pool); ./third_party/highway/hwy/ops/x86_512-inl.h:3066:29 @[ ./third_party/highway/hwy/examples/skeleton.cc:62:18 ]
  %2:vr512 = VPANDQZrr killed %290:vr512, killed %292:vr512, debug-location !277; ./third_party/highway/hwy/ops/x86_512-inl.h:3066:29 @[ ./third_party/highway/hwy/examples/skeleton.cc:62:18 ]
  %293:gr32 = MOV32r0 implicit-def dead $eflags
  %294:gr8 = COPY %293.sub_8bit:gr32
  TEST8rr %294:gr8, %294:gr8, implicit-def $eflags, debug-location !277; ./third_party/highway/hwy/ops/x86_512-inl.h:3066:29 @[ ./third_party/highway/hwy/examples/skeleton.cc:62:18 ]
  JCC_1 %bb.6, 5, implicit $eflags, debug-location !277; ./third_party/highway/hwy/ops/x86_512-inl.h:3066:29 @[ ./third_party/highway/hwy/examples/skeleton.cc:62:18 ]
  JMP_1 %bb.5, debug-location !277; ./third_party/highway/hwy/ops/x86_512-inl.h:3066:29 @[ ./third_party/highway/hwy/examples/skeleton.cc:62:18 ]

@vitalybuka Do you have a better repro? I can't tell from your post whether its something to do with this patch or its just unearthed another existing issue

What we might have done is uncovered an existing inconsistency with or with undef folding: https://gcc.godbolt.org/z/e6aPfe3Ye which effects later folds, such as the bitcasting (bitcast <32 x i16> to i512).

define i64 @xx(<2 x i64> %a0) {
  %s = sub <2 x i64> %a0, %a0
  %r = or <2 x i64> <i64 0, i64 undef>, %s
  %x = extractelement <2 x i64> %r, i32 1
  ret i64 %x
}

instcombine:
define i64 @xx(<2 x i64> %a0) {
  ret i64 undef
}

llc:
xx:
        movq    $-1, %rax
        retq

@vitalybuka Any luck with a repro?

@vitalybuka Any luck with a repro?

Sorry, busy with other stuff. I'll reply today.
I assume IR is needed, not source c++?

@vitalybuka Any luck with a repro?

Sorry, busy with other stuff. I'll reply today.
I assume IR is needed, not source c++?

I can investigate the change in folding in the DAG, but what I'm particularly interested in is how we ended up with %235 - the vector OR with a constant-with-poison - that then gets bitcast to a i512 scalar for a comparison. I'm assuming a SimplifyDemandedVectorElts call or something?

	231:                                              ; preds = %115
	  %232 = call <32 x i16> @llvm.x86.avx512.packusdw.512(<16 x i32> %223, <16 x i32> poison) #14
	  %233 = and <32 x i16> %232, zeroinitializer
	  %234 = or <32 x i16> zeroinitializer, %233
	  %235 = or <32 x i16> %234, <i16 0, i16 0, i16 0, i16 0, i16 poison, i16 poison, i16 poison, i16 poison, i16 0, i16 0, i16 0, i16 0, i16 poison, i16 poison, i16 poison, i16 poison, i16 0, i16 0, i16 0, i16 0, i16 poison, i16 poison, i16 poison, i16 poison, i16 0, i16 0, i16 0, i16 0, i16 poison, i16 poison, i16 poison, i16 poison>
	  %236 = and <32 x i16> %232, <i16 32767, i16 32767, i16 32767, i16 32767, i16 poison, i16 poison, i16 poison, i16 poison, i16 32767, i16 32767, i16 32767, i16 32767, i16 poison, i16 poison, i16 poison, i16 poison, i16 32767, i16 32767, i16 32767, i16 32767, i16 poison, i16 poison, i16 poison, i16 poison, i16 32767, i16 32767, i16 32767, i16 32767, i16 poison, i16 poison, i16 poison, i16 poison>
	  %237 = bitcast <32 x i16> %235 to i512
	  %238 = icmp ne i512 %237, 0
	  br i1 %238, label %239, label %240, !prof !10