This is an archive of the discontinued LLVM Phabricator instance.

[DAG] SimplifyDemandedVectorElts Bug fix for rG7cb5a51f386d
AbandonedPublic

Authored by yubing on May 15 2020, 2:09 AM.

Download Raw Diff

Details

Reviewers

RKSimon
craig.topper
LuoYuanke
pengfei

Summary

In X86TargetLowering::SimplifyDemandedVectorEltsForTargetNode,
before we try to do SimplifyDemandedVectorElts for OpInputs[Src],
if OpInputs[Src] which has users excluding Op.getNode(),
we assume that all elements are needed, i.e, set SrcElts.setAllBits()

For example:
t1317: v8i32 = insert_subvector undef:v8i32, t1414, Constant:i64<0>
t1315: v8i32 = X86ISD::BLENDI t380, t1317, TargetConstant:i8<2>
t1414: v4i32 = insert_vector_elt t679, t677, Constant:i64<2>
t1416: v8i32 = X86ISD::VBROADCAST t1414

When getTargetShuffleInputs(...) processed t1416, it created
NewNode:
  v8i32 = insert_subvector undef:v8i32, t1414, Constant:i64<0>
which is the same with t1317.

So getTargetShuffleInputs(...) set
 OpInputs[0] = t1317 which is used by t1315

Before SimplifyDemandedVectorElts processes OpInputs[0] which is used by
t1315, we assume that all elements are needed, i.e. SrcElts.setAllBits()

Diff Detail

Event Timeline

yubing created this revision.May 15 2020, 2:09 AM

Herald added a project: Restricted Project. · View Herald TranscriptMay 15 2020, 2:09 AM

Herald added subscribers: llvm-commits, hiraditya. · View Herald Transcript

Harbormaster completed remote builds in B56831: Diff 264183.May 15 2020, 4:30 AM

Really we need to stop creating nodes inside getFauxShuffle - I'm going to see if we can do this without too many regressions.

llvm/test/CodeGen/X86/simplifydemandedvectorselts-broadcast.ll
21	these bitcasts can probably be removed ? or at very least the addrspace ?

yubing updated this revision to Diff 264269.May 15 2020, 9:56 AM

In D79987#2038738, @RKSimon wrote:

Really we need to stop creating nodes inside getFauxShuffle - I'm going to see if we can do this without too many regressions.

Hi, Simon. I am wondering what benefits we can get from getFauxShuffle. Do we have a good example for it?
As I aware of, SimplifyDemandedVectorElts(...) is used to simplify a real SDNode using the information of DemandedVectorElts but It seems weird to simplify a Fauxshuffle.

PING @RKSimon

I'm still looking at fixing getFauxShuffleMask (PR45974) but that might take a while, so this sort of approach is probably necessary.

Did you investigate replacing getTargetShuffleInputs with getTargetShuffleAndZeroables in the SimplifyDemandedBitsForTargetNode/SimplifyDemandedVectorEltsForTargetNode?

llvm/test/CodeGen/X86/simplifydemandedvectorselts-broadcast.ll
2	-use -mattr=+avx2 instead of -mcpu=core-avx2

yubing updated this revision to Diff 265967.May 24 2020, 9:54 PM

In D79987#2041642, @RKSimon wrote:

I'm still looking at fixing getFauxShuffleMask (PR45974) but that might take a while, so this sort of approach is probably necessary.

Did you investigate replacing getTargetShuffleInputs with getTargetShuffleAndZeroables in the SimplifyDemandedBitsForTargetNode/SimplifyDemandedVectorEltsForTargetNode?

Hi, Simon. It seems I figure out why we do getFauxShuffleMask(...) for t1416: v8i32 = X86ISD::VBROADCAST t1414.
Is that because we are able to recursively simpify t1414(the real SRC of t1416) if we are simplifying the fauxshuffle created by getFauxShuffleMask(...)?

Investigating further, I think adding a SimplifyMultipleUseDemandedBits call in the X86ISD::VBROADCAST case at the beginning of SimplifyDemandedVectorEltsForTargetNode would be a better solution going forward, and will help with the codegen in your test case. See the X86ISD::PACKS case for an example.

RKSimon mentioned this in rG45ebe38ffc40: [X86][AVX] Pad small shuffle inputs in combineX86ShufflesRecursively.May 31 2020, 4:47 AM

RKSimon mentioned this in rG15b281d7805d: [X86][AVX] Add test case described in D79987.May 31 2020, 6:22 AM

RKSimon mentioned this in rG4a2673d79fdb: [X86][AVX] Add SimplifyMultipleUseDemandedBits VBROADCAST handling to….May 31 2020, 6:54 AM

@yubing I think my fixes for PR45974 have addressed this now - please can you confirm?

In D79987#2065821, @RKSimon wrote:

@yubing I think my fixes for PR45974 have addressed this now - please can you confirm?

Yeah, I've seen that bugfix and it works for my testcase. Thanks, Simon~
Besides, I am wondering in why we call SimplifyMultipleUseDemandedBits behind SimplifyDemandedVectorElts for X86ISD::VBROADCAST in SimplifyDemandedVectorEltsForTargetNode. Is that also correct calling SimplifyMultipleUseDemandedBits before SimplifyDemandedVectorElts ?

yubing abandoned this revision.Jun 1 2020, 1:29 AM

In D79987#2065834, @yubing wrote:

In D79987#2065821, @RKSimon wrote:

@yubing I think my fixes for PR45974 have addressed this now - please can you confirm?

Yeah, I've seen that bugfix and it works for my testcase. Thanks, Simon~
Besides, I am wondering in why we call SimplifyMultipleUseDemandedBits behind SimplifyDemandedVectorElts for X86ISD::VBROADCAST in SimplifyDemandedVectorEltsForTargetNode. Is that also correct calling SimplifyMultipleUseDemandedBits before SimplifyDemandedVectorElts ?

SimplifyMultipleUseDemandedBits is very limited as it only deals with bypassing the immediate SDValue (despite its name it handles bits + elts demand masks). SimplifyDemandedBitsForTargetNode/SimplifyDemandedVectorElts are much more capable as they can keep recursing up the DAG looking for a simplification until they reach the maximum depth limit, so it much more likely it will find a simplification and SimplifyMultipleUseDemandedBits tends to be only for cleanup (as it can ignore most OneUse limitations).

Revision Contents

Path

Size

llvm/

lib/

Target/

X86/

X86ISelLowering.cpp

16 lines

test/

CodeGen/

X86/

simplifydemandedvectorselts-broadcast.ll

19 lines

Diff 264269

llvm/lib/Target/X86/X86ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 37,001 Lines • ▼ Show 20 Lines	for (int Src = 0; Src != NumSrcs; ++Src) {
APInt SrcElts = APInt::getNullValue(NumElts);		APInt SrcElts = APInt::getNullValue(NumElts);
for (int i = 0; i != NumElts; ++i)		for (int i = 0; i != NumElts; ++i)
if (DemandedElts[i]) {		if (DemandedElts[i]) {
int M = OpMask[i] - Lo;		int M = OpMask[i] - Lo;
if (0 <= M && M < NumElts)		if (0 <= M && M < NumElts)
SrcElts.setBit(M);		SrcElts.setBit(M);
}		}

		// As for OpInputs[Src] which has users excluding Op.getNode(),
		// we assume that all elements are needed, i.e, set SrcElts.setAllBits()
		// For example:
		// t1317: v8i32 = insert_subvector undef:v8i32, t1414, Constant:i64<0>
		// t1315: v8i32 = X86ISD::BLENDI t380, t1317, TargetConstant:i8<2>
		// t1414: v4i32 = insert_vector_elt t679, t677, Constant:i64<2>
		// t1416: v8i32 = X86ISD::VBROADCAST t1414
		// When getTargetShuffleInputs(...) processed t1416, it created
		// NewNode: v8i32 = insert_subvector undef:v8i32, t1414, Constant:i64<0>
		// which is the same with t1317.
		// So getTargetShuffleInputs(...) set
		// OpInputs[0] = t1317 which is used by t1315
		// Before SimplifyDemandedVectorElts processes OpInputs[0] which is used by
		// t1315, we assume that all elements are needed, i.e. SrcElts.setAllBits()
		if (!OpInputs[Src].isOperandOf(Op.getNode()) && !OpInputs[Src].use_empty())
		SrcElts.setAllBits();
// TODO - Propagate input undef/zero elts.		// TODO - Propagate input undef/zero elts.
APInt SrcUndef, SrcZero;		APInt SrcUndef, SrcZero;
if (SimplifyDemandedVectorElts(OpInputs[Src], SrcElts, SrcUndef, SrcZero,		if (SimplifyDemandedVectorElts(OpInputs[Src], SrcElts, SrcUndef, SrcZero,
TLO, Depth + 1))		TLO, Depth + 1))
return true;		return true;
}		}

// If we don't demand all elements, then attempt to combine to a simpler		// If we don't demand all elements, then attempt to combine to a simpler
▲ Show 20 Lines • Show All 12,037 Lines • Show Last 20 Lines

llvm/test/CodeGen/X86/simplifydemandedvectorselts-broadcast.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
				; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=core-avx2 \| FileCheck %s
				RKSimonUnsubmitted Done Reply Inline Actions -use -mattr=+avx2 instead of -mcpu=core-avx2 RKSimon: -use -mattr=+avx2 instead of -mcpu=core-avx2

				; Function Attrs: noinline nounwind optnone uwtable
				define <16 x i32> @main(<3 x i32>* %ptr) {
				; CHECK-LABEL: main:
				; CHECK: # %bb.0: # %entry
				; CHECK-NEXT: vmovq {{.*#+}} xmm0 = mem[0],zero
				; CHECK-NEXT: vpinsrd $2, 8(%rdi), %xmm0, %xmm1
				; CHECK-NEXT: vpxor %xmm0, %xmm0, %xmm0
				; CHECK-NEXT: vpblendd {{.*#+}} ymm0 = ymm0[0],ymm1[1],ymm0[2,3,4,5,6,7]
				; CHECK-NEXT: vpshufb {{.*#+}} ymm1 = zero,zero,zero,zero,zero,zero,zero,zero,ymm1[0,1,2,3],zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero
				; CHECK-NEXT: retq
				entry:
				%int3 = load <3 x i32>, <3 x i32>* %ptr, align 1
				%0 = shufflevector <3 x i32> %int3, <3 x i32> undef, <16 x i32> <i32 0, i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
				%1 = shufflevector <16 x i32> <i32 0, i32 undef, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 undef, i32 0, i32 0, i32 0, i32 0, i32 0>, <16 x i32> %0, <16 x i32> <i32 0, i32 17, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 16, i32 11, i32 12, i32 13, i32 14, i32 15>
				ret <16 x i32 > %1
				}
				RKSimonUnsubmitted Done Reply Inline Actions these bitcasts can probably be removed ? or at very least the addrspace ? RKSimon: these bitcasts can probably be removed ? or at very least the addrspace ?