In X86TargetLowering::SimplifyDemandedVectorEltsForTargetNode, before we try to do SimplifyDemandedVectorElts for OpInputs[Src], if OpInputs[Src] which has users excluding Op.getNode(), we assume that all elements are needed, i.e, set SrcElts.setAllBits() For example: t1317: v8i32 = insert_subvector undef:v8i32, t1414, Constant:i64<0> t1315: v8i32 = X86ISD::BLENDI t380, t1317, TargetConstant:i8<2> t1414: v4i32 = insert_vector_elt t679, t677, Constant:i64<2> t1416: v8i32 = X86ISD::VBROADCAST t1414 When getTargetShuffleInputs(...) processed t1416, it created NewNode: v8i32 = insert_subvector undef:v8i32, t1414, Constant:i64<0> which is the same with t1317. So getTargetShuffleInputs(...) set OpInputs[0] = t1317 which is used by t1315 Before SimplifyDemandedVectorElts processes OpInputs[0] which is used by t1315, we assume that all elements are needed, i.e. SrcElts.setAllBits()
Details
- Reviewers
RKSimon craig.topper LuoYuanke pengfei
Diff Detail
Event Timeline
Really we need to stop creating nodes inside getFauxShuffle - I'm going to see if we can do this without too many regressions.
llvm/test/CodeGen/X86/simplifydemandedvectorselts-broadcast.ll | ||
---|---|---|
21 | these bitcasts can probably be removed ? or at very least the addrspace ? |
Hi, Simon. I am wondering what benefits we can get from getFauxShuffle. Do we have a good example for it?
As I aware of, SimplifyDemandedVectorElts(...) is used to simplify a real SDNode using the information of DemandedVectorElts but It seems weird to simplify a Fauxshuffle.
I'm still looking at fixing getFauxShuffleMask (PR45974) but that might take a while, so this sort of approach is probably necessary.
Did you investigate replacing getTargetShuffleInputs with getTargetShuffleAndZeroables in the SimplifyDemandedBitsForTargetNode/SimplifyDemandedVectorEltsForTargetNode?
llvm/test/CodeGen/X86/simplifydemandedvectorselts-broadcast.ll | ||
---|---|---|
2 | -use -mattr=+avx2 instead of -mcpu=core-avx2 |
Hi, Simon. It seems I figure out why we do getFauxShuffleMask(...) for t1416: v8i32 = X86ISD::VBROADCAST t1414.
Is that because we are able to recursively simpify t1414(the real SRC of t1416) if we are simplifying the fauxshuffle created by getFauxShuffleMask(...)?
Investigating further, I think adding a SimplifyMultipleUseDemandedBits call in the X86ISD::VBROADCAST case at the beginning of SimplifyDemandedVectorEltsForTargetNode would be a better solution going forward, and will help with the codegen in your test case. See the X86ISD::PACKS case for an example.
@yubing I think my fixes for PR45974 have addressed this now - please can you confirm?
Yeah, I've seen that bugfix and it works for my testcase. Thanks, Simon~
Besides, I am wondering in why we call SimplifyMultipleUseDemandedBits behind SimplifyDemandedVectorElts for X86ISD::VBROADCAST in SimplifyDemandedVectorEltsForTargetNode. Is that also correct calling SimplifyMultipleUseDemandedBits before SimplifyDemandedVectorElts ?
SimplifyMultipleUseDemandedBits is very limited as it only deals with bypassing the immediate SDValue (despite its name it handles bits + elts demand masks). SimplifyDemandedBitsForTargetNode/SimplifyDemandedVectorElts are much more capable as they can keep recursing up the DAG looking for a simplification until they reach the maximum depth limit, so it much more likely it will find a simplification and SimplifyMultipleUseDemandedBits tends to be only for cleanup (as it can ignore most OneUse limitations).
-use -mattr=+avx2 instead of -mcpu=core-avx2