Implement the opportunity to merge stores from extracts before type is legalized on PPC.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
This seems like it should be folded into the already existing checks. Do you know why NumStoresToMerge was not being before. I expect it's the requirement of a legal types in pre-legal merges or PPC's check for allowed misaligned accesses missing some cases. If it's the former I suspect we can disable the legality requirement prelegaltypes for non-truncated stores (replace TLI.isTypeLegal(ty) with isTypeLegal(ty)).
Thanks for responding, @niravd. I currently have no idea of why NumStoresToMerge was not being before. I might have a look at patches related to this portion of code. I notice that consecutive stores of vector elements was first introduced by https://reviews.llvm.org/rL224611 in which a legal type was a requirement already. I'll have a try of what you said.
Hi, @niravd, after investigate code carefully, I think this check might not be redundant. Considering the case, we have 3 i32 values extracted from vectors, both isTypeLegal and TIL.isTypeLegal see a v3i32 illegal. However, MergeStoresOfConstantsOrVecElts which is called later by MergeConsecutiveStores doesn't require type legality and builds a BUILD_VECTOR node whose elements are 3 EXTRACT_VECTOR_ELT values. PowerPC's vector type legalizer can handle such cases, so it can benefit from getNumStoresOfVectorElementsToMergePreLegalize. I know it's quite weird such check added within a context where type legality check is around. I once wanna implement it in PPCTargetLowering::PerformDAGCombine, however it might make code duplicated.
Hi, @niravd, after investigate code carefully, I think this check might not be redundant. Considering the case, we have 3 i32 values extracted from vectors, both isTypeLegal and TIL.isTypeLegal see a v3i32 illegal.
Are you certain isTypeLegal was returning false in prelegalization? isTypeLegal(x) should be "!LegalTypes || TLI.isTypeLegal(x)". I was expecting that if you swapped out TLI.isTypeLegal for isTypeLegal where it was failing, we could generate an invalid node, but it sounds like that's not the case. If so, I think we should double check.
However, MergeStoresOfConstantsOrVecElts which is called later by MergeConsecutiveStores doesn't require type legality and builds a BUILD_VECTOR node whose elements are 3 EXTRACT_VECTOR_ELT values. PowerPC's vector type legalizer can handle such cases, so it can benefit from getNumStoresOfVectorElementsToMergePreLegalize. I know it's quite weird such check added within a context where type legality check is around. I once wanna implement it in PPCTargetLowering::PerformDAGCombine, however it might make code duplicated.
@niravd Thanks for pointing out my mistake. I tried swapping out TIL.isTypeLegal with isTypeLegal, code generated for PowerPC is not what I expect, cuz TIL.allowsMemoryAccess will return false if Ty is something like v3i32. Also this change will break some regression tests of SystemZ and X86. And I don't quite understand the meaning of 'double check' here, could you please explain more?
By double check, I just meant to look at the results again with isTypeLegal checked, which is where we are?
FWIW, It's probably fine to do something like isTypeLegal with allowsMemoryAccess, though there will likely be more changes in other backends,. I expect most to be mundane. It may be worth it to update and see if others are motivated to look into real regressions.
Looks like it's mostly an improvement though there are some potential regressions around vector shuffles. I'll leave it to the others if it's acceptable to land.
Also, what happened to the PPC test changes?
Also, what happened to the PPC test changes?
Cuz PPC's allowsMemoryAccess lacks information about which CombineLevel is at, it fails the check(Only a few vector types are allowed to have misaligned access). As a result, currently no changes happen in PPC's code. I might try to solve this problem with another patch.
Cuz PPC's allowsMemoryAccess lacks information about which CombineLevel is at, it fails the check(Only a few vector types are allowed to have misaligned access). As a result, currently no changes happen in PPC's code. I might try to solve this problem with another patch.
Ah. Can you solve this by doing the analog to isTypeLegal (vs. isTypeLegal) and assume it's true prelegalization? If it's that small, you should (but don't feel like you must) fold it into this patch.
At a first glance this seems to be generating worse code now since we need to do all those permute-type instructions in order to get the bytes into the correct order in a single register to store ... I'll clarify with the hardware folks which of the sequences would actually be preferable.
If it turns out that there are instances where *not* merging stores (even when it would be *possible*) is not preferred from a performance perspective, should common code use some cost function here?
Considering suggestions of @niravd and @uweigand , is it proper to have an implementation like
bool DAGCombiner::isAbleToMergeConsecutiveStoresPreLegalize(ArrayRef<SNode*> elements, .../* Params related to align, addrspace and etc.*/) { if (LegalTypes) return false; // Let target decides cost considering elements to be stored. return TLI.canMergeConsecutiveStoresOfVectorElements(elements, ...); }
And new check is
if (isTypeLegal(Ty) && TLI.canMergeStoresTo(FirstStoreAS, Ty, DAG) && ((TLI.allowsMemoryAccess(Context, DL, Ty, *FirstInChain->getMemOperand(), &IsFast) && IsFast) || isAbleToMergeConsecutiveStoresPreLegalize(elements, ...))
It's reasonable assuming we want to prohibit the SystemZ cases. Note that TLI.canMergeConsecutiveStoresOfVectorElements and TLI.canMergeStoresTo are both only called here and should be merged into a single method.
That said, the permute expression seems like a sign that there are some permutation peepholes may be worthwhile. We could simplify the permute into a BUILD_VECTOR of a vector stores which should be close enough to the original element-wise stores here.
Hi @RKSimon , currently I'm not working on it. I should have abandoned this patch. I might have another patch which also works for PowerPC.
Since this patch has long been not updated, I'll close it and plan another patch that also works for PowerPC.
With this change, the SystemZ case is not touched. Two x86 cases still need to confirm.
According to previous comments, it may be better to have some cost function to decide if merge is preferred or not. I'm not sure how to implement the cost function, one reason is that the final code sequence generated may depends on target specific flag etc., and backend target at this stage may not have enough information to decide. For example results in extract-and-store.ll show dependence on -ppc-disable-perfect-shuffle=false.
Instead of a cost function, could we use (possibly tweaked) isMultiStoresCheaperThanBitsMerge?
llvm/test/CodeGen/PowerPC/extract-and-store.ll | ||
---|---|---|
533 | is this really an issue with the store merging or the ppc shuffle combines have gotten messed up? |
Thank you for pointing out. I will try to come up with something to guard against cases that got degenerated.
llvm/test/CodeGen/PowerPC/extract-and-store.ll | ||
---|---|---|
533 | Had a quick check, this is the case PPC::LowerVECTOR_SHUFFLE does not have efficient solution, so it turned into VPERM as last resort. Probably the cost function should avoid this kind of situation. |
llvm/test/CodeGen/PowerPC/extract-and-store.ll | ||
---|---|---|
533 | I'd be nervous about a cost function as that is likely to be very difficult to keep balanced. I'd probably recommend just overriding canMergeStoresTo or isMultiStoresCheaperThanBitsMerge for PPC |
Added cost function for target as suggested. The function on PPC tries to avoid some patterns that lead to TOC accesses.
llvm/test/CodeGen/PowerPC/extract-and-store.ll | ||
---|---|---|
533 | Thank you for the advice. I created a new function, hope that is fine. |
This code seems quite unnecessarily complex. I can achieve essentially the same results with something like this:
diff --git a/llvm/include/llvm/CodeGen/TargetLowering.h b/llvm/include/llvm/CodeGen/TargetLowering.h index 36973f5bddb0..984e84ba6fdc 100644 --- a/llvm/include/llvm/CodeGen/TargetLowering.h +++ b/llvm/include/llvm/CodeGen/TargetLowering.h @@ -939,6 +939,9 @@ public: (unsigned)VT.getSimpleVT().SimpleTy < array_lengthof(RegClassForVT)); return VT.isSimple() && RegClassForVT[VT.getSimpleVT().SimpleTy] != nullptr; } + virtual bool isTypeLegalForMemAccess(EVT VT) const { + return isTypeLegal(VT); + } class ValueTypeActionImpl { /// ValueTypeActions - For each value type, keep a LegalizeTypeAction enum diff --git a/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp b/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp index 5e77317572af..6acde2a5ae91 100644 --- a/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp +++ b/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp @@ -18486,7 +18486,7 @@ bool DAGCombiner::tryStoreMergeOfExtracts( if (Ty.getSizeInBits() > MaximumLegalStoreInBits) break; - if (TLI.isTypeLegal(Ty) && + if (TLI.isTypeLegalForMemAccess(Ty) && TLI.canMergeStoresTo(FirstStoreAS, Ty, DAG.getMachineFunction()) && TLI.allowsMemoryAccess(Context, DL, Ty, *FirstInChain->getMemOperand(), &IsFast) && diff --git a/llvm/lib/Target/PowerPC/PPCISelLowering.cpp b/llvm/lib/Target/PowerPC/PPCISelLowering.cpp index 862d2ebc75a6..ceed4c1ffc91 100644 --- a/llvm/lib/Target/PowerPC/PPCISelLowering.cpp +++ b/llvm/lib/Target/PowerPC/PPCISelLowering.cpp @@ -17569,7 +17569,8 @@ bool PPCTargetLowering::allowsMisalignedMemoryAccesses(EVT VT, return true; if (Subtarget.hasVSX()) { if (VT != MVT::v2f64 && VT != MVT::v2i64 && - VT != MVT::v4f32 && VT != MVT::v4i32) + VT != MVT::v4f32 && VT != MVT::v4i32 && + VT != MVT::v2f32 && VT != MVT::v2i32) return false; } else { return false; diff --git a/llvm/lib/Target/PowerPC/PPCISelLowering.h b/llvm/lib/Target/PowerPC/PPCISelLowering.h index 2fa6d45bfe1a..1f0051f8d273 100644 --- a/llvm/lib/Target/PowerPC/PPCISelLowering.h +++ b/llvm/lib/Target/PowerPC/PPCISelLowering.h @@ -1101,6 +1101,11 @@ namespace llvm { EVT getOptimalMemOpType(const MemOp &Op, const AttributeList &FuncAttributes) const override; + bool isTypeLegalForMemAccess(EVT VT) const override { + bool Ret = TargetLoweringBase::isTypeLegalForMemAccess(VT) || VT == MVT::v2i32 || VT == MVT::v2f32; + return Ret; + } + /// Is unaligned memory access allowed for the given type, and is it fast /// relative to software emulation. bool allowsMisalignedMemoryAccesses(
Sure, it produces some vperm's with this test case, but I don't see an issue with that - in most cases that matter, the constant pool loads aren't likely to lead to a lot of cache misses.
The original approach is too complex, and the same effect can be achieved more simply as Nemanja pointed out.
I'm adopting the whole approach, and added a guard to make sure this is applied only before type is legalized.
llvm/include/llvm/CodeGen/TargetLowering.h | ||
---|---|---|
945 | please add doxygen descriptions of the params (B in particular....) |
please add doxygen descriptions of the params (B in particular....)