Here's a patch to match VLS all-1s masks to PTRUE. There were a few places, before and after legalization, that this could be done, but I think SETCC_MERGE_ZERO combines are the best fit.
Details
Diff Detail
Event Timeline
I may be misunderstanding, but this pattern is just a legalized truncate, e.g. ({1,1,1,1} & splat(1)) != splat(0). The VLS->VLA transition is creating a bunch of extra nodes that need to be matched.
Fix formatting for the Lint bots.
Notice the "sign_extend" -> "sext" change to fit in 80 columns. This no longer matches the ISD node naming scheme, so it's a little weird. I didn't see a better fix for it though.
Hi @cameron.mcinally, I really like what you're trying to do in this patch and the codegen indeed looks a lot better! I just had some suggestions about a possibly simpler, and more comprehensive approach that might give us more overall benefit.
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp | ||
---|---|---|
17030 | I think you also need to check for && !Negated here. Alternatively, I think you could just do: APInt SplatVal; if (isAllActivePredicate(DAG, Pred) && LHS.getOpcode() == ISD::AND && ISD::isConstantSplatVector(LHS.getOperand(1), SplatVal) && SplatVal == 1) { The only additional value that isPow2Splat adds here is that it also checks for AArch64ISD::DUP nodes, but I imagine at the point we're doing the DAG combines here we haven't generated an AArch64 ISD node yet? | |
17036 | It feels like we should have a more basic DAG combine here, i.e. something in either DAGCombiner::visitINSERT_SUBVECTOR or in performInsertSubvectorCombine, that basically combines <vscale x M x iXY> insert_subvector <vscale x M x iXY> undef, <N x iXY> <splat of iXY A> into <vscale x M x iXY> <splat of iXY A> when we know that vscale x M == N. If you implement such a DAG combine it might benefit other parts of the code too? It would then mean here you should only have to check if Trunc is a splat of 1, which may also catch more cases. | |
17038 | Again, here I think you need to check && !TruncNegated |
Hi @cameron.mcinally, sorry I mentioned @craig.topper in my previous comment, but I meant you. It's because I've also just reviewed one of Craig's patches so I got mixed up. :)
Does D120328 achieve the effect you're after @cameron.mcinally? I need to pull out and extend the LowerSPLAT_VECTOR related change but figured I'd push up my current work in case it helps.
Updated patch based on @david-arm's review.
@paulwalker-arm, D120328 looks good too. I'm happy to go with that one. But I notice that it will define bits that may be undef otherwise. E.g. we're inserting a 1/4 width vector into a full width vector, or Idx != 0.
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp | ||
---|---|---|
14457 | I'm not sure if getVScaleForTuning is the right way to go here, but it seemed like the cleanest solution. I also wonder if all insert_subvector(undef, splat(X), 0)->splat(X) should be canonicalized here. I don't have a strong opinion on it though. |
I also wondered about defining bits that may be undef otherwise, but am unsure how much is really matters. I'll see if I can limit (when or perhaps just handle the constant case) the combine to reduce any potential downsides and report back. I'll also note that I believe this patch suffers the same problem because getVScaleForTuning() is only a hint. You can use getMinSVEVectorSizeInBits and getMaxSVEVectorSizeInBits to see if the true size is known but the downside if that you'll only be optimising the cases when the fixed length vector is the same size as the scalar equivalent, which means not all vectorised loops will see the benefit.
Good point. Replacing the lowered truncates with ptrue sounds like a win in the general case. Abandoning this Diff.
I'm not sure if getVScaleForTuning is the right way to go here, but it seemed like the cleanest solution.
I also wonder if all insert_subvector(undef, splat(X), 0)->splat(X) should be canonicalized here. I don't have a strong opinion on it though.