Possibly vectorized operations are extended to the power-of-2 number with UndefValues to allow to use regular vector operations.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
199–200 | Yeah, will fix it. |
llvm/test/Transforms/SLPVectorizer/AArch64/PR38339.ll | ||
---|---|---|
15 | These "feel" like regressions to me - any idea whats going on? |
llvm/test/Transforms/SLPVectorizer/AArch64/PR38339.ll | ||
---|---|---|
15 | The cost model problem, if I recall it correctly. I investigated it before and found out that the cost model for AArch64 is not defined for long vectors in some cases and we fall back to the generic cost model evaluation which is not quite correct in many cases. Need to tweak the cost model for AArch64. |
llvm/test/Transforms/SLPVectorizer/AArch64/PR38339.ll | ||
---|---|---|
15 | Any instruction cost type (extract/shuffle/store?) in particular that needs better costs? It'd be good to at least raise a specific bug report to the aarch64 team |
llvm/test/Transforms/SLPVectorizer/AArch64/PR38339.ll | ||
---|---|---|
15 | Do not remember already, need some time to investigate it again. Hope to do it by the end of this week. |
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
7568 | Is it necessary to copy these? If so, it would be better to name this function something like "getCopyOfExtraArgValues" to make that explicit. If not, we can just make this a standard 'get' method: const MapVector<Instruction *, Value *> &getExtraArgs() const { return ExtraArgs; } And then access the 'second' data in the user code? |
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
7568 | We don't need to expose the first element of the MapVector here, it is not good from the general design point of view. I'll rename the member function. |
llvm/test/Transforms/SLPVectorizer/AArch64/PR38339.ll | ||
---|---|---|
15 | Found the reason. It is the cost of shuffle of TTI::SK_PermuteSingleSrc kind. Before this patch, the test operated with the vector <2 x i16>, which is transformed to llvm::MVT::v2i32 by type legalization function and the cost of this shuffle is tweaked to be 1 (see llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp, AArch64TTIImpl::getShuffleCost). The cost of this operation is 1, per table. With this patch, the original vector type is <4 x i16> which is transformed to llvm::MVT::v4i16 and there is no optimized value for TTI::SK_PermuteSingleSrc in the table for this type and the function falls back to the pessimistic cost model and returns 18. There are several TODOs int the file already about fixing the cost model for different shuffle operations. |
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
3207–3209 | Use isValidElementType() or check for undef directly? | |
3902 | Are we always creating a masked load for a vector with 2 elements? This logic needs a code comment to explain the cases. | |
4868 | Please add code comment/example to explain what the difference is between these 2 clauses. | |
4907 | Is Passthrough a full vector of undef elements? If so, it should be created/named that way (or directly in the call to CreateMaskedLoad()) rather than in the loop. | |
4973–4974 | Similar to above (so can we add a helper function to avoid duplicating the code?): |
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
3207–3209 | I was just trying to protect the code and try to support it only for simple types at first. There are some doubts that the cost for masked loads/stores is completed and I protected it to make it work only for simple types. I can remove this check if the cost model for masked ops is good enough. | |
3902 | No, no need to do it for 2 elements, removed it. | |
4868 | Fixed it, thanks. | |
4907 | Fixed | |
4973–4974 | Fixed, thanks! |
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
3207–3209 | masked load/store costs for constant masks should be good enough now (getScalarizationOverhead should now provide us with a reasonable fallback) |
All of my comments were addressed, so LGTM. But please wait for an official 'accept' from at least 1 other reviewer.
llvm/test/Transforms/SLPVectorizer/X86/insert-element-build-vector.ll | ||
---|---|---|
278 | There isn't a ANY check-prefix atm (it was cleaned out in rG119e4550ddedc75e4 as part of the unused prefix cleanup) - please can you review? |
llvm/test/Transforms/SLPVectorizer/X86/insert-element-build-vector.ll | ||
---|---|---|
278 | Yes, need to remove it, I think. Most probably, caused but not quite clean merge. |
llvm/test/Transforms/SLPVectorizer/X86/pr47623.ll | ||
---|---|---|
26 ↗ | (On Diff #306735) | Regression on avx? |
llvm/test/Transforms/SLPVectorizer/X86/pr47623.ll | ||
---|---|---|
26 ↗ | (On Diff #306735) | Yes, looks like the issue with the cost of @llvm.masked.gather for masked gather with some undefs in the mask |
llvm/test/Transforms/SLPVectorizer/X86/pr47623.ll | ||
---|---|---|
26 ↗ | (On Diff #306735) | Gather is slow on CPUs prior to AVX512. And its cost is proportional to the number of elements. I don't think the value of the mask should be a factor. |
llvm/test/Transforms/SLPVectorizer/X86/pr47623.ll | ||
---|---|---|
26 ↗ | (On Diff #306735) | True, but in some cases it can be optimized into gather + shuffle instead of wide gather, if there are undefs in mask. |
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
2601 | Could it be actually no one Instruction in UseEntry or should it be assert? | |
2605 | "Lane 0" seems outdated here, but not sure about better description. | |
2739 | assert(NumberOfInstructions != 0 && "...") and if (NumberOfInstructions == 1)? | |
3484–3487 | Comment typo: aggrgate. | |
4899–4900 | emplace_back() | |
6313 | typo: "indeces" |
Btw, I've observed significant compile-time regression with this patch: http://llvm-compile-time-tracker.com/compare.php?from=99d82412f822190a6caa3e3a5b9f87b71f56de47&to=81b636bae72c967f526bcd18de45a6f4a76daa41&stat=instructions (thanks to @nikic for awesome service). This could be justified in case of comparable performance improvements but have you done any benchmarking?
I have done a while back with SPECINT 2006 and as I remember results were good, but I am not sure that I could find those now. Yes, for me, having this new functionality with presented compile-time regression looks ok.
I dont think (geomean) 0.20% is significant compile time problem. TBH, I expected bigger CT regressions - up to 0.5% is fine IMHO.
AFAICT the only outstanding question is whether the compile time increase is acceptable?
I'd agree that geomean = 0.2% is acceptable for the change with such awesome performance impact, just noted that changed time compilation is significant in comparision with other changes. Generally it looks good to me apart from one minor unaddressed comment.
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
3213 | Well, "unsortable" or "unprocessable" term would be more precise. But why did we change if (sortPtrAccesses)... to opposite condition? This change just duplicate debug output, since we didn't differentiate it. Also I'd prefer to see the same if-else structure as for the load case. |
Could the summary of the revision be updated with the performance data? The thread is very long and I didn't spot where the measurements are, so it's hard to say what we're trading off here...
Generally though this level of geomean regression is fine, the issue is usually in the outliers. For example, I spot this this 4-5% regression:
CMakeFiles/lencod.dir/transform8x8.c.o 3053M 3188M (+4.41%)
It may be worthwhile to briefly check it in case something can be improved.
Will try to run the benchmarks and get fresh data.
Generally though this level of geomean regression is fine, the issue is usually in the outliers. For example, I spot this this 4-5% regression:
CMakeFiles/lencod.dir/transform8x8.c.o 3053M 3188M (+4.41%)It may be worthwhile to briefly check it in case something can be improved.
I'll check what can be improved in terms of compile time. Not sure that will be able to improve it significantly since the patch itself does not add extensive analysis/transformations, just adds an extra 1 iteration for wider vector analysis. But I'll check it anyway and will try to improve things where possible.
While reviewing the latest update, I think I spotted SLP compile-time failure in SingleSource/Benchmarks/Misc/oourafft.c, here is the reduced testcase to reporduce:
source_filename = "/home/dtemirbulatov/llvm/test-suite/SingleSource/Benchmarks/Misc/oourafft.c"
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"
define dso_local fastcc void @cft1st(double* %a) unnamed_addr #0 {
entry:
%0 = or i64 16, 2 %arrayidx107 = getelementptr inbounds double, double* %a, i64 %0 %1 = or i64 16, 3 %arrayidx114 = getelementptr inbounds double, double* %a, i64 %1 %2 = or i64 16, 4 %arrayidx131 = getelementptr inbounds double, double* %a, i64 %2 %3 = or i64 16, 6 %arrayidx134 = getelementptr inbounds double, double* %a, i64 %3 %4 = load double, double* %arrayidx134, align 8 %5 = or i64 16, 5 %arrayidx138 = getelementptr inbounds double, double* %a, i64 %5 %6 = or i64 16, 7 %arrayidx141 = getelementptr inbounds double, double* %a, i64 %6 %7 = load double, double* %arrayidx141, align 8 %sub149 = fsub double undef, %4 %sub156 = fsub double undef, %7 store double undef, double* %arrayidx131, align 8 store double undef, double* %arrayidx138, align 8 %sub178 = fsub double undef, %sub156 %add179 = fadd double undef, %sub149 %mul180 = fmul double undef, %sub178 %sub182 = fsub double %mul180, undef store double %sub182, double* %arrayidx107, align 8 %mul186 = fmul double undef, %add179 %add188 = fadd double %mul186, undef store double %add188, double* %arrayidx114, align 8 unreachable
}
attributes #0 = { "target-features"="+avx,+avx2,+bmi,+bmi2,+cx16,+cx8,+f16c,+fma,+fsgsbase,+fxsr,+invpcid,+lzcnt,+mmx,+movbe,+pclmul,+popcnt,+rdrnd,+sahf,+sse,+sse2,+sse3,+sse4.1,+sse4.2,+ssse3,+x87,+xsave,+xsaveopt" }
!llvm.ident = !{!0}
!0 = !{!"clang version 12.0.0 (https://github.com/llvm/llvm-project.git aaa925795f93c389a96ee01bab73bc2b6b771cbb)"}