Page MenuHomePhabricator

[SLP] Initial support for the vectorization of the non-power-of-2 vectors.
Needs ReviewPublic

Authored by ABataev on Jan 22 2019, 8:30 AM.

Details

Summary

Possibly vectorized operations are extended to the power-of-2 number with UndefValues to allow to use regular vector operations.

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
ABataev added inline comments.Sep 23 2020, 4:30 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
199–200

Yeah, will fix it.

ABataev updated this revision to Diff 293701.Sep 23 2020, 5:30 AM

Rebase + fix

RKSimon added inline comments.Sep 23 2020, 5:52 AM
llvm/test/Transforms/SLPVectorizer/AArch64/PR38339.ll
15

These "feel" like regressions to me - any idea whats going on?

ABataev added inline comments.Sep 23 2020, 5:55 AM
llvm/test/Transforms/SLPVectorizer/AArch64/PR38339.ll
15

The cost model problem, if I recall it correctly. I investigated it before and found out that the cost model for AArch64 is not defined for long vectors in some cases and we fall back to the generic cost model evaluation which is not quite correct in many cases. Need to tweak the cost model for AArch64.

RKSimon added inline comments.Sep 23 2020, 6:15 AM
llvm/test/Transforms/SLPVectorizer/AArch64/PR38339.ll
15

Any instruction cost type (extract/shuffle/store?) in particular that needs better costs? It'd be good to at least raise a specific bug report to the aarch64 team

ABataev added inline comments.Sep 23 2020, 6:31 AM
llvm/test/Transforms/SLPVectorizer/AArch64/PR38339.ll
15

Do not remember already, need some time to investigate it again. Hope to do it by the end of this week.
PS. There was a question about this test already.

spatel added inline comments.Sep 23 2020, 7:18 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
7568

Is it necessary to copy these? If so, it would be better to name this function something like "getCopyOfExtraArgValues" to make that explicit.

If not, we can just make this a standard 'get' method:

const MapVector<Instruction *, Value *> &getExtraArgs() const {
  return ExtraArgs;
}

And then access the 'second' data in the user code?

ABataev added inline comments.Sep 24 2020, 6:26 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
7568

We don't need to expose the first element of the MapVector here, it is not good from the general design point of view. I'll rename the member function.

ABataev updated this revision to Diff 294040.Sep 24 2020, 6:33 AM

Rebase + rename

ABataev added inline comments.Sep 24 2020, 7:19 AM
llvm/test/Transforms/SLPVectorizer/AArch64/PR38339.ll
15

Found the reason. It is the cost of shuffle of TTI::SK_PermuteSingleSrc kind. Before this patch, the test operated with the vector <2 x i16>, which is transformed to llvm::MVT::v2i32 by type legalization function and the cost of this shuffle is tweaked to be 1 (see llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp, AArch64TTIImpl::getShuffleCost). The cost of this operation is 1, per table.

With this patch, the original vector type is <4 x i16> which is transformed to llvm::MVT::v4i16 and there is no optimized value for TTI::SK_PermuteSingleSrc in the table for this type and the function falls back to the pessimistic cost model and returns 18.

There are several TODOs int the file already about fixing the cost model for different shuffle operations.

Does anyone have any more comments?

spatel added inline comments.Sep 30 2020, 6:23 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3207–3209

Use isValidElementType() or check for undef directly?
I still can't tell from the debug statement exactly what we are guarding against. Should the type check already be here even without this patch?

3902

Are we always creating a masked load for a vector with 2 elements? This logic needs a code comment to explain the cases.

4868

Please add code comment/example to explain what the difference is between these 2 clauses.

4907

Is Passthrough a full vector of undef elements? If so, it should be created/named that way (or directly in the call to CreateMaskedLoad()) rather than in the loop.

4973–4974

Similar to above (so can we add a helper function to avoid duplicating the code?):
Please add code comment/example to explain what the difference is between these 2 clauses.

reverse ping

reverse ping

Will update the patch as soon as I'm back to work, in 2-3 weeks.

reverse ping?

reverse ping?

Need some time to setup my dev environment, will update ASAP

ABataev added inline comments.Nov 9 2020, 10:37 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3207–3209

I was just trying to protect the code and try to support it only for simple types at first. There are some doubts that the cost for masked loads/stores is completed and I protected it to make it work only for simple types. I can remove this check if the cost model for masked ops is good enough.

3902

No, no need to do it for 2 elements, removed it.

4868

Fixed it, thanks.

4907

Fixed

4973–4974

Fixed, thanks!

RKSimon added inline comments.Nov 9 2020, 11:14 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3207–3209

masked load/store costs for constant masks should be good enough now (getScalarizationOverhead should now provide us with a reasonable fallback)

ABataev updated this revision to Diff 304196.Nov 10 2020, 8:11 AM

Rebase, updates and fixes

All of my comments were addressed, so LGTM. But please wait for an official 'accept' from at least 1 other reviewer.

RKSimon added inline comments.Nov 12 2020, 9:35 AM
llvm/test/Transforms/SLPVectorizer/X86/insert-element-build-vector.ll
278

There isn't a ANY check-prefix atm (it was cleaned out in rG119e4550ddedc75e4 as part of the unused prefix cleanup) - please can you review?

ABataev added inline comments.Nov 12 2020, 9:39 AM
llvm/test/Transforms/SLPVectorizer/X86/insert-element-build-vector.ll
278

Yes, need to remove it, I think. Most probably, caused but not quite clean merge.

ABataev updated this revision to Diff 304968.Nov 12 2020, 1:49 PM

Rebase, test cleanup + small code improvements.

xbolva00 added inline comments.
llvm/test/Transforms/SLPVectorizer/X86/pr47623.ll
26 ↗(On Diff #306735)

Regression on avx?

ABataev added inline comments.Nov 20 2020, 11:01 AM
llvm/test/Transforms/SLPVectorizer/X86/pr47623.ll
26 ↗(On Diff #306735)

Yes, looks like the issue with the cost of @llvm.masked.gather for masked gather with some undefs in the mask

craig.topper added inline comments.Nov 20 2020, 11:07 AM
llvm/test/Transforms/SLPVectorizer/X86/pr47623.ll
26 ↗(On Diff #306735)

Gather is slow on CPUs prior to AVX512. And its cost is proportional to the number of elements. I don't think the value of the mask should be a factor.

ABataev added inline comments.Nov 20 2020, 11:18 AM
llvm/test/Transforms/SLPVectorizer/X86/pr47623.ll
26 ↗(On Diff #306735)

True, but in some cases it can be optimized into gather + shuffle instead of wide gather, if there are undefs in mask.

ABataev updated this revision to Diff 307098.Nov 23 2020, 9:09 AM

Rebase + improve handling of masked gathers.

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2601

Could it be actually no one Instruction in UseEntry or should it be assert?

2605

"Lane 0" seems outdated here, but not sure about better description.

2739

assert(NumberOfInstructions != 0 && "...") and if (NumberOfInstructions == 1)?

3484–3487

Comment typo: aggrgate.

4899–4900

emplace_back()

6313

typo: "indeces"

ABataev updated this revision to Diff 307204.Nov 23 2020, 2:45 PM

Fixed according to comments

ABataev updated this revision to Diff 307205.Nov 23 2020, 3:02 PM

Fixed function name.

Btw, I've observed significant compile-time regression with this patch: http://llvm-compile-time-tracker.com/compare.php?from=99d82412f822190a6caa3e3a5b9f87b71f56de47&to=81b636bae72c967f526bcd18de45a6f4a76daa41&stat=instructions (thanks to @nikic for awesome service). This could be justified in case of comparable performance improvements but have you done any benchmarking?

dtemirbulatov added a comment.EditedDec 1 2020, 3:37 PM

Btw, I've observed significant compile-time regression with this patch: http://llvm-compile-time-tracker.com/compare.php?from=99d82412f822190a6caa3e3a5b9f87b71f56de47&to=81b636bae72c967f526bcd18de45a6f4a76daa41&stat=instructions (thanks to @nikic for awesome service). This could be justified in case of comparable performance improvements but have you done any benchmarking?

I have done a while back with SPECINT 2006 and as I remember results were good, but I am not sure that I could find those now. Yes, for me, having this new functionality with presented compile-time regression looks ok.

I dont think (geomean) 0.20% is significant compile time problem. TBH, I expected bigger CT regressions - up to 0.5% is fine IMHO.

AFAICT the only outstanding question is whether the compile time increase is acceptable?

AFAICT the only outstanding question is whether the compile time increase is acceptable?

I'd agree that geomean = 0.2% is acceptable for the change with such awesome performance impact, just noted that changed time compilation is significant in comparision with other changes. Generally it looks good to me apart from one minor unaddressed comment.

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3213

Well, "unsortable" or "unprocessable" term would be more precise. But why did we change if (sortPtrAccesses)... to opposite condition? This change just duplicate debug output, since we didn't differentiate it. Also I'd prefer to see the same if-else structure as for the load case.

nikic added a comment.Dec 9 2020, 5:19 AM

AFAICT the only outstanding question is whether the compile time increase is acceptable?

I'd agree that geomean = 0.2% is acceptable for the change with such awesome performance impact, just noted that changed time compilation is significant in comparision with other changes. Generally it looks good to me apart from one minor unaddressed comment.

Could the summary of the revision be updated with the performance data? The thread is very long and I didn't spot where the measurements are, so it's hard to say what we're trading off here...

Generally though this level of geomean regression is fine, the issue is usually in the outliers. For example, I spot this this 4-5% regression:

CMakeFiles/lencod.dir/transform8x8.c.o 	3053M 	3188M (+4.41%)

It may be worthwhile to briefly check it in case something can be improved.

AFAICT the only outstanding question is whether the compile time increase is acceptable?

I'd agree that geomean = 0.2% is acceptable for the change with such awesome performance impact, just noted that changed time compilation is significant in comparision with other changes. Generally it looks good to me apart from one minor unaddressed comment.

Could the summary of the revision be updated with the performance data? The thread is very long and I didn't spot where the measurements are, so it's hard to say what we're trading off here...

Will try to run the benchmarks and get fresh data.

Generally though this level of geomean regression is fine, the issue is usually in the outliers. For example, I spot this this 4-5% regression:

CMakeFiles/lencod.dir/transform8x8.c.o 	3053M 	3188M (+4.41%)

It may be worthwhile to briefly check it in case something can be improved.

I'll check what can be improved in terms of compile time. Not sure that will be able to improve it significantly since the patch itself does not add extensive analysis/transformations, just adds an extra 1 iteration for wider vector analysis. But I'll check it anyway and will try to improve things where possible.

While reviewing the latest update, I think I spotted SLP compile-time failure in SingleSource/Benchmarks/Misc/oourafft.c, here is the reduced testcase to reporduce:
source_filename = "/home/dtemirbulatov/llvm/test-suite/SingleSource/Benchmarks/Misc/oourafft.c"
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"

define dso_local fastcc void @cft1st(double* %a) unnamed_addr #0 {
entry:

%0 = or i64 16, 2
%arrayidx107 = getelementptr inbounds double, double* %a, i64 %0
%1 = or i64 16, 3
%arrayidx114 = getelementptr inbounds double, double* %a, i64 %1
%2 = or i64 16, 4
%arrayidx131 = getelementptr inbounds double, double* %a, i64 %2
%3 = or i64 16, 6
%arrayidx134 = getelementptr inbounds double, double* %a, i64 %3
%4 = load double, double* %arrayidx134, align 8
%5 = or i64 16, 5
%arrayidx138 = getelementptr inbounds double, double* %a, i64 %5
%6 = or i64 16, 7
%arrayidx141 = getelementptr inbounds double, double* %a, i64 %6
%7 = load double, double* %arrayidx141, align 8
%sub149 = fsub double undef, %4
%sub156 = fsub double undef, %7
store double undef, double* %arrayidx131, align 8
store double undef, double* %arrayidx138, align 8
%sub178 = fsub double undef, %sub156
%add179 = fadd double undef, %sub149
%mul180 = fmul double undef, %sub178
%sub182 = fsub double %mul180, undef
store double %sub182, double* %arrayidx107, align 8
%mul186 = fmul double undef, %add179
%add188 = fadd double %mul186, undef
store double %add188, double* %arrayidx114, align 8
unreachable

}

attributes #0 = { "target-features"="+avx,+avx2,+bmi,+bmi2,+cx16,+cx8,+f16c,+fma,+fsgsbase,+fxsr,+invpcid,+lzcnt,+mmx,+movbe,+pclmul,+popcnt,+rdrnd,+sahf,+sse,+sse2,+sse3,+sse4.1,+sse4.2,+ssse3,+x87,+xsave,+xsaveopt" }

!llvm.ident = !{!0}

!0 = !{!"clang version 12.0.0 (https://github.com/llvm/llvm-project.git aaa925795f93c389a96ee01bab73bc2b6b771cbb)"}

While reviewing the latest update, I think I spotted SLP compile-time failure in SingleSource/Benchmarks/Misc/oourafft.c, here is the reduced testcase to reporduce:
source_filename = "/home/dtemirbulatov/llvm/test-suite/SingleSource/Benchmarks/Misc/oourafft.c"
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"

define dso_local fastcc void @cft1st(double* %a) unnamed_addr #0 {
entry:

%0 = or i64 16, 2
%arrayidx107 = getelementptr inbounds double, double* %a, i64 %0
%1 = or i64 16, 3
%arrayidx114 = getelementptr inbounds double, double* %a, i64 %1
%2 = or i64 16, 4
%arrayidx131 = getelementptr inbounds double, double* %a, i64 %2
%3 = or i64 16, 6
%arrayidx134 = getelementptr inbounds double, double* %a, i64 %3
%4 = load double, double* %arrayidx134, align 8
%5 = or i64 16, 5
%arrayidx138 = getelementptr inbounds double, double* %a, i64 %5
%6 = or i64 16, 7
%arrayidx141 = getelementptr inbounds double, double* %a, i64 %6
%7 = load double, double* %arrayidx141, align 8
%sub149 = fsub double undef, %4
%sub156 = fsub double undef, %7
store double undef, double* %arrayidx131, align 8
store double undef, double* %arrayidx138, align 8
%sub178 = fsub double undef, %sub156
%add179 = fadd double undef, %sub149
%mul180 = fmul double undef, %sub178
%sub182 = fsub double %mul180, undef
store double %sub182, double* %arrayidx107, align 8
%mul186 = fmul double undef, %add179
%add188 = fadd double %mul186, undef
store double %add188, double* %arrayidx114, align 8
unreachable

}

attributes #0 = { "target-features"="+avx,+avx2,+bmi,+bmi2,+cx16,+cx8,+f16c,+fma,+fsgsbase,+fxsr,+invpcid,+lzcnt,+mmx,+movbe,+pclmul,+popcnt,+rdrnd,+sahf,+sse,+sse2,+sse3,+sse4.1,+sse4.2,+ssse3,+x87,+xsave,+xsaveopt" }

!llvm.ident = !{!0}

!0 = !{!"clang version 12.0.0 (https://github.com/llvm/llvm-project.git aaa925795f93c389a96ee01bab73bc2b6b771cbb)"}

Do you mean compile time increasing? With this patch?

Do you mean compile time increasing? With this patch?

no, just compile-time error.

Do you mean compile time increasing? With this patch?

no, just compile-time error.

Crash or incorrect code?

Do you mean compile time increasing? With this patch?

no, just compile-time error.

Crash or incorrect code?

Crash.

wxiao3 added a subscriber: wxiao3.Dec 14 2020, 7:09 AM