Currently we emit gathers for scalars being vectorized in the tre as
a pair of extractelement/insertelement instructions. Instead we can try
to find all required vectors and emit shuffle vector instructions
directly, improving the code and reducing compile time.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
304 | Is it worth merging the isa<> and cast<> into a dyn_cast<>? | |
597 | return None instead to make it obvious it failed? Maybe do this as an early out instead of the much bigger if (Res.hasValue()) indented block? | |
6844 | What targets are we still missing support for? |
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
6844 | AArch64, in many cases switches to the default cost bunch of extracts + bunch of inserts. |
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
9808 | Please use PoisonValue whenever possible. It seems this is just a placeholder, so it can be switched. |
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
9808 | Sure, thanks! |
Large update.
Includes:
- Unifies all shuffle builders and shuffle demission operands.
- Generalizes emission and cost model estimation of the buildvectors/gathers.
Will be splitted into several smaller patches eventually.
This is causing a performance regression.
@ABataev could you please take a look? Here is a reduced reproducer. It is getting vectorized without this patch, but is not getting vectorized with it.
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-f80:128-n8:16:32:64-S128" target triple = "x86_64-grtev4-linux-gnu" %"classA" = type { %"vector", %"vector", %"complex" } %"vector" = type { ptr, ptr, %"pair" } %"pair" = type { %"pair_elem" } %"pair_elem" = type { ptr } %"complex" = type { double, double } define void @foo() #0 { %1 = getelementptr %"classA", ptr null, i64 0, i32 2 %2 = getelementptr %"classA", ptr null, i64 0, i32 2, i32 1 br i1 false, label %10, label %3 3: ; preds = %10, %0 %4 = phi double [ 0.000000e+00, %0 ], [ %25, %10 ] %5 = phi double [ 0.000000e+00, %0 ], [ %24, %10 ] %6 = fmul double %5, %5 %7 = fmul double %4, %4 %8 = fadd double %7, %6 %9 = fcmp ult double %8, 0.000000e+00 ret void 10: ; preds = %10, %0 %11 = phi double [ %24, %10 ], [ 0.000000e+00, %0 ] %12 = phi double [ %25, %10 ], [ 0.000000e+00, %0 ] %13 = load double, ptr null, align 8 %14 = load double, ptr null, align 8 %15 = load double, ptr null, align 8 %16 = getelementptr %"complex", ptr null, i64 0, i32 1 %17 = load double, ptr %16, align 8 %18 = fmul double %13, %15 %19 = fmul double %14, %17 %20 = fadd double %18, %19 %21 = fmul double %14, %15 %22 = fmul double %13, %17 %23 = fsub double %21, %22 %24 = fadd double %11, %20 store double %11, ptr %1, align 8 %25 = fadd double %12, %23 store double %12, ptr %2, align 8 br i1 false, label %3, label %10 ; uselistorder directives uselistorder double %24, { 1, 0 } uselistorder double %25, { 1, 0 } } attributes #0 = { "target-features"="+aes,+cmov,+crc32,+cx16,+cx8,+fxsr,+mmx,+pclmul,+popcnt,+sse,+sse2,+sse3,+sse4.1,+sse4.2,+ssse3,+x87" }
Thanks!
Is it worth merging the isa<> and cast<> into a dyn_cast<>?