Currently we emit gathers for scalars being vectorized in the tre as
a pair of extractelement/insertelement instructions. Instead we can try
to find all required vectors and emit shuffle vector instructions
directly, improving the code and reducing compile time.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
| llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
|---|---|---|
| 303 | Is it worth merging the isa<> and cast<> into a dyn_cast<>? | |
| 596 | return None instead to make it obvious it failed? Maybe do this as an early out instead of the much bigger if (Res.hasValue()) indented block? | |
| 6558 | What targets are we still missing support for? | |
| llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
|---|---|---|
| 6558 | AArch64, in many cases switches to the default cost bunch of extracts + bunch of inserts. | |
| llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
|---|---|---|
| 10073 | Please use PoisonValue whenever possible. It seems this is just a placeholder, so it can be switched. | |
| llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
|---|---|---|
| 10073 | Sure, thanks! | |
Large update.
Includes:
- Unifies all shuffle builders and shuffle demission operands.
- Generalizes emission and cost model estimation of the buildvectors/gathers.
Will be splitted into several smaller patches eventually.
This is causing a performance regression.
@ABataev could you please take a look? Here is a reduced reproducer. It is getting vectorized without this patch, but is not getting vectorized with it.
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-f80:128-n8:16:32:64-S128"
target triple = "x86_64-grtev4-linux-gnu"
%"classA" = type { %"vector", %"vector", %"complex" }
%"vector" = type { ptr, ptr, %"pair" }
%"pair" = type { %"pair_elem" }
%"pair_elem" = type { ptr }
%"complex" = type { double, double }
define void @foo() #0 {
%1 = getelementptr %"classA", ptr null, i64 0, i32 2
%2 = getelementptr %"classA", ptr null, i64 0, i32 2, i32 1
br i1 false, label %10, label %3
3: ; preds = %10, %0
%4 = phi double [ 0.000000e+00, %0 ], [ %25, %10 ]
%5 = phi double [ 0.000000e+00, %0 ], [ %24, %10 ]
%6 = fmul double %5, %5
%7 = fmul double %4, %4
%8 = fadd double %7, %6
%9 = fcmp ult double %8, 0.000000e+00
ret void
10: ; preds = %10, %0
%11 = phi double [ %24, %10 ], [ 0.000000e+00, %0 ]
%12 = phi double [ %25, %10 ], [ 0.000000e+00, %0 ]
%13 = load double, ptr null, align 8
%14 = load double, ptr null, align 8
%15 = load double, ptr null, align 8
%16 = getelementptr %"complex", ptr null, i64 0, i32 1
%17 = load double, ptr %16, align 8
%18 = fmul double %13, %15
%19 = fmul double %14, %17
%20 = fadd double %18, %19
%21 = fmul double %14, %15
%22 = fmul double %13, %17
%23 = fsub double %21, %22
%24 = fadd double %11, %20
store double %11, ptr %1, align 8
%25 = fadd double %12, %23
store double %12, ptr %2, align 8
br i1 false, label %3, label %10
; uselistorder directives
uselistorder double %24, { 1, 0 }
uselistorder double %25, { 1, 0 }
}
attributes #0 = { "target-features"="+aes,+cmov,+crc32,+cx16,+cx8,+fxsr,+mmx,+pclmul,+popcnt,+sse,+sse2,+sse3,+sse4.1,+sse4.2,+ssse3,+x87" }Thanks!
Is it worth merging the isa<> and cast<> into a dyn_cast<>?