This is an archive of the discontinued LLVM Phabricator instance.

[SLP]Improve gathering of the scalars used in the graph.
ClosedPublic

Authored by ABataev on Oct 1 2021, 4:10 PM.

Details

Summary

Currently we emit gathers for scalars being vectorized in the tre as
a pair of extractelement/insertelement instructions. Instead we can try
to find all required vectors and emit shuffle vector instructions
directly, improving the code and reducing compile time.

Diff Detail

Event Timeline

ABataev created this revision.Oct 1 2021, 4:10 PM
ABataev requested review of this revision.Oct 1 2021, 4:10 PM
Herald added a project: Restricted Project. · View Herald TranscriptOct 1 2021, 4:10 PM
RKSimon retitled this revision from [SLP]Improve gathering of the scals used in the graph. to [SLP]Improve gathering of the scalars used in the graph..Oct 5 2021, 6:35 AM
ABataev updated this revision to Diff 386648.Nov 11 2021, 1:34 PM

Rebase + bug fixes

vporpo added a subscriber: vporpo.Nov 11 2021, 7:57 PM
RKSimon added inline comments.Nov 29 2021, 9:13 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
296

Is it worth merging the isa<> and cast<> into a dyn_cast<>?

494

return None instead to make it obvious it failed?

Maybe do this as an early out instead of the much bigger if (Res.hasValue()) indented block?

4572

What targets are we still missing support for?

ABataev added inline comments.Nov 29 2021, 9:15 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4572

AArch64, in many cases switches to the default cost bunch of extracts + bunch of inserts.

ABataev updated this revision to Diff 390398.Nov 29 2021, 10:06 AM

Rebase + address comments.

RKSimon added inline comments.Dec 14 2021, 8:04 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4536

Wshadow warning vs Idx @ Line 4688?

4569

Wshadow warning vs Idx @ Line 4688?

ABataev updated this revision to Diff 394269.Dec 14 2021, 8:33 AM

Address comments

Herald added a project: Restricted Project. · View Herald TranscriptAug 26 2022, 7:51 AM
nlopes added inline comments.Aug 26 2022, 7:54 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
6572

Please use PoisonValue whenever possible. It seems this is just a placeholder, so it can be switched.
Thank you!

ABataev added inline comments.Aug 26 2022, 8:08 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
6572

Sure, thanks!

ABataev updated this revision to Diff 455933.Aug 26 2022, 8:52 AM

Address comments

nhaehnle removed a subscriber: nhaehnle.Oct 19 2022, 2:00 AM
ABataev updated this revision to Diff 480583.Dec 6 2022, 12:45 PM

Large update.
Includes:

  1. Unifies all shuffle builders and shuffle demission operands.
  2. Generalizes emission and cost model estimation of the buildvectors/gathers.

Will be splitted into several smaller patches eventually.

ABataev updated this revision to Diff 482619.Dec 13 2022, 1:32 PM

Restore accidentally removed code.

khchen added a subscriber: khchen.Dec 22 2022, 8:35 AM
ABataev updated this revision to Diff 503510.Mar 8 2023, 1:56 PM

Restore deleted code/update test

RKSimon added inline comments.Mar 13 2023, 2:27 PM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4516

Any chance that we can use ShuffleVectorInst::isIdentityMask ?

4850

auto *

4852

auto *

ABataev added inline comments.Mar 13 2023, 2:42 PM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4516

Sure, will do it later

4852

Both these cases are the existing code, just the diff is not quite correct because of the big differences.

ABataev updated this revision to Diff 504861.Mar 13 2023, 3:10 PM

Restore accidentally removed lines, address comments

ABataev updated this revision to Diff 505467.Mar 15 2023, 6:16 AM

Restore some deleted code

ABataev updated this revision to Diff 519833.May 5 2023, 6:18 AM

Temp rebase, requires some extra work.

This revision is now accepted and ready to land.Thu, Nov 30, 7:34 AM
This revision was automatically updated to reflect the committed changes.
vporpo added a comment.Thu, Dec 7, 1:36 PM

This is causing a performance regression.

@ABataev could you please take a look? Here is a reduced reproducer. It is getting vectorized without this patch, but is not getting vectorized with it.

target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-f80:128-n8:16:32:64-S128"
target triple = "x86_64-grtev4-linux-gnu"

%"classA" = type { %"vector", %"vector", %"complex" }
%"vector" = type { ptr, ptr, %"pair" }
%"pair" = type { %"pair_elem" }
%"pair_elem" = type { ptr }
%"complex" = type { double, double }

define void @foo() #0 {
  %1 = getelementptr %"classA", ptr null, i64 0, i32 2
  %2 = getelementptr %"classA", ptr null, i64 0, i32 2, i32 1
  br i1 false, label %10, label %3

3:                                                ; preds = %10, %0                                                                                                                                                
  %4 = phi double [ 0.000000e+00, %0 ], [ %25, %10 ]
  %5 = phi double [ 0.000000e+00, %0 ], [ %24, %10 ]
  %6 = fmul double %5, %5
  %7 = fmul double %4, %4
  %8 = fadd double %7, %6
  %9 = fcmp ult double %8, 0.000000e+00
  ret void

10:                                               ; preds = %10, %0                                                                                                                                                
  %11 = phi double [ %24, %10 ], [ 0.000000e+00, %0 ]
  %12 = phi double [ %25, %10 ], [ 0.000000e+00, %0 ]
  %13 = load double, ptr null, align 8
  %14 = load double, ptr null, align 8
  %15 = load double, ptr null, align 8
  %16 = getelementptr %"complex", ptr null, i64 0, i32 1
  %17 = load double, ptr %16, align 8
  %18 = fmul double %13, %15
  %19 = fmul double %14, %17
  %20 = fadd double %18, %19
  %21 = fmul double %14, %15
  %22 = fmul double %13, %17
  %23 = fsub double %21, %22
  %24 = fadd double %11, %20
  store double %11, ptr %1, align 8
  %25 = fadd double %12, %23
  store double %12, ptr %2, align 8
  br i1 false, label %3, label %10

; uselistorder directives                                                                                                                                                                                          
  uselistorder double %24, { 1, 0 }
  uselistorder double %25, { 1, 0 }
}

attributes #0 = { "target-features"="+aes,+cmov,+crc32,+cx16,+cx8,+fxsr,+mmx,+pclmul,+popcnt,+sse,+sse2,+sse3,+sse4.1,+sse4.2,+ssse3,+x87" }

Thanks!

bgraur added a subscriber: bgraur.Mon, Dec 11, 2:17 AM

This is causing a performance regression.

@ABataev could you please take a look? Here is a reduced reproducer. It is getting vectorized without this patch, but is not getting vectorized with it.

target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-f80:128-n8:16:32:64-S128"
target triple = "x86_64-grtev4-linux-gnu"

%"classA" = type { %"vector", %"vector", %"complex" }
%"vector" = type { ptr, ptr, %"pair" }
%"pair" = type { %"pair_elem" }
%"pair_elem" = type { ptr }
%"complex" = type { double, double }

define void @foo() #0 {
  %1 = getelementptr %"classA", ptr null, i64 0, i32 2
  %2 = getelementptr %"classA", ptr null, i64 0, i32 2, i32 1
  br i1 false, label %10, label %3

3:                                                ; preds = %10, %0                                                                                                                                                
  %4 = phi double [ 0.000000e+00, %0 ], [ %25, %10 ]
  %5 = phi double [ 0.000000e+00, %0 ], [ %24, %10 ]
  %6 = fmul double %5, %5
  %7 = fmul double %4, %4
  %8 = fadd double %7, %6
  %9 = fcmp ult double %8, 0.000000e+00
  ret void

10:                                               ; preds = %10, %0                                                                                                                                                
  %11 = phi double [ %24, %10 ], [ 0.000000e+00, %0 ]
  %12 = phi double [ %25, %10 ], [ 0.000000e+00, %0 ]
  %13 = load double, ptr null, align 8
  %14 = load double, ptr null, align 8
  %15 = load double, ptr null, align 8
  %16 = getelementptr %"complex", ptr null, i64 0, i32 1
  %17 = load double, ptr %16, align 8
  %18 = fmul double %13, %15
  %19 = fmul double %14, %17
  %20 = fadd double %18, %19
  %21 = fmul double %14, %15
  %22 = fmul double %13, %17
  %23 = fsub double %21, %22
  %24 = fadd double %11, %20
  store double %11, ptr %1, align 8
  %25 = fadd double %12, %23
  store double %12, ptr %2, align 8
  br i1 false, label %3, label %10

; uselistorder directives                                                                                                                                                                                          
  uselistorder double %24, { 1, 0 }
  uselistorder double %25, { 1, 0 }
}

attributes #0 = { "target-features"="+aes,+cmov,+crc32,+cx16,+cx8,+fxsr,+mmx,+pclmul,+popcnt,+sse,+sse2,+sse3,+sse4.1,+sse4.2,+ssse3,+x87" }

Thanks!

Ping @ABataev ! This is blocking our internal release at Google!