This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Transforms/Vectorize/
-
Transforms/
-
Vectorize/
2/6
SLPVectorizer.cpp
-
test/Transforms/SLPVectorizer/X86/
-
Transforms/
-
SLPVectorizer/
-
X86/
-
extract_in_tree_user.ll

Differential D106613

Bad SLPVectorization shufflevector replacement, resulting in write to wrong memory location
ClosedPublic

Authored by vtjnash on Jul 22 2021, 4:48 PM.

Download Raw Diff

Details

Reviewers

ABataev
spatel
vdmitrie
RKSimon

Commits

rGe27a6db5298f: Bad SLPVectorization shufflevector replacement, resulting in write to wrong…

Summary

I am not yet sure how to test it, since my example case (in the commit comment) was broken by the cost model change in 0d74fd3fdf50. Revert that to see this bad code transform I described there in brief to happen. Suggestions would be helpful!

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

vtjnash created this revision.Jul 22 2021, 4:48 PM

Herald added a subscriber: hiraditya. · View Herald TranscriptJul 22 2021, 4:48 PM

vtjnash requested review of this revision.Jul 22 2021, 4:48 PM

Herald added a project: Restricted Project. · View Herald TranscriptJul 22 2021, 4:48 PM

Herald added a subscriber: llvm-commits. · View Herald Transcript

vtjnash retitled this revision from Bad SLPVectorization shufflevector replacement to Bad SLPVectorization shufflevector replacement, resulting in write to wrong memory location.Jul 22 2021, 4:56 PM

vtjnash edited the summary of this revision. (Show Details)

vtjnash added reviewers: ABataev, spatel, vdmitrie.

Harbormaster completed remote builds in B115733: Diff 361032.Jul 22 2021, 7:16 PM

I am not yet sure how to test it, since my example case (in the commit comment) was broken by the cost model change in 0d74fd3fdf50.

Can you actually link to it? I can't find it.

Revert that to see this bad code transform I described there in brief to happen. Suggestions would be helpful!

What does bad code transform mean?

vchuravy added a project: Restricted Project.Jul 23 2021, 2:53 AM

vchuravy added a subscriber: vchuravy.

RKSimon added a subscriber: RKSimon.Jul 23 2021, 5:02 AM

ABataev added inline comments.Jul 23 2021, 7:40 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
6038–6042	This is definitely wrong, the third arg must be the lane id in the current entry node, not in the user node. But (not sure, looks like) it points to a problem with scatter vectorize pointers, which were vectorized previously. Will check it later.

vtjnash added a subscriber: Restricted Project.Jul 23 2021, 10:16 AM

vtjnash added inline comments.Jul 23 2021, 10:29 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
6038–6042	I think that makes sense. I was hoping this would break a test, so I could see where this fix attempt was wrong, but apparently no tests currently depended upon that. This was a confusing example, where it vectorized the load with a scatter-gather (offsets 10 and 4), but later vectorized the store using lane 1 of that combined gep (desiring to extract the pointer with gep offsets 4 and 6), but the 0 here means that store got the gep with offset 10 (and 12) back instead when it tried to get PO back out of the scatter-gather for insertion into that VecPtr user.

ABataev added inline comments.Jul 23 2021, 10:36 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
6038–6042	I think here we need an extra check for ScatterVectorize nodes. For such nodes, we need to extract all used pointers, not only the first one. Going to try to look at it later.

RKSimon added a reviewer: RKSimon.Jul 31 2021, 7:01 AM

vtjnash added inline comments.Aug 19 2021, 11:30 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
6038–6042	Bump? If this was fixed, could you let me know the commit #, so I could test with it. Thanks!

ABataev added inline comments.Aug 19 2021, 11:33 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
6038–6042	Not yet, busy with some other patches, hope to investigate it soon.

ABataev added inline comments.Aug 27 2021, 12:07 PM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
6038–6042	Hi, could you send your reproducer? Even if it is not triggered anymore?

The full reproducer is in the commit message on phabricator for this review, though I don't know how to link to the text of a commit specifically. Here's the text of that:

With opt -mcpu=haswell -slp-vectorizer, we see that it might currently produce:

%10 = getelementptr {}**, <2 x {}***> %9, <2 x i32> <i32 10, i32 4>
%11 = bitcast <2 x {}***> %10 to <2 x i64*>
...
%27 = extractelement <2 x i64*> %11, i32 0
%28 = bitcast i64* %27 to <2 x i64>*
store <2 x i64> %22, <2 x i64>* %28, align 4, !tbaa !2

Which is an out-of-bounds store (the extractelement got offset 10
instead of offset 4 as intended). With the fix, we correctly generate
extractelement for i32 1 and generate correct code.

; ModuleID = 'rand3.ll'
source_filename = "rand"
target datalayout = "e-m:e-p:32:32-p270:32:32-p271:32:32-p272:64:64-f64:32:64-f80:32-n8:16:32-S128-ni:10:11:12:13"
target triple = "i686-unknown-linux-gnu"

@llvm.compiler.used = appending global [3 x i8*] [i8* bitcast (void ({} addrspace(10)*)* @jl_gc_queue_root to i8*), i8* bitcast ({} addrspace(10)* (i8*, i32, i32)* @jl_gc_pool_alloc to i8*), i8* bitcast ({} addrspace(10)* (i8*, i32)* @jl_gc_big_alloc to i8*)], section "llvm.metadata"

; Function Attrs: sspstrong
define void @julia_rand_5(i64* noalias nocapture sret(i64) %0) #0 {
top:
%1 = call {}*** @julia.get_pgcstack()
%2 = getelementptr {}**, {}*** %1, i32 4
%3 = bitcast {}*** %2 to i64*
%4 = load i64, i64* %3, align 4, !tbaa !2
%5 = getelementptr {}**, {}*** %1, i32 6
%6 = bitcast {}*** %5 to i64*
%7 = load i64, i64* %6, align 4, !tbaa !2
%8 = getelementptr {}**, {}*** %1, i32 8
%9 = bitcast {}*** %8 to i64*
%10 = load i64, i64* %9, align 4, !tbaa !2
%11 = getelementptr {}**, {}*** %1, i32 10
%12 = bitcast {}*** %11 to i64*
%13 = load i64, i64* %12, align 4, !tbaa !2
%14 = add i64 %13, %4
%15 = call i64 @llvm.fshl.i64(i64 %14, i64 %14, i64 23)
%16 = shl i64 %7, 17
%17 = xor i64 %10, %4
%18 = xor i64 %13, %7
%19 = xor i64 %17, %7
%20 = xor i64 %18, %4
%21 = xor i64 %17, %16
%22 = call i64 @llvm.fshl.i64(i64 %18, i64 %18, i64 45)
store i64 %20, i64* %3, align 4, !tbaa !2
store i64 %19, i64* %6, align 4, !tbaa !2
store i64 %21, i64* %9, align 4, !tbaa !2
store i64 %22, i64* %12, align 4, !tbaa !2
store i64 %15, i64* %0, align 4
ret void
}

define nonnull {} addrspace(10)* @jfptr_rand_6({} addrspace(10)* %0, {} addrspace(10)** %1, i32 %2) #1 {
top:
%3 = call {}*** @julia.get_pgcstack()
%4 = alloca i64, align 8
call void @julia_rand_5(i64* noalias nocapture nonnull sret(i64) %4) #5
%5 = load i64, i64* %4, align 8, !tbaa !7
%6 = call nonnull {} addrspace(10)* @jl_box_uint64(i64 zeroext %5)
ret {} addrspace(10)* %6
}

declare {}*** @julia.get_pgcstack()

declare nonnull {} addrspace(10)* @jl_box_uint64(i64 zeroext)

; Function Attrs: inaccessiblemem_or_argmemonly
declare void @jl_gc_queue_root({} addrspace(10)*) #2

; Function Attrs: allocsize(1)
declare noalias nonnull {} addrspace(10)* @jl_gc_pool_alloc(i8*, i32, i32) #3

; Function Attrs: allocsize(1)
declare noalias nonnull {} addrspace(10)* @jl_gc_big_alloc(i8*, i32) #3

; Function Attrs: nofree nosync nounwind readnone speculatable willreturn
declare i64 @llvm.fshl.i64(i64, i64, i64) #4

attributes #0 = { sspstrong "probe-stack"="inline-asm" }
attributes #1 = { "probe-stack"="inline-asm" "thunk" }
attributes #2 = { inaccessiblemem_or_argmemonly }
attributes #3 = { allocsize(1) }
attributes #4 = { nofree nosync nounwind readnone speculatable willreturn }
attributes #5 = { "probe-stack"="inline-asm" }

!llvm.module.flags = !{!0, !1}

!0 = !{i32 2, !"Dwarf Version", i32 4}
!1 = !{i32 2, !"Debug Info Version", i32 3}
!2 = !{!3, !3, i64 0}
!3 = !{!"jtbaa_value", !4, i64 0}
!4 = !{!"jtbaa_data", !5, i64 0}
!5 = !{!"jtbaa", !6, i64 0}
!6 = !{!"jtbaa"}
!7 = !{!8, !8, i64 0}
!8 = !{!"jtbaa_stack", !5, i64 0}

bump

In D106613#3000343, @vtjnash wrote:

bump

Hi, just returned back from vacation, will investigate it ASAP.

In D106613#2970895, @vtjnash wrote:

The full reproducer is in the commit message on phabricator for this review, though I don't know how to link to the text of a commit specifically. Here's the text of that:

With opt -mcpu=haswell -slp-vectorizer, we see that it might currently produce:

%10 = getelementptr {}**, <2 x {}***> %9, <2 x i32> <i32 10, i32 4>
%11 = bitcast <2 x {}***> %10 to <2 x i64*>
...
%27 = extractelement <2 x i64*> %11, i32 0
%28 = bitcast i64* %27 to <2 x i64>*
store <2 x i64> %22, <2 x i64>* %28, align 4, !tbaa !2

Which is an out-of-bounds store (the extractelement got offset 10
instead of offset 4 as intended). With the fix, we correctly generate
extractelement for i32 1 and generate correct code.

; ModuleID = 'rand3.ll'
source_filename = "rand"
target datalayout = "e-m:e-p:32:32-p270:32:32-p271:32:32-p272:64:64-f64:32:64-f80:32-n8:16:32-S128-ni:10:11:12:13"
target triple = "i686-unknown-linux-gnu"

@llvm.compiler.used = appending global [3 x i8*] [i8* bitcast (void ({} addrspace(10)*)* @jl_gc_queue_root to i8*), i8* bitcast ({} addrspace(10)* (i8*, i32, i32)* @jl_gc_pool_alloc to i8*), i8* bitcast ({} addrspace(10)* (i8*, i32)* @jl_gc_big_alloc to i8*)], section "llvm.metadata"

; Function Attrs: sspstrong
define void @julia_rand_5(i64* noalias nocapture sret(i64) %0) #0 {
top:
%1 = call {}*** @julia.get_pgcstack()
%2 = getelementptr {}**, {}*** %1, i32 4
%3 = bitcast {}*** %2 to i64*
%4 = load i64, i64* %3, align 4, !tbaa !2
%5 = getelementptr {}**, {}*** %1, i32 6
%6 = bitcast {}*** %5 to i64*
%7 = load i64, i64* %6, align 4, !tbaa !2
%8 = getelementptr {}**, {}*** %1, i32 8
%9 = bitcast {}*** %8 to i64*
%10 = load i64, i64* %9, align 4, !tbaa !2
%11 = getelementptr {}**, {}*** %1, i32 10
%12 = bitcast {}*** %11 to i64*
%13 = load i64, i64* %12, align 4, !tbaa !2
%14 = add i64 %13, %4
%15 = call i64 @llvm.fshl.i64(i64 %14, i64 %14, i64 23)
%16 = shl i64 %7, 17
%17 = xor i64 %10, %4
%18 = xor i64 %13, %7
%19 = xor i64 %17, %7
%20 = xor i64 %18, %4
%21 = xor i64 %17, %16
%22 = call i64 @llvm.fshl.i64(i64 %18, i64 %18, i64 45)
store i64 %20, i64* %3, align 4, !tbaa !2
store i64 %19, i64* %6, align 4, !tbaa !2
store i64 %21, i64* %9, align 4, !tbaa !2
store i64 %22, i64* %12, align 4, !tbaa !2
store i64 %15, i64* %0, align 4
ret void
}

define nonnull {} addrspace(10)* @jfptr_rand_6({} addrspace(10)* %0, {} addrspace(10)** %1, i32 %2) #1 {
top:
%3 = call {}*** @julia.get_pgcstack()
%4 = alloca i64, align 8
call void @julia_rand_5(i64* noalias nocapture nonnull sret(i64) %4) #5
%5 = load i64, i64* %4, align 8, !tbaa !7
%6 = call nonnull {} addrspace(10)* @jl_box_uint64(i64 zeroext %5)
ret {} addrspace(10)* %6
}

declare {}*** @julia.get_pgcstack()

declare nonnull {} addrspace(10)* @jl_box_uint64(i64 zeroext)

; Function Attrs: inaccessiblemem_or_argmemonly
declare void @jl_gc_queue_root({} addrspace(10)*) #2

; Function Attrs: allocsize(1)
declare noalias nonnull {} addrspace(10)* @jl_gc_pool_alloc(i8*, i32, i32) #3

; Function Attrs: allocsize(1)
declare noalias nonnull {} addrspace(10)* @jl_gc_big_alloc(i8*, i32) #3

; Function Attrs: nofree nosync nounwind readnone speculatable willreturn
declare i64 @llvm.fshl.i64(i64, i64, i64) #4

attributes #0 = { sspstrong "probe-stack"="inline-asm" }
attributes #1 = { "probe-stack"="inline-asm" "thunk" }
attributes #2 = { inaccessiblemem_or_argmemonly }
attributes #3 = { allocsize(1) }
attributes #4 = { nofree nosync nounwind readnone speculatable willreturn }
attributes #5 = { "probe-stack"="inline-asm" }

!llvm.module.flags = !{!0, !1}

!0 = !{i32 2, !"Dwarf Version", i32 4}
!1 = !{i32 2, !"Debug Info Version", i32 3}
!2 = !{!3, !3, i64 0}
!3 = !{!"jtbaa_value", !4, i64 0}
!4 = !{!"jtbaa_data", !5, i64 0}
!5 = !{!"jtbaa", !6, i64 0}
!6 = !{!"jtbaa"}
!7 = !{!8, !8, i64 0}
!8 = !{!"jtbaa_stack", !5, i64 0}

Unable to reproduce. This is what I get with trunk currently:

opt -S -mcpu=haswell -slp-vectorizer repro.ll -o -
; ModuleID = 'repro.ll'
source_filename = "rand"
target datalayout = "e-m:e-p:32:32-p270:32:32-p271:32:32-p272:64:64-f64:32:64-f80:32-n8:16:32-S128-ni:10:11:12:13"
target triple = "i686-unknown-linux-gnu"

@llvm.compiler.used = appending global [3 x i8*] [i8* bitcast (void ({} addrspace(10)*)* @jl_gc_queue_root to i8*), i8* bitcast ({} addrspace(10)* (i8*, i32, i32)* @jl_gc_pool_alloc to i8*), i8* bitcast ({} addrspace(10)* (i8*, i32)* @jl
_gc_big_alloc to i8*)], section "llvm.metadata"

; Function Attrs: sspstrong
define void @julia_rand_5(i64* noalias nocapture sret(i64) %0) #0 {
top:
  %1 = call {}*** @julia.get_pgcstack()
  %2 = getelementptr {}**, {}*** %1, i32 4
  %3 = bitcast {}*** %2 to i64*
  %4 = load i64, i64* %3, align 4, !tbaa !2
  %5 = getelementptr {}**, {}*** %1, i32 6
  %6 = bitcast {}*** %5 to i64*
  %7 = getelementptr {}**, {}*** %1, i32 8
  %8 = bitcast {}*** %7 to i64*
  %9 = bitcast i64* %6 to <2 x i64>*
  %10 = load <2 x i64>, <2 x i64>* %9, align 4, !tbaa !2
  %11 = getelementptr {}**, {}*** %1, i32 10
  %12 = bitcast {}*** %11 to i64*
  %13 = load i64, i64* %12, align 4, !tbaa !2
  %14 = add i64 %13, %4
  %15 = call i64 @llvm.fshl.i64(i64 %14, i64 %14, i64 23)
  %16 = extractelement <2 x i64> %10, i32 0
  %17 = shl i64 %16, 17
  %18 = insertelement <2 x i64> poison, i64 %13, i32 0
  %19 = insertelement <2 x i64> %18, i64 %4, i32 1
  %20 = xor <2 x i64> %19, %10
  %21 = insertelement <2 x i64> poison, i64 %4, i32 0
  %22 = insertelement <2 x i64> %21, i64 %16, i32 1
  %23 = xor <2 x i64> %20, %22
  %24 = extractelement <2 x i64> %20, i32 1
  %25 = xor i64 %24, %17
  %26 = extractelement <2 x i64> %20, i32 0
  %27 = call i64 @llvm.fshl.i64(i64 %26, i64 %26, i64 45)
  %28 = bitcast i64* %3 to <2 x i64>*
  store <2 x i64> %23, <2 x i64>* %28, align 4, !tbaa !2
  store i64 %25, i64* %8, align 4, !tbaa !2
  store i64 %27, i64* %12, align 4, !tbaa !2
  store i64 %15, i64* %0, align 4
  ret void
}

define nonnull {} addrspace(10)* @jfptr_rand_6({} addrspace(10)* %0, {} addrspace(10)** %1, i32 %2) #1 {
top:
  %3 = call {}*** @julia.get_pgcstack()
  %4 = alloca i64, align 8
  call void @julia_rand_5(i64* noalias nocapture nonnull sret(i64) %4) #6
  %5 = load i64, i64* %4, align 8, !tbaa !7
  %6 = call nonnull {} addrspace(10)* @jl_box_uint64(i64 zeroext %5)
  ret {} addrspace(10)* %6
}

declare {}*** @julia.get_pgcstack() #2

declare nonnull {} addrspace(10)* @jl_box_uint64(i64 zeroext) #2

; Function Attrs: inaccessiblemem_or_argmemonly
declare void @jl_gc_queue_root({} addrspace(10)*) #3

; Function Attrs: allocsize(1)
declare noalias nonnull {} addrspace(10)* @jl_gc_pool_alloc(i8*, i32, i32) #4

; Function Attrs: allocsize(1)
declare noalias nonnull {} addrspace(10)* @jl_gc_big_alloc(i8*, i32) #4

; Function Attrs: nofree nosync nounwind readnone speculatable willreturn
declare i64 @llvm.fshl.i64(i64, i64, i64) #5

attributes #0 = { sspstrong "probe-stack"="inline-asm" "target-cpu"="haswell" }
attributes #1 = { "probe-stack"="inline-asm" "target-cpu"="haswell" "thunk" }
attributes #2 = { "target-cpu"="haswell" }
attributes #3 = { inaccessiblemem_or_argmemonly "target-cpu"="haswell" }
attributes #4 = { allocsize(1) "target-cpu"="haswell" }
attributes #5 = { nofree nosync nounwind readnone speculatable willreturn "target-cpu"="haswell" }
attributes #6 = { "probe-stack"="inline-asm" }

!llvm.module.flags = !{!0, !1}

!0 = !{i32 2, !"Dwarf Version", i32 4}
!1 = !{i32 2, !"Debug Info Version", i32 3}
!2 = !{!3, !3, i64 0}
!3 = !{!"jtbaa_value", !4, i64 0}
!4 = !{!"jtbaa_data", !5, i64 0}
!5 = !{!"jtbaa", !6, i64 0}
!6 = !{!"jtbaa"}
!7 = !{!8, !8, i64 0}
!8 = !{!"jtbaa_stack", !5, i64 0}

The output looks to be missing a llvm.masked.gather.v2i64.v2p0i64 call. Did you remember to revert 0d74fd3fdf50? I can push a branch to github with this current commit, if that would be helpful.

In D106613#3013357, @vtjnash wrote:

The output looks to be missing a llvm.masked.gather.v2i64.v2p0i64 call. Did you remember to revert 0d74fd3fdf50? I can push a branch to github with this current commit, if that would be helpful.

Ok, will try the test case with the reverted 0d74fd3fdf50, going to look at it today and tomorrow.

Thanks!

In D106613#3013383, @vtjnash wrote:

Thanks!

Ok, looks like I was able to reproduce it, investigating.

Ok, investigated the problem, looks like my initial analysis was wrong. We really need to use findLaneForValue here. I added a test case that reveals the bug. You need to format your changes and update the test checks.

@vtjnash https://bugs.llvm.org/show_bug.cgi?id=51957 might be due to the same issue

clang-format

In D106613#3022557, @vtjnash wrote:

clang-format

Need to update the tests too

I don't have a test, but it sounded like you did?

In D106613#3022574, @vtjnash wrote:

I don't have a test, but it sounded like you did?

Iris already committed and should fail with your patch, you need to update the checks in this test (llvm/test/Transforms/SLPVectorizer/X86/extract_in_tree_user.ll)

Harbormaster completed remote builds in B125716: Diff 375054.Sep 25 2021, 11:34 AM

fix test

This revision is now accepted and ready to land.Sep 25 2021, 11:50 AM

Harbormaster completed remote builds in B125720: Diff 375058.Sep 25 2021, 12:02 PM

This revision was landed with ongoing or failed builds.Sep 27 2021, 11:09 AM

Closed by commit rGe27a6db5298f: Bad SLPVectorization shufflevector replacement, resulting in write to wrong… (authored by vtjnash). · Explain Why

This revision was automatically updated to reflect the committed changes.

vtjnash added a commit: rGe27a6db5298f: Bad SLPVectorization shufflevector replacement, resulting in write to wrong….

Revision Contents

Path

Size

llvm/

lib/

Transforms/

Vectorize/

SLPVectorizer.cpp

25 lines

test/

Transforms/

SLPVectorizer/

X86/

extract_in_tree_user.ll

3 lines

Diff 375346

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 6,029 Lines • ▼ Show 20 Lines	case Instruction::Load: {
Value *PO = LI->getPointerOperand();		Value *PO = LI->getPointerOperand();
if (E->State == TreeEntry::Vectorize) {		if (E->State == TreeEntry::Vectorize) {

Value *VecPtr = Builder.CreateBitCast(PO, VecTy->getPointerTo(AS));		Value *VecPtr = Builder.CreateBitCast(PO, VecTy->getPointerTo(AS));

// The pointer operand uses an in-tree scalar so we add the new BitCast		// The pointer operand uses an in-tree scalar so we add the new BitCast
// to ExternalUses list to make sure that an extract will be generated		// to ExternalUses list to make sure that an extract will be generated
// in the future.		// in the future.
if (getTreeEntry(PO))		if (TreeEntry *Entry = getTreeEntry(PO)) {
ExternalUses.emplace_back(PO, cast<User>(VecPtr), 0);		// Find which lane we need to extract.
		unsigned FoundLane = Entry->findLaneForValue(PO);
		ExternalUses.emplace_back(PO, cast<User>(VecPtr), FoundLane);
		}
		ABataevUnsubmitted Not Done Reply Inline Actions This is definitely wrong, the third arg must be the lane id in the current entry node, not in the user node. But (not sure, looks like) it points to a problem with scatter vectorize pointers, which were vectorized previously. Will check it later. ABataev: This is definitely wrong, the third arg must be the lane id in the current entry node, not in…
		vtjnashAuthorUnsubmitted Done Reply Inline Actions I think that makes sense. I was hoping this would break a test, so I could see where this fix attempt was wrong, but apparently no tests currently depended upon that. This was a confusing example, where it vectorized the load with a scatter-gather (offsets 10 and 4), but later vectorized the store using lane 1 of that combined gep (desiring to extract the pointer with gep offsets 4 and 6), but the 0 here means that store got the gep with offset 10 (and 12) back instead when it tried to get PO back out of the scatter-gather for insertion into that VecPtr user. vtjnash: I think that makes sense. I was hoping this would break a test, so I could see where this fix…
		ABataevUnsubmitted Not Done Reply Inline Actions I think here we need an extra check for ScatterVectorize nodes. For such nodes, we need to extract all used pointers, not only the first one. Going to try to look at it later. ABataev: I think here we need an extra check for ScatterVectorize nodes. For such nodes, we need to…
		vtjnashAuthorUnsubmitted Done Reply Inline Actions Bump? If this was fixed, could you let me know the commit #, so I could test with it. Thanks! vtjnash: Bump? If this was fixed, could you let me know the commit #, so I could test with it. Thanks!
		ABataevUnsubmitted Not Done Reply Inline Actions Not yet, busy with some other patches, hope to investigate it soon. ABataev: Not yet, busy with some other patches, hope to investigate it soon.
		ABataevUnsubmitted Not Done Reply Inline Actions Hi, could you send your reproducer? Even if it is not triggered anymore? ABataev: Hi, could you send your reproducer? Even if it is not triggered anymore?

NewLI = Builder.CreateAlignedLoad(VecTy, VecPtr, LI->getAlign());		NewLI = Builder.CreateAlignedLoad(VecTy, VecPtr, LI->getAlign());
} else {		} else {
assert(E->State == TreeEntry::ScatterVectorize && "Unhandled state");		assert(E->State == TreeEntry::ScatterVectorize && "Unhandled state");
Value *VecPtr = vectorizeTree(E->getOperand(0));		Value *VecPtr = vectorizeTree(E->getOperand(0));
// Use the minimum alignment of the gathered loads.		// Use the minimum alignment of the gathered loads.
Align CommonAlignment = LI->getAlign();		Align CommonAlignment = LI->getAlign();
for (Value *V : E->Scalars)		for (Value *V : E->Scalars)
Show All 24 Lines	case Instruction::Store: {
Value *VecPtr = Builder.CreateBitCast(		Value *VecPtr = Builder.CreateBitCast(
ScalarPtr, VecValue->getType()->getPointerTo(AS));		ScalarPtr, VecValue->getType()->getPointerTo(AS));
StoreInst *ST = Builder.CreateAlignedStore(VecValue, VecPtr,		StoreInst *ST = Builder.CreateAlignedStore(VecValue, VecPtr,
SI->getAlign());		SI->getAlign());

// The pointer operand uses an in-tree scalar, so add the new BitCast to		// The pointer operand uses an in-tree scalar, so add the new BitCast to
// ExternalUses to make sure that an extract will be generated in the		// ExternalUses to make sure that an extract will be generated in the
// future.		// future.
if (getTreeEntry(ScalarPtr))		if (TreeEntry *Entry = getTreeEntry(ScalarPtr)) {
ExternalUses.push_back(ExternalUser(ScalarPtr, cast<User>(VecPtr), 0));		// Find which lane we need to extract.
		unsigned FoundLane = Entry->findLaneForValue(ScalarPtr);
		ExternalUses.push_back(
		ExternalUser(ScalarPtr, cast<User>(VecPtr), FoundLane));
		}

Value *V = propagateMetadata(ST, E->Scalars);		Value *V = propagateMetadata(ST, E->Scalars);

E->VectorizedValue = V;		E->VectorizedValue = V;
++NumVectorInstructions;		++NumVectorInstructions;
return V;		return V;
}		}
case Instruction::GetElementPtr: {		case Instruction::GetElementPtr: {
▲ Show 20 Lines • Show All 86 Lines • ▼ Show 20 Lines	case Instruction::Call: {

SmallVector<OperandBundleDef, 1> OpBundles;		SmallVector<OperandBundleDef, 1> OpBundles;
CI->getOperandBundlesAsDefs(OpBundles);		CI->getOperandBundlesAsDefs(OpBundles);
Value *V = Builder.CreateCall(CF, OpVecs, OpBundles);		Value *V = Builder.CreateCall(CF, OpVecs, OpBundles);

// The scalar argument uses an in-tree scalar so we add the new vectorized		// The scalar argument uses an in-tree scalar so we add the new vectorized
// call to ExternalUses list to make sure that an extract will be		// call to ExternalUses list to make sure that an extract will be
// generated in the future.		// generated in the future.
if (ScalarArg && getTreeEntry(ScalarArg))		if (ScalarArg) {
ExternalUses.push_back(ExternalUser(ScalarArg, cast<User>(V), 0));		if (TreeEntry *Entry = getTreeEntry(ScalarArg)) {
		// Find which lane we need to extract.
		unsigned FoundLane = Entry->findLaneForValue(ScalarArg);
		ExternalUses.push_back(
		ExternalUser(ScalarArg, cast<User>(V), FoundLane));
		}
		}

propagateIRFlags(V, E->Scalars, VL0);		propagateIRFlags(V, E->Scalars, VL0);
ShuffleBuilder.addInversedMask(E->ReorderIndices);		ShuffleBuilder.addInversedMask(E->ReorderIndices);
ShuffleBuilder.addMask(E->ReuseShuffleIndices);		ShuffleBuilder.addMask(E->ReuseShuffleIndices);
V = ShuffleBuilder.finalize(V);		V = ShuffleBuilder.finalize(V);

E->VectorizedValue = V;		E->VectorizedValue = V;
++NumVectorInstructions;		++NumVectorInstructions;
▲ Show 20 Lines • Show All 3,205 Lines • Show Last 20 Lines

llvm/test/Transforms/SLPVectorizer/X86/extract_in_tree_user.ll

	Show First 20 Lines • Show All 102 Lines • ▼ Show 20 Lines
	; CHECK-LABEL: @externally_used_ptrs(			; CHECK-LABEL: @externally_used_ptrs(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: [[TMP0:%.]] = load i64, i64** @a, align 8			; CHECK-NEXT: [[TMP0:%.]] = load i64, i64** @a, align 8
	; CHECK-NEXT: [[TMP1:%.]] = insertelement <2 x i64> poison, i64* [[TMP0]], i32 0			; CHECK-NEXT: [[TMP1:%.]] = insertelement <2 x i64> poison, i64* [[TMP0]], i32 0
	; CHECK-NEXT: [[TMP2:%.]] = insertelement <2 x i64> [[TMP1]], i64* [[TMP0]], i32 1			; CHECK-NEXT: [[TMP2:%.]] = insertelement <2 x i64> [[TMP1]], i64* [[TMP0]], i32 1
	; CHECK-NEXT: [[TMP3:%.]] = getelementptr i64, <2 x i64> [[TMP2]], <2 x i64> <i64 56, i64 11>			; CHECK-NEXT: [[TMP3:%.]] = getelementptr i64, <2 x i64> [[TMP2]], <2 x i64> <i64 56, i64 11>
	; CHECK-NEXT: [[TMP4:%.]] = ptrtoint <2 x i64> [[TMP3]] to <2 x i64>			; CHECK-NEXT: [[TMP4:%.]] = ptrtoint <2 x i64> [[TMP3]] to <2 x i64>
	; CHECK-NEXT: [[ARRAYIDX2:%.]] = getelementptr inbounds i64, i64 [[TMP0]], i64 12			; CHECK-NEXT: [[ARRAYIDX2:%.]] = getelementptr inbounds i64, i64 [[TMP0]], i64 12
	; CHECK-NEXT: [[TMP5:%.]] = extractelement <2 x i64> [[TMP3]], i32 0			; CHECK-NEXT: [[TMP5:%.]] = extractelement <2 x i64> [[TMP3]], i32 1
	; CHECK-NEXT: [[TMP6:%.]] = bitcast i64 [[TMP5]] to <2 x i64>*			; CHECK-NEXT: [[TMP6:%.]] = bitcast i64 [[TMP5]] to <2 x i64>*
	; CHECK-NEXT: [[TMP7:%.]] = load <2 x i64>, <2 x i64> [[TMP6]], align 8			; CHECK-NEXT: [[TMP7:%.]] = load <2 x i64>, <2 x i64> [[TMP6]], align 8
	; CHECK-NEXT: [[TMP8:%.]] = extractelement <2 x i64> [[TMP3]], i32 1
	; CHECK-NEXT: [[TMP9:%.*]] = add <2 x i64> [[TMP4]], [[TMP7]]			; CHECK-NEXT: [[TMP9:%.*]] = add <2 x i64> [[TMP4]], [[TMP7]]
	; CHECK-NEXT: [[TMP10:%.]] = bitcast i64 [[TMP5]] to <2 x i64>*			; CHECK-NEXT: [[TMP10:%.]] = bitcast i64 [[TMP5]] to <2 x i64>*
	; CHECK-NEXT: store <2 x i64> [[TMP9]], <2 x i64>* [[TMP10]], align 8			; CHECK-NEXT: store <2 x i64> [[TMP9]], <2 x i64>* [[TMP10]], align 8
	; CHECK-NEXT: ret void			; CHECK-NEXT: ret void
	;			;
	entry:			entry:
	%0 = load i64, i64* @a, align 8			%0 = load i64, i64* @a, align 8
	%add.ptr = getelementptr inbounds i64, i64* %0, i64 11			%add.ptr = getelementptr inbounds i64, i64* %0, i64 11
	Show All 12 Lines