This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Transforms/Vectorize/
-
Transforms/
-
Vectorize/
-
SLPVectorizer.cpp
-
test/Transforms/SLPVectorizer/
-
Transforms/
-
SLPVectorizer/
-
AArch64/
-
ext-trunc.ll
-
getelementptr.ll
-
X86/
-
load-merge.ll

Differential D68667

[SLP] respect target register width for GEP vectorization (PR43578)
ClosedPublic

Authored by spatel on Oct 8 2019, 2:58 PM.

Download Raw Diff

Details

Reviewers

craig.topper
RKSimon
xbolva00
ABataev

Commits

rGdf14bd315db9: [SLP] respect target register width for GEP vectorization (PR43578)
rL374183: [SLP] respect target register width for GEP vectorization (PR43578)

Summary

We failed to account for the target register width (max vector factor) when vectorizing starting from GEPs. This causes vectorization to proceed to obviously illegal widths as in:
https://bugs.llvm.org/show_bug.cgi?id=43578

For x86, this also means that SLP can produce rogue AVX or AVX512 code even when the user specifies a narrower vector width.

The AArch64 test in ext-trunc.ll appears to be better using the narrower width. I'm not exactly sure what getelementptr.ll is trying to do, but it's testing with "-slp-threshold=-18", so I'm not worried about those diffs. The x86 test is an over-reduction from SPEC h264; this patch appears to restore the perf loss caused by SLP when using -march=haswell.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

spatel created this revision.Oct 8 2019, 2:58 PM

Herald added a project: Restricted Project. · View Herald TranscriptOct 8 2019, 2:58 PM

Herald added subscribers: hiraditya, kristof.beyls, mcrosier. · View Herald Transcript

I can confirm that h264 benchmark is now atleast as good as plain -O3.

Generally, I think there are more bugs for -march=haswell. Only in rare cases the perf of binaries with -march=haswell is better than plain -O3.
I tried this patch with zstd but nothing improved.

Plain -O3
./zstd -b selesiafiles/* -f
3# 13 files : 251919670 -> 97724903 (2.578), 182.0 MB/s , 923.2 MB/s

-O3 -march=haswell
/zstd -b selesiafiles/* -f
3# 13 files : 251919670 -> 97724903 (2.578), 185.7 MB/s , 866.9 MB/s

-O3 -march=haswell -mprefer-vector-width=128
./zstd -b bench/* -f
3# 13 files : 251919670 -> 97724903 (2.578), 188.5 MB/s , 806.8 MB/s

for example gcc-10's results for -march=haswell
./zstd -b bench/* -f
3# 13 files : 251919670 -> 97724903 (2.578), 188.7 MB/s ,1032.8 MB/s

In D68667#1701060, @xbolva00 wrote:

Generally, I think there are more bugs for -march=haswell. Only in rare cases the perf of binaries with -march=haswell is better than plain -O3.
I tried this patch with zstd but nothing improved.

Plain -O3
./zstd -b selesiafiles/* -f
3# 13 files : 251919670 -> 97724903 (2.578), 182.0 MB/s , 923.2 MB/s

-O3 -march=haswell
/zstd -b selesiafiles/* -f
3# 13 files : 251919670 -> 97724903 (2.578), 185.7 MB/s , 866.9 MB/s

-O3 -march=haswell -mprefer-vector-width=128
./zstd -b bench/* -f
3# 13 files : 251919670 -> 97724903 (2.578), 188.5 MB/s , 806.8 MB/s

for example gcc-10's results for -march=haswell
./zstd -b bench/* -f
3# 13 files : 251919670 -> 97724903 (2.578), 188.7 MB/s ,1032.8 MB/s

Thanks for testing! I suspect that this problem (ignoring the target-based register width) is more widespread than only the transform starting from GEP, but I want to make sure we have proper tests in place if we change the behavior in other places. Can you file another bug for "zstd"?

Yes, I will do.

Looks good.

This revision is now accepted and ready to land.Oct 9 2019, 7:58 AM

Closed by commit rGdf14bd315db9: [SLP] respect target register width for GEP vectorization (PR43578) (authored by spatel). · Explain WhyOct 9 2019, 9:41 AM

This revision was automatically updated to reflect the committed changes.

I tracked down a 7% regression in h264 on AArch64 -O3 LTO & PGO to this commit. The regressions in the aarch64 tests seem a bit suspicious and from the description the changes seem unintentional (4 x i32 vectors should be perfectly legal on AArch64). I'll take a look to see what's going on.

In D68667#2106962, @fhahn wrote:

I tracked down a 7% regression in h264 on AArch64 -O3 LTO & PGO to this commit. The regressions in the aarch64 tests seem a bit suspicious and from the description the changes seem unintentional (4 x i32 vectors should be perfectly legal on AArch64). I'll take a look to see what's going on.

Oh I now see what's going on. The actual compute is done on i64 x 4.

In D68667#2107079, @fhahn wrote:

In D68667#2106962, @fhahn wrote:

I tracked down a 7% regression in h264 on AArch64 -O3 LTO & PGO to this commit. The regressions in the aarch64 tests seem a bit suspicious and from the description the changes seem unintentional (4 x i32 vectors should be perfectly legal on AArch64). I'll take a look to see what's going on.

Oh I now see what's going on. The actual compute is done on i64 x 4.

But not in the getelementptr_4x32() test, right? Maybe we need to refine getVectorElementSize() in some way.

spatel mentioned this in D82227: SLP: honor requested max vector size merging PHIs.Jun 23 2020, 8:20 AM

fhahn mentioned this in D82418: [SLP] Limit GEP lists based on width of index computation..Jun 23 2020, 3:43 PM

In D68667#2107167, @spatel wrote:

In D68667#2107079, @fhahn wrote:

In D68667#2106962, @fhahn wrote:

I tracked down a 7% regression in h264 on AArch64 -O3 LTO & PGO to this commit. The regressions in the aarch64 tests seem a bit suspicious and from the description the changes seem unintentional (4 x i32 vectors should be perfectly legal on AArch64). I'll take a look to see what's going on.

Oh I now see what's going on. The actual compute is done on i64 x 4.

But not in the getelementptr_4x32() test, right? Maybe we need to refine getVectorElementSize() in some way.

Initially I was looking at the @test2, but the more interesting one is indeed getelementptr_4x32. I think the issue might be that we use the width of the pointer to limit the list size, rather than the width of the index computations. IIUC we only vectorize the index computations, so I think it would make sense to limit the width based on the GEP index width, rather than the GEP itself. I put up D82418, which restores getelementptr_4x32 and also catches the important h264 pattern on AArch64, while not regressing the test case on X86.

(there are also a bunch of slightly odd cost-model decision I ran into (e.g. all extracts on AArch64 for indices > 0 have a cost of 3 for some reason...); I'll prepare some patches for some of the issue I encountered when I have a bit more time)

fhahn mentioned this in rG35bb9bfbb099: [SLP] Limit GEP lists based on width of index computation..Jun 24 2020, 12:29 PM

fhahn mentioned this in rG705d120af59b: [SLP] Limit GEP lists based on width of index computation. (#1403).Jul 14 2020, 4:56 PM

Revision Contents

Path

Size

llvm/

lib/

Transforms/

Vectorize/

SLPVectorizer.cpp

14 lines

test/

Transforms/

SLPVectorizer/

AArch64/

ext-trunc.ll

30 lines

getelementptr.ll

57 lines

X86/

load-merge.ll

31 lines

Diff 224086

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 6,975 Lines • ▼ Show 20 Lines	for (auto &Entry : GEPs) {
// If the getelementptr list has fewer than two elements, there's nothing		// If the getelementptr list has fewer than two elements, there's nothing
// to do.		// to do.
if (Entry.second.size() < 2)		if (Entry.second.size() < 2)
continue;		continue;

LLVM_DEBUG(dbgs() << "SLP: Analyzing a getelementptr list of length "		LLVM_DEBUG(dbgs() << "SLP: Analyzing a getelementptr list of length "
<< Entry.second.size() << ".\n");		<< Entry.second.size() << ".\n");

// We process the getelementptr list in chunks of 16 (like we do for		// Process the GEP list in chunks suitable for the target's supported
// stores) to minimize compile-time.		// vector size. If a vector register can't hold 1 element, we are done.
for (unsigned BI = 0, BE = Entry.second.size(); BI < BE; BI += 16) {		unsigned MaxVecRegSize = R.getMaxVecRegSize();
auto Len = std::min<unsigned>(BE - BI, 16);		unsigned EltSize = R.getVectorElementSize(Entry.second[0]);
		if (MaxVecRegSize < EltSize)
		continue;

		unsigned MaxElts = MaxVecRegSize / EltSize;
		for (unsigned BI = 0, BE = Entry.second.size(); BI < BE; BI += MaxElts) {
		auto Len = std::min<unsigned>(BE - BI, MaxElts);
auto GEPList = makeArrayRef(&Entry.second[BI], Len);		auto GEPList = makeArrayRef(&Entry.second[BI], Len);

// Initialize a set a candidate getelementptrs. Note that we use a		// Initialize a set a candidate getelementptrs. Note that we use a
// SetVector here to preserve program order. If the index computations		// SetVector here to preserve program order. If the index computations
// are vectorizable and begin with loads, we want to minimize the chance		// are vectorizable and begin with loads, we want to minimize the chance
// of having to reorder them later.		// of having to reorder them later.
SetVector<Value *> Candidates(GEPList.begin(), GEPList.end());		SetVector<Value *> Candidates(GEPList.begin(), GEPList.end());

▲ Show 20 Lines • Show All 98 Lines • Show Last 20 Lines

llvm/test/Transforms/SLPVectorizer/AArch64/ext-trunc.ll

	Show First 20 Lines • Show All 55 Lines • ▼ Show 20 Lines
	}			}

	define void @test2(<4 x i16> %a, <4 x i16> %b, i64 %c0, i64 %c1, i64 %c2, i64 %c3, i64* %p) {			define void @test2(<4 x i16> %a, <4 x i16> %b, i64 %c0, i64 %c1, i64 %c2, i64 %c3, i64* %p) {
	; CHECK-LABEL: @test2(			; CHECK-LABEL: @test2(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: [[Z0:%.]] = zext <4 x i16> [[A:%.]] to <4 x i32>			; CHECK-NEXT: [[Z0:%.]] = zext <4 x i16> [[A:%.]] to <4 x i32>
	; CHECK-NEXT: [[Z1:%.]] = zext <4 x i16> [[B:%.]] to <4 x i32>			; CHECK-NEXT: [[Z1:%.]] = zext <4 x i16> [[B:%.]] to <4 x i32>
	; CHECK-NEXT: [[SUB0:%.*]] = sub <4 x i32> [[Z0]], [[Z1]]			; CHECK-NEXT: [[SUB0:%.*]] = sub <4 x i32> [[Z0]], [[Z1]]
	; CHECK-NEXT: [[TMP0:%.*]] = sext <4 x i32> [[SUB0]] to <4 x i64>			; CHECK-NEXT: [[E0:%.*]] = extractelement <4 x i32> [[SUB0]], i32 0
	; CHECK-NEXT: [[TMP1:%.]] = insertelement <4 x i64> undef, i64 [[C0:%.]], i32 0			; CHECK-NEXT: [[S0:%.*]] = sext i32 [[E0]] to i64
	; CHECK-NEXT: [[TMP2:%.]] = insertelement <4 x i64> [[TMP1]], i64 [[C1:%.]], i32 1			; CHECK-NEXT: [[A0:%.]] = add i64 [[S0]], [[C0:%.]]
	; CHECK-NEXT: [[TMP3:%.]] = insertelement <4 x i64> [[TMP2]], i64 [[C2:%.]], i32 2			; CHECK-NEXT: [[GEP0:%.]] = getelementptr inbounds i64, i64 [[P:%.*]], i64 [[A0]]
	; CHECK-NEXT: [[TMP4:%.]] = insertelement <4 x i64> [[TMP3]], i64 [[C3:%.]], i32 3
	; CHECK-NEXT: [[TMP5:%.*]] = add <4 x i64> [[TMP0]], [[TMP4]]
	; CHECK-NEXT: [[TMP6:%.*]] = extractelement <4 x i64> [[TMP5]], i32 0
	; CHECK-NEXT: [[GEP0:%.]] = getelementptr inbounds i64, i64 [[P:%.*]], i64 [[TMP6]]
	; CHECK-NEXT: [[LOAD0:%.]] = load i64, i64 [[GEP0]]			; CHECK-NEXT: [[LOAD0:%.]] = load i64, i64 [[GEP0]]
	; CHECK-NEXT: [[TMP7:%.*]] = extractelement <4 x i64> [[TMP5]], i32 1			; CHECK-NEXT: [[E1:%.*]] = extractelement <4 x i32> [[SUB0]], i32 1
	; CHECK-NEXT: [[GEP1:%.]] = getelementptr inbounds i64, i64 [[P]], i64 [[TMP7]]			; CHECK-NEXT: [[S1:%.*]] = sext i32 [[E1]] to i64
				; CHECK-NEXT: [[A1:%.]] = add i64 [[S1]], [[C1:%.]]
				; CHECK-NEXT: [[GEP1:%.]] = getelementptr inbounds i64, i64 [[P]], i64 [[A1]]
	; CHECK-NEXT: [[LOAD1:%.]] = load i64, i64 [[GEP1]]			; CHECK-NEXT: [[LOAD1:%.]] = load i64, i64 [[GEP1]]
	; CHECK-NEXT: [[TMP8:%.*]] = extractelement <4 x i64> [[TMP5]], i32 2			; CHECK-NEXT: [[E2:%.*]] = extractelement <4 x i32> [[SUB0]], i32 2
	; CHECK-NEXT: [[GEP2:%.]] = getelementptr inbounds i64, i64 [[P]], i64 [[TMP8]]			; CHECK-NEXT: [[S2:%.*]] = sext i32 [[E2]] to i64
				; CHECK-NEXT: [[A2:%.]] = add i64 [[S2]], [[C2:%.]]
				; CHECK-NEXT: [[GEP2:%.]] = getelementptr inbounds i64, i64 [[P]], i64 [[A2]]
	; CHECK-NEXT: [[LOAD2:%.]] = load i64, i64 [[GEP2]]			; CHECK-NEXT: [[LOAD2:%.]] = load i64, i64 [[GEP2]]
	; CHECK-NEXT: [[TMP9:%.*]] = extractelement <4 x i64> [[TMP5]], i32 3			; CHECK-NEXT: [[E3:%.*]] = extractelement <4 x i32> [[SUB0]], i32 3
	; CHECK-NEXT: [[GEP3:%.]] = getelementptr inbounds i64, i64 [[P]], i64 [[TMP9]]			; CHECK-NEXT: [[S3:%.*]] = sext i32 [[E3]] to i64
				; CHECK-NEXT: [[A3:%.]] = add i64 [[S3]], [[C3:%.]]
				; CHECK-NEXT: [[GEP3:%.]] = getelementptr inbounds i64, i64 [[P]], i64 [[A3]]
	; CHECK-NEXT: [[LOAD3:%.]] = load i64, i64 [[GEP3]]			; CHECK-NEXT: [[LOAD3:%.]] = load i64, i64 [[GEP3]]
	; CHECK-NEXT: call void @foo(i64 [[LOAD0]], i64 [[LOAD1]], i64 [[LOAD2]], i64 [[LOAD3]])			; CHECK-NEXT: call void @foo(i64 [[LOAD0]], i64 [[LOAD1]], i64 [[LOAD2]], i64 [[LOAD3]])
	; CHECK-NEXT: ret void			; CHECK-NEXT: ret void
	;			;
	entry:			entry:
	%z0 = zext <4 x i16> %a to <4 x i32>			%z0 = zext <4 x i16> %a to <4 x i32>
	%z1 = zext <4 x i16> %b to <4 x i32>			%z1 = zext <4 x i16> %b to <4 x i32>
	%sub0 = sub <4 x i32> %z0, %z1			%sub0 = sub <4 x i32> %z0, %z1
	Show All 23 Lines

llvm/test/Transforms/SLPVectorizer/AArch64/getelementptr.ll

Show All 17 Lines
; for (int i = 0; i < n ; ++i) {		; for (int i = 0; i < n ; ++i) {
; sum += g[2i + w]; sum += g[2i + x];		; sum += g[2i + w]; sum += g[2i + x];
; sum += g[2i + y]; sum += g[2i + z];		; sum += g[2i + y]; sum += g[2i + z];
; }		; }
; return sum;		; return sum;
; }		; }
;		;

; YAML: --- !Passed		; YAML-LABEL: Function: getelementptr_4x32
; YAML-NEXT: Pass: slp-vectorizer
; YAML-NEXT: Name: VectorizedList
; YAML-NEXT: Function: getelementptr_4x32
; YAML-NEXT: Args:		; YAML-NEXT: Args:
; YAML-NEXT: - String: 'SLP vectorized with cost '		; YAML-NEXT: - String: 'SLP vectorized with cost '
; YAML-NEXT: - Cost: '11'		; YAML-NEXT: - Cost: '11'
; YAML-NEXT: - String: ' and with tree size '		; YAML-NEXT: - String: ' and with tree size '
; YAML-NEXT: - TreeSize: '5'		; YAML-NEXT: - TreeSize: '5'

; YAML: --- !Passed		; YAML: --- !Passed
; YAML-NEXT: Pass: slp-vectorizer		; YAML-NEXT: Pass: slp-vectorizer
; YAML-NEXT: Name: VectorizedList		; YAML-NEXT: Name: VectorizedList
; YAML-NEXT: Function: getelementptr_4x32		; YAML-NEXT: Function: getelementptr_4x32
; YAML-NEXT: Args:		; YAML-NEXT: Args:
; YAML-NEXT: - String: 'SLP vectorized with cost '		; YAML-NEXT: - String: 'SLP vectorized with cost '
; YAML-NEXT: - Cost: '16'		; YAML-NEXT: - Cost: '6'
; YAML-NEXT: - String: ' and with tree size '		; YAML-NEXT: - String: ' and with tree size '
; YAML-NEXT: - TreeSize: '3'		; YAML-NEXT: - TreeSize: '3'

define i32 @getelementptr_4x32(i32* nocapture readonly %g, i32 %n, i32 %x, i32 %y, i32 %z) {		define i32 @getelementptr_4x32(i32* nocapture readonly %g, i32 %n, i32 %x, i32 %y, i32 %z) {
; CHECK-LABEL: @getelementptr_4x32(		; CHECK-LABEL: @getelementptr_4x32(
; CHECK-NEXT: entry:		; CHECK-NEXT: entry:
; CHECK-NEXT: [[CMP31:%.]] = icmp sgt i32 [[N:%.]], 0		; CHECK-NEXT: [[CMP31:%.]] = icmp sgt i32 [[N:%.]], 0
; CHECK-NEXT: br i1 [[CMP31]], label [[FOR_BODY_PREHEADER:%.]], label [[FOR_COND_CLEANUP:%.]]		; CHECK-NEXT: br i1 [[CMP31]], label [[FOR_BODY_PREHEADER:%.]], label [[FOR_COND_CLEANUP:%.]]
; CHECK: for.body.preheader:		; CHECK: for.body.preheader:
; CHECK-NEXT: [[TMP0:%.]] = insertelement <4 x i32> <i32 0, i32 undef, i32 undef, i32 undef>, i32 [[X:%.]], i32 1		; CHECK-NEXT: [[TMP0:%.]] = insertelement <2 x i32> <i32 0, i32 undef>, i32 [[X:%.]], i32 1
; CHECK-NEXT: [[TMP1:%.]] = insertelement <4 x i32> [[TMP0]], i32 [[Y:%.]], i32 2		; CHECK-NEXT: [[TMP1:%.]] = insertelement <2 x i32> undef, i32 [[Y:%.]], i32 0
; CHECK-NEXT: [[TMP2:%.]] = insertelement <4 x i32> [[TMP1]], i32 [[Z:%.]], i32 3		; CHECK-NEXT: [[TMP2:%.]] = insertelement <2 x i32> [[TMP1]], i32 [[Z:%.]], i32 1
; CHECK-NEXT: br label [[FOR_BODY:%.*]]		; CHECK-NEXT: br label [[FOR_BODY:%.*]]
; CHECK: for.cond.cleanup.loopexit:		; CHECK: for.cond.cleanup.loopexit:
; CHECK-NEXT: [[TMP3:%.]] = extractelement <2 x i32> [[TMP21:%.]], i32 1		; CHECK-NEXT: [[TMP3:%.]] = extractelement <2 x i32> [[TMP22:%.]], i32 1
; CHECK-NEXT: br label [[FOR_COND_CLEANUP]]		; CHECK-NEXT: br label [[FOR_COND_CLEANUP]]
; CHECK: for.cond.cleanup:		; CHECK: for.cond.cleanup:
; CHECK-NEXT: [[SUM_0_LCSSA:%.]] = phi i32 [ 0, [[ENTRY:%.]] ], [ [[TMP3]], [[FOR_COND_CLEANUP_LOOPEXIT:%.*]] ]		; CHECK-NEXT: [[SUM_0_LCSSA:%.]] = phi i32 [ 0, [[ENTRY:%.]] ], [ [[TMP3]], [[FOR_COND_CLEANUP_LOOPEXIT:%.*]] ]
; CHECK-NEXT: ret i32 [[SUM_0_LCSSA]]		; CHECK-NEXT: ret i32 [[SUM_0_LCSSA]]
; CHECK: for.body:		; CHECK: for.body:
; CHECK-NEXT: [[TMP4:%.*]] = phi <2 x i32> [ zeroinitializer, [[FOR_BODY_PREHEADER]] ], [ [[TMP21]], [[FOR_BODY]] ]		; CHECK-NEXT: [[TMP4:%.*]] = phi <2 x i32> [ zeroinitializer, [[FOR_BODY_PREHEADER]] ], [ [[TMP22]], [[FOR_BODY]] ]
; CHECK-NEXT: [[TMP5:%.*]] = extractelement <2 x i32> [[TMP4]], i32 0		; CHECK-NEXT: [[TMP5:%.*]] = extractelement <2 x i32> [[TMP4]], i32 0
; CHECK-NEXT: [[T4:%.*]] = shl nsw i32 [[TMP5]], 1		; CHECK-NEXT: [[T4:%.*]] = shl nsw i32 [[TMP5]], 1
; CHECK-NEXT: [[TMP6:%.*]] = insertelement <4 x i32> undef, i32 [[T4]], i32 0		; CHECK-NEXT: [[TMP6:%.*]] = insertelement <2 x i32> undef, i32 [[T4]], i32 0
; CHECK-NEXT: [[TMP7:%.*]] = shufflevector <4 x i32> [[TMP6]], <4 x i32> undef, <4 x i32> zeroinitializer		; CHECK-NEXT: [[TMP7:%.*]] = shufflevector <2 x i32> [[TMP6]], <2 x i32> undef, <2 x i32> zeroinitializer
; CHECK-NEXT: [[TMP8:%.*]] = add nsw <4 x i32> [[TMP7]], [[TMP2]]		; CHECK-NEXT: [[TMP8:%.*]] = add nsw <2 x i32> [[TMP7]], [[TMP0]]
; CHECK-NEXT: [[TMP9:%.*]] = extractelement <4 x i32> [[TMP8]], i32 0		; CHECK-NEXT: [[TMP9:%.*]] = extractelement <2 x i32> [[TMP8]], i32 0
; CHECK-NEXT: [[TMP10:%.*]] = sext i32 [[TMP9]] to i64		; CHECK-NEXT: [[TMP10:%.*]] = sext i32 [[TMP9]] to i64
; CHECK-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds i32, i32 [[G:%.*]], i64 [[TMP10]]		; CHECK-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds i32, i32 [[G:%.*]], i64 [[TMP10]]
; CHECK-NEXT: [[T6:%.]] = load i32, i32 [[ARRAYIDX]], align 4		; CHECK-NEXT: [[T6:%.]] = load i32, i32 [[ARRAYIDX]], align 4
; CHECK-NEXT: [[TMP11:%.*]] = extractelement <2 x i32> [[TMP4]], i32 1		; CHECK-NEXT: [[TMP11:%.*]] = extractelement <2 x i32> [[TMP4]], i32 1
; CHECK-NEXT: [[ADD1:%.*]] = add nsw i32 [[T6]], [[TMP11]]		; CHECK-NEXT: [[ADD1:%.*]] = add nsw i32 [[T6]], [[TMP11]]
; CHECK-NEXT: [[TMP12:%.*]] = extractelement <4 x i32> [[TMP8]], i32 1		; CHECK-NEXT: [[TMP12:%.*]] = extractelement <2 x i32> [[TMP8]], i32 1
; CHECK-NEXT: [[TMP13:%.*]] = sext i32 [[TMP12]] to i64		; CHECK-NEXT: [[TMP13:%.*]] = sext i32 [[TMP12]] to i64
; CHECK-NEXT: [[ARRAYIDX5:%.]] = getelementptr inbounds i32, i32 [[G]], i64 [[TMP13]]		; CHECK-NEXT: [[ARRAYIDX5:%.]] = getelementptr inbounds i32, i32 [[G]], i64 [[TMP13]]
; CHECK-NEXT: [[T8:%.]] = load i32, i32 [[ARRAYIDX5]], align 4		; CHECK-NEXT: [[T8:%.]] = load i32, i32 [[ARRAYIDX5]], align 4
; CHECK-NEXT: [[ADD6:%.*]] = add nsw i32 [[ADD1]], [[T8]]		; CHECK-NEXT: [[ADD6:%.*]] = add nsw i32 [[ADD1]], [[T8]]
; CHECK-NEXT: [[TMP14:%.*]] = extractelement <4 x i32> [[TMP8]], i32 2		; CHECK-NEXT: [[TMP14:%.*]] = add nsw <2 x i32> [[TMP7]], [[TMP2]]
; CHECK-NEXT: [[TMP15:%.*]] = sext i32 [[TMP14]] to i64		; CHECK-NEXT: [[TMP15:%.*]] = extractelement <2 x i32> [[TMP14]], i32 0
; CHECK-NEXT: [[ARRAYIDX10:%.]] = getelementptr inbounds i32, i32 [[G]], i64 [[TMP15]]		; CHECK-NEXT: [[TMP16:%.*]] = sext i32 [[TMP15]] to i64
		; CHECK-NEXT: [[ARRAYIDX10:%.]] = getelementptr inbounds i32, i32 [[G]], i64 [[TMP16]]
; CHECK-NEXT: [[T10:%.]] = load i32, i32 [[ARRAYIDX10]], align 4		; CHECK-NEXT: [[T10:%.]] = load i32, i32 [[ARRAYIDX10]], align 4
; CHECK-NEXT: [[ADD11:%.*]] = add nsw i32 [[ADD6]], [[T10]]		; CHECK-NEXT: [[ADD11:%.*]] = add nsw i32 [[ADD6]], [[T10]]
; CHECK-NEXT: [[TMP16:%.*]] = extractelement <4 x i32> [[TMP8]], i32 3		; CHECK-NEXT: [[TMP17:%.*]] = extractelement <2 x i32> [[TMP14]], i32 1
; CHECK-NEXT: [[TMP17:%.*]] = sext i32 [[TMP16]] to i64		; CHECK-NEXT: [[TMP18:%.*]] = sext i32 [[TMP17]] to i64
; CHECK-NEXT: [[ARRAYIDX15:%.]] = getelementptr inbounds i32, i32 [[G]], i64 [[TMP17]]		; CHECK-NEXT: [[ARRAYIDX15:%.]] = getelementptr inbounds i32, i32 [[G]], i64 [[TMP18]]
; CHECK-NEXT: [[T12:%.]] = load i32, i32 [[ARRAYIDX15]], align 4		; CHECK-NEXT: [[T12:%.]] = load i32, i32 [[ARRAYIDX15]], align 4
; CHECK-NEXT: [[TMP18:%.*]] = insertelement <2 x i32> undef, i32 [[TMP5]], i32 0		; CHECK-NEXT: [[TMP19:%.*]] = insertelement <2 x i32> undef, i32 [[TMP5]], i32 0
; CHECK-NEXT: [[TMP19:%.*]] = insertelement <2 x i32> [[TMP18]], i32 [[ADD11]], i32 1		; CHECK-NEXT: [[TMP20:%.*]] = insertelement <2 x i32> [[TMP19]], i32 [[ADD11]], i32 1
; CHECK-NEXT: [[TMP20:%.*]] = insertelement <2 x i32> <i32 1, i32 undef>, i32 [[T12]], i32 1		; CHECK-NEXT: [[TMP21:%.*]] = insertelement <2 x i32> <i32 1, i32 undef>, i32 [[T12]], i32 1
; CHECK-NEXT: [[TMP21]] = add nsw <2 x i32> [[TMP19]], [[TMP20]]		; CHECK-NEXT: [[TMP22]] = add nsw <2 x i32> [[TMP20]], [[TMP21]]
; CHECK-NEXT: [[TMP22:%.*]] = extractelement <2 x i32> [[TMP21]], i32 0		; CHECK-NEXT: [[TMP23:%.*]] = extractelement <2 x i32> [[TMP22]], i32 0
; CHECK-NEXT: [[EXITCOND:%.*]] = icmp eq i32 [[TMP22]], [[N]]		; CHECK-NEXT: [[EXITCOND:%.*]] = icmp eq i32 [[TMP23]], [[N]]
; CHECK-NEXT: br i1 [[EXITCOND]], label [[FOR_COND_CLEANUP_LOOPEXIT]], label [[FOR_BODY]]		; CHECK-NEXT: br i1 [[EXITCOND]], label [[FOR_COND_CLEANUP_LOOPEXIT]], label [[FOR_BODY]]
;		;
entry:		entry:
%cmp31 = icmp sgt i32 %n, 0		%cmp31 = icmp sgt i32 %n, 0
br i1 %cmp31, label %for.body.preheader, label %for.cond.cleanup		br i1 %cmp31, label %for.body.preheader, label %for.cond.cleanup

for.body.preheader:		for.body.preheader:
br label %for.body		br label %for.body
Show All 25 Lines	for.body:
%arrayidx15 = getelementptr inbounds i32, i32* %g, i32 %t11		%arrayidx15 = getelementptr inbounds i32, i32* %g, i32 %t11
%t12 = load i32, i32* %arrayidx15, align 4		%t12 = load i32, i32* %arrayidx15, align 4
%add16 = add nsw i32 %add11, %t12		%add16 = add nsw i32 %add11, %t12
%indvars.iv.next = add nuw nsw i32 %indvars.iv, 1		%indvars.iv.next = add nuw nsw i32 %indvars.iv, 1
%exitcond = icmp eq i32 %indvars.iv.next , %n		%exitcond = icmp eq i32 %indvars.iv.next , %n
br i1 %exitcond, label %for.cond.cleanup.loopexit, label %for.body		br i1 %exitcond, label %for.cond.cleanup.loopexit, label %for.body
}		}

; YAML: --- !Passed		; YAML-LABEL: Function: getelementptr_2x32
; YAML-NEXT: Pass: slp-vectorizer
; YAML-NEXT: Name: VectorizedList
; YAML-NEXT: Function: getelementptr_2x32
; YAML-NEXT: Args:		; YAML-NEXT: Args:
; YAML-NEXT: - String: 'SLP vectorized with cost '		; YAML-NEXT: - String: 'SLP vectorized with cost '
; YAML-NEXT: - Cost: '11'		; YAML-NEXT: - Cost: '11'
; YAML-NEXT: - String: ' and with tree size '		; YAML-NEXT: - String: ' and with tree size '
; YAML-NEXT: - TreeSize: '5'		; YAML-NEXT: - TreeSize: '5'

; YAML: --- !Passed		; YAML: --- !Passed
; YAML-NEXT: Pass: slp-vectorizer		; YAML-NEXT: Pass: slp-vectorizer
▲ Show 20 Lines • Show All 95 Lines • Show Last 20 Lines

llvm/test/Transforms/SLPVectorizer/X86/load-merge.ll

	Show First 20 Lines • Show All 147 Lines • ▼ Show 20 Lines
	; CHECK-NEXT: [[P0:%.]] = getelementptr inbounds i64, i64 [[P:%.*]], i64 0			; CHECK-NEXT: [[P0:%.]] = getelementptr inbounds i64, i64 [[P:%.*]], i64 0
	; CHECK-NEXT: [[P1:%.]] = getelementptr inbounds i64, i64 [[P]], i64 1			; CHECK-NEXT: [[P1:%.]] = getelementptr inbounds i64, i64 [[P]], i64 1
	; CHECK-NEXT: [[P2:%.]] = getelementptr inbounds i64, i64 [[P]], i64 2			; CHECK-NEXT: [[P2:%.]] = getelementptr inbounds i64, i64 [[P]], i64 2
	; CHECK-NEXT: [[P3:%.]] = getelementptr inbounds i64, i64 [[P]], i64 3			; CHECK-NEXT: [[P3:%.]] = getelementptr inbounds i64, i64 [[P]], i64 3
	; CHECK-NEXT: [[Q0:%.]] = getelementptr inbounds i64, i64 [[Q:%.*]], i64 0			; CHECK-NEXT: [[Q0:%.]] = getelementptr inbounds i64, i64 [[Q:%.*]], i64 0
	; CHECK-NEXT: [[Q1:%.]] = getelementptr inbounds i64, i64 [[Q]], i64 1			; CHECK-NEXT: [[Q1:%.]] = getelementptr inbounds i64, i64 [[Q]], i64 1
	; CHECK-NEXT: [[Q2:%.]] = getelementptr inbounds i64, i64 [[Q]], i64 2			; CHECK-NEXT: [[Q2:%.]] = getelementptr inbounds i64, i64 [[Q]], i64 2
	; CHECK-NEXT: [[Q3:%.]] = getelementptr inbounds i64, i64 [[Q]], i64 3			; CHECK-NEXT: [[Q3:%.]] = getelementptr inbounds i64, i64 [[Q]], i64 3
	; CHECK-NEXT: [[TMP1:%.]] = bitcast i64 [[P0]] to <4 x i64>*			; CHECK-NEXT: [[TMP1:%.]] = bitcast i64 [[P0]] to <2 x i64>*
	; CHECK-NEXT: [[TMP2:%.]] = load <4 x i64>, <4 x i64> [[TMP1]], align 2			; CHECK-NEXT: [[TMP2:%.]] = load <2 x i64>, <2 x i64> [[TMP1]], align 2
	; CHECK-NEXT: [[TMP3:%.]] = bitcast i64 [[Q0]] to <4 x i64>*			; CHECK-NEXT: [[TMP3:%.]] = bitcast i64 [[P2]] to <2 x i64>*
	; CHECK-NEXT: [[TMP4:%.]] = load <4 x i64>, <4 x i64> [[TMP3]], align 2			; CHECK-NEXT: [[TMP4:%.]] = load <2 x i64>, <2 x i64> [[TMP3]], align 2
	; CHECK-NEXT: [[TMP5:%.*]] = sub nsw <4 x i64> [[TMP2]], [[TMP4]]			; CHECK-NEXT: [[TMP5:%.]] = bitcast i64 [[Q0]] to <2 x i64>*
	; CHECK-NEXT: [[TMP6:%.*]] = extractelement <4 x i64> [[TMP5]], i32 0			; CHECK-NEXT: [[TMP6:%.]] = load <2 x i64>, <2 x i64> [[TMP5]], align 2
	; CHECK-NEXT: [[G0:%.]] = getelementptr inbounds i32, i32 [[R:%.*]], i64 [[TMP6]]			; CHECK-NEXT: [[TMP7:%.]] = bitcast i64 [[Q2]] to <2 x i64>*
	; CHECK-NEXT: [[TMP7:%.*]] = extractelement <4 x i64> [[TMP5]], i32 1			; CHECK-NEXT: [[TMP8:%.]] = load <2 x i64>, <2 x i64> [[TMP7]], align 2
	; CHECK-NEXT: [[G1:%.]] = getelementptr inbounds i32, i32 [[R]], i64 [[TMP7]]			; CHECK-NEXT: [[TMP9:%.*]] = sub nsw <2 x i64> [[TMP2]], [[TMP6]]
	; CHECK-NEXT: [[TMP8:%.*]] = extractelement <4 x i64> [[TMP5]], i32 2			; CHECK-NEXT: [[TMP10:%.*]] = sub nsw <2 x i64> [[TMP4]], [[TMP8]]
	; CHECK-NEXT: [[G2:%.]] = getelementptr inbounds i32, i32 [[R]], i64 [[TMP8]]			; CHECK-NEXT: [[TMP11:%.*]] = extractelement <2 x i64> [[TMP9]], i32 0
	; CHECK-NEXT: [[TMP9:%.*]] = extractelement <4 x i64> [[TMP5]], i32 3			; CHECK-NEXT: [[G0:%.]] = getelementptr inbounds i32, i32 [[R:%.*]], i64 [[TMP11]]
	; CHECK-NEXT: [[G3:%.]] = getelementptr inbounds i32, i32 [[R]], i64 [[TMP9]]			; CHECK-NEXT: [[TMP12:%.*]] = extractelement <2 x i64> [[TMP9]], i32 1
				; CHECK-NEXT: [[G1:%.]] = getelementptr inbounds i32, i32 [[R]], i64 [[TMP12]]
				; CHECK-NEXT: [[TMP13:%.*]] = extractelement <2 x i64> [[TMP10]], i32 0
				; CHECK-NEXT: [[G2:%.]] = getelementptr inbounds i32, i32 [[R]], i64 [[TMP13]]
				; CHECK-NEXT: [[TMP14:%.*]] = extractelement <2 x i64> [[TMP10]], i32 1
				; CHECK-NEXT: [[G3:%.]] = getelementptr inbounds i32, i32 [[R]], i64 [[TMP14]]
	; CHECK-NEXT: ret void			; CHECK-NEXT: ret void
	;			;
	%p0 = getelementptr inbounds i64, i64* %p, i64 0			%p0 = getelementptr inbounds i64, i64* %p, i64 0
	%p1 = getelementptr inbounds i64, i64* %p, i64 1			%p1 = getelementptr inbounds i64, i64* %p, i64 1
	%p2 = getelementptr inbounds i64, i64* %p, i64 2			%p2 = getelementptr inbounds i64, i64* %p, i64 2
	%p3 = getelementptr inbounds i64, i64* %p, i64 3			%p3 = getelementptr inbounds i64, i64* %p, i64 3

	%q0 = getelementptr inbounds i64, i64* %q, i64 0			%q0 = getelementptr inbounds i64, i64* %q, i64 0
	Show All 27 Lines