This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Transforms/Vectorize/
-
Transforms/
-
Vectorize/
5/12
SLPVectorizer.cpp
-
test/Transforms/SLPVectorizer/AArch64/
-
Transforms/
-
SLPVectorizer/
-
AArch64/
-
loadorder.ll

Differential D122148

[SLP] Peek into loads when hitting the RecursionMaxDepth
ClosedPublic

Authored by dmgreen on Mar 21 2022, 8:28 AM.

Download Raw Diff

Details

Reviewers

ABataev
RKSimon
vporpo
dtemirbulatov

Commits

rG2de05afc192d: [SLP] Peek into loads when hitting the RecursionMaxDepth

Summary

I have an issue where a SLP tree hit the RecursionMaxDepth limit in the SLP vectorizer, and is unable to analyze the whole tree because of it. This patch effectively slightly extends the limit, but only when it hits a load (or zext/sext of a load). This allows it to peek through in the places where it will be the most valuable, without ballooning out the O(..) by any 2^n factors.

The compile time I measured for this (and D122145 together, as percentage differences) were small:

ClamAV		0.086
7zip		-0.012
tramp3d-v4	0.098
kimwitu++	0.008
sqlite3		0.046
mafft		0.003
lencod		0.169
SPASS		0.006
consumer-typeset -0.019
Bullet		0.011

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

dmgreen created this revision.Mar 21 2022, 8:28 AM

Herald added a project: Restricted Project. · View Herald TranscriptMar 21 2022, 8:28 AM

Herald added a subscriber: hiraditya. · View Herald Transcript

dmgreen requested review of this revision.Mar 21 2022, 8:28 AM

Herald added a project: Restricted Project. · View Herald TranscriptMar 21 2022, 8:28 AM

Harbormaster completed remote builds in B155406: Diff 416955.Mar 21 2022, 8:28 AM

Fix some spelling

dmgreen added a parent revision: D122145: [SLP] Cluster ordering for loads.Mar 21 2022, 8:30 AM

Harbormaster completed remote builds in B155408: Diff 416965.Mar 21 2022, 8:30 AM

Have you checked the compile-time overhead of increasing the max depth by 1?

In D122148#3397828, @vporpo wrote:

Have you checked the compile-time overhead of increasing the max depth by 1?

Sure - I can check. But it would need to increase the max depth by 2.

OK - it didn't increase the time very much either. Without the influence of D122145, these are the compile times I measured. The first column is this patch (almost no change). The second is increasing the limit from 12 to 14.

			This (%)	14 limit (%)
ClamAV			0.006357042	0.132623139
7zip			0.01145899	0.003936277
tramp3d-v4		-0.00551849	0.009901878
kimwitu++		-0.000151601	0.003420248
sqlite3			-0.005542751	0.002919453
mafft			0.019282126	-0.006002062
lencod			-0.0100786	0.029042137
SPASS			0.006203808	0.014101154
consumer-typeset	0.015334486	0.031552953
Bullet			0.010231917	-0.017328356

Increasing the limit is still very small, but a little bigger in places than this patch (and may have worse performance in some degenerate cases).

Thank you for taking the time to measure the compilation time with an increased max-depth limit.

I am not sure what the max-depth limit was designed to be used for, probably for testing gathers in lit test? But I don't think that it is a very good way of limiting compilation time. A better option would be to limit the total number of nodes in the graph, by simply counting the entries in buildTree_rec(). So I would assume that increasing the max-depth limit should be relatively safe. If we run into compile-time issues, we can introduce the buildTree_rec() entry counter. @ABataev @RKSimon any thoughts on this?

In D122148#3401136, @vporpo wrote:

Thank you for taking the time to measure the compilation time with an increased max-depth limit.

I am not sure what the max-depth limit was designed to be used for, probably for testing gathers in lit test? But I don't think that it is a very good way of limiting compilation time. A better option would be to limit the total number of nodes in the graph, by simply counting the entries in buildTree_rec(). So I would assume that increasing the max-depth limit should be relatively safe. If we run into compile-time issues, we can introduce the buildTree_rec() entry counter. @ABataev @RKSimon any thoughts on this?

I don't think it is a good idea to increase/remove completely the limit for tree height. I tried to do something similar recently and found out lots of regressions. I think we can do this safely only after we have the implementation for partial subtree/subgraph vectorization, before that it is not profitable.

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4600	It might be not profitable, e.g. if vector extension is not free, while scalar extension is free, and loads are not vectorizable. Also, what if you have an extension of a load and some other instructions, not loads/extractelements, etc.?

In D122148#3401136, @vporpo wrote:

Thank you for taking the time to measure the compilation time with an increased max-depth limit.

I am not sure what the max-depth limit was designed to be used for, probably for testing gathers in lit test? But I don't think that it is a very good way of limiting compilation time. A better option would be to limit the total number of nodes in the graph, by simply counting the entries in buildTree_rec(). So I would assume that increasing the max-depth limit should be relatively safe. If we run into compile-time issues, we can introduce the buildTree_rec() entry counter. @ABataev @RKSimon any thoughts on this?

I think I would still recommend this over increasing the max-depth limit. It is more targeted, and less likely to run into degenerate cases where the compile time does increase. It just increases the limit where we know it will be most valuable. They are not mutually exclusive of course - we can always increase the limit separately if we need to.

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4600	Im not sure I understand what you mean by unprofitable? This just stops zext(load being forced to be gather if it hits the max depth. It should just mean that those node are either better (not gathers) or the same and shouldn't lead to regressions. It was previously hitting an arbitrary limit - you could say the same where any arbitrary limit causes arbitrary problems. Giving the loads the ability to order nicely should be a bigger win. For the second part - do you mean a AltOp? If so then that makes sense, we can add a check for that, making sure it is the same as the MainOp.

Add a check that S.MainOp == S.AltOp

Harbormaster completed remote builds in B156019: Diff 417871.Mar 24 2022, 3:31 AM

ABataev added inline comments.Mar 24 2022, 5:51 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4600	The cost of vector sext/zext is larger than the cost of scalar sext/zext (which might be free in many cases). If S.MainOp is zext/sext(load), it does not mean that all values are zext/sext(load), they might be sext/sext(load,extract,binop,etc.), since you're checking only the mainop.

ktkachov added a subscriber: ktkachov.Mar 24 2022, 6:28 AM

ktkachov added inline comments.

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4597	typo nit: should be "peek"

dmgreen added inline comments.Mar 29 2022, 3:05 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4597	Oh right, yeah. Cheers. I'll fix that.
4600	Can't that be true for any limit though? We have an arbitrary limit of 12 at the moment. Decreasing the limit to 11 will mean some zext are treated like gathers, not vectorized, and the cost of zexting the loads may be cheaper for scalars than it is for vectors. The same would be true for decreasing the limit to 10 or 9. We would end up picking the limit where we most expect to find zext(load, which is probably very low. (Or just never vectorizing zext(load if the load is gather). But in general if the loads can be vectorized nicely (either continuously or in clusters) then it should be a gain. The better vectorization of the load would overcome the difference in cost between the scalar and vector zext. We should expect for all the code out there for this to improve performance more than it happens to decrease it. For the second point, do you have any suggestions? As a simple heuristic, this seemed like nice enough check to me, to balance the complexity of the check vs the expected outcome. Should I make it an all_of?

ABataev added inline comments.Mar 29 2022, 4:25 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4600	Can't that be true for any limit though? We have an arbitrary limit of 12 at the moment. Decreasing the limit to 11 will mean some zext are treated like gathers, not vectorized, and the cost of zexting the loads may be cheaper for scalars than it is for vectors. The same would be true for decreasing the limit to 10 or 9. We would end up picking the limit where we most expect to find zext(load, which is probably very low. (Or just never vectorizing zext(load if the load is gather). Yes, but we have already some optimizations/numbers/data for this limit. Yes, this limit is not optimal but it is stable, lots of apps shows that it is good enough. Any changes here are very sensitive. But in general if the loads can be vectorized nicely (either continuously or in clusters) then it should be a gain. The better vectorization of the load would overcome the difference in cost between the scalar and vector zext. We should expect for all the code out there for this to improve performance more than it happens to decrease it. The problem here is that you're not checking for loads, you just check that you have a single load. It will work for just loads, but for zext/sext it is not. For the second point, do you have any suggestions? As a simple heuristic, this seemed like nice enough check to me, to balance the complexity of the check vs the expected outcome. Should I make it an all_of? Yes, better to have all_of here, at least for zex/sext. For loads themselves it is enough just to check S.MainOp

Update with more precise zext/sext checks. The clang-formatting comes out a bit strange - let me know if this should be put into a lambda.

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4600	I tried the llvm test suite - there were only 2 changes I saw with this patch (according to the file hashes). One was the same code in a different order and the other was essentially the same, just had slightly better reuse of values. Both had the same number of instructions. I'm don't think I agree that the old limit was particularly special. It was an arbitrary point, and it would seem that increasing should be beneficial in general. A lot of graphs don't get that big, but the ones that do can benefit from including the loads. Can you give more details about where you expect this to cause problems. Is there a particular benchmark you are worried about?

Harbormaster completed remote builds in B156849: Diff 418902.Mar 29 2022, 3:21 PM

ABataev added inline comments.Mar 29 2022, 3:25 PM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4600	I already told that I tried to increase it it and there were lots of regressions. Try your patch for SpecCPU or other benchmark testsuite.

SPEC I've tried for AArch64 - that was one of the places that showed the need for something like this. For AArch64 along with D122145 this helps one of the routines in x264 to speed up the whole thing. The exact speedup is quite dependant on the ordering of shuffles chosen, and may need some more work to come out as good as it can. The introduction of select shuffles wasn't great for targets without them. We are talking about something like a 5% improvement.

I've been trying to run X86 too, but I don't have a great setup for it. On what I think is some sort of "Xeon 6148" thing, these were the performance scores I saw from running SPEC 2017 with -march=native (again with this and D122145):

500.perlbench_r	0.829412281
502.gcc_r	0.398122431
505.mcf_r	-0.352758469
520.omnetpp_r	0.256906516
523.xalancbmk_r	-0.789195872
525.x264_r	0.198126785
531.deepsjeng_r	-0.024245574
541.leela_r	0.003397322
557.xz_r	-0.722254612

They can be quite noisy though I'm afraid, even if those results are averaged between three runs. I wouldn't be surprised if all those changes were down to machine noise or knock-on alignment changes. We have a better setup for AArch64 than we do for X86, but there is always some noise.

I tried 2006 too on AArch64. It was only the perlbench binary that changed there, according to the file hashes. And the performance was the same.

In D122148#3426311, @dmgreen wrote:
SPEC I've tried for AArch64 - that was one of the places that showed the need for something like this. For AArch64 along with D122145 this helps one of the routines in x264 to speed up the whole thing. The exact speedup is quite dependant on the ordering of shuffles chosen, and may need some more work to come out as good as it can. The introduction of select shuffles wasn't great for targets without them. We are talking about something like a 5% improvement.

I've been trying to run X86 too, but I don't have a great setup for it. On what I think is some sort of "Xeon 6148" thing, these were the performance scores I saw from running SPEC 2017 with -march=native (again with this and D122145):
500.perlbench_r	0.829412281
502.gcc_r	0.398122431
505.mcf_r	-0.352758469
520.omnetpp_r	0.256906516
523.xalancbmk_r	-0.789195872
525.x264_r	0.198126785
531.deepsjeng_r	-0.024245574
541.leela_r	0.003397322
557.xz_r	-0.722254612
They can be quite noisy though I'm afraid, even if those results are averaged between three runs. I wouldn't be surprised if all those changes were down to machine noise or knock-on alignment changes. We have a better setup for AArch64 than we do for X86, but there is always some noise.

I tried 2006 too on AArch64. It was only the perlbench binary that changed there, according to the file hashes. And the performance was the same.

I believe that this patch and D122145 can be landed, just need to tweak them a bit.

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4602–4608	I would also add a check that we have `VL.size() >= 4` values, for 2 elements it may cause regressions and, probably, single uses, to avoid extra extractelements.

In D122148#3426341, @ABataev wrote:

I believe that this patch and D122145 can be landed, just need to tweak them a bit.

Ah, great. Thanks. I guess I have to find some way to improve the shuffle order for AArch64 now... :)

Update with OneUse checks and a VL limit.

Harbormaster completed remote builds in B157922: Diff 420432.Apr 5 2022, 3:11 AM

ABataev added inline comments.Apr 5 2022, 3:23 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4604–4607	I would also checked that sext/zext have just one use too.

Add an extra oneuse check on the zext/sext

Harbormaster completed remote builds in B158030: Diff 420534.Apr 5 2022, 10:28 AM

ABataev added inline comments.Apr 5 2022, 10:36 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4602–4608	Can you merge these 2 checks into one with single `all_of` to avoid 2 similar checks?

Attempt to turn the two loops into one, by checking the opcode matches the MainOp.

Harbormaster completed remote builds in B158419: Diff 421115.Apr 7 2022, 12:53 AM

This revision is now accepted and ready to land.Apr 7 2022, 5:55 AM

OK I'm going to give this a go to get the better lane ordering. Please let me know if any problems arise.

This revision was landed with ongoing or failed builds.Jul 4 2022, 6:22 AM

Closed by commit rG2de05afc192d: [SLP] Peek into loads when hitting the RecursionMaxDepth (authored by dmgreen). · Explain Why

This revision was automatically updated to reflect the committed changes.

dmgreen added a commit: rG2de05afc192d: [SLP] Peek into loads when hitting the RecursionMaxDepth.

Revision Contents

Path

Size

llvm/

lib/

Transforms/

Vectorize/

SLPVectorizer.cpp

14 lines

test/

Transforms/

SLPVectorizer/

AArch64/

loadorder.ll

151 lines

Diff 442081

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 4,586 Lines • ▼ Show 20 Lines	if (NumUniqueScalarValues == VL.size()) {
return false;		return false;
}		}
VL = UniqueValues;		VL = UniqueValues;
}		}
return true;		return true;
};		};

InstructionsState S = getSameOpcode(VL);		InstructionsState S = getSameOpcode(VL);
if (Depth == RecursionMaxDepth) {
		// Gather if we hit the RecursionMaxDepth, unless this is a load (or z/sext of
		// a load), in which case peek through to include it in the tree, without
		ktkachovUnsubmitted Not Done Reply Inline Actions typo nit: should be "peek" ktkachov: typo nit: should be "peek"
		dmgreenAuthorUnsubmitted Done Reply Inline Actions Oh right, yeah. Cheers. I'll fix that. dmgreen: Oh right, yeah. Cheers. I'll fix that.
		// ballooning over-budget.
		if (Depth >= RecursionMaxDepth &&
		!(S.MainOp && isa<Instruction>(S.MainOp) && S.MainOp == S.AltOp &&
		ABataevUnsubmitted Not Done Reply Inline Actions It might be not profitable, e.g. if vector extension is not free, while scalar extension is free, and loads are not vectorizable. Also, what if you have an extension of a load and some other instructions, not loads/extractelements, etc.? ABataev: It might be not profitable, e.g. if vector extension is not free, while scalar extension is…
		dmgreenAuthorUnsubmitted Done Reply Inline Actions Im not sure I understand what you mean by unprofitable? This just stops zext(load being forced to be gather if it hits the max depth. It should just mean that those node are either better (not gathers) or the same and shouldn't lead to regressions. It was previously hitting an arbitrary limit - you could say the same where any arbitrary limit causes arbitrary problems. Giving the loads the ability to order nicely should be a bigger win. For the second part - do you mean a AltOp? If so then that makes sense, we can add a check for that, making sure it is the same as the MainOp. dmgreen: Im not sure I understand what you mean by unprofitable? This just stops zext(load being forced…
		ABataevUnsubmitted Not Done Reply Inline Actions The cost of vector sext/zext is larger than the cost of scalar sext/zext (which might be free in many cases). If S.MainOp is zext/sext(load), it does not mean that all values are zext/sext(load), they might be sext/sext(load,extract,binop,etc.), since you're checking only the mainop. ABataev: 1. The cost of vector sext/zext is larger than the cost of scalar sext/zext (which might be…
		dmgreenAuthorUnsubmitted Done Reply Inline Actions Can't that be true for any limit though? We have an arbitrary limit of 12 at the moment. Decreasing the limit to 11 will mean some zext are treated like gathers, not vectorized, and the cost of zexting the loads may be cheaper for scalars than it is for vectors. The same would be true for decreasing the limit to 10 or 9. We would end up picking the limit where we most expect to find zext(load, which is probably very low. (Or just never vectorizing zext(load if the load is gather). But in general if the loads can be vectorized nicely (either continuously or in clusters) then it should be a gain. The better vectorization of the load would overcome the difference in cost between the scalar and vector zext. We should expect for all the code out there for this to improve performance more than it happens to decrease it. For the second point, do you have any suggestions? As a simple heuristic, this seemed like nice enough check to me, to balance the complexity of the check vs the expected outcome. Should I make it an all_of? dmgreen: Can't that be true for any limit though? We have an arbitrary limit of 12 at the moment.
		ABataevUnsubmitted Not Done Reply Inline Actions Can't that be true for any limit though? We have an arbitrary limit of 12 at the moment. Decreasing the limit to 11 will mean some zext are treated like gathers, not vectorized, and the cost of zexting the loads may be cheaper for scalars than it is for vectors. The same would be true for decreasing the limit to 10 or 9. We would end up picking the limit where we most expect to find zext(load, which is probably very low. (Or just never vectorizing zext(load if the load is gather). Yes, but we have already some optimizations/numbers/data for this limit. Yes, this limit is not optimal but it is stable, lots of apps shows that it is good enough. Any changes here are very sensitive. But in general if the loads can be vectorized nicely (either continuously or in clusters) then it should be a gain. The better vectorization of the load would overcome the difference in cost between the scalar and vector zext. We should expect for all the code out there for this to improve performance more than it happens to decrease it. The problem here is that you're not checking for loads, you just check that you have a single load. It will work for just loads, but for zext/sext it is not. For the second point, do you have any suggestions? As a simple heuristic, this seemed like nice enough check to me, to balance the complexity of the check vs the expected outcome. Should I make it an all_of? Yes, better to have all_of here, at least for zex/sext. For loads themselves it is enough just to check S.MainOp ABataev: > Can't that be true for any limit though? We have an arbitrary limit of 12 at the moment.
		dmgreenAuthorUnsubmitted Done Reply Inline Actions I tried the llvm test suite - there were only 2 changes I saw with this patch (according to the file hashes). One was the same code in a different order and the other was essentially the same, just had slightly better reuse of values. Both had the same number of instructions. I'm don't think I agree that the old limit was particularly special. It was an arbitrary point, and it would seem that increasing should be beneficial in general. A lot of graphs don't get that big, but the ones that do can benefit from including the loads. Can you give more details about where you expect this to cause problems. Is there a particular benchmark you are worried about? dmgreen: I tried the llvm test suite - there were only 2 changes I saw with this patch (according to the…
		ABataevUnsubmitted Not Done Reply Inline Actions I already told that I tried to increase it it and there were lots of regressions. Try your patch for SpecCPU or other benchmark testsuite. ABataev: I already told that I tried to increase it it and there were lots of regressions. Try your…
		VL.size() >= 4 &&
		(match(S.MainOp, m_Load(m_Value())) \|\| all_of(VL, [&S](const Value *I) {
		return match(I,
		m_OneUse(m_ZExtOrSExt(m_OneUse(m_Load(m_Value()))))) &&
		cast<Instruction>(I)->getOpcode() ==
		cast<Instruction>(S.MainOp)->getOpcode();
		})))) {
		ABataevUnsubmitted Not Done Reply Inline Actions I would also checked that sext/zext have just one use too. ABataev: I would also checked that sext/zext have just one use too.
LLVM_DEBUG(dbgs() << "SLP: Gathering due to max recursion depth.\n");		LLVM_DEBUG(dbgs() << "SLP: Gathering due to max recursion depth.\n");
		ABataevUnsubmitted Done Reply Inline Actions I would also add a check that we have `VL.size() >= 4` values, for 2 elements it may cause regressions and, probably, single uses, to avoid extra extractelements. ABataev: I would also add a check that we have `VL.size() >= 4` values, for 2 elements it may cause…
		ABataevUnsubmitted Not Done Reply Inline Actions Can you merge these 2 checks into one with single `all_of` to avoid 2 similar checks? ABataev: Can you merge these 2 checks into one with single `all_of` to avoid 2 similar checks?
if (TryToFindDuplicates(S))		if (TryToFindDuplicates(S))
newTreeEntry(VL, None /not vectorized/, S, UserTreeIdx,		newTreeEntry(VL, None /not vectorized/, S, UserTreeIdx,
ReuseShuffleIndicies);		ReuseShuffleIndicies);
return;		return;
}		}

// Don't handle scalable vectors		// Don't handle scalable vectors
if (S.getOpcode() == Instruction::ExtractElement &&		if (S.getOpcode() == Instruction::ExtractElement &&
▲ Show 20 Lines • Show All 7,852 Lines • Show Last 20 Lines

llvm/test/Transforms/SLPVectorizer/AArch64/loadorder.ll

	Show First 20 Lines • Show All 1,233 Lines • ▼ Show 20 Lines

	define dso_local i32 @full(i8* nocapture noundef readonly %p1, i32 noundef %st1, i8* nocapture noundef readonly %p2, i32 noundef %st2) {			define dso_local i32 @full(i8* nocapture noundef readonly %p1, i32 noundef %st1, i8* nocapture noundef readonly %p2, i32 noundef %st2) {
	; CHECK-LABEL: @full(			; CHECK-LABEL: @full(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: [[IDX_EXT:%.]] = sext i32 [[ST1:%.]] to i64			; CHECK-NEXT: [[IDX_EXT:%.]] = sext i32 [[ST1:%.]] to i64
	; CHECK-NEXT: [[IDX_EXT63:%.]] = sext i32 [[ST2:%.]] to i64			; CHECK-NEXT: [[IDX_EXT63:%.]] = sext i32 [[ST2:%.]] to i64
	; CHECK-NEXT: [[ARRAYIDX3:%.]] = getelementptr inbounds i8, i8 [[P1:%.*]], i64 4			; CHECK-NEXT: [[ARRAYIDX3:%.]] = getelementptr inbounds i8, i8 [[P1:%.*]], i64 4
	; CHECK-NEXT: [[ARRAYIDX5:%.]] = getelementptr inbounds i8, i8 [[P2:%.*]], i64 4			; CHECK-NEXT: [[ARRAYIDX5:%.]] = getelementptr inbounds i8, i8 [[P2:%.*]], i64 4
	; CHECK-NEXT: [[TMP0:%.]] = bitcast i8 [[P1]] to <4 x i8>*
	; CHECK-NEXT: [[TMP1:%.]] = load <4 x i8>, <4 x i8> [[TMP0]], align 1
	; CHECK-NEXT: [[TMP2:%.]] = bitcast i8 [[P2]] to <4 x i8>*
	; CHECK-NEXT: [[TMP3:%.]] = load <4 x i8>, <4 x i8> [[TMP2]], align 1
	; CHECK-NEXT: [[ADD_PTR:%.]] = getelementptr inbounds i8, i8 [[P1]], i64 [[IDX_EXT]]			; CHECK-NEXT: [[ADD_PTR:%.]] = getelementptr inbounds i8, i8 [[P1]], i64 [[IDX_EXT]]
	; CHECK-NEXT: [[ADD_PTR64:%.]] = getelementptr inbounds i8, i8 [[P2]], i64 [[IDX_EXT63]]			; CHECK-NEXT: [[ADD_PTR64:%.]] = getelementptr inbounds i8, i8 [[P2]], i64 [[IDX_EXT63]]
	; CHECK-NEXT: [[ARRAYIDX3_1:%.]] = getelementptr inbounds i8, i8 [[ADD_PTR]], i64 4			; CHECK-NEXT: [[ARRAYIDX3_1:%.]] = getelementptr inbounds i8, i8 [[ADD_PTR]], i64 4
	; CHECK-NEXT: [[ARRAYIDX5_1:%.]] = getelementptr inbounds i8, i8 [[ADD_PTR64]], i64 4			; CHECK-NEXT: [[ARRAYIDX5_1:%.]] = getelementptr inbounds i8, i8 [[ADD_PTR64]], i64 4
	; CHECK-NEXT: [[TMP4:%.]] = bitcast i8 [[ADD_PTR]] to <4 x i8>*
	; CHECK-NEXT: [[TMP5:%.]] = load <4 x i8>, <4 x i8> [[TMP4]], align 1
	; CHECK-NEXT: [[TMP6:%.]] = bitcast i8 [[ADD_PTR64]] to <4 x i8>*
	; CHECK-NEXT: [[TMP7:%.]] = load <4 x i8>, <4 x i8> [[TMP6]], align 1
	; CHECK-NEXT: [[ADD_PTR_1:%.]] = getelementptr inbounds i8, i8 [[ADD_PTR]], i64 [[IDX_EXT]]			; CHECK-NEXT: [[ADD_PTR_1:%.]] = getelementptr inbounds i8, i8 [[ADD_PTR]], i64 [[IDX_EXT]]
	; CHECK-NEXT: [[ADD_PTR64_1:%.]] = getelementptr inbounds i8, i8 [[ADD_PTR64]], i64 [[IDX_EXT63]]			; CHECK-NEXT: [[ADD_PTR64_1:%.]] = getelementptr inbounds i8, i8 [[ADD_PTR64]], i64 [[IDX_EXT63]]
	; CHECK-NEXT: [[ARRAYIDX3_2:%.]] = getelementptr inbounds i8, i8 [[ADD_PTR_1]], i64 4			; CHECK-NEXT: [[ARRAYIDX3_2:%.]] = getelementptr inbounds i8, i8 [[ADD_PTR_1]], i64 4
	; CHECK-NEXT: [[ARRAYIDX5_2:%.]] = getelementptr inbounds i8, i8 [[ADD_PTR64_1]], i64 4			; CHECK-NEXT: [[ARRAYIDX5_2:%.]] = getelementptr inbounds i8, i8 [[ADD_PTR64_1]], i64 4
	; CHECK-NEXT: [[TMP8:%.]] = bitcast i8 [[ADD_PTR_1]] to <4 x i8>*
	; CHECK-NEXT: [[TMP9:%.]] = load <4 x i8>, <4 x i8> [[TMP8]], align 1
	; CHECK-NEXT: [[TMP10:%.]] = bitcast i8 [[ADD_PTR64_1]] to <4 x i8>*
	; CHECK-NEXT: [[TMP11:%.]] = load <4 x i8>, <4 x i8> [[TMP10]], align 1
	; CHECK-NEXT: [[ADD_PTR_2:%.]] = getelementptr inbounds i8, i8 [[ADD_PTR_1]], i64 [[IDX_EXT]]			; CHECK-NEXT: [[ADD_PTR_2:%.]] = getelementptr inbounds i8, i8 [[ADD_PTR_1]], i64 [[IDX_EXT]]
	; CHECK-NEXT: [[ADD_PTR64_2:%.]] = getelementptr inbounds i8, i8 [[ADD_PTR64_1]], i64 [[IDX_EXT63]]			; CHECK-NEXT: [[ADD_PTR64_2:%.]] = getelementptr inbounds i8, i8 [[ADD_PTR64_1]], i64 [[IDX_EXT63]]
	; CHECK-NEXT: [[ARRAYIDX3_3:%.]] = getelementptr inbounds i8, i8 [[ADD_PTR_2]], i64 4			; CHECK-NEXT: [[ARRAYIDX3_3:%.]] = getelementptr inbounds i8, i8 [[ADD_PTR_2]], i64 4
	; CHECK-NEXT: [[ARRAYIDX5_3:%.]] = getelementptr inbounds i8, i8 [[ADD_PTR64_2]], i64 4			; CHECK-NEXT: [[ARRAYIDX5_3:%.]] = getelementptr inbounds i8, i8 [[ADD_PTR64_2]], i64 4
	; CHECK-NEXT: [[TMP12:%.]] = bitcast i8 [[ADD_PTR_2]] to <4 x i8>*			; CHECK-NEXT: [[TMP0:%.]] = bitcast i8 [[P1]] to <4 x i8>*
				; CHECK-NEXT: [[TMP1:%.]] = load <4 x i8>, <4 x i8> [[TMP0]], align 1
				; CHECK-NEXT: [[TMP2:%.]] = bitcast i8 [[P2]] to <4 x i8>*
				; CHECK-NEXT: [[TMP3:%.]] = load <4 x i8>, <4 x i8> [[TMP2]], align 1
				; CHECK-NEXT: [[TMP4:%.]] = bitcast i8 [[ARRAYIDX3]] to <4 x i8>*
				; CHECK-NEXT: [[TMP5:%.]] = load <4 x i8>, <4 x i8> [[TMP4]], align 1
				; CHECK-NEXT: [[TMP6:%.]] = bitcast i8 [[ARRAYIDX5]] to <4 x i8>*
				; CHECK-NEXT: [[TMP7:%.]] = load <4 x i8>, <4 x i8> [[TMP6]], align 1
				; CHECK-NEXT: [[TMP8:%.]] = bitcast i8 [[ADD_PTR]] to <4 x i8>*
				; CHECK-NEXT: [[TMP9:%.]] = load <4 x i8>, <4 x i8> [[TMP8]], align 1
				; CHECK-NEXT: [[TMP10:%.]] = bitcast i8 [[ADD_PTR64]] to <4 x i8>*
				; CHECK-NEXT: [[TMP11:%.]] = load <4 x i8>, <4 x i8> [[TMP10]], align 1
				; CHECK-NEXT: [[TMP12:%.]] = bitcast i8 [[ARRAYIDX3_1]] to <4 x i8>*
	; CHECK-NEXT: [[TMP13:%.]] = load <4 x i8>, <4 x i8> [[TMP12]], align 1			; CHECK-NEXT: [[TMP13:%.]] = load <4 x i8>, <4 x i8> [[TMP12]], align 1
	; CHECK-NEXT: [[TMP14:%.]] = bitcast i8 [[ADD_PTR64_2]] to <4 x i8>*			; CHECK-NEXT: [[TMP14:%.]] = bitcast i8 [[ARRAYIDX5_1]] to <4 x i8>*
	; CHECK-NEXT: [[TMP15:%.]] = load <4 x i8>, <4 x i8> [[TMP14]], align 1			; CHECK-NEXT: [[TMP15:%.]] = load <4 x i8>, <4 x i8> [[TMP14]], align 1
	; CHECK-NEXT: [[TMP16:%.]] = bitcast i8 [[ARRAYIDX3]] to <4 x i8>*			; CHECK-NEXT: [[TMP16:%.]] = bitcast i8 [[ADD_PTR_1]] to <4 x i8>*
	; CHECK-NEXT: [[TMP17:%.]] = load <4 x i8>, <4 x i8> [[TMP16]], align 1			; CHECK-NEXT: [[TMP17:%.]] = load <4 x i8>, <4 x i8> [[TMP16]], align 1
	; CHECK-NEXT: [[TMP18:%.]] = bitcast i8 [[ARRAYIDX3_1]] to <4 x i8>*			; CHECK-NEXT: [[TMP18:%.]] = bitcast i8 [[ADD_PTR64_1]] to <4 x i8>*
	; CHECK-NEXT: [[TMP19:%.]] = load <4 x i8>, <4 x i8> [[TMP18]], align 1			; CHECK-NEXT: [[TMP19:%.]] = load <4 x i8>, <4 x i8> [[TMP18]], align 1
	; CHECK-NEXT: [[TMP20:%.]] = bitcast i8 [[ARRAYIDX3_2]] to <4 x i8>*			; CHECK-NEXT: [[TMP20:%.]] = bitcast i8 [[ARRAYIDX3_2]] to <4 x i8>*
	; CHECK-NEXT: [[TMP21:%.]] = load <4 x i8>, <4 x i8> [[TMP20]], align 1			; CHECK-NEXT: [[TMP21:%.]] = load <4 x i8>, <4 x i8> [[TMP20]], align 1
	; CHECK-NEXT: [[TMP22:%.]] = bitcast i8 [[ARRAYIDX3_3]] to <4 x i8>*			; CHECK-NEXT: [[TMP22:%.]] = bitcast i8 [[ARRAYIDX5_2]] to <4 x i8>*
	; CHECK-NEXT: [[TMP23:%.]] = load <4 x i8>, <4 x i8> [[TMP22]], align 1			; CHECK-NEXT: [[TMP23:%.]] = load <4 x i8>, <4 x i8> [[TMP22]], align 1
	; CHECK-NEXT: [[TMP24:%.*]] = shufflevector <4 x i8> [[TMP23]], <4 x i8> [[TMP21]], <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>			; CHECK-NEXT: [[TMP24:%.]] = bitcast i8 [[ADD_PTR_2]] to <4 x i8>*
	; CHECK-NEXT: [[TMP25:%.*]] = shufflevector <4 x i8> [[TMP19]], <4 x i8> poison, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>			; CHECK-NEXT: [[TMP25:%.]] = load <4 x i8>, <4 x i8> [[TMP24]], align 1
	; CHECK-NEXT: [[TMP26:%.*]] = shufflevector <16 x i8> [[TMP24]], <16 x i8> [[TMP25]], <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 16, i32 17, i32 18, i32 19, i32 undef, i32 undef, i32 undef, i32 undef>			; CHECK-NEXT: [[TMP26:%.*]] = shufflevector <4 x i8> [[TMP25]], <4 x i8> [[TMP17]], <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
	; CHECK-NEXT: [[TMP27:%.*]] = shufflevector <4 x i8> [[TMP17]], <4 x i8> poison, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>			; CHECK-NEXT: [[TMP27:%.*]] = shufflevector <4 x i8> [[TMP9]], <4 x i8> poison, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
	; CHECK-NEXT: [[TMP28:%.*]] = shufflevector <16 x i8> [[TMP26]], <16 x i8> [[TMP27]], <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 16, i32 17, i32 18, i32 19>			; CHECK-NEXT: [[TMP28:%.*]] = shufflevector <16 x i8> [[TMP26]], <16 x i8> [[TMP27]], <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 16, i32 17, i32 18, i32 19, i32 undef, i32 undef, i32 undef, i32 undef>
	; CHECK-NEXT: [[TMP29:%.*]] = zext <16 x i8> [[TMP28]] to <16 x i32>			; CHECK-NEXT: [[TMP29:%.*]] = shufflevector <4 x i8> [[TMP1]], <4 x i8> poison, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
	; CHECK-NEXT: [[TMP30:%.]] = bitcast i8 [[ARRAYIDX5]] to <4 x i8>*			; CHECK-NEXT: [[TMP30:%.*]] = shufflevector <16 x i8> [[TMP28]], <16 x i8> [[TMP29]], <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 16, i32 17, i32 18, i32 19>
	; CHECK-NEXT: [[TMP31:%.]] = load <4 x i8>, <4 x i8> [[TMP30]], align 1			; CHECK-NEXT: [[TMP31:%.*]] = zext <16 x i8> [[TMP30]] to <16 x i32>
	; CHECK-NEXT: [[TMP32:%.]] = bitcast i8 [[ARRAYIDX5_1]] to <4 x i8>*			; CHECK-NEXT: [[TMP32:%.]] = bitcast i8 [[ADD_PTR64_2]] to <4 x i8>*
	; CHECK-NEXT: [[TMP33:%.]] = load <4 x i8>, <4 x i8> [[TMP32]], align 1			; CHECK-NEXT: [[TMP33:%.]] = load <4 x i8>, <4 x i8> [[TMP32]], align 1
	; CHECK-NEXT: [[TMP34:%.]] = bitcast i8 [[ARRAYIDX5_2]] to <4 x i8>*			; CHECK-NEXT: [[TMP34:%.*]] = shufflevector <4 x i8> [[TMP33]], <4 x i8> [[TMP19]], <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
	; CHECK-NEXT: [[TMP35:%.]] = load <4 x i8>, <4 x i8> [[TMP34]], align 1			; CHECK-NEXT: [[TMP35:%.*]] = shufflevector <4 x i8> [[TMP11]], <4 x i8> poison, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
	; CHECK-NEXT: [[TMP36:%.]] = bitcast i8 [[ARRAYIDX5_3]] to <4 x i8>*			; CHECK-NEXT: [[TMP36:%.*]] = shufflevector <16 x i8> [[TMP34]], <16 x i8> [[TMP35]], <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 16, i32 17, i32 18, i32 19, i32 undef, i32 undef, i32 undef, i32 undef>
	; CHECK-NEXT: [[TMP37:%.]] = load <4 x i8>, <4 x i8> [[TMP36]], align 1			; CHECK-NEXT: [[TMP37:%.*]] = shufflevector <4 x i8> [[TMP3]], <4 x i8> poison, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
	; CHECK-NEXT: [[TMP38:%.*]] = shufflevector <4 x i8> [[TMP37]], <4 x i8> [[TMP35]], <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>			; CHECK-NEXT: [[TMP38:%.*]] = shufflevector <16 x i8> [[TMP36]], <16 x i8> [[TMP37]], <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 16, i32 17, i32 18, i32 19>
	; CHECK-NEXT: [[TMP39:%.*]] = shufflevector <4 x i8> [[TMP33]], <4 x i8> poison, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>			; CHECK-NEXT: [[TMP39:%.*]] = zext <16 x i8> [[TMP38]] to <16 x i32>
	; CHECK-NEXT: [[TMP40:%.*]] = shufflevector <16 x i8> [[TMP38]], <16 x i8> [[TMP39]], <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 16, i32 17, i32 18, i32 19, i32 undef, i32 undef, i32 undef, i32 undef>			; CHECK-NEXT: [[TMP40:%.*]] = sub nsw <16 x i32> [[TMP31]], [[TMP39]]
	; CHECK-NEXT: [[TMP41:%.*]] = shufflevector <4 x i8> [[TMP31]], <4 x i8> poison, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>			; CHECK-NEXT: [[TMP41:%.]] = bitcast i8 [[ARRAYIDX3_3]] to <4 x i8>*
	; CHECK-NEXT: [[TMP42:%.*]] = shufflevector <16 x i8> [[TMP40]], <16 x i8> [[TMP41]], <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 16, i32 17, i32 18, i32 19>			; CHECK-NEXT: [[TMP42:%.]] = load <4 x i8>, <4 x i8> [[TMP41]], align 1
	; CHECK-NEXT: [[TMP43:%.*]] = zext <16 x i8> [[TMP42]] to <16 x i32>			; CHECK-NEXT: [[TMP43:%.*]] = shufflevector <4 x i8> [[TMP42]], <4 x i8> [[TMP21]], <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
	; CHECK-NEXT: [[TMP44:%.*]] = shufflevector <4 x i8> [[TMP13]], <4 x i8> [[TMP9]], <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>			; CHECK-NEXT: [[TMP44:%.*]] = shufflevector <4 x i8> [[TMP13]], <4 x i8> poison, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
	; CHECK-NEXT: [[TMP45:%.*]] = shufflevector <4 x i8> [[TMP5]], <4 x i8> poison, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>			; CHECK-NEXT: [[TMP45:%.*]] = shufflevector <16 x i8> [[TMP43]], <16 x i8> [[TMP44]], <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 16, i32 17, i32 18, i32 19, i32 undef, i32 undef, i32 undef, i32 undef>
	; CHECK-NEXT: [[TMP46:%.*]] = shufflevector <16 x i8> [[TMP44]], <16 x i8> [[TMP45]], <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 16, i32 17, i32 18, i32 19, i32 undef, i32 undef, i32 undef, i32 undef>			; CHECK-NEXT: [[TMP46:%.*]] = shufflevector <4 x i8> [[TMP5]], <4 x i8> poison, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
	; CHECK-NEXT: [[TMP47:%.*]] = shufflevector <4 x i8> [[TMP1]], <4 x i8> poison, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>			; CHECK-NEXT: [[TMP47:%.*]] = shufflevector <16 x i8> [[TMP45]], <16 x i8> [[TMP46]], <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 16, i32 17, i32 18, i32 19>
	; CHECK-NEXT: [[TMP48:%.*]] = shufflevector <16 x i8> [[TMP46]], <16 x i8> [[TMP47]], <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 16, i32 17, i32 18, i32 19>			; CHECK-NEXT: [[TMP48:%.*]] = zext <16 x i8> [[TMP47]] to <16 x i32>
	; CHECK-NEXT: [[TMP49:%.*]] = zext <16 x i8> [[TMP48]] to <16 x i32>			; CHECK-NEXT: [[TMP49:%.]] = bitcast i8 [[ARRAYIDX5_3]] to <4 x i8>*
	; CHECK-NEXT: [[TMP50:%.*]] = shufflevector <4 x i8> [[TMP15]], <4 x i8> [[TMP11]], <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>			; CHECK-NEXT: [[TMP50:%.]] = load <4 x i8>, <4 x i8> [[TMP49]], align 1
	; CHECK-NEXT: [[TMP51:%.*]] = shufflevector <4 x i8> [[TMP7]], <4 x i8> poison, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>			; CHECK-NEXT: [[TMP51:%.*]] = shufflevector <4 x i8> [[TMP50]], <4 x i8> [[TMP23]], <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
	; CHECK-NEXT: [[TMP52:%.*]] = shufflevector <16 x i8> [[TMP50]], <16 x i8> [[TMP51]], <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 16, i32 17, i32 18, i32 19, i32 undef, i32 undef, i32 undef, i32 undef>			; CHECK-NEXT: [[TMP52:%.*]] = shufflevector <4 x i8> [[TMP15]], <4 x i8> poison, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
	; CHECK-NEXT: [[TMP53:%.*]] = shufflevector <4 x i8> [[TMP3]], <4 x i8> poison, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>			; CHECK-NEXT: [[TMP53:%.*]] = shufflevector <16 x i8> [[TMP51]], <16 x i8> [[TMP52]], <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 16, i32 17, i32 18, i32 19, i32 undef, i32 undef, i32 undef, i32 undef>
	; CHECK-NEXT: [[TMP54:%.*]] = shufflevector <16 x i8> [[TMP52]], <16 x i8> [[TMP53]], <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 16, i32 17, i32 18, i32 19>			; CHECK-NEXT: [[TMP54:%.*]] = shufflevector <4 x i8> [[TMP7]], <4 x i8> poison, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
	; CHECK-NEXT: [[TMP55:%.*]] = zext <16 x i8> [[TMP54]] to <16 x i32>			; CHECK-NEXT: [[TMP55:%.*]] = shufflevector <16 x i8> [[TMP53]], <16 x i8> [[TMP54]], <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 16, i32 17, i32 18, i32 19>
	; CHECK-NEXT: [[TMP56:%.*]] = sub nsw <16 x i32> [[TMP49]], [[TMP55]]			; CHECK-NEXT: [[TMP56:%.*]] = zext <16 x i8> [[TMP55]] to <16 x i32>
	; CHECK-NEXT: [[TMP57:%.*]] = sub nsw <16 x i32> [[TMP29]], [[TMP43]]			; CHECK-NEXT: [[TMP57:%.*]] = sub nsw <16 x i32> [[TMP48]], [[TMP56]]
	; CHECK-NEXT: [[TMP58:%.*]] = shl nsw <16 x i32> [[TMP57]], <i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16>			; CHECK-NEXT: [[TMP58:%.*]] = shl nsw <16 x i32> [[TMP57]], <i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16>
	; CHECK-NEXT: [[TMP59:%.*]] = add nsw <16 x i32> [[TMP58]], [[TMP56]]			; CHECK-NEXT: [[TMP59:%.*]] = add nsw <16 x i32> [[TMP58]], [[TMP40]]
	; CHECK-NEXT: [[TMP60:%.*]] = shufflevector <16 x i32> [[TMP59]], <16 x i32> poison, <16 x i32> <i32 3, i32 7, i32 11, i32 15, i32 6, i32 2, i32 10, i32 14, i32 5, i32 1, i32 9, i32 13, i32 4, i32 0, i32 8, i32 12>			; CHECK-NEXT: [[TMP60:%.*]] = shufflevector <16 x i32> [[TMP59]], <16 x i32> poison, <16 x i32> <i32 1, i32 0, i32 3, i32 2, i32 5, i32 4, i32 7, i32 6, i32 9, i32 8, i32 11, i32 10, i32 13, i32 12, i32 15, i32 14>
	; CHECK-NEXT: [[TMP61:%.*]] = shufflevector <16 x i32> [[TMP60]], <16 x i32> poison, <16 x i32> <i32 5, i32 4, i32 6, i32 7, i32 1, i32 0, i32 2, i32 3, i32 12, i32 13, i32 14, i32 15, i32 8, i32 9, i32 10, i32 11>			; CHECK-NEXT: [[TMP61:%.*]] = add nsw <16 x i32> [[TMP59]], [[TMP60]]
	; CHECK-NEXT: [[TMP62:%.*]] = add nsw <16 x i32> [[TMP60]], [[TMP61]]			; CHECK-NEXT: [[TMP62:%.*]] = sub nsw <16 x i32> [[TMP59]], [[TMP60]]
	; CHECK-NEXT: [[TMP63:%.*]] = sub nsw <16 x i32> [[TMP60]], [[TMP61]]			; CHECK-NEXT: [[TMP63:%.*]] = shufflevector <16 x i32> [[TMP61]], <16 x i32> [[TMP62]], <16 x i32> <i32 3, i32 7, i32 11, i32 15, i32 22, i32 18, i32 26, i32 30, i32 5, i32 1, i32 9, i32 13, i32 20, i32 16, i32 24, i32 28>
	; CHECK-NEXT: [[TMP64:%.*]] = shufflevector <16 x i32> [[TMP62]], <16 x i32> [[TMP63]], <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 20, i32 21, i32 22, i32 23, i32 8, i32 9, i32 10, i32 11, i32 28, i32 29, i32 30, i32 31>			; CHECK-NEXT: [[TMP64:%.*]] = shufflevector <16 x i32> [[TMP63]], <16 x i32> poison, <16 x i32> <i32 9, i32 8, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 1, i32 0, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
	; CHECK-NEXT: [[TMP65:%.*]] = shufflevector <16 x i32> [[TMP64]], <16 x i32> poison, <16 x i32> <i32 9, i32 8, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 1, i32 0, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>			; CHECK-NEXT: [[TMP65:%.*]] = add nsw <16 x i32> [[TMP63]], [[TMP64]]
	; CHECK-NEXT: [[TMP66:%.*]] = add nsw <16 x i32> [[TMP64]], [[TMP65]]			; CHECK-NEXT: [[TMP66:%.*]] = sub nsw <16 x i32> [[TMP63]], [[TMP64]]
	; CHECK-NEXT: [[TMP67:%.*]] = sub nsw <16 x i32> [[TMP64]], [[TMP65]]			; CHECK-NEXT: [[TMP67:%.*]] = shufflevector <16 x i32> [[TMP65]], <16 x i32> [[TMP66]], <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 24, i32 25, i32 26, i32 27, i32 28, i32 29, i32 30, i32 31>
	; CHECK-NEXT: [[TMP68:%.*]] = shufflevector <16 x i32> [[TMP66]], <16 x i32> [[TMP67]], <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 24, i32 25, i32 26, i32 27, i32 28, i32 29, i32 30, i32 31>			; CHECK-NEXT: [[TMP68:%.*]] = shufflevector <16 x i32> [[TMP67]], <16 x i32> poison, <16 x i32> <i32 1, i32 0, i32 3, i32 2, i32 5, i32 4, i32 7, i32 6, i32 9, i32 8, i32 11, i32 10, i32 13, i32 12, i32 15, i32 14>
	; CHECK-NEXT: [[TMP69:%.*]] = shufflevector <16 x i32> [[TMP68]], <16 x i32> poison, <16 x i32> <i32 1, i32 0, i32 3, i32 2, i32 5, i32 4, i32 7, i32 6, i32 9, i32 8, i32 11, i32 10, i32 13, i32 12, i32 15, i32 14>			; CHECK-NEXT: [[TMP69:%.*]] = add nsw <16 x i32> [[TMP67]], [[TMP68]]
	; CHECK-NEXT: [[TMP70:%.*]] = add nsw <16 x i32> [[TMP68]], [[TMP69]]			; CHECK-NEXT: [[TMP70:%.*]] = sub nsw <16 x i32> [[TMP67]], [[TMP68]]
	; CHECK-NEXT: [[TMP71:%.*]] = sub nsw <16 x i32> [[TMP68]], [[TMP69]]			; CHECK-NEXT: [[TMP71:%.*]] = shufflevector <16 x i32> [[TMP69]], <16 x i32> [[TMP70]], <16 x i32> <i32 0, i32 17, i32 2, i32 19, i32 20, i32 5, i32 6, i32 23, i32 24, i32 9, i32 10, i32 27, i32 28, i32 13, i32 14, i32 31>
	; CHECK-NEXT: [[TMP72:%.*]] = shufflevector <16 x i32> [[TMP70]], <16 x i32> [[TMP71]], <16 x i32> <i32 0, i32 17, i32 2, i32 19, i32 20, i32 5, i32 6, i32 23, i32 24, i32 9, i32 10, i32 27, i32 28, i32 13, i32 14, i32 31>			; CHECK-NEXT: [[TMP72:%.*]] = shufflevector <16 x i32> [[TMP71]], <16 x i32> poison, <16 x i32> <i32 2, i32 3, i32 0, i32 1, i32 7, i32 6, i32 5, i32 4, i32 11, i32 10, i32 9, i32 8, i32 15, i32 14, i32 13, i32 12>
	; CHECK-NEXT: [[TMP73:%.*]] = shufflevector <16 x i32> [[TMP72]], <16 x i32> poison, <16 x i32> <i32 2, i32 3, i32 0, i32 1, i32 7, i32 6, i32 5, i32 4, i32 11, i32 10, i32 9, i32 8, i32 15, i32 14, i32 13, i32 12>			; CHECK-NEXT: [[TMP73:%.*]] = add nsw <16 x i32> [[TMP71]], [[TMP72]]
	; CHECK-NEXT: [[TMP74:%.*]] = add nsw <16 x i32> [[TMP72]], [[TMP73]]			; CHECK-NEXT: [[TMP74:%.*]] = sub nsw <16 x i32> [[TMP71]], [[TMP72]]
	; CHECK-NEXT: [[TMP75:%.*]] = sub nsw <16 x i32> [[TMP72]], [[TMP73]]			; CHECK-NEXT: [[TMP75:%.*]] = shufflevector <16 x i32> [[TMP73]], <16 x i32> [[TMP74]], <16 x i32> <i32 0, i32 1, i32 18, i32 19, i32 4, i32 5, i32 22, i32 23, i32 8, i32 9, i32 26, i32 27, i32 12, i32 13, i32 30, i32 31>
	; CHECK-NEXT: [[TMP76:%.*]] = shufflevector <16 x i32> [[TMP74]], <16 x i32> [[TMP75]], <16 x i32> <i32 0, i32 1, i32 18, i32 19, i32 4, i32 5, i32 22, i32 23, i32 8, i32 9, i32 26, i32 27, i32 12, i32 13, i32 30, i32 31>			; CHECK-NEXT: [[TMP76:%.*]] = lshr <16 x i32> [[TMP75]], <i32 15, i32 15, i32 15, i32 15, i32 15, i32 15, i32 15, i32 15, i32 15, i32 15, i32 15, i32 15, i32 15, i32 15, i32 15, i32 15>
	; CHECK-NEXT: [[TMP77:%.*]] = lshr <16 x i32> [[TMP76]], <i32 15, i32 15, i32 15, i32 15, i32 15, i32 15, i32 15, i32 15, i32 15, i32 15, i32 15, i32 15, i32 15, i32 15, i32 15, i32 15>			; CHECK-NEXT: [[TMP77:%.*]] = and <16 x i32> [[TMP76]], <i32 65537, i32 65537, i32 65537, i32 65537, i32 65537, i32 65537, i32 65537, i32 65537, i32 65537, i32 65537, i32 65537, i32 65537, i32 65537, i32 65537, i32 65537, i32 65537>
	; CHECK-NEXT: [[TMP78:%.*]] = and <16 x i32> [[TMP77]], <i32 65537, i32 65537, i32 65537, i32 65537, i32 65537, i32 65537, i32 65537, i32 65537, i32 65537, i32 65537, i32 65537, i32 65537, i32 65537, i32 65537, i32 65537, i32 65537>			; CHECK-NEXT: [[TMP78:%.*]] = mul nuw <16 x i32> [[TMP77]], <i32 65535, i32 65535, i32 65535, i32 65535, i32 65535, i32 65535, i32 65535, i32 65535, i32 65535, i32 65535, i32 65535, i32 65535, i32 65535, i32 65535, i32 65535, i32 65535>
	; CHECK-NEXT: [[TMP79:%.*]] = mul nuw <16 x i32> [[TMP78]], <i32 65535, i32 65535, i32 65535, i32 65535, i32 65535, i32 65535, i32 65535, i32 65535, i32 65535, i32 65535, i32 65535, i32 65535, i32 65535, i32 65535, i32 65535, i32 65535>			; CHECK-NEXT: [[TMP79:%.*]] = add <16 x i32> [[TMP78]], [[TMP75]]
	; CHECK-NEXT: [[TMP80:%.*]] = add <16 x i32> [[TMP79]], [[TMP76]]			; CHECK-NEXT: [[TMP80:%.*]] = xor <16 x i32> [[TMP79]], [[TMP78]]
	; CHECK-NEXT: [[TMP81:%.*]] = xor <16 x i32> [[TMP80]], [[TMP79]]			; CHECK-NEXT: [[TMP81:%.*]] = call i32 @llvm.vector.reduce.add.v16i32(<16 x i32> [[TMP80]])
	; CHECK-NEXT: [[TMP82:%.*]] = call i32 @llvm.vector.reduce.add.v16i32(<16 x i32> [[TMP81]])			; CHECK-NEXT: [[CONV118:%.*]] = and i32 [[TMP81]], 65535
	; CHECK-NEXT: [[CONV118:%.*]] = and i32 [[TMP82]], 65535			; CHECK-NEXT: [[SHR:%.*]] = lshr i32 [[TMP81]], 16
	; CHECK-NEXT: [[SHR:%.*]] = lshr i32 [[TMP82]], 16
	; CHECK-NEXT: [[ADD119:%.*]] = add nuw nsw i32 [[CONV118]], [[SHR]]			; CHECK-NEXT: [[ADD119:%.*]] = add nuw nsw i32 [[CONV118]], [[SHR]]
	; CHECK-NEXT: [[SHR120:%.*]] = lshr i32 [[ADD119]], 1			; CHECK-NEXT: [[SHR120:%.*]] = lshr i32 [[ADD119]], 1
	; CHECK-NEXT: ret i32 [[SHR120]]			; CHECK-NEXT: ret i32 [[SHR120]]
	;			;
	entry:			entry:
	%idx.ext = sext i32 %st1 to i64			%idx.ext = sext i32 %st1 to i64
	%idx.ext63 = sext i32 %st2 to i64			%idx.ext63 = sext i32 %st2 to i64
	%0 = load i8, i8* %p1, align 1			%0 = load i8, i8* %p1, align 1
	▲ Show 20 Lines • Show All 418 Lines • Show Last 20 Lines