This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
lib/
-
CodeGen/
18
InterleavedAccessPass.cpp
-
Target/
-
AArch64/
4
AArch64ISelLowering.cpp
-
ARM/
1
ARMISelLowering.cpp
-
test/CodeGen/
-
CodeGen/
-
AArch64/
1
aarch64-interleaved-accesses.ll
-
ARM/
3
arm-interleaved-accesses.ll

Differential D23646

Generalize strided store pattern in interleave access pass
ClosedPublic

Authored by asbirlea on Aug 17 2016, 11:46 PM.

Download Raw Diff

Details

Reviewers

silviu.baranga
rengolin
t.p.northover
• HaoLiu
jmolloy
mssimpso

Commits

rG77c5eaaedac7: Generalize strided store pattern in interleave access pass
rL289573: Generalize strided store pattern in interleave access pass

Summary

This patch aims to generalize matching of the strided store accesses to more general masks.
The more general rule is to have consecutive accesses based on the stride:
[x, y, ... z, x+1, y+1, ...z+1, x+2, y+2, ...z+2, ...]
and for each start element in each stride (x, y, ... z] to be aligned.
However all elements in the masks need not form a contiguous space, there may be gaps.
As before, undefs are allowed and filled in with adjacent element loads.

Note this patch is not final, but I would like to get feedback on the approach.
There are at least the pending TODOs.

Diff Detail

Build Status

Buildable 676
Build 676: arc lint + arc unit

Event Timeline

asbirlea updated this revision to Diff 68484.Aug 17 2016, 11:46 PM

asbirlea retitled this revision from to Generalize strided store pattern in interleave access pass.

asbirlea updated this object.

asbirlea added reviewers: • HaoLiu, mssimpso.

asbirlea added subscribers: llvm-commits, delena, mkuper.

Hi Alina,

I think I understand this, but I just want to be sure I get how this differs from what we currently have before going further. Currently, we only match [x, y, ..., z, x+1, y+1, z+1, ...] where each y-x and each z-y equals the number of sub elements for the given factor. Or said another way, if I create a list or all the x's followed by all the y's and then all the z's, the entire list would be consecutive. With your path, the only requirement is that each sub-list be consecutive. Is this right?

The current approach was designed to match the shuffle patterns produced by the loop vectorizer. I'm curious to know where we are generating these more general patterns. Have you run across some code examples?

Also, another high level comment before I start looking at the details: you'll want to include some IR test cases as well (to be run with opt instead of llc).

Matt.

Hi Matt,

Thanks for looking to review this. Please find my answers below.

In D23646#521091, @mssimpso wrote:

Hi Alina,

I think I understand this, but I just want to be sure I get how this differs from what we currently have before going further. Currently, we only match [x, y, ..., z, x+1, y+1, z+1, ...] where each y-x and each z-y equals the number of sub elements for the given factor. Or said another way, if I create a list or all the x's followed by all the y's and then all the z's, the entire list would be consecutive. With your path, the only requirement is that each sub-list be consecutive. Is this right?

That's right. Also, from my understanding, x is always 0. So all elements form a consecutive sublist which always starts at 0.
My first approach was actually to generalize this just to add a prefix to remove the "starts with 0" restriction and a more general stride that allowed gaps. But this still didn't cover all the testcases I came across, such as the example I added in "store_general_mask_factor4".

To answer your question below, the usecases I'm looking at are generated by Halide (https://github.com/halide/Halide).
Halide generates LLVM IR and relies on its optimization pipeline and lowering, but they need to generate explicit intrinsics (including strided loads and stores) for arm and aarch64, because their patterns are not lowered to intrinsics by LLVM.
Since this approach was taken before the interleaved-access pass was added, it's quite understandable, but LLVM is more powerful now and I'm trying to make use of this, and in the process, cover the cases missing in LLVM.
For example, for strided loads the interleaved-access pass does cover the code patterns generated by Halide, so the "custom" intrinsic code generation in Halide will soon be removed. My goal is to improve the pass to make this happen for the stores as well.
The tests I will add are actually simplified versions of what Halide is generating.

The current approach was designed to match the shuffle patterns produced by the loop vectorizer. I'm curious to know where we are generating these more general patterns. Have you run across some code examples?

Also, another high level comment before I start looking at the details: you'll want to include some IR test cases as well (to be run with opt instead of llc).

Agreed, the plan is to add more tests, including IR tests.

Matt.

mssimpso added inline comments.Aug 26 2016, 10:42 AM

lib/CodeGen/InterleavedAccessPass.cpp
158–169	You should probably update this to define the more general pattern.
184–226	This looks fairly reasonable to me, but the parts dealing with undef are pretty difficult to follow. I think some more high-level comments would help people better understand what's going on here.

Address comments re. comments.
Complete TODOs.
Requesting help on whether to include TLI.misalignedAccess check and on what's the correct way to do it.

asbirlea added inline comments.Sep 6 2016, 3:37 PM

lib/CodeGen/InterleavedAccessPass.cpp
413	This is a part that I'm not sure is needed, and how to address it. The goal was to check for the alignment of each of the strides, i.e. BaseStoreAddress + StartingIncrementInStride, for all stride [0, Factor). The commented attempt has a series of problems and does not achieve this. Should this check exist and what's the correct way to handle it?

mssimpso added a reviewer: t.p.northover.Sep 13 2016, 1:35 PM

Hi Alina,

Sorry for the delay. I'm not quite sure I understand this patch anymore. I'm adding Tim Northover (ARM/AArch64 code owner) as a reviewer to hopefully get this unstuck.

SG, thank you!

I'm going back to look at the alignment check this afternoon (that's the big commented out block).
I'd really like to understand why some basic alignment checks lead to ARM tests failing and not their AArch64 counterparts.

mssimpso added inline comments.Sep 13 2016, 1:51 PM

lib/CodeGen/InterleavedAccessPass.cpp
413	I could be wrong about this, but I don't think you need to worry about alignment here. I'm not seeing how the memory behavior with this patch would be different than the current situation.

asbirlea added inline comments.Sep 13 2016, 4:09 PM

lib/CodeGen/InterleavedAccessPass.cpp
413	I agree with you that there should be no significant difference from the current situation. There is one small difference though...before there could be one misaligned access, now there may be Factor such accesses. That's why I'd still like to understand the alignment issue - whether the check is needed or not and in what form. Perhaps it would be better to have it in a separate patch though.

Remove comment block checking for alignment. Will revisit in a future patch.

Pinging patch.

Also, working around the case when masks are larger than 16 elements.
This can happen now if Halide takes advantage of this pass for strided stores.
This is not the only use-case of larger shuffle masks, but the topic is beyond the scope of this patch.

Minor edit of temporary variables.

rengolin added reviewers: rengolin, silviu.baranga, jmolloy.Oct 8 2016, 6:11 AM

rengolin added inline comments.Oct 14 2016, 7:36 AM

lib/CodeGen/InterleavedAccessPass.cpp
193	Nit, `ij` is hard to follow. Try `lane` or something more expressive. (this is not a matrix :)
204	This is really confusing. Can you factor the comparison elements out with expressive names, so the if becomes a comparison of obvious terms?
208	PreviousMask is always used in conjunction with PreviousPos, so you don't need the mask to be signed and you can compare the pos in the block above and get rid of the static casts. Or you could have an additional boolean flag and make them both unsigned.
lib/Target/AArch64/AArch64ISelLowering.cpp
7233	Nit. use brackets here: if (...) { ... } else {
7237	I don't get the `- j` here.
lib/Target/ARM/ARMISelLowering.cpp
13181	Better to duplicate the comment, I think. These back-ends evolve at different paces.
test/CodeGen/ARM/arm-interleaved-accesses.ll
321	is this really guaranteed to reproduce? they don't seem connected to the pcs directly...

Address comments.

asbirlea added inline comments.Oct 14 2016, 11:34 AM

lib/CodeGen/InterleavedAccessPass.cpp
193	Renamed to Lane. I hope changing the NumSubElts to LaneLen makes more sense too. I'm inclined to change it in the lowering files as well for consistency.
lib/Target/AArch64/AArch64ISelLowering.cpp
7237	Assuming the mask starts with a few undefs, this computes what the start of the mask would be based on the first non-undef value. The computation is done first in the pass to make sure the start is a positive value (hence the correctness comment below on "StartMask cannot be negative")
test/CodeGen/ARM/arm-interleaved-accesses.ll
321	I'm not sure about this TBH, and not sure how to verify it. Should I replace it by a simple vst4.32 check?

Hi Alina,

This is looking much better, thanks!

The code has a lot of undef handling, but not much in the way of testing it. I think we should have at least the following:

one and two undefs in the middle
one undef at the beginning and one at the end
all undefs in one lane
one undef in each lane, at different positions

Repeating the pattern of your current tests but adding undefs should be enough.

cheers,
--renato

lib/CodeGen/InterleavedAccessPass.cpp
188	Better to declare I and J inside the `for` declaration.
193	Nice, much better! I agree with renaming the lowering code, too.
200	A comment here would help... // If both defined, only sequential values allowed
216	What about the case where the first lane is undef, but the others aren't?
232	Instead of using a `SavedNonUndef` above, you could save the last non-undef value and the number of undefs since that value. That'd make the next-value computation easier: If (NextValue != SavedValue + NumUndefs) break; and also help get the StartMask here, for free.
243	"Found an interleaved..."
lib/Target/AArch64/AArch64ISelLowering.cpp
7237	Right, I agree you could repeat the naming pattern above, here.
test/CodeGen/ARM/arm-interleaved-accesses.ll
321	Something like: vst4.32 {d{{\n+}}, d{{\n+}}, d{{\n+}}, d{{\n+}}}, [r0] would do. (I'm not sure of the triple brackets there...)

Address comments. One pending.

Great point on the lack of testing. But I wasn't happy with the coverage the vst4/st4 had.
I added a pattern for vst3/st3 that covers the undefs in the middle of a lane.

lib/CodeGen/InterleavedAccessPass.cpp
188	There's a check on I and J following each loop. I could add an additional flag to check that we broke out of the loop early, but it seemed overkill to do that when I and J could be used if declared outside the loop.
216	Nothing wrong with that (unless I'm missing something).. It'll check the correctness for the ones that follow and the first one will receive a value based on the following values - that's the start mask value.
232	I'm still looking into this one. I can do without SaveNonUndef, and update the condition to a "SavedLaneValue+SavedNoUndefs (+1)". This needs an additional if clause in the loop to increment the SavedNoUndefs, and at least another check to help with computing the mask. The second check is because right now I only store SavedLaneValue if a value is followed by an undef, but at the end of the mask we'll need this updated too to get the correct StartMask as something like SavedLaneValue+SavedNoUndefs -LaneLen (+/- 1). Right now I find it easier to just compute the StartMask in the same j loop. So, yeah, still looking what's the cleanest way to do this.

Address remaining comment. Add additional testcases.

[clang-format]

Gentle ping.

Re-pinging patch.

Pinging again.

Hi,

Sorry to keep you waiting, this completely fell out of my radar.

I think the code looks good now, just need to make sure the test is generic enough on the CHECK line (see inline comment).

cheers,
--renato

test/CodeGen/AArch64/aarch64-interleaved-accesses.ll
285	Please, also use: st4.32 {d{{\n+}}, d{{\n+}}, d{{\n+}}, d{{\n+}}}, [x0] here.

This revision is now accepted and ready to land.Dec 13 2016, 7:31 AM

Update aarch64 test.

Thank you for the review, Renato!

Closed by commit rL289573: Generalize strided store pattern in interleave access pass (authored by asbirlea). · Explain WhyDec 13 2016, 11:43 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

lib/

CodeGen/

InterleavedAccessPass.cpp

90 lines

Target/

AArch64/

AArch64ISelLowering.cpp

44 lines

ARM/

ARMISelLowering.cpp

42 lines

test/

CodeGen/

AArch64/

aarch64-interleaved-accesses.ll

111 lines

ARM/

arm-interleaved-accesses.ll

144 lines

Diff 75329

lib/CodeGen/InterleavedAccessPass.cpp

Show First 20 Lines • Show All 149 Lines • ▼ Show 20 Lines	static bool isDeInterleaveMask(ArrayRef<int> Mask, unsigned &Factor,

// Check potential Factors.		// Check potential Factors.
for (Factor = 2; Factor <= MaxFactor; Factor++)		for (Factor = 2; Factor <= MaxFactor; Factor++)
if (isDeInterleaveMaskOfFactor(Mask, Factor, Index))		if (isDeInterleaveMaskOfFactor(Mask, Factor, Index))
return true;		return true;

return false;		return false;
}		}

/// \brief Check if the mask is RE-interleave mask for an interleaved store.		/// \brief Check if the mask can be used in an interleaved store.
///		//
/// I.e. <0, NumSubElts, ... , NumSubElts*(Factor - 1), 1, NumSubElts + 1, ...>		/// It checks for a more general pattern than the RE-interleave mask.
		/// I.e. <x, y, ... z, x+1, y+1, ...z+1, x+2, y+2, ...z+2, ...>
		/// E.g. For a Factor of 2 (LaneLen=4): <4, 32, 5, 33, 6, 34, 7, 35>
		/// E.g. For a Factor of 3 (LaneLen=4): <4, 32, 16, 5, 33, 17, 6, 34, 18, 7, 35, 19>
		/// E.g. For a Factor of 4 (LaneLen=2): <8, 2, 12, 4, 9, 3, 13, 5>
///		///
/// E.g. The RE-interleave mask (Factor = 2) could be:		/// The particular case of an RE-interleave mask is:
/// <0, 4, 1, 5, 2, 6, 3, 7>		/// I.e. <0, LaneLen, ... , LaneLen*(Factor - 1), 1, LaneLen + 1, ...>
		/// E.g. For a Factor of 2 (LaneLen=4): <0, 4, 1, 5, 2, 6, 3, 7>
		mssimpsoUnsubmitted Not Done Reply Inline Actions You should probably update this to define the more general pattern. mssimpso: You should probably update this to define the more general pattern.
static bool isReInterleaveMask(ArrayRef<int> Mask, unsigned &Factor) {		static bool isReInterleaveMask(ArrayRef<int> Mask, unsigned &Factor) {
unsigned NumElts = Mask.size();		unsigned NumElts = Mask.size();
if (NumElts < 4)		if (NumElts < 4)
return false;		return false;

// Check potential Factors.		// Check potential Factors.
for (Factor = 2; Factor <= MaxFactor; Factor++) {		for (Factor = 2; Factor <= MaxFactor; Factor++) {
if (NumElts % Factor)		if (NumElts % Factor)
continue;		continue;

unsigned NumSubElts = NumElts / Factor;		unsigned LaneLen = NumElts / Factor;
if (!isPowerOf2_32(NumSubElts))		if (!isPowerOf2_32(LaneLen))
continue;		continue;

// Check whether each element matchs the RE-interleaved rule. Ignore undef		// Check whether each element matches the general interleaved rule.
// elements.		// Ignore undef elements, as long as the defined elements match the rule.
unsigned i = 0;		// Outer loop processes all factors (x, y, z in the above example)
for (; i < NumElts; i++)		unsigned I = 0, J;
if (Mask[i] >= 0 &&		for (; I < Factor; I++) {
		rengolinUnsubmitted Not Done Reply Inline Actions Better to declare I and J inside the `for` declaration. rengolin: Better to declare I and J inside the `for` declaration.
		asbirleaAuthorUnsubmitted Not Done Reply Inline Actions There's a check on I and J following each loop. I could add an additional flag to check that we broke out of the loop early, but it seemed overkill to do that when I and J could be used if declared outside the loop. asbirlea: There's a check on I and J following each loop. I could add an additional flag to check that we…
static_cast<unsigned>(Mask[i]) !=		unsigned SavedLaneValue;
(i % Factor) * NumSubElts + i / Factor)		unsigned SavedNoUndefs = 0;

		//Inner loop processes all consecutive accesses (x, x+1... in the example)
		for (J = 0; J < LaneLen-1; J++) {
		rengolinUnsubmitted Not Done Reply Inline Actions Nit, `ij` is hard to follow. Try `lane` or something more expressive. (this is not a matrix :) rengolin: Nit, `ij` is hard to follow. Try `lane` or something more expressive. (this is not a matrix :)
		asbirleaAuthorUnsubmitted Not Done Reply Inline Actions Renamed to Lane. I hope changing the NumSubElts to LaneLen makes more sense too. I'm inclined to change it in the lowering files as well for consistency. asbirlea: Renamed to Lane. I hope changing the NumSubElts to LaneLen makes more sense too. I'm inclined…
		rengolinUnsubmitted Not Done Reply Inline Actions Nice, much better! I agree with renaming the lowering code, too. rengolin: Nice, much better! I agree with renaming the lowering code, too.
		//Lane computes x's position in the Mask
		unsigned Lane = J*Factor + I;
		unsigned NextLane = Lane + Factor;
		int LaneValue = Mask[Lane];
		int NextLaneValue = Mask[NextLane];

		// If both are defined, values must be sequential
		rengolinUnsubmitted Not Done Reply Inline Actions A comment here would help... // If both defined, only sequential values allowed rengolin: A comment here would help... // If both defined, only sequential values allowed
		if (LaneValue >= 0 && NextLaneValue >= 0 &&
		LaneValue + 1 != NextLaneValue)
break;		break;

		rengolinUnsubmitted Not Done Reply Inline Actions This is really confusing. Can you factor the comparison elements out with expressive names, so the if becomes a comparison of obvious terms? rengolin: This is really confusing. Can you factor the comparison elements out with expressive names, so…
// Find a RE-interleaved mask of current factor.		// If the next value is undef, save the current one as reference
if (i == NumElts)		if (LaneValue >= 0 && NextLaneValue < 0) {
		SavedLaneValue = LaneValue;
		SavedNoUndefs = 1;
		rengolinUnsubmitted Not Done Reply Inline Actions PreviousMask is always used in conjunction with PreviousPos, so you don't need the mask to be signed and you can compare the pos in the block above and get rid of the static casts. Or you could have an additional boolean flag and make them both unsigned. rengolin: PreviousMask is always used in conjunction with PreviousPos, so you don't need the mask to be…
		}

		// Undefs are allowed, but the defined elements must still be consecutive:
		// i.e.: x,..., undef,..., x + 2,..., undef,..., undef,..., x + 5, ....
		// Verify this by storing the last non-undef followed by an undef
		// Check that following non-undef masks are incremented with the
		// corresponding distance.
		if (SavedNoUndefs > 0 && LaneValue < 0) {
		rengolinUnsubmitted Not Done Reply Inline Actions What about the case where the first lane is undef, but the others aren't? rengolin: What about the case where the first lane is undef, but the others aren't?
		asbirleaAuthorUnsubmitted Not Done Reply Inline Actions Nothing wrong with that (unless I'm missing something).. It'll check the correctness for the ones that follow and the first one will receive a value based on the following values - that's the start mask value. asbirlea: Nothing wrong with that (unless I'm missing something).. It'll check the correctness for the…
		SavedNoUndefs ++;
		if (NextLaneValue >= 0 &&
		SavedLaneValue + SavedNoUndefs != (unsigned) NextLaneValue)
		break;
		}
		}

		if (J < LaneLen-1)
		break;

		mssimpsoUnsubmitted Not Done Reply Inline Actions This looks fairly reasonable to me, but the parts dealing with undef are pretty difficult to follow. I think some more high-level comments would help people better understand what's going on here. mssimpso: This looks fairly reasonable to me, but the parts dealing with undef are pretty difficult to…
		int StartMask = 0;
		if (Mask[I] >= 0) {
		//Check that the start of the I range (J=0) is greater than 0
		StartMask = Mask[I];
		}
		else if (Mask[(LaneLen-1)*Factor + I] >= 0) {
		rengolinUnsubmitted Not Done Reply Inline Actions Instead of using a `SavedNonUndef` above, you could save the last non-undef value and the number of undefs since that value. That'd make the next-value computation easier: If (NextValue != SavedValue + NumUndefs) break; and also help get the StartMask here, for free. rengolin: Instead of using a `SavedNonUndef` above, you could save the last non-undef value and the…
		asbirleaAuthorUnsubmitted Not Done Reply Inline Actions I'm still looking into this one. I can do without SaveNonUndef, and update the condition to a "SavedLaneValue+SavedNoUndefs (+1)". This needs an additional if clause in the loop to increment the SavedNoUndefs, and at least another check to help with computing the mask. The second check is because right now I only store SavedLaneValue if a value is followed by an undef, but at the end of the mask we'll need this updated too to get the correct StartMask as something like SavedLaneValue+SavedNoUndefs -LaneLen (+/- 1). Right now I find it easier to just compute the StartMask in the same j loop. So, yeah, still looking what's the cleanest way to do this. asbirlea: I'm still looking into this one. I can do without SaveNonUndef, and update the condition to a…
		// StartMask defined by the last value in lane
		StartMask = Mask[(LaneLen-1)*Factor + I] - J;
		}
		else if (SavedNoUndefs > 0) {
		// StartMask defined by some non-zero value in the j loop
		StartMask = SavedLaneValue - (LaneLen - 1 - SavedNoUndefs);
		}
		//else StartMask remains set to 0, i.e. all elements are undefs

		if (StartMask < 0)
		break;
		rengolinUnsubmitted Not Done Reply Inline Actions "Found an interleaved..." rengolin: "Found an interleaved..."
		}

		// Found an interleaved mask of current factor.
		if (I == Factor)
return true;		return true;
}		}

return false;		return false;
}		}

bool InterleavedAccess::lowerInterleavedLoad(		bool InterleavedAccess::lowerInterleavedLoad(
LoadInst LI, SmallVector<Instruction , 32> &DeadInsts) {		LoadInst LI, SmallVector<Instruction , 32> &DeadInsts) {
▲ Show 20 Lines • Show All 149 Lines • ▼ Show 20 Lines	bool InterleavedAccess::lowerInterleavedStore(

DEBUG(dbgs() << "IA: Found an interleaved store: " << *SI << "\n");		DEBUG(dbgs() << "IA: Found an interleaved store: " << *SI << "\n");

// Try to create target specific intrinsics to replace the store and shuffle.		// Try to create target specific intrinsics to replace the store and shuffle.
if (!TLI->lowerInterleavedStore(SI, SVI, Factor))		if (!TLI->lowerInterleavedStore(SI, SVI, Factor))
return false;		return false;

// Already have a new target specific interleaved store. Erase the old store.		// Already have a new target specific interleaved store. Erase the old store.
DeadInsts.push_back(SI);		DeadInsts.push_back(SI);
		asbirleaAuthorUnsubmitted Not Done Reply Inline Actions This is a part that I'm not sure is needed, and how to address it. The goal was to check for the alignment of each of the strides, i.e. BaseStoreAddress + StartingIncrementInStride, for all stride [0, Factor). The commented attempt has a series of problems and does not achieve this. Should this check exist and what's the correct way to handle it? asbirlea: This is a part that I'm not sure is needed, and how to address it. The goal was to check for…
		mssimpsoUnsubmitted Not Done Reply Inline Actions I could be wrong about this, but I don't think you need to worry about alignment here. I'm not seeing how the memory behavior with this patch would be different than the current situation. mssimpso: I could be wrong about this, but I don't think you need to worry about alignment here. I'm not…
		asbirleaAuthorUnsubmitted Not Done Reply Inline Actions I agree with you that there should be no significant difference from the current situation. There is one small difference though...before there could be one misaligned access, now there may be Factor such accesses. That's why I'd still like to understand the alignment issue - whether the check is needed or not and in what form. Perhaps it would be better to have it in a separate patch though. asbirlea: I agree with you that there should be no significant difference from the current situation.
DeadInsts.push_back(SVI);		DeadInsts.push_back(SVI);
return true;		return true;
}		}

bool InterleavedAccess::runOnFunction(Function &F) {		bool InterleavedAccess::runOnFunction(Function &F) {
if (!TM \|\| !LowerInterleavedAccesses)		if (!TM \|\| !LowerInterleavedAccesses)
return false;		return false;

Show All 23 Lines

lib/Target/AArch64/AArch64ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 7,146 Lines • ▼ Show 20 Lines	static Constant *getSequentialMask(IRBuilder<> &Builder, unsigned Start,

return ConstantVector::get(Mask);		return ConstantVector::get(Mask);
}		}

/// \brief Lower an interleaved store into a stN intrinsic.		/// \brief Lower an interleaved store into a stN intrinsic.
///		///
/// E.g. Lower an interleaved store (Factor = 3):		/// E.g. Lower an interleaved store (Factor = 3):
/// %i.vec = shuffle <8 x i32> %v0, <8 x i32> %v1,		/// %i.vec = shuffle <8 x i32> %v0, <8 x i32> %v1,
/// <0, 4, 8, 1, 5, 9, 2, 6, 10, 3, 7, 11>		/// <0, 4, 8, 1, 5, 9, 2, 6, 10, 3, 7, 11>
/// store <12 x i32> %i.vec, <12 x i32>* %ptr		/// store <12 x i32> %i.vec, <12 x i32>* %ptr
///		///
/// Into:		/// Into:
/// %sub.v0 = shuffle <8 x i32> %v0, <8 x i32> v1, <0, 1, 2, 3>		/// %sub.v0 = shuffle <8 x i32> %v0, <8 x i32> v1, <0, 1, 2, 3>
/// %sub.v1 = shuffle <8 x i32> %v0, <8 x i32> v1, <4, 5, 6, 7>		/// %sub.v1 = shuffle <8 x i32> %v0, <8 x i32> v1, <4, 5, 6, 7>
/// %sub.v2 = shuffle <8 x i32> %v0, <8 x i32> v1, <8, 9, 10, 11>		/// %sub.v2 = shuffle <8 x i32> %v0, <8 x i32> v1, <8, 9, 10, 11>
/// call void llvm.aarch64.neon.st3(%sub.v0, %sub.v1, %sub.v2, %ptr)		/// call void llvm.aarch64.neon.st3(%sub.v0, %sub.v1, %sub.v2, %ptr)
///		///
/// Note that the new shufflevectors will be removed and we'll only generate one		/// Note that the new shufflevectors will be removed and we'll only generate one
/// st3 instruction in CodeGen.		/// st3 instruction in CodeGen.
		///
		/// Example for a more general valid mask (Factor 3). Lower:
		/// %i.vec = shuffle <32 x i32> %v0, <32 x i32> %v1,
		/// <4, 32, 16, 5, 33, 17, 6, 34, 18, 7, 35, 19>
		/// store <12 x i32> %i.vec, <12 x i32>* %ptr
		///
		/// Into:
		/// %sub.v0 = shuffle <32 x i32> %v0, <32 x i32> v1, <4, 5, 6, 7>
		/// %sub.v1 = shuffle <32 x i32> %v0, <32 x i32> v1, <32, 33, 34, 35>
		/// %sub.v2 = shuffle <32 x i32> %v0, <32 x i32> v1, <16, 17, 18, 19>
		/// call void llvm.aarch64.neon.st3(%sub.v0, %sub.v1, %sub.v2, %ptr)
bool AArch64TargetLowering::lowerInterleavedStore(StoreInst *SI,		bool AArch64TargetLowering::lowerInterleavedStore(StoreInst *SI,
ShuffleVectorInst *SVI,		ShuffleVectorInst *SVI,
unsigned Factor) const {		unsigned Factor) const {
assert(Factor >= 2 && Factor <= getMaxSupportedInterleaveFactor() &&		assert(Factor >= 2 && Factor <= getMaxSupportedInterleaveFactor() &&
"Invalid interleave factor");		"Invalid interleave factor");

VectorType *VecTy = SVI->getType();		VectorType *VecTy = SVI->getType();
assert(VecTy->getVectorNumElements() % Factor == 0 &&		assert(VecTy->getVectorNumElements() % Factor == 0 &&
"Invalid interleaved store");		"Invalid interleaved store");

unsigned NumSubElts = VecTy->getVectorNumElements() / Factor;		unsigned LaneLen = VecTy->getVectorNumElements() / Factor;
Type *EltTy = VecTy->getVectorElementType();		Type *EltTy = VecTy->getVectorElementType();
VectorType *SubVecTy = VectorType::get(EltTy, NumSubElts);		VectorType *SubVecTy = VectorType::get(EltTy, LaneLen);

const DataLayout &DL = SI->getModule()->getDataLayout();		const DataLayout &DL = SI->getModule()->getDataLayout();
unsigned SubVecSize = DL.getTypeSizeInBits(SubVecTy);		unsigned SubVecSize = DL.getTypeSizeInBits(SubVecTy);

// Skip if we do not have NEON and skip illegal vector types.		// Skip if we do not have NEON and skip illegal vector types.
if (!Subtarget->hasNEON() \|\| (SubVecSize != 64 && SubVecSize != 128))		if (!Subtarget->hasNEON() \|\| (SubVecSize != 64 && SubVecSize != 128))
return false;		return false;

Value *Op0 = SVI->getOperand(0);		Value *Op0 = SVI->getOperand(0);
Value *Op1 = SVI->getOperand(1);		Value *Op1 = SVI->getOperand(1);
IRBuilder<> Builder(SI);		IRBuilder<> Builder(SI);

// StN intrinsics don't support pointer vectors as arguments. Convert pointer		// StN intrinsics don't support pointer vectors as arguments. Convert pointer
// vectors to integer vectors.		// vectors to integer vectors.
if (EltTy->isPointerTy()) {		if (EltTy->isPointerTy()) {
Type *IntTy = DL.getIntPtrType(EltTy);		Type *IntTy = DL.getIntPtrType(EltTy);
unsigned NumOpElts =		unsigned NumOpElts =
dyn_cast<VectorType>(Op0->getType())->getVectorNumElements();		dyn_cast<VectorType>(Op0->getType())->getVectorNumElements();

// Convert to the corresponding integer vector.		// Convert to the corresponding integer vector.
Type *IntVecTy = VectorType::get(IntTy, NumOpElts);		Type *IntVecTy = VectorType::get(IntTy, NumOpElts);
Op0 = Builder.CreatePtrToInt(Op0, IntVecTy);		Op0 = Builder.CreatePtrToInt(Op0, IntVecTy);
Op1 = Builder.CreatePtrToInt(Op1, IntVecTy);		Op1 = Builder.CreatePtrToInt(Op1, IntVecTy);

SubVecTy = VectorType::get(IntTy, NumSubElts);		SubVecTy = VectorType::get(IntTy, LaneLen);
}		}

Type *PtrTy = SubVecTy->getPointerTo(SI->getPointerAddressSpace());		Type *PtrTy = SubVecTy->getPointerTo(SI->getPointerAddressSpace());
Type *Tys[2] = {SubVecTy, PtrTy};		Type *Tys[2] = {SubVecTy, PtrTy};
static const Intrinsic::ID StoreInts[3] = {Intrinsic::aarch64_neon_st2,		static const Intrinsic::ID StoreInts[3] = {Intrinsic::aarch64_neon_st2,
Intrinsic::aarch64_neon_st3,		Intrinsic::aarch64_neon_st3,
Intrinsic::aarch64_neon_st4};		Intrinsic::aarch64_neon_st4};
Function *StNFunc =		Function *StNFunc =
Intrinsic::getDeclaration(SI->getModule(), StoreInts[Factor - 2], Tys);		Intrinsic::getDeclaration(SI->getModule(), StoreInts[Factor - 2], Tys);

SmallVector<Value *, 5> Ops;		SmallVector<Value *, 5> Ops;

// Split the shufflevector operands into sub vectors for the new stN call.		// Split the shufflevector operands into sub vectors for the new stN call.
for (unsigned i = 0; i < Factor; i++)		auto Mask = SVI->getShuffleMask();
		for (unsigned i = 0; i < Factor; i++) {
		if (Mask[i] >= 0) {
		Ops.push_back(Builder.CreateShuffleVector(
		Op0, Op1, getSequentialMask(Builder, Mask[i], LaneLen)));
		} else {
		rengolinUnsubmitted Not Done Reply Inline Actions Nit. use brackets here: if (...) { ... } else { rengolin: Nit. use brackets here: if (...) { ... } else {
		unsigned StartMask = 0;
		for (unsigned j = 1; j < LaneLen; j++) {
		if (Mask[j*Factor + i] >= 0) {
		StartMask = Mask[j*Factor + i] - j;
		rengolinUnsubmitted Not Done Reply Inline Actions I don't get the `- j` here. rengolin: I don't get the `- j` here.
		asbirleaAuthorUnsubmitted Not Done Reply Inline Actions Assuming the mask starts with a few undefs, this computes what the start of the mask would be based on the first non-undef value. The computation is done first in the pass to make sure the start is a positive value (hence the correctness comment below on "StartMask cannot be negative") asbirlea: Assuming the mask starts with a few undefs, this computes what the start of the mask would be…
		rengolinUnsubmitted Not Done Reply Inline Actions Right, I agree you could repeat the naming pattern above, here. rengolin: Right, I agree you could repeat the naming pattern above, here.
		break;
		}
		}
		// Note: If all elements in a chunk are undefs, StartMask=0!
		// Note: Filling undef gaps with random elements is ok, since
		// those elements were being written anyway (with undefs).
		// In the case of all undefs we're defaulting to using elems from 0
		// Note: StartMask cannot be negative, it's checked in isReInterleaveMask
Ops.push_back(Builder.CreateShuffleVector(		Ops.push_back(Builder.CreateShuffleVector(
Op0, Op1, getSequentialMask(Builder, NumSubElts * i, NumSubElts)));		Op0, Op1, getSequentialMask(Builder, StartMask, LaneLen)));
		}
		}

Ops.push_back(Builder.CreateBitCast(SI->getPointerOperand(), PtrTy));		Ops.push_back(Builder.CreateBitCast(SI->getPointerOperand(), PtrTy));
Builder.CreateCall(StNFunc, Ops);		Builder.CreateCall(StNFunc, Ops);
return true;		return true;
}		}

static bool memOpAlign(unsigned DstAlign, unsigned SrcAlign,		static bool memOpAlign(unsigned DstAlign, unsigned SrcAlign,
unsigned AlignCheck) {		unsigned AlignCheck) {
▲ Show 20 Lines • Show All 3,150 Lines • Show Last 20 Lines

lib/Target/ARM/ARMISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 13,093 Lines • ▼ Show 20 Lines
/// Into:		/// Into:
/// %sub.v0 = shuffle <8 x i32> %v0, <8 x i32> v1, <0, 1, 2, 3>		/// %sub.v0 = shuffle <8 x i32> %v0, <8 x i32> v1, <0, 1, 2, 3>
/// %sub.v1 = shuffle <8 x i32> %v0, <8 x i32> v1, <4, 5, 6, 7>		/// %sub.v1 = shuffle <8 x i32> %v0, <8 x i32> v1, <4, 5, 6, 7>
/// %sub.v2 = shuffle <8 x i32> %v0, <8 x i32> v1, <8, 9, 10, 11>		/// %sub.v2 = shuffle <8 x i32> %v0, <8 x i32> v1, <8, 9, 10, 11>
/// call void llvm.arm.neon.vst3(%ptr, %sub.v0, %sub.v1, %sub.v2, 4)		/// call void llvm.arm.neon.vst3(%ptr, %sub.v0, %sub.v1, %sub.v2, 4)
///		///
/// Note that the new shufflevectors will be removed and we'll only generate one		/// Note that the new shufflevectors will be removed and we'll only generate one
/// vst3 instruction in CodeGen.		/// vst3 instruction in CodeGen.
		///
		/// Example for a more general valid mask (Factor 3). Lower:
		/// %i.vec = shuffle <32 x i32> %v0, <32 x i32> %v1,
		/// <4, 32, 16, 5, 33, 17, 6, 34, 18, 7, 35, 19>
		/// store <12 x i32> %i.vec, <12 x i32>* %ptr
		///
		/// Into:
		/// %sub.v0 = shuffle <32 x i32> %v0, <32 x i32> v1, <4, 5, 6, 7>
		/// %sub.v1 = shuffle <32 x i32> %v0, <32 x i32> v1, <32, 33, 34, 35>
		/// %sub.v2 = shuffle <32 x i32> %v0, <32 x i32> v1, <16, 17, 18, 19>
		/// call void llvm.arm.neon.vst3(%ptr, %sub.v0, %sub.v1, %sub.v2, 4)
bool ARMTargetLowering::lowerInterleavedStore(StoreInst *SI,		bool ARMTargetLowering::lowerInterleavedStore(StoreInst *SI,
ShuffleVectorInst *SVI,		ShuffleVectorInst *SVI,
unsigned Factor) const {		unsigned Factor) const {
assert(Factor >= 2 && Factor <= getMaxSupportedInterleaveFactor() &&		assert(Factor >= 2 && Factor <= getMaxSupportedInterleaveFactor() &&
"Invalid interleave factor");		"Invalid interleave factor");

VectorType *VecTy = SVI->getType();		VectorType *VecTy = SVI->getType();
assert(VecTy->getVectorNumElements() % Factor == 0 &&		assert(VecTy->getVectorNumElements() % Factor == 0 &&
"Invalid interleaved store");		"Invalid interleaved store");

unsigned NumSubElts = VecTy->getVectorNumElements() / Factor;		unsigned LaneLen = VecTy->getVectorNumElements() / Factor;
Type *EltTy = VecTy->getVectorElementType();		Type *EltTy = VecTy->getVectorElementType();
VectorType *SubVecTy = VectorType::get(EltTy, NumSubElts);		VectorType *SubVecTy = VectorType::get(EltTy, LaneLen);

const DataLayout &DL = SI->getModule()->getDataLayout();		const DataLayout &DL = SI->getModule()->getDataLayout();
unsigned SubVecSize = DL.getTypeSizeInBits(SubVecTy);		unsigned SubVecSize = DL.getTypeSizeInBits(SubVecTy);
bool EltIs64Bits = DL.getTypeSizeInBits(EltTy) == 64;		bool EltIs64Bits = DL.getTypeSizeInBits(EltTy) == 64;

// Skip if we do not have NEON and skip illegal vector types and vector types		// Skip if we do not have NEON and skip illegal vector types and vector types
// with i64/f64 elements (vstN doesn't support i64/f64 elements).		// with i64/f64 elements (vstN doesn't support i64/f64 elements).
if (!Subtarget->hasNEON() \|\| (SubVecSize != 64 && SubVecSize != 128) \|\|		if (!Subtarget->hasNEON() \|\| (SubVecSize != 64 && SubVecSize != 128) \|\|
Show All 10 Lines	if (EltTy->isPointerTy()) {
Type *IntTy = DL.getIntPtrType(EltTy);		Type *IntTy = DL.getIntPtrType(EltTy);

// Convert to the corresponding integer vector.		// Convert to the corresponding integer vector.
Type *IntVecTy =		Type *IntVecTy =
VectorType::get(IntTy, Op0->getType()->getVectorNumElements());		VectorType::get(IntTy, Op0->getType()->getVectorNumElements());
Op0 = Builder.CreatePtrToInt(Op0, IntVecTy);		Op0 = Builder.CreatePtrToInt(Op0, IntVecTy);
Op1 = Builder.CreatePtrToInt(Op1, IntVecTy);		Op1 = Builder.CreatePtrToInt(Op1, IntVecTy);

SubVecTy = VectorType::get(IntTy, NumSubElts);		SubVecTy = VectorType::get(IntTy, LaneLen);
}		}

static const Intrinsic::ID StoreInts[3] = {Intrinsic::arm_neon_vst2,		static const Intrinsic::ID StoreInts[3] = {Intrinsic::arm_neon_vst2,
Intrinsic::arm_neon_vst3,		Intrinsic::arm_neon_vst3,
Intrinsic::arm_neon_vst4};		Intrinsic::arm_neon_vst4};
SmallVector<Value *, 6> Ops;		SmallVector<Value *, 6> Ops;

Type *Int8Ptr = Builder.getInt8PtrTy(SI->getPointerAddressSpace());		Type *Int8Ptr = Builder.getInt8PtrTy(SI->getPointerAddressSpace());
Ops.push_back(Builder.CreateBitCast(SI->getPointerOperand(), Int8Ptr));		Ops.push_back(Builder.CreateBitCast(SI->getPointerOperand(), Int8Ptr));

Type *Tys[] = { Int8Ptr, SubVecTy };		Type *Tys[] = { Int8Ptr, SubVecTy };
Function *VstNFunc = Intrinsic::getDeclaration(		Function *VstNFunc = Intrinsic::getDeclaration(
SI->getModule(), StoreInts[Factor - 2], Tys);		SI->getModule(), StoreInts[Factor - 2], Tys);

// Split the shufflevector operands into sub vectors for the new vstN call.		// Split the shufflevector operands into sub vectors for the new vstN call.
for (unsigned i = 0; i < Factor; i++)		auto Mask = SVI->getShuffleMask();
		for (unsigned i = 0; i < Factor; i++) {
		if (Mask[i] >= 0) {
Ops.push_back(Builder.CreateShuffleVector(		Ops.push_back(Builder.CreateShuffleVector(
Op0, Op1, getSequentialMask(Builder, NumSubElts * i, NumSubElts)));		Op0, Op1, getSequentialMask(Builder, Mask[i], LaneLen)));
		} else {
		unsigned StartMask = 0;
		for (unsigned j = 1; j < LaneLen; j++) {
		if (Mask[j*Factor + i] >= 0) {
		StartMask = Mask[j*Factor + i] - j;
		break;
		}
		}
		// Note: If all elements in a chunk are undefs, StartMask=0!
		rengolinUnsubmitted Not Done Reply Inline Actions Better to duplicate the comment, I think. These back-ends evolve at different paces. rengolin: Better to duplicate the comment, I think. These back-ends evolve at different paces.
		// Note: Filling undef gaps with random elements is ok, since
		// those elements were being written anyway (with undefs).
		// In the case of all undefs we're defaulting to using elems from 0
		// Note: StartMask cannot be negative, it's checked in isReInterleaveMask
		Ops.push_back(Builder.CreateShuffleVector(
		Op0, Op1, getSequentialMask(Builder, StartMask, LaneLen)));
		}
		}

Ops.push_back(Builder.getInt32(SI->getAlignment()));		Ops.push_back(Builder.getInt32(SI->getAlignment()));
Builder.CreateCall(VstNFunc, Ops);		Builder.CreateCall(VstNFunc, Ops);
return true;		return true;
}		}

enum HABaseType {		enum HABaseType {
HA_UNKNOWN = 0,		HA_UNKNOWN = 0,
▲ Show 20 Lines • Show All 135 Lines • Show Last 20 Lines

test/CodeGen/AArch64/aarch64-interleaved-accesses.ll

	Show First 20 Lines • Show All 274 Lines • ▼ Show 20 Lines
	; NONEON-LABEL: load_factor2_with_extract_user:			; NONEON-LABEL: load_factor2_with_extract_user:
	; NONEON-NOT: ld2			; NONEON-NOT: ld2
	define i32 @load_factor2_with_extract_user(<8 x i32>* %a) {			define i32 @load_factor2_with_extract_user(<8 x i32>* %a) {
	%1 = load <8 x i32>, <8 x i32>* %a, align 8			%1 = load <8 x i32>, <8 x i32>* %a, align 8
	%2 = shufflevector <8 x i32> %1, <8 x i32> undef, <4 x i32> <i32 0, i32 2, i32 4, i32 6>			%2 = shufflevector <8 x i32> %1, <8 x i32> undef, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
	%3 = extractelement <8 x i32> %1, i32 2			%3 = extractelement <8 x i32> %1, i32 2
	ret i32 %3			ret i32 %3
	}			}

				; NEON-LABEL: store_general_mask_factor4:
				; NEON: st4 { v3.2s, v4.2s, v5.2s, v6.2s }, [x0]
				rengolinUnsubmitted Not Done Reply Inline Actions Please, also use: st4.32 {d{{\n+}}, d{{\n+}}, d{{\n+}}, d{{\n+}}}, [x0] here. rengolin: Please, also use: st4.32 {d{{\n+}}, d{{\n+}}, d{{\n+}}, d{{\n+}}}, [x0] here.
				; NONEON-LABEL: store_general_mask_factor4:
				; NONEON-NOT: st4
				define void @store_general_mask_factor4(i32* %ptr, <32 x i32> %v0, <32 x i32> %v1) {
				%base = bitcast i32* %ptr to <8 x i32>*
				%i.vec = shufflevector <32 x i32> %v0, <32 x i32> %v1, <8 x i32> <i32 4, i32 16, i32 32, i32 8, i32 5, i32 17, i32 33, i32 9>
				store <8 x i32> %i.vec, <8 x i32>* %base, align 4
				ret void
				}

				; NEON-LABEL: store_general_mask_factor4_undefbeg:
				; NEON: st4 { v3.2s, v4.2s, v5.2s, v6.2s }, [x0]
				; NONEON-LABEL: store_general_mask_factor4_undefbeg:
				; NONEON-NOT: st4
				define void @store_general_mask_factor4_undefbeg(i32* %ptr, <32 x i32> %v0, <32 x i32> %v1) {
				%base = bitcast i32* %ptr to <8 x i32>*
				%i.vec = shufflevector <32 x i32> %v0, <32 x i32> %v1, <8 x i32> <i32 undef, i32 16, i32 32, i32 8, i32 5, i32 17, i32 33, i32 9>
				store <8 x i32> %i.vec, <8 x i32>* %base, align 4
				ret void
				}

				; NEON-LABEL: store_general_mask_factor4_undefend:
				; NEON: st4 { v3.2s, v4.2s, v5.2s, v6.2s }, [x0]
				; NONEON-LABEL: store_general_mask_factor4_undefend:
				; NONEON-NOT: st4
				define void @store_general_mask_factor4_undefend(i32* %ptr, <32 x i32> %v0, <32 x i32> %v1) {
				%base = bitcast i32* %ptr to <8 x i32>*
				%i.vec = shufflevector <32 x i32> %v0, <32 x i32> %v1, <8 x i32> <i32 4, i32 16, i32 32, i32 8, i32 5, i32 17, i32 33, i32 undef>
				store <8 x i32> %i.vec, <8 x i32>* %base, align 4
				ret void
				}

				; NEON-LABEL: store_general_mask_factor4_undefmid:
				; NEON: st4 { v3.2s, v4.2s, v5.2s, v6.2s }, [x0]
				; NONEON-LABEL: store_general_mask_factor4_undefmid:
				; NONEON-NOT: st4
				define void @store_general_mask_factor4_undefmid(i32* %ptr, <32 x i32> %v0, <32 x i32> %v1) {
				%base = bitcast i32* %ptr to <8 x i32>*
				%i.vec = shufflevector <32 x i32> %v0, <32 x i32> %v1, <8 x i32> <i32 4, i32 undef, i32 32, i32 8, i32 5, i32 17, i32 undef, i32 9>
				store <8 x i32> %i.vec, <8 x i32>* %base, align 4
				ret void
				}

				; NEON-LABEL: store_general_mask_factor4_undefmulti:
				; NEON: st4 { v2.2s, v3.2s, v4.2s, v5.2s }, [x0]
				; NONEON-LABEL: store_general_mask_factor4_undefmulti:
				; NONEON-NOT: st4
				define void @store_general_mask_factor4_undefmulti(i32* %ptr, <32 x i32> %v0, <32 x i32> %v1) {
				%base = bitcast i32* %ptr to <8 x i32>*
				%i.vec = shufflevector <32 x i32> %v0, <32 x i32> %v1, <8 x i32> <i32 4, i32 undef, i32 undef, i32 8, i32 undef, i32 undef, i32 undef, i32 9>
				store <8 x i32> %i.vec, <8 x i32>* %base, align 4
				ret void
				}

				; NEON-LABEL: store_general_mask_factor3:
				; NEON: st3 { v2.4s, v3.4s, v4.4s }, [x0]
				; NONEON-LABEL: store_general_mask_factor3:
				; NONEON-NOT: st3
				define void @store_general_mask_factor3(i32* %ptr, <32 x i32> %v0, <32 x i32> %v1) {
				%base = bitcast i32* %ptr to <12 x i32>*
				%i.vec = shufflevector <32 x i32> %v0, <32 x i32> %v1, <12 x i32> <i32 4, i32 32, i32 16, i32 5, i32 33, i32 17, i32 6, i32 34, i32 18, i32 7, i32 35, i32 19>
				store <12 x i32> %i.vec, <12 x i32>* %base, align 4
				ret void
				}

				; NEON-LABEL: store_general_mask_factor3_undefmultimid:
				; NEON: st3 { v2.4s, v3.4s, v4.4s }, [x0]
				; NONEON-LABEL: store_general_mask_factor3_undefmultimid:
				; NONEON-NOT: st3
				define void @store_general_mask_factor3_undefmultimid(i32* %ptr, <32 x i32> %v0, <32 x i32> %v1) {
				%base = bitcast i32* %ptr to <12 x i32>*
				%i.vec = shufflevector <32 x i32> %v0, <32 x i32> %v1, <12 x i32> <i32 4, i32 32, i32 16, i32 undef, i32 33, i32 17, i32 undef, i32 34, i32 18, i32 7, i32 35, i32 19>
				store <12 x i32> %i.vec, <12 x i32>* %base, align 4
				ret void
				}

				; NEON-LABEL: store_general_mask_factor3_undef_fail:
				; NEON-NOT: st3
				; NONEON-LABEL: store_general_mask_factor3_undef_fail:
				; NONEON-NOT: st3
				define void @store_general_mask_factor3_undef_fail(i32* %ptr, <32 x i32> %v0, <32 x i32> %v1) {
				%base = bitcast i32* %ptr to <12 x i32>*
				%i.vec = shufflevector <32 x i32> %v0, <32 x i32> %v1, <12 x i32> <i32 4, i32 32, i32 16, i32 undef, i32 33, i32 17, i32 undef, i32 34, i32 18, i32 8, i32 35, i32 19>
				store <12 x i32> %i.vec, <12 x i32>* %base, align 4
				ret void
				}

				; NEON-LABEL: store_general_mask_factor3_undeflane:
				; NEON: st3 { v1.4s, v2.4s, v3.4s }, [x0]
				; NONEON-LABEL: store_general_mask_factor3_undeflane:
				; NONEON-NOT: st3
				define void @store_general_mask_factor3_undeflane(i32* %ptr, <32 x i32> %v0, <32 x i32> %v1) {
				%base = bitcast i32* %ptr to <12 x i32>*
				%i.vec = shufflevector <32 x i32> %v0, <32 x i32> %v1, <12 x i32> <i32 undef, i32 32, i32 16, i32 undef, i32 33, i32 17, i32 undef, i32 34, i32 18, i32 undef, i32 35, i32 19>
				store <12 x i32> %i.vec, <12 x i32>* %base, align 4
				ret void
				}

				; NEON-LABEL: store_general_mask_factor3_negativestart:
				; NEON-NOT: st3
				; NONEON-LABEL: store_general_mask_factor3_negativestart:
				; NONEON-NOT: st3
				define void @store_general_mask_factor3_negativestart(i32* %ptr, <32 x i32> %v0, <32 x i32> %v1) {
				%base = bitcast i32* %ptr to <12 x i32>*
				%i.vec = shufflevector <32 x i32> %v0, <32 x i32> %v1, <12 x i32> <i32 undef, i32 32, i32 16, i32 undef, i32 33, i32 17, i32 undef, i32 34, i32 18, i32 2, i32 35, i32 19>
				store <12 x i32> %i.vec, <12 x i32>* %base, align 4
				ret void
				}

test/CodeGen/ARM/arm-interleaved-accesses.ll

	Show First 20 Lines • Show All 310 Lines • ▼ Show 20 Lines
	; NONEON-LABEL: load_factor2_with_extract_user:			; NONEON-LABEL: load_factor2_with_extract_user:
	; NONEON-NOT: vld2			; NONEON-NOT: vld2
	define i32 @load_factor2_with_extract_user(<8 x i32>* %a) {			define i32 @load_factor2_with_extract_user(<8 x i32>* %a) {
	%1 = load <8 x i32>, <8 x i32>* %a, align 8			%1 = load <8 x i32>, <8 x i32>* %a, align 8
	%2 = shufflevector <8 x i32> %1, <8 x i32> undef, <4 x i32> <i32 0, i32 2, i32 4, i32 6>			%2 = shufflevector <8 x i32> %1, <8 x i32> undef, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
	%3 = extractelement <8 x i32> %1, i32 2			%3 = extractelement <8 x i32> %1, i32 2
	ret i32 %3			ret i32 %3
	}			}

				; NEON-LABEL: store_general_mask_factor4:
				; NEON: vst4.32 {d{{[0-9]+}}, d{{[0-9]+}}, d{{[0-9]+}}, d{{[0-9]+}}}, [r0]
				rengolinUnsubmitted Not Done Reply Inline Actions is this really guaranteed to reproduce? they don't seem connected to the pcs directly... rengolin: is this really guaranteed to reproduce? they don't seem connected to the pcs directly...
				asbirleaAuthorUnsubmitted Not Done Reply Inline Actions I'm not sure about this TBH, and not sure how to verify it. Should I replace it by a simple vst4.32 check? asbirlea: I'm not sure about this TBH, and not sure how to verify it. Should I replace it by a simple…
				rengolinUnsubmitted Not Done Reply Inline Actions Something like: vst4.32 {d{{\n+}}, d{{\n+}}, d{{\n+}}, d{{\n+}}}, [r0] would do. (I'm not sure of the triple brackets there...) rengolin: Something like: vst4.32 {d{{\n+}}, d{{\n+}}, d{{\n+}}, d{{\n+}}}, [r0] would do. (I'm not…
				; NONEON-LABEL: store_general_mask_factor4:
				; NONEON-NOT: vst4.32
				define void @store_general_mask_factor4(i32* %ptr, <32 x i32> %v0, <32 x i32> %v1) {
				%base = bitcast i32* %ptr to <8 x i32>*
				%i.vec = shufflevector <32 x i32> %v0, <32 x i32> %v1, <8 x i32> <i32 4, i32 16, i32 32, i32 8, i32 5, i32 17, i32 33, i32 9>
				store <8 x i32> %i.vec, <8 x i32>* %base, align 4
				ret void
				}

				; NEON-LABEL: store_general_mask_factor4_undefbeg:
				; NEON: vst4.32 {d{{[0-9]+}}, d{{[0-9]+}}, d{{[0-9]+}}, d{{[0-9]+}}}, [r0]
				; NONEON-LABEL: store_general_mask_factor4_undefbeg:
				; NONEON-NOT: vst4.32
				define void @store_general_mask_factor4_undefbeg(i32* %ptr, <32 x i32> %v0, <32 x i32> %v1) {
				%base = bitcast i32* %ptr to <8 x i32>*
				%i.vec = shufflevector <32 x i32> %v0, <32 x i32> %v1, <8 x i32> <i32 undef, i32 16, i32 32, i32 8, i32 5, i32 17, i32 33, i32 9>
				store <8 x i32> %i.vec, <8 x i32>* %base, align 4
				ret void
				}

				; NEON-LABEL: store_general_mask_factor4_undefend:
				; NEON: vst4.32 {d{{[0-9]+}}, d{{[0-9]+}}, d{{[0-9]+}}, d{{[0-9]+}}}, [r0]
				; NONEON-LABEL: store_general_mask_factor4_undefend:
				; NONEON-NOT: vst4.32
				define void @store_general_mask_factor4_undefend(i32* %ptr, <32 x i32> %v0, <32 x i32> %v1) {
				%base = bitcast i32* %ptr to <8 x i32>*
				%i.vec = shufflevector <32 x i32> %v0, <32 x i32> %v1, <8 x i32> <i32 4, i32 16, i32 32, i32 8, i32 5, i32 17, i32 33, i32 undef>
				store <8 x i32> %i.vec, <8 x i32>* %base, align 4
				ret void
				}

				; NEON-LABEL: store_general_mask_factor4_undefmid:
				; NEON: vst4.32 {d{{[0-9]+}}, d{{[0-9]+}}, d{{[0-9]+}}, d{{[0-9]+}}}, [r0]
				; NONEON-LABEL: store_general_mask_factor4_undefmid:
				; NONEON-NOT: vst4.32
				define void @store_general_mask_factor4_undefmid(i32* %ptr, <32 x i32> %v0, <32 x i32> %v1) {
				%base = bitcast i32* %ptr to <8 x i32>*
				%i.vec = shufflevector <32 x i32> %v0, <32 x i32> %v1, <8 x i32> <i32 4, i32 undef, i32 32, i32 8, i32 5, i32 17, i32 undef, i32 9>
				store <8 x i32> %i.vec, <8 x i32>* %base, align 4
				ret void
				}

				; NEON-LABEL: store_general_mask_factor4_undefmulti:
				; NEON: vst4.32 {d{{[0-9]+}}, d{{[0-9]+}}, d{{[0-9]+}}, d{{[0-9]+}}}, [r0]
				; NONEON-LABEL: store_general_mask_factor4_undefmulti:
				; NONEON-NOT: vst4.32
				define void @store_general_mask_factor4_undefmulti(i32* %ptr, <32 x i32> %v0, <32 x i32> %v1) {
				%base = bitcast i32* %ptr to <8 x i32>*
				%i.vec = shufflevector <32 x i32> %v0, <32 x i32> %v1, <8 x i32> <i32 4, i32 undef, i32 undef, i32 8, i32 undef, i32 undef, i32 undef, i32 9>
				store <8 x i32> %i.vec, <8 x i32>* %base, align 4
				ret void
				}

				; NEON-LABEL: store_general_mask_factor3:
				; NEON: vst3.32 {d{{[0-9]+}}, d{{[0-9]+}}, d{{[0-9]+}}}, [r0]
				; NONEON-LABEL: store_general_mask_factor3:
				; NONEON-NOT: vst3.32
				define void @store_general_mask_factor3(i32* %ptr, <32 x i32> %v0, <32 x i32> %v1) {
				%base = bitcast i32* %ptr to <12 x i32>*
				%i.vec = shufflevector <32 x i32> %v0, <32 x i32> %v1, <12 x i32> <i32 4, i32 32, i32 16, i32 5, i32 33, i32 17, i32 6, i32 34, i32 18, i32 7, i32 35, i32 19>
				store <12 x i32> %i.vec, <12 x i32>* %base, align 4
				ret void
				}

				; NEON-LABEL: store_general_mask_factor3_undefmultimid:
				; NEON: vst3.32 {d{{[0-9]+}}, d{{[0-9]+}}, d{{[0-9]+}}}, [r0]
				; NONEON-LABEL: store_general_mask_factor3_undefmultimid:
				; NONEON-NOT: vst3.32
				define void @store_general_mask_factor3_undefmultimid(i32* %ptr, <32 x i32> %v0, <32 x i32> %v1) {
				%base = bitcast i32* %ptr to <12 x i32>*
				%i.vec = shufflevector <32 x i32> %v0, <32 x i32> %v1, <12 x i32> <i32 4, i32 32, i32 16, i32 undef, i32 33, i32 17, i32 undef, i32 34, i32 18, i32 7, i32 35, i32 19>
				store <12 x i32> %i.vec, <12 x i32>* %base, align 4
				ret void
				}

				; NEON-LABEL: store_general_mask_factor3_undef_fail:
				; NEON-NOT: vst3.32
				; NONEON-LABEL: store_general_mask_factor3_undef_fail:
				; NONEON-NOT: vst3.32
				define void @store_general_mask_factor3_undef_fail(i32* %ptr, <32 x i32> %v0, <32 x i32> %v1) {
				%base = bitcast i32* %ptr to <12 x i32>*
				%i.vec = shufflevector <32 x i32> %v0, <32 x i32> %v1, <12 x i32> <i32 4, i32 32, i32 16, i32 undef, i32 33, i32 17, i32 undef, i32 34, i32 18, i32 8, i32 35, i32 19>
				store <12 x i32> %i.vec, <12 x i32>* %base, align 4
				ret void
				}

				; NEON-LABEL: store_general_mask_factor3_undeflane:
				; NEON: vst3.32 {d{{[0-9]+}}, d{{[0-9]+}}, d{{[0-9]+}}}, [r0]
				; NONEON-LABEL: store_general_mask_factor3_undeflane:
				; NONEON-NOT: vst3.32
				define void @store_general_mask_factor3_undeflane(i32* %ptr, <32 x i32> %v0, <32 x i32> %v1) {
				%base = bitcast i32* %ptr to <12 x i32>*
				%i.vec = shufflevector <32 x i32> %v0, <32 x i32> %v1, <12 x i32> <i32 undef, i32 32, i32 16, i32 undef, i32 33, i32 17, i32 undef, i32 34, i32 18, i32 undef, i32 35, i32 19>
				store <12 x i32> %i.vec, <12 x i32>* %base, align 4
				ret void
				}

				; NEON-LABEL: store_general_mask_factor3_endstart_fail:
				; NEON-NOT: vst3.32
				; NONEON-LABEL: store_general_mask_factor3_endstart_fail:
				; NONEON-NOT: vst3.32
				define void @store_general_mask_factor3_endstart_fail(i32* %ptr, <32 x i32> %v0, <32 x i32> %v1) {
				%base = bitcast i32* %ptr to <12 x i32>*
				%i.vec = shufflevector <32 x i32> %v0, <32 x i32> %v1, <12 x i32> <i32 undef, i32 32, i32 16, i32 undef, i32 33, i32 17, i32 undef, i32 34, i32 18, i32 2, i32 35, i32 19>
				store <12 x i32> %i.vec, <12 x i32>* %base, align 4
				ret void
				}

				; NEON-LABEL: store_general_mask_factor3_endstart_pass:
				; NEON: vst3.32 {d{{[0-9]+}}, d{{[0-9]+}}, d{{[0-9]+}}}, [r0]
				; NONEON-LABEL: store_general_mask_factor3_endstart_pass:
				; NONEON-NOT: vst3.32
				define void @store_general_mask_factor3_endstart_pass(i32* %ptr, <32 x i32> %v0, <32 x i32> %v1) {
				%base = bitcast i32* %ptr to <12 x i32>*
				%i.vec = shufflevector <32 x i32> %v0, <32 x i32> %v1, <12 x i32> <i32 undef, i32 32, i32 16, i32 undef, i32 33, i32 17, i32 undef, i32 34, i32 18, i32 7, i32 35, i32 19>
				store <12 x i32> %i.vec, <12 x i32>* %base, align 4
				ret void
				}

				; NEON-LABEL: store_general_mask_factor3_midstart_fail:
				; NEON-NOT: vst3.32
				; NONEON-LABEL: store_general_mask_factor3_midstart_fail:
				; NONEON-NOT: vst3.32
				define void @store_general_mask_factor3_midstart_fail(i32* %ptr, <32 x i32> %v0, <32 x i32> %v1) {
				%base = bitcast i32* %ptr to <12 x i32>*
				%i.vec = shufflevector <32 x i32> %v0, <32 x i32> %v1, <12 x i32> <i32 undef, i32 32, i32 16, i32 0, i32 33, i32 17, i32 undef, i32 34, i32 18, i32 undef, i32 35, i32 19>
				store <12 x i32> %i.vec, <12 x i32>* %base, align 4
				ret void
				}

				; NEON-LABEL: store_general_mask_factor3_midstart_pass:
				; NEON: vst3.32 {d{{[0-9]+}}, d{{[0-9]+}}, d{{[0-9]+}}}, [r0]
				; NONEON-LABEL: store_general_mask_factor3_midstart_pass:
				; NONEON-NOT: vst3.32
				define void @store_general_mask_factor3_midstart_pass(i32* %ptr, <32 x i32> %v0, <32 x i32> %v1) {
				%base = bitcast i32* %ptr to <12 x i32>*
				%i.vec = shufflevector <32 x i32> %v0, <32 x i32> %v1, <12 x i32> <i32 undef, i32 32, i32 16, i32 1, i32 33, i32 17, i32 undef, i32 34, i32 18, i32 undef, i32 35, i32 19>
				store <12 x i32> %i.vec, <12 x i32>* %base, align 4
				ret void
				}

This is an archive of the discontinued LLVM Phabricator instance.

Generalize strided store pattern in interleave access passClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 75329

lib/CodeGen/InterleavedAccessPass.cpp

lib/Target/AArch64/AArch64ISelLowering.cpp

lib/Target/ARM/ARMISelLowering.cpp

test/CodeGen/AArch64/aarch64-interleaved-accesses.ll

test/CodeGen/ARM/arm-interleaved-accesses.ll

Generalize strided store pattern in interleave access pass
ClosedPublic