This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/trunk/
-
trunk/
-
lib/Transforms/Vectorize/
-
Transforms/
-
Vectorize/
-
LoadStoreVectorizer.cpp
-
test/Transforms/LoadStoreVectorizer/
-
Transforms/
-
LoadStoreVectorizer/
-
AMDGPU/
-
insertion-point.ll
-
X86/
-
correct-order.ll
-
preserve-order32.ll
-
preserve-order64.ll
-
subchain-interleaved.ll

Differential D22071

Correct ordering of loads/stores.
ClosedPublic

Authored by asbirlea on Jul 6 2016, 2:34 PM.

Download Raw Diff

Details

Reviewers

• tstellarAMD
jlebar
llvm-commits
arsenm

Commits

rGcbc6ac2afd7b: Correct ordering of loads/stores.
rL275117: Correct ordering of loads/stores.

Summary

Aiming to correct the ordering of loads/stores. This patch changes insert point for loads to the position of the first load; it adds a new ordering method for loads to insert before, rather than after the load.
Updated testcases to reflect the changes.

Diff Detail

Repository: rL LLVM

Event Timeline

asbirlea updated this revision to Diff 62977.Jul 6 2016, 2:34 PM

asbirlea retitled this revision from to Correct ordering of loads/stores..

asbirlea updated this object.

asbirlea added reviewers: llvm-commits, jlebar, arsenm.

Herald added a reviewer: • tstellarAMD. · View Herald TranscriptJul 6 2016, 2:34 PM

Herald added a subscriber: mzolotukhin. · View Herald Transcript

asbirlea added a parent revision: D21935: Add TLI.allowsMisalignedMemoryAccesses to LoadStoreVectorizer.Jul 6 2016, 2:35 PM

Revert some test changes after correcting the condition testing for misaligned in D21935.

jlebar added inline comments.Jul 6 2016, 3:35 PM

lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
97 ↗	(On Diff #62977)	Where do we use reorderAfter? Should one of the calls to reorderBefore call reorderAfter? If so, why aren't some of the tests failing? :)
99 ↗	(On Diff #62977)	"&" should bind to the variable name. (It's helpful to me to run clang-format as part of "arc diff", so I don't ever forget this stuff. I have a wrapper script around arc that invokes git-clang-format before posting a change. https://github.com/jlebar/conf/blob/master/bin/arc In fact I call my wrapper "arc" so I don't even have to think about it. You'll need a file/symlink in the same directory as that script called arc-real, and you'll need git-clang-format on your path -- it's in the LLVM source tree at tools/clang/tools/clang-format/git-clang-format. I think you'll also need to add [clangformat] style = "file" to your .gitconfig. Easy, right? :)
342 ↗	(On Diff #62977)	Whitespace alignment here.
350 ↗	(On Diff #62977)	This is arbitrarily-deep recursion, which I think we tend to avoid, for fear of overflowing the stack given pessimal inputs. Can we write this with an explicit worklist instead?
356 ↗	(On Diff #62977)	Can we use llvm::SmallPtrSet? unordered_set is much slower and cache-inefficient.
359 ↗	(On Diff #62977)	Space around "=" (clang-format should take care off this, too). Here and elsewhere.
360 ↗	(On Diff #62977)	Do we know that I->getParent()->end() can't change when we insert new elements?
360 ↗	(On Diff #62977)	This only considers instructions in I's BB, but InstructionsToMove may contain other instructions, no? If so that may make this whole thing more complicated...
367 ↗	(On Diff #62977)	If you're going to do this, maybe we should assert that InstructionsToMove is empty after the loop?
367 ↗	(On Diff #62977)	Do we need the `&*`? I'd think it should work without that.
925 ↗	(On Diff #62977)	We use dyn_cast when the cast may fail and return null. But I think you don't want a null pointer inside InstrsToReorder, and we know that Bitcast is an Instruction, so I think you want plain cast<>.
test/Transforms/LoadStoreVectorizer/AMDGPU/insertion-point.ll
12 ↗	(On Diff #62986)	I don't quite get what the original code is testing here. Like, the adds are completely independent of the loads, right? If so, can we fix this test so it's not sensitive to implementation details?
test/Transforms/LoadStoreVectorizer/X86/correct-order.ll
16 ↗	(On Diff #62986)	Do all these tests need to be inside loops?
17 ↗	(On Diff #62986)	Do we have a test which checks that we don't reorder through phi nodes?

Partially address comments.

Still looking into updating the tests and some of the comments I missed.

lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
97 ↗	(On Diff #62986)	Fair point. I need to figure out if there si a case when reorderAfter is needed for stores. In the mean time, removing it.
99 ↗	(On Diff #62986)	Yes. Postponing running clang-format until all other comments are addressed. I also have another patch on top of this which may lead to conflicts after formatting, in which case I will clang-format the next patch.
368 ↗	(On Diff #62986)	All this was removed, but I use the comments for reorderBefore.
928 ↗	(On Diff #62986)	Yes, updated.
test/Transforms/LoadStoreVectorizer/AMDGPU/insertion-point.ll
12 ↗	(On Diff #62986)	I'm not sure what the original purpose was. It looks to me it is intentionally testing an implementation detail ("insert_load_point").
test/Transforms/LoadStoreVectorizer/X86/correct-order.ll
17 ↗	(On Diff #62986)	I don't think so. Feel free to add one :).

OK, I'll have another look once you've figured the rest out. Just lmk.

lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
97 ↗	(On Diff #62998)	in which case I will clang-format the next patch. I can help you resolve the conflicts if it gets hairy (I have some ideas for rebase tricks), but it's a requirement each commit be properly clang-formatted. We can't commit improperly-formatted code.
test/Transforms/LoadStoreVectorizer/AMDGPU/insertion-point.ll
12 ↗	(On Diff #62998)	It looks to me it is intentionally testing an implementation detail ("insert_load_point"). Looks like it to me, too. When we committed the original patches, we all agreed that we wouldn't act with a bias towards the existing code, since we committed with existing unresolved issues. I think this should count under that rubric. That is, can we fix the test so it no longer tests an implementation detail? I suppose you don't need to do that in this patch if you don't want.
test/Transforms/LoadStoreVectorizer/X86/correct-order.ll
18 ↗	(On Diff #62998)	You don't think it's worth adding a test as part of this patch? It seems relevant because we could otherwise get infinite recursion or something...

arsenm added inline comments.Jul 6 2016, 8:13 PM

test/Transforms/LoadStoreVectorizer/AMDGPU/insertion-point.ll
12 ↗	(On Diff #62998)	The pass should still have an expectation for where the instructions will be inserted relative to the originals, I think a test ensuring this is useful

jlebar added inline comments.Jul 6 2016, 8:41 PM

test/Transforms/LoadStoreVectorizer/AMDGPU/insertion-point.ll
12 ↗	(On Diff #62998)	The pass should still have an expectation for where the instructions will be inserted relative to the originals I guess I am OK with this if we can articulate in the test file or the cpp file exactly what is the rule that we expect applies to our output. If we cannot articulate a rule, and instead we're just checking that the pass does what it currently does, I do not think that is a good test. The reason is that, without an articulation of the rule, if the test fails, we have no way to tell whether there's a bug or if the test just needs to be changed. (And if we can articulate a rule, it should go without saying that, inasmuch as reasonable, the test should check only for adherence to the rule, ignoring other ancillary properties of the output.)

arsenm added inline comments.Jul 6 2016, 11:55 PM

test/Transforms/LoadStoreVectorizer/AMDGPU/insertion-point.ll
12 ↗	(On Diff #62998)	Test should generally have a comment explaining what they are testing anyway. This change was mostly why I added this test in the first place. If something is changing any behavior of the pass, a test should capture this. I don't understand the concern about wondering if it's a bug or the test needs update, the point of having the test is you have to look at the test output changes to verify that it is still correct

jlebar added inline comments.Jul 7 2016, 8:09 AM

test/Transforms/LoadStoreVectorizer/AMDGPU/insertion-point.ll
12 ↗	(On Diff #62998)	the point of having the test is you have to look at the test output changes to verify that it is still correct My thesis is that this process is bug-prone. My evidence for this is that Alina found multiple tests in this test suite that had bugs -- tests that checked that some sequence of operations was vectorized when in fact it was not safe to vectorize it. This motivates my suggestion, which is that, inasmuch as we can, we should avoid engaging in this process (by writing tests that are not fragile to uninteresting details), and, where we can't avoid the process entirely (maybe the details are interesting, and maybe that's the case here), we should write down explicitly what behavior we expect from the pass. I don't mean to suggest that the bugs in the tests were the result of you being careless -- I didn't catch them either when I reviewed the patches. My point is just that every time humans have to look at new output and decide if it's correct, there is a chance that we'll overlook a bug. And based on this history, that chance is not negligible. Even if you disagree with my application of the evidence and think it's unlikely that the three of us would make such a bug, surely other maintainers may not be as scrupulous. Thus my suggestion: If we can write down the behavior we expect from the pass -- "Vectorized loads should be inserted at the position of the first load, and instructions which were between the first and last load should be reordered preserving their relative order inasmuch as possible." (or whatever the actual rule is) -- then when the test fails, we can judge against that whether the test or the pass is broken. And if we have to update the test, we have some chance of creating the correct output ourselves, rather than just accepting the output created by the pass (which is more likely, in my judgement, to lead to us accepting buggy output, per above). I think we're pretty close in what we want, honestly. The main difference, I think, is that I am saying that we should try not to test behavior for which we cannot articulate a rule. If the behavior is so incidental that we can't even say what it's supposed to do, I don't see why we'd want to enstone it as a test.

Address comments.

I think most comments are addressed now. Let me know if I missed anything.

lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
97 ↗	(On Diff #63097)	Formatted.
343 ↗	(On Diff #63097)	Used a work-list for the other method.
test/Transforms/LoadStoreVectorizer/AMDGPU/insertion-point.ll
13 ↗	(On Diff #63097)	I tried to resolve this for now by adding the comment you suggested in both this and the other 2 tests checking the order is preserved.
test/Transforms/LoadStoreVectorizer/X86/correct-order.ll
17 ↗	(On Diff #63097)	Removed loops in all tests.
19 ↗	(On Diff #63097)	I'm not sure how to properly create one right now. I added a test that makes an attempt at that, but it in fact ensures there is no vectorization beyond basic blocks (and implicitly through a phi node).

Adding changes from dependent patches.

asbirlea added a child revision: D22119: Extended LoadStoreVectorizer to vectorize subchains..Jul 7 2016, 4:21 PM

Can you elaborate a bit more in the commit message exactly what bug we're fixing here? It's not immediately clear to me.

lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
95 ↗	(On Diff #63151)	Maybe "reorderUsers" or "moveUsers" would be a better name, especially since we no longer have reorderAfter? Or even just "reorder", I guess.
342 ↗	(On Diff #63151)	push_back
343 ↗	(On Diff #63151)	`!Worklist.empty()` (I think?)
347 ↗	(On Diff #63151)	`I = Worklist.pop_back_elem();`
356 ↗	(On Diff #63151)	push_back
364 ↗	(On Diff #63151)	Since this is no longer recursive, we don't need the helper anymore? We can keep it if you think it's a useful way to break things down (I'm not sure it is), but then we should change the name so it's more descriptive and have it return the SmallPtrSet instead of take it by reference.
367 ↗	(On Diff #63151)	Could we call this iterator BBI, and then call the instruction to move IM? That would be consistent with the helper, and also would fix the problem of InstructionToMove and InstructionsToMove looking very similar.
374 ↗	(On Diff #63151)	Now that I think about this, could we just have a DEBUG() loop that checks that every element of InstructionsToMove is in the same BB as I? Then we don't need to erase from the set, which is overhead we don't need.
377 ↗	(On Diff #63151)	Nit, end sentences with periods. Also suggest swapping the order of the prepositional phrases; this order is awkward (although I can't articulate a rule :).
test/Transforms/LoadStoreVectorizer/AMDGPU/insertion-point.ll
7 ↗	(On Diff #63151)	Please reflow. Also thanks for adding this; I now understand why the test output is what it is. That makes me happy.

Address comments.

Mark as done.

jlebar added inline comments.Jul 8 2016, 3:14 PM

lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
354 ↗	(On Diff #63300)	Don't need DEBUG around the assert(). assert() is a macro and only evaluates its args in debug builds.
357 ↗	(On Diff #63300)	One of the things I'm bad at is evaluating the correctness of code that contains nits to be fixed. So...now that you've fixed the nits, I have a correctness concern, which I'm sorry I didn't see earlier. My concern is that we stop reordering when IM dominates IW. But it seems to me that we should stop reordering when IM dominates I, no? Because after we move IW up before I, it may no longer dominate its operands.

asbirlea added inline comments.Jul 8 2016, 5:03 PM

lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
357 ↗	(On Diff #63300)	Let me try to reason this. The loop checks that all operands IM should dominate IW. If that's not the case, IW should be moved before I. If that happens, IW is added to the worklist, so all its operands are checked and possibly moved before as well in the next iterations. Yes, all IM should implicitly dominate I as well, but that should be transitive through IW. Unless I missed something?

asbirlea added inline comments.Jul 8 2016, 5:15 PM

lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
357 ↗	(On Diff #63300)	What I missed is that instructions are not moved until after the loop. You're right.

Address comments.

jlebar added inline comments.Jul 11 2016, 9:46 AM

lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
357 ↗	(On Diff #63300)	Did you add a test for this? Phabricator isn't showing one, but I don't entirely trust it. If not, can you add one?

jlebar mentioned this in D22119: Extended LoadStoreVectorizer to vectorize subchains..Jul 11 2016, 10:45 AM

Add testcase.

asbirlea added inline comments.Jul 11 2016, 12:45 PM

lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
357–368 ↗	(On Diff #63555)	Added a tescase now. The test fails without the above change (replacing I with IW in the dominators check)

jlebar accepted this revision.Jul 11 2016, 12:53 PM

jlebar edited edge metadata.

This revision is now accepted and ready to land.Jul 11 2016, 12:53 PM

Update to latest.

Closed by commit rL275117: Correct ordering of loads/stores. (authored by asbirlea). · Explain WhyJul 11 2016, 3:41 PM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

llvm/

trunk/

lib/

Transforms/

Vectorize/

LoadStoreVectorizer.cpp

48 lines

test/

Transforms/

LoadStoreVectorizer/

AMDGPU/

insertion-point.ll

6 lines

X86/

correct-order.ll

26 lines

preserve-order32.ll

7 lines

preserve-order64.ll

56 lines

subchain-interleaved.ll

91 lines

Diff 63601

llvm/trunk/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp

Show First 20 Lines • Show All 82 Lines • ▼ Show 20 Lines	unsigned getAlignment(StoreInst *SI) const {
if (Align != 0)		if (Align != 0)
return Align;		return Align;

return DL.getABITypeAlignment(SI->getValueOperand()->getType());		return DL.getABITypeAlignment(SI->getValueOperand()->getType());
}		}

bool isConsecutiveAccess(Value A, Value B);		bool isConsecutiveAccess(Value A, Value B);

/// Reorders the users of I after vectorization to ensure that I dominates its		/// After vectorization, reorder the instructions that I depends on
/// users.		/// (the instructions defining its operands), to ensure they dominate I.
void reorder(Instruction *I);		void reorder(Instruction *I);

/// Returns the first and the last instructions in Chain.		/// Returns the first and the last instructions in Chain.
std::pair<BasicBlock::iterator, BasicBlock::iterator>		std::pair<BasicBlock::iterator, BasicBlock::iterator>
getBoundaryInstrs(ArrayRef<Value *> Chain);		getBoundaryInstrs(ArrayRef<Value *> Chain);

/// Erases the original instructions after vectorizing.		/// Erases the original instructions after vectorizing.
void eraseInstructions(ArrayRef<Value *> Chain);		void eraseInstructions(ArrayRef<Value *> Chain);
▲ Show 20 Lines • Show All 228 Lines • ▼ Show 20 Lines	bool Vectorizer::isConsecutiveAccess(Value A, Value B) {
const SCEV *OffsetSCEVA = SE.getSCEV(OpA);		const SCEV *OffsetSCEVA = SE.getSCEV(OpA);
const SCEV *OffsetSCEVB = SE.getSCEV(OpB);		const SCEV *OffsetSCEVB = SE.getSCEV(OpB);
const SCEV *One = SE.getConstant(APInt(BitWidth, 1));		const SCEV *One = SE.getConstant(APInt(BitWidth, 1));
const SCEV *X2 = SE.getAddExpr(OffsetSCEVA, One);		const SCEV *X2 = SE.getAddExpr(OffsetSCEVA, One);
return X2 == OffsetSCEVB;		return X2 == OffsetSCEVB;
}		}

void Vectorizer::reorder(Instruction *I) {		void Vectorizer::reorder(Instruction *I) {
Instruction *InsertAfter = I;		SmallPtrSet<Instruction *, 16> InstructionsToMove;
for (User *U : I->users()) {		SmallVector<Instruction *, 16> Worklist;
Instruction *User = dyn_cast<Instruction>(U);
if (!User \|\| User->getOpcode() == Instruction::PHI)		Worklist.push_back(I);
		while (!Worklist.empty()) {
		Instruction *IW = Worklist.pop_back_val();
		int NumOperands = IW->getNumOperands();
		for (int i = 0; i < NumOperands; i++) {
		Instruction *IM = dyn_cast<Instruction>(IW->getOperand(i));
		if (!IM \|\| IM->getOpcode() == Instruction::PHI)
continue;		continue;

if (!DT.dominates(I, User)) {		if (!DT.dominates(IM, I)) {
User->removeFromParent();		InstructionsToMove.insert(IM);
User->insertAfter(InsertAfter);		Worklist.push_back(IM);
InsertAfter = User;		assert(IM->getParent() == IW->getParent() &&
reorder(User);		"Instructions to move should be in the same basic block");
		}
		}
}		}

		// All instructions to move should follow I. Start from I, not from begin().
		for (auto BBI = I->getIterator(), E = I->getParent()->end(); BBI != E;
		++BBI) {
		if (!is_contained(InstructionsToMove, &*BBI))
		continue;
		Instruction IM = &BBI;
		--BBI;
		IM->removeFromParent();
		IM->insertBefore(I);
}		}
}		}

std::pair<BasicBlock::iterator, BasicBlock::iterator>		std::pair<BasicBlock::iterator, BasicBlock::iterator>
Vectorizer::getBoundaryInstrs(ArrayRef<Value *> Chain) {		Vectorizer::getBoundaryInstrs(ArrayRef<Value *> Chain) {
Instruction *C0 = cast<Instruction>(Chain[0]);		Instruction *C0 = cast<Instruction>(Chain[0]);
BasicBlock::iterator FirstInstr = C0->getIterator();		BasicBlock::iterator FirstInstr = C0->getIterator();
BasicBlock::iterator LastInstr = C0->getIterator();		BasicBlock::iterator LastInstr = C0->getIterator();
▲ Show 20 Lines • Show All 489 Lines • ▼ Show 20 Lines	bool Vectorizer::vectorizeLoadChain(ArrayRef<Value *> Chain) {

BasicBlock::iterator First, Last;		BasicBlock::iterator First, Last;
std::tie(First, Last) = getBoundaryInstrs(Chain);		std::tie(First, Last) = getBoundaryInstrs(Chain);

if (!isVectorizable(Chain, First, Last))		if (!isVectorizable(Chain, First, Last))
return false;		return false;

// Set insert point.		// Set insert point.
Builder.SetInsertPoint(&*Last);		Builder.SetInsertPoint(&*First);

Value *Bitcast =		Value *Bitcast =
Builder.CreateBitCast(L0->getPointerOperand(), VecTy->getPointerTo(AS));		Builder.CreateBitCast(L0->getPointerOperand(), VecTy->getPointerTo(AS));

LoadInst *LI = cast<LoadInst>(Builder.CreateLoad(Bitcast));		LoadInst *LI = cast<LoadInst>(Builder.CreateLoad(Bitcast));
propagateMetadata(LI, Chain);		propagateMetadata(LI, Chain);
LI->setAlignment(Alignment);		LI->setAlignment(Alignment);

if (VecLoadTy) {		if (VecLoadTy) {
SmallVector<Instruction *, 16> InstrsToErase;		SmallVector<Instruction *, 16> InstrsToErase;
SmallVector<Instruction *, 16> InstrsToReorder;		SmallVector<Instruction *, 16> InstrsToReorder;
		InstrsToReorder.push_back(cast<Instruction>(Bitcast));

unsigned VecWidth = VecLoadTy->getNumElements();		unsigned VecWidth = VecLoadTy->getNumElements();
for (unsigned I = 0, E = Chain.size(); I != E; ++I) {		for (unsigned I = 0, E = Chain.size(); I != E; ++I) {
for (auto Use : Chain[I]->users()) {		for (auto Use : Chain[I]->users()) {
Instruction *UI = cast<Instruction>(Use);		Instruction *UI = cast<Instruction>(Use);
unsigned Idx = cast<ConstantInt>(UI->getOperand(1))->getZExtValue();		unsigned Idx = cast<ConstantInt>(UI->getOperand(1))->getZExtValue();
unsigned NewIdx = Idx + I * VecWidth;		unsigned NewIdx = Idx + I * VecWidth;
Value *V = Builder.CreateExtractElement(LI, Builder.getInt32(NewIdx));		Value *V = Builder.CreateExtractElement(LI, Builder.getInt32(NewIdx));
Instruction *Extracted = cast<Instruction>(V);		Instruction *Extracted = cast<Instruction>(V);
if (Extracted->getType() != UI->getType())		if (Extracted->getType() != UI->getType())
Extracted = cast<Instruction>(		Extracted = cast<Instruction>(
Builder.CreateBitCast(Extracted, UI->getType()));		Builder.CreateBitCast(Extracted, UI->getType()));

// Replace the old instruction.		// Replace the old instruction.
UI->replaceAllUsesWith(Extracted);		UI->replaceAllUsesWith(Extracted);
InstrsToReorder.push_back(Extracted);
InstrsToErase.push_back(UI);		InstrsToErase.push_back(UI);
}		}
}		}

for (Instruction *ModUser : InstrsToReorder)		for (Instruction *ModUser : InstrsToReorder)
reorder(ModUser);		reorder(ModUser);

for (auto I : InstrsToErase)		for (auto I : InstrsToErase)
I->eraseFromParent();		I->eraseFromParent();
} else {		} else {
SmallVector<Instruction *, 16> InstrsToReorder;		SmallVector<Instruction *, 16> InstrsToReorder;
		InstrsToReorder.push_back(cast<Instruction>(Bitcast));

for (unsigned I = 0, E = Chain.size(); I != E; ++I) {		for (unsigned I = 0, E = Chain.size(); I != E; ++I) {
Value *V = Builder.CreateExtractElement(LI, Builder.getInt32(I));		Value *V = Builder.CreateExtractElement(LI, Builder.getInt32(I));
Instruction *Extracted = cast<Instruction>(V);		Instruction *Extracted = cast<Instruction>(V);
Instruction *UI = cast<Instruction>(Chain[I]);		Instruction *UI = cast<Instruction>(Chain[I]);
if (Extracted->getType() != UI->getType()) {		if (Extracted->getType() != UI->getType()) {
Extracted = cast<Instruction>(		Extracted = cast<Instruction>(
Builder.CreateBitOrPointerCast(Extracted, UI->getType()));		Builder.CreateBitOrPointerCast(Extracted, UI->getType()));
}		}

// Replace the old instruction.		// Replace the old instruction.
UI->replaceAllUsesWith(Extracted);		UI->replaceAllUsesWith(Extracted);
InstrsToReorder.push_back(Extracted);
}		}

for (Instruction *ModUser : InstrsToReorder)		for (Instruction *ModUser : InstrsToReorder)
reorder(ModUser);		reorder(ModUser);
}		}

eraseInstructions(Chain);		eraseInstructions(Chain);

Show All 14 Lines

llvm/trunk/test/Transforms/LoadStoreVectorizer/AMDGPU/insertion-point.ll

	; RUN: opt -mtriple=amdgcn-amd-amdhsa -basicaa -load-store-vectorizer -S -o - %s \| FileCheck %s			; RUN: opt -mtriple=amdgcn-amd-amdhsa -basicaa -load-store-vectorizer -S -o - %s \| FileCheck %s

	target datalayout = "e-p:32:32-p1:64:64-p2:64:64-p3:32:32-p4:64:64-p5:32:32-p24:64:64-i64:64-v16:16-v24:32-v32:32-v48:64-v96:128-v192:256-v256:256-v512:512-v1024:1024-v2048:2048-n32:64"			target datalayout = "e-p:32:32-p1:64:64-p2:64:64-p3:32:32-p4:64:64-p5:32:32-p24:64:64-i64:64-v16:16-v24:32-v32:32-v48:64-v96:128-v192:256-v256:256-v512:512-v1024:1024-v2048:2048-n32:64"

	; Check relative position of the inserted vector load relative to the			; Check relative position of the inserted vector load relative to the existing
	; existing adds.			; adds. Vectorized loads should be inserted at the position of the first load.

	; CHECK-LABEL: @insert_load_point(			; CHECK-LABEL: @insert_load_point(
	; CHECK: %z = add i32 %x, 4			; CHECK: %z = add i32 %x, 4
	; CHECK: %w = add i32 %y, 9
	; CHECK: load <2 x float>			; CHECK: load <2 x float>
				; CHECK: %w = add i32 %y, 9
	; CHECK: %foo = add i32 %z, %w			; CHECK: %foo = add i32 %z, %w
	define void @insert_load_point(float addrspace(1)* nocapture %a, float addrspace(1)* nocapture %b, float addrspace(1)* nocapture readonly %c, i64 %idx, i32 %x, i32 %y) #0 {			define void @insert_load_point(float addrspace(1)* nocapture %a, float addrspace(1)* nocapture %b, float addrspace(1)* nocapture readonly %c, i64 %idx, i32 %x, i32 %y) #0 {
	entry:			entry:
	%a.idx.x = getelementptr inbounds float, float addrspace(1)* %a, i64 %idx			%a.idx.x = getelementptr inbounds float, float addrspace(1)* %a, i64 %idx
	%c.idx.x = getelementptr inbounds float, float addrspace(1)* %c, i64 %idx			%c.idx.x = getelementptr inbounds float, float addrspace(1)* %c, i64 %idx
	%a.idx.x.1 = getelementptr inbounds float, float addrspace(1)* %a.idx.x, i64 1			%a.idx.x.1 = getelementptr inbounds float, float addrspace(1)* %a.idx.x, i64 1
	%c.idx.x.1 = getelementptr inbounds float, float addrspace(1)* %c.idx.x, i64 1			%c.idx.x.1 = getelementptr inbounds float, float addrspace(1)* %c.idx.x, i64 1

	▲ Show 20 Lines • Show All 43 Lines • Show Last 20 Lines

llvm/trunk/test/Transforms/LoadStoreVectorizer/X86/correct-order.ll

				; RUN: opt -mtriple=x86-linux -load-store-vectorizer -S -o - %s \| FileCheck %s

				target datalayout = "e-m:e-i64:64-i128:128-n32:64-S128"

				; CHECK-LABEL: @correct_order(
				; CHECK: bitcast i32*
				; CHECK: load <2 x i32>
				; CHECK: load i32
				; CHECK: bitcast i32*
				; CHECK: store <2 x i32>
				; CHECK: load i32
				define void @correct_order(i32* noalias %ptr) {
				%next.gep = getelementptr i32, i32* %ptr, i64 0
				%next.gep1 = getelementptr i32, i32* %ptr, i64 1
				%next.gep2 = getelementptr i32, i32* %ptr, i64 2

				%l1 = load i32, i32* %next.gep1, align 4
				%l2 = load i32, i32* %next.gep, align 4
				store i32 0, i32* %next.gep1, align 4
				store i32 0, i32* %next.gep, align 4
				%l3 = load i32, i32* %next.gep1, align 4
				%l4 = load i32, i32* %next.gep2, align 4

				ret void
				}

llvm/trunk/test/Transforms/LoadStoreVectorizer/X86/preserve-order32.ll

	; RUN: opt -mtriple=x86-linux -load-store-vectorizer -S -o - %s \| FileCheck %s			; RUN: opt -mtriple=x86-linux -load-store-vectorizer -S -o - %s \| FileCheck %s

	target datalayout = "e-p:32:32-p1:64:64-p2:64:64-p3:32:32-p4:64:64-p5:32:32-p24:64:64-i64:64-v16:16-v24:32-v32:32-v48:64-v96:128-v192:256-v256:256-v512:512-v1024:1024-v2048:2048-n32:64"			target datalayout = "e-p:32:32-p1:64:64-p2:64:64-p3:32:32-p4:64:64-p5:32:32-p24:64:64-i64:64-v16:16-v24:32-v32:32-v48:64-v96:128-v192:256-v256:256-v512:512-v1024:1024-v2048:2048-n32:64"

	%struct.buffer_t = type { i32, i8* }			%struct.buffer_t = type { i32, i8* }

	; Check an i32 and i8* get vectorized, and that			; Check an i32 and i8* get vectorized, and that the two accesses
	; the two accesses (load into buff.val and store to buff.p) preserve their order.			; (load into buff.val and store to buff.p) preserve their order.
				; Vectorized loads should be inserted at the position of the first load,
				; and instructions which were between the first and last load should be
				; reordered preserving their relative order inasmuch as possible.

	; CHECK-LABEL: @preserve_order_32(			; CHECK-LABEL: @preserve_order_32(
	; CHECK: load <2 x i32>			; CHECK: load <2 x i32>
	; CHECK: %buff.val = load i8			; CHECK: %buff.val = load i8
	; CHECK: store i8 0			; CHECK: store i8 0
	define void @preserve_order_32(%struct.buffer_t* noalias %buff) #0 {			define void @preserve_order_32(%struct.buffer_t* noalias %buff) #0 {
	entry:			entry:
	%tmp1 = getelementptr inbounds %struct.buffer_t, %struct.buffer_t* %buff, i32 0, i32 1			%tmp1 = getelementptr inbounds %struct.buffer_t, %struct.buffer_t* %buff, i32 0, i32 1
	Show All 9 Lines

llvm/trunk/test/Transforms/LoadStoreVectorizer/X86/preserve-order64.ll

	; RUN: opt -mtriple=x86-linux -load-store-vectorizer -S -o - %s \| FileCheck %s			; RUN: opt -mtriple=x86-linux -load-store-vectorizer -S -o - %s \| FileCheck %s

	target datalayout = "e-m:e-i64:64-i128:128-n32:64-S128"			target datalayout = "e-m:e-i64:64-i128:128-n32:64-S128"

	%struct.buffer_t = type { i64, i8* }			%struct.buffer_t = type { i64, i8* }
				%struct.nested.buffer = type { %struct.buffer_t, %struct.buffer_t }

	; Check an i64 and i8* get vectorized, and that			; Check an i64 and i8* get vectorized, and that the two accesses
	; the two accesses (load into buff.val and store to buff.p) preserve their order.			; (load into buff.val and store to buff.p) preserve their order.
				; Vectorized loads should be inserted at the position of the first load,
				; and instructions which were between the first and last load should be
				; reordered preserving their relative order inasmuch as possible.

	; CHECK-LABEL: @preserve_order_64(			; CHECK-LABEL: @preserve_order_64(
	; CHECK: load <2 x i64>			; CHECK: load <2 x i64>
	; CHECK: %buff.val = load i8			; CHECK: %buff.val = load i8
	; CHECK: store i8 0			; CHECK: store i8 0
	define void @preserve_order_64(%struct.buffer_t* noalias %buff) #0 {			define void @preserve_order_64(%struct.buffer_t* noalias %buff) #0 {
	entry:			entry:
	%tmp1 = getelementptr inbounds %struct.buffer_t, %struct.buffer_t* %buff, i64 0, i32 1			%tmp1 = getelementptr inbounds %struct.buffer_t, %struct.buffer_t* %buff, i64 0, i32 1
	%buff.p = load i8, i8* %tmp1, align 8			%buff.p = load i8, i8* %tmp1, align 8
	%buff.val = load i8, i8* %buff.p, align 8			%buff.val = load i8, i8* %buff.p, align 8
	store i8 0, i8* %buff.p, align 8			store i8 0, i8* %buff.p, align 8
	%tmp0 = getelementptr inbounds %struct.buffer_t, %struct.buffer_t* %buff, i64 0, i32 0			%tmp0 = getelementptr inbounds %struct.buffer_t, %struct.buffer_t* %buff, i64 0, i32 0
	%buff.int = load i64, i64* %tmp0, align 8			%buff.int = load i64, i64* %tmp0, align 8
	ret void			ret void
	}			}

				; Check reordering recurses correctly.

				; CHECK-LABEL: @transitive_reorder(
				; CHECK: load <2 x i64>
				; CHECK: %buff.val = load i8
				; CHECK: store i8 0
				define void @transitive_reorder(%struct.buffer_t* noalias %buff, %struct.nested.buffer* noalias %nest) #0 {
				entry:
				%nest0_0 = getelementptr inbounds %struct.nested.buffer, %struct.nested.buffer* %nest, i64 0, i32 0
				%tmp1 = getelementptr inbounds %struct.buffer_t, %struct.buffer_t* %nest0_0, i64 0, i32 1
				%buff.p = load i8, i8* %tmp1, align 8
				%buff.val = load i8, i8* %buff.p, align 8
				store i8 0, i8* %buff.p, align 8
				%nest1_0 = getelementptr inbounds %struct.nested.buffer, %struct.nested.buffer* %nest, i64 0, i32 0
				%tmp0 = getelementptr inbounds %struct.buffer_t, %struct.buffer_t* %nest1_0, i64 0, i32 0
				%buff.int = load i64, i64* %tmp0, align 8
				ret void
				}

				; Check for no vectorization over phi node

				; CHECK-LABEL: @no_vect_phi(
				; CHECK: load i8*
				; CHECK: load i8
				; CHECK: store i8 0
				; CHECK: load i64
				define void @no_vect_phi(i32* noalias %ptr, %struct.buffer_t* noalias %buff) {
				entry:
				%tmp1 = getelementptr inbounds %struct.buffer_t, %struct.buffer_t* %buff, i64 0, i32 1
				%buff.p = load i8, i8* %tmp1, align 8
				%buff.val = load i8, i8* %buff.p, align 8
				store i8 0, i8* %buff.p, align 8
				br label %"for something"

				"for something":
				%index = phi i64 [ 0, %entry ], [ %index.next, %"for something" ]

				%tmp0 = getelementptr inbounds %struct.buffer_t, %struct.buffer_t* %buff, i64 0, i32 0
				%buff.int = load i64, i64* %tmp0, align 8

				%index.next = add i64 %index, 8
				%cmp_res = icmp eq i64 %index.next, 8
				br i1 %cmp_res, label %ending, label %"for something"

				ending:
				ret void
				}

	attributes #0 = { nounwind }			attributes #0 = { nounwind }

llvm/trunk/test/Transforms/LoadStoreVectorizer/X86/subchain-interleaved.ll

				; RUN: opt -mtriple=x86-linux -load-store-vectorizer -S -o - %s \| FileCheck %s

				target datalayout = "e-m:e-i64:64-i128:128-n32:64-S128"

				; Vectorized subsets of the load/store chains in the presence of
				; interleaved loads/stores

				; CHECK-LABEL: @interleave_2L_2S(
				; CHECK: load <2 x i32>
				; CHECK: load i32
				; CHECK: store <2 x i32>
				; CHECK: load i32
				define void @interleave_2L_2S(i32* noalias %ptr) {
				%next.gep = getelementptr i32, i32* %ptr, i64 0
				%next.gep1 = getelementptr i32, i32* %ptr, i64 1
				%next.gep2 = getelementptr i32, i32* %ptr, i64 2

				%l1 = load i32, i32* %next.gep1, align 4
				%l2 = load i32, i32* %next.gep, align 4
				store i32 0, i32* %next.gep1, align 4
				store i32 0, i32* %next.gep, align 4
				%l3 = load i32, i32* %next.gep1, align 4
				%l4 = load i32, i32* %next.gep2, align 4

				ret void
				}

				; CHECK-LABEL: @interleave_3L_2S_1L(
				; CHECK: load <3 x i32>
				; CHECK: store <2 x i32>
				; CHECK: load i32

				define void @interleave_3L_2S_1L(i32* noalias %ptr) {
				%next.gep = getelementptr i32, i32* %ptr, i64 0
				%next.gep1 = getelementptr i32, i32* %ptr, i64 1
				%next.gep2 = getelementptr i32, i32* %ptr, i64 2

				%l2 = load i32, i32* %next.gep, align 4
				%l1 = load i32, i32* %next.gep1, align 4
				store i32 0, i32* %next.gep1, align 4
				store i32 0, i32* %next.gep, align 4
				%l3 = load i32, i32* %next.gep1, align 4
				%l4 = load i32, i32* %next.gep2, align 4

				ret void
				}

				; CHECK-LABEL: @chain_suffix(
				; CHECK: load i32
				; CHECK: store <2 x i32>
				; CHECK: load i32
				; CHECK: load i32
				define void @chain_suffix(i32* noalias %ptr) {
				%next.gep = getelementptr i32, i32* %ptr, i64 0
				%next.gep1 = getelementptr i32, i32* %ptr, i64 1
				%next.gep2 = getelementptr i32, i32* %ptr, i64 2

				%l2 = load i32, i32* %next.gep, align 4
				store i32 0, i32* %next.gep1, align 4
				store i32 0, i32* %next.gep, align 4
				%l3 = load i32, i32* %next.gep1, align 4
				%l4 = load i32, i32* %next.gep2, align 4

				ret void
				}


				; CHECK-LABEL: @chain_prefix_suffix(
				; CHECK: load i32
				; CHECK: load i32
				; CHECK: store <2 x i32>
				; CHECK: load i32
				; CHECK: load i32
				; CHECK: load i32
				define void @chain_prefix_suffix(i32* noalias %ptr) {
				%next.gep = getelementptr i32, i32* %ptr, i64 0
				%next.gep1 = getelementptr i32, i32* %ptr, i64 1
				%next.gep2 = getelementptr i32, i32* %ptr, i64 2
				%next.gep3 = getelementptr i32, i32* %ptr, i64 3

				%l1 = load i32, i32* %next.gep, align 4
				%l2 = load i32, i32* %next.gep1, align 4
				store i32 0, i32* %next.gep1, align 4
				store i32 0, i32* %next.gep2, align 4
				%l3 = load i32, i32* %next.gep1, align 4
				%l4 = load i32, i32* %next.gep2, align 4
				%l5 = load i32, i32* %next.gep3, align 4

				ret void
				}

This is an archive of the discontinued LLVM Phabricator instance.

Correct ordering of loads/stores.ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 63601

llvm/trunk/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp

llvm/trunk/test/Transforms/LoadStoreVectorizer/AMDGPU/insertion-point.ll

llvm/trunk/test/Transforms/LoadStoreVectorizer/X86/correct-order.ll

llvm/trunk/test/Transforms/LoadStoreVectorizer/X86/preserve-order32.ll

llvm/trunk/test/Transforms/LoadStoreVectorizer/X86/preserve-order64.ll

llvm/trunk/test/Transforms/LoadStoreVectorizer/X86/subchain-interleaved.ll

Correct ordering of loads/stores.
ClosedPublic