This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
lib/Transforms/Vectorize/
-
Transforms/
-
Vectorize/
4
LoadStoreVectorizer.cpp
-
test/
-
CodeGen/X86/
-
X86/
-
loadStore_vectorizer.ll
-
Transforms/LoadStoreVectorizer/AMDGPU/
-
LoadStoreVectorizer/
-
AMDGPU/
2
gep-bitcast.ll

Differential D49342

[LSV] Refactoring + supporting bitcasts to a type of different size
ClosedPublic

Authored by rtereshin on Jul 14 2018, 10:08 AM.

Download Raw Diff

Details

Reviewers

volkan
arsenm
rampitec

Commits

rL337489: [LSV] Refactoring + supporting bitcasts to a type of different size

Summary

This is mostly a preparation work for adding a limited support for select instructions.
It proved to be difficult to do due to size and irregularity of Vectorizer::isConsecutiveAccess,
this is fixed here I believe.

It also turned out that these changes make it simpler to finish one of the TODOs and fix a number of other small issues, namely:

Looking through bitcasts to a type of a different size (requires careful tracking of the original load/store size and some math converting sizes in bytes to expected differences in indices of GEPs).
Reusing partial analysis of pointers done by first attempt in proving them consecutive instead of starting from scratch. This added limited support for nested GEPs co-existing with difficult sext/zext instructions. This also required a careful handling of negative differences between constant parts of offsets.
Handing a case where the first pointer index is not an add, but something else (a function parameter for instance).

I observe an increased number of successful vectorizations on a large set of shader programs. Only few shaders are affected, but those that are affected sport >5% less loads and stores than before the patch.

The selects patch is coming soon.

This is related to but independent of https://reviews.llvm.org/D48853 ("[SCEV] Add zext(C + x + ...) -> D + zext(C-D + x + ...)<nuw> transform"), also improving LoadStoreVectorizer.

Diff Detail

Repository: rL LLVM

Event Timeline

rtereshin created this revision.Jul 14 2018, 10:08 AM

Herald added subscribers: javed.absar, nhaehnle, wdng. · View Herald TranscriptJul 14 2018, 10:08 AM

rampitec added inline comments.Jul 16 2018, 10:32 AM

lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
393	I do not think this can happen. There is a convention to have constant operand always at the last position in a commutative operation. Do you see it in a real world example or just in the artificially crafted testcase?

rtereshin added inline comments.Jul 16 2018, 10:54 AM

lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
393	In the artificial one only. I will remove the case from the patch and the test.

Removing the "4. Handling a case where the sext/zext's operand is not add %x, C, but add C, %x, where C is a constant." case handling code and updating the test correspondingly.

LGTM

This revision is now accepted and ready to land.Jul 16 2018, 2:20 PM

arsenm added inline comments.Jul 16 2018, 2:24 PM

lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
339	Needs better name
test/Transforms/LoadStoreVectorizer/AMDGPU/gep-bitcast.ll
62	Why was the new dropped here?

rtereshin added inline comments.Jul 16 2018, 2:55 PM

lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
339	Any suggestions?
test/Transforms/LoadStoreVectorizer/AMDGPU/gep-bitcast.ll
62	We don't need to know that `add i32 %base, C0` doesn't wrap if we know that `add n[su]w i32 %base, C1` doesn't wrap and `0 <= C0 <= C1`. `Vectorizer::isConsecutiveAccess` is able to notice and exploit this fact in some cases (w/ and w/o this patch). I don't want to regress on it. Perhaps it's better if we also change the offsets in this test from 0, 4, 8, 12 to something non-zero based, like 4, 8, 12, 16.

Renaming tryHarder to lookThroughComplexAddresses
Adding nuw flag back to the first pointer in the pre-existing test

I've noticed that one of tests I've added partly tests what I wanted to test anyway so this is fine.

rtereshin added a child revision: D49428: [LSV] Look through selects for consecutive addresses.Jul 17 2018, 9:24 AM

rtereshin closed this revision.Jul 19 2018, 2:19 PM

rtereshin added a commit: rL337489: [LSV] Refactoring + supporting bitcasts to a type of different size.

Revision Contents

Path

Size

lib/

Transforms/

Vectorize/

LoadStoreVectorizer.cpp

111 lines

test/

CodeGen/

X86/

loadStore_vectorizer.ll

18 lines

Transforms/

LoadStoreVectorizer/

AMDGPU/

gep-bitcast.ll

23 lines

Diff 155558

lib/Transforms/Vectorize/LoadStoreVectorizer.cpp

Show First 20 Lines • Show All 112 Lines • ▼ Show 20 Lines	public:
Vectorizer(Function &F, AliasAnalysis &AA, DominatorTree &DT,		Vectorizer(Function &F, AliasAnalysis &AA, DominatorTree &DT,
ScalarEvolution &SE, TargetTransformInfo &TTI)		ScalarEvolution &SE, TargetTransformInfo &TTI)
: F(F), AA(AA), DT(DT), SE(SE), TTI(TTI),		: F(F), AA(AA), DT(DT), SE(SE), TTI(TTI),
DL(F.getParent()->getDataLayout()), Builder(SE.getContext()) {}		DL(F.getParent()->getDataLayout()), Builder(SE.getContext()) {}

bool run();		bool run();

private:		private:
GetElementPtrInst getSourceGEP(Value Src) const;

unsigned getPointerAddressSpace(Value *I);		unsigned getPointerAddressSpace(Value *I);

unsigned getAlignment(LoadInst *LI) const {		unsigned getAlignment(LoadInst *LI) const {
unsigned Align = LI->getAlignment();		unsigned Align = LI->getAlignment();
if (Align != 0)		if (Align != 0)
return Align;		return Align;

return DL.getABITypeAlignment(LI->getType());		return DL.getABITypeAlignment(LI->getType());
}		}

unsigned getAlignment(StoreInst *SI) const {		unsigned getAlignment(StoreInst *SI) const {
unsigned Align = SI->getAlignment();		unsigned Align = SI->getAlignment();
if (Align != 0)		if (Align != 0)
return Align;		return Align;

return DL.getABITypeAlignment(SI->getValueOperand()->getType());		return DL.getABITypeAlignment(SI->getValueOperand()->getType());
}		}

bool isConsecutiveAccess(Value A, Value B);		bool isConsecutiveAccess(Value A, Value B);
		bool areConsecutivePointers(Value PtrA, Value PtrB, APInt Size);
		bool tryHarder(Value PtrA, Value PtrB, APInt PtrDelta);

/// After vectorization, reorder the instructions that I depends on		/// After vectorization, reorder the instructions that I depends on
/// (the instructions defining its operands), to ensure they dominate I.		/// (the instructions defining its operands), to ensure they dominate I.
void reorder(Instruction *I);		void reorder(Instruction *I);

/// Returns the first and the last instructions in Chain.		/// Returns the first and the last instructions in Chain.
std::pair<BasicBlock::iterator, BasicBlock::iterator>		std::pair<BasicBlock::iterator, BasicBlock::iterator>
getBoundaryInstrs(ArrayRef<Instruction *> Chain);		getBoundaryInstrs(ArrayRef<Instruction *> Chain);
▲ Show 20 Lines • Show All 122 Lines • ▼ Show 20 Lines
unsigned Vectorizer::getPointerAddressSpace(Value *I) {		unsigned Vectorizer::getPointerAddressSpace(Value *I) {
if (LoadInst *L = dyn_cast<LoadInst>(I))		if (LoadInst *L = dyn_cast<LoadInst>(I))
return L->getPointerAddressSpace();		return L->getPointerAddressSpace();
if (StoreInst *S = dyn_cast<StoreInst>(I))		if (StoreInst *S = dyn_cast<StoreInst>(I))
return S->getPointerAddressSpace();		return S->getPointerAddressSpace();
return -1;		return -1;
}		}

GetElementPtrInst Vectorizer::getSourceGEP(Value Src) const {
// First strip pointer bitcasts. Make sure pointee size is the same with
// and without casts.
// TODO: a stride set by the add instruction below can match the difference
// in pointee type size here. Currently it will not be vectorized.
Value *SrcPtr = getLoadStorePointerOperand(Src);
Value *SrcBase = SrcPtr->stripPointerCasts();
Type *SrcPtrType = SrcPtr->getType()->getPointerElementType();
Type *SrcBaseType = SrcBase->getType()->getPointerElementType();
if (SrcPtrType->isSized() && SrcBaseType->isSized() &&
DL.getTypeStoreSize(SrcPtrType) == DL.getTypeStoreSize(SrcBaseType))
SrcPtr = SrcBase;
return dyn_cast<GetElementPtrInst>(SrcPtr);
}

// FIXME: Merge with llvm::isConsecutiveAccess		// FIXME: Merge with llvm::isConsecutiveAccess
bool Vectorizer::isConsecutiveAccess(Value A, Value B) {		bool Vectorizer::isConsecutiveAccess(Value A, Value B) {
Value *PtrA = getLoadStorePointerOperand(A);		Value *PtrA = getLoadStorePointerOperand(A);
Value *PtrB = getLoadStorePointerOperand(B);		Value *PtrB = getLoadStorePointerOperand(B);
unsigned ASA = getPointerAddressSpace(A);		unsigned ASA = getPointerAddressSpace(A);
unsigned ASB = getPointerAddressSpace(B);		unsigned ASB = getPointerAddressSpace(B);

// Check that the address spaces match and that the pointers are valid.		// Check that the address spaces match and that the pointers are valid.
if (!PtrA \|\| !PtrB \|\| (ASA != ASB))		if (!PtrA \|\| !PtrB \|\| (ASA != ASB))
return false;		return false;

// Make sure that A and B are different pointers of the same size type.		// Make sure that A and B are different pointers of the same size type.
unsigned PtrBitWidth = DL.getPointerSizeInBits(ASA);
Type *PtrATy = PtrA->getType()->getPointerElementType();		Type *PtrATy = PtrA->getType()->getPointerElementType();
Type *PtrBTy = PtrB->getType()->getPointerElementType();		Type *PtrBTy = PtrB->getType()->getPointerElementType();
if (PtrA == PtrB \|\|		if (PtrA == PtrB \|\|
PtrATy->isVectorTy() != PtrBTy->isVectorTy() \|\|		PtrATy->isVectorTy() != PtrBTy->isVectorTy() \|\|
DL.getTypeStoreSize(PtrATy) != DL.getTypeStoreSize(PtrBTy) \|\|		DL.getTypeStoreSize(PtrATy) != DL.getTypeStoreSize(PtrBTy) \|\|
DL.getTypeStoreSize(PtrATy->getScalarType()) !=		DL.getTypeStoreSize(PtrATy->getScalarType()) !=
DL.getTypeStoreSize(PtrBTy->getScalarType()))		DL.getTypeStoreSize(PtrBTy->getScalarType()))
return false;		return false;

		unsigned PtrBitWidth = DL.getPointerSizeInBits(ASA);
APInt Size(PtrBitWidth, DL.getTypeStoreSize(PtrATy));		APInt Size(PtrBitWidth, DL.getTypeStoreSize(PtrATy));

unsigned IdxWidth = DL.getIndexSizeInBits(ASA);		return areConsecutivePointers(PtrA, PtrB, Size);
APInt OffsetA(IdxWidth, 0), OffsetB(IdxWidth, 0);		}

		bool Vectorizer::areConsecutivePointers(Value PtrA, Value PtrB, APInt Size) {
		unsigned PtrBitWidth = DL.getPointerTypeSizeInBits(PtrA->getType());
		APInt OffsetA(PtrBitWidth, 0);
		APInt OffsetB(PtrBitWidth, 0);
PtrA = PtrA->stripAndAccumulateInBoundsConstantOffsets(DL, OffsetA);		PtrA = PtrA->stripAndAccumulateInBoundsConstantOffsets(DL, OffsetA);
PtrB = PtrB->stripAndAccumulateInBoundsConstantOffsets(DL, OffsetB);		PtrB = PtrB->stripAndAccumulateInBoundsConstantOffsets(DL, OffsetB);

APInt OffsetDelta = OffsetB - OffsetA;		APInt OffsetDelta = OffsetB - OffsetA;

// Check if they are based on the same pointer. That makes the offsets		// Check if they are based on the same pointer. That makes the offsets
// sufficient.		// sufficient.
if (PtrA == PtrB)		if (PtrA == PtrB)
Show All 9 Lines	bool Vectorizer::areConsecutivePointers(Value PtrA, Value PtrB, APInt Size) {
const SCEV *C = SE.getConstant(BaseDelta);		const SCEV *C = SE.getConstant(BaseDelta);
const SCEV *X = SE.getAddExpr(PtrSCEVA, C);		const SCEV *X = SE.getAddExpr(PtrSCEVA, C);
if (X == PtrSCEVB)		if (X == PtrSCEVB)
return true;		return true;

// Sometimes even this doesn't work, because SCEV can't always see through		// Sometimes even this doesn't work, because SCEV can't always see through
// patterns that look like (gep (ext (add (shl X, C1), C2))). Try checking		// patterns that look like (gep (ext (add (shl X, C1), C2))). Try checking
// things the hard way.		// things the hard way.
		return tryHarder(PtrA, PtrB, BaseDelta);
		}

		bool Vectorizer::tryHarder(Value PtrA, Value PtrB, APInt PtrDelta) {
		arsenmUnsubmitted Not Done Reply Inline Actions Needs better name arsenm: Needs better name
		rtereshinAuthorUnsubmitted Not Done Reply Inline Actions Any suggestions? rtereshin: Any suggestions?
		auto *GEPA = dyn_cast<GetElementPtrInst>(PtrA);
		auto *GEPB = dyn_cast<GetElementPtrInst>(PtrB);
		if (!GEPA \|\| !GEPB)
		return false;

// Look through GEPs after checking they're the same except for the last		// Look through GEPs after checking they're the same except for the last
// index.		// index.
GetElementPtrInst *GEPA = getSourceGEP(A);		if (GEPA->getNumOperands() != GEPB->getNumOperands() \|\|
GetElementPtrInst *GEPB = getSourceGEP(B);		GEPA->getPointerOperand() != GEPB->getPointerOperand())
if (!GEPA \|\| !GEPB \|\| GEPA->getNumOperands() != GEPB->getNumOperands())
return false;		return false;
unsigned FinalIndex = GEPA->getNumOperands() - 1;		gep_type_iterator GTIA = gep_type_begin(GEPA);
for (unsigned i = 0; i < FinalIndex; i++)		gep_type_iterator GTIB = gep_type_begin(GEPB);
if (GEPA->getOperand(i) != GEPB->getOperand(i))		for (unsigned I = 0, E = GEPA->getNumIndices() - 1; I < E; ++I) {
		if (GTIA.getOperand() != GTIB.getOperand())
return false;		return false;
		++GTIA;
		++GTIB;
		}

Instruction *OpA = dyn_cast<Instruction>(GEPA->getOperand(FinalIndex));		Instruction *OpA = dyn_cast<Instruction>(GTIA.getOperand());
Instruction *OpB = dyn_cast<Instruction>(GEPB->getOperand(FinalIndex));		Instruction *OpB = dyn_cast<Instruction>(GTIB.getOperand());
if (!OpA \|\| !OpB \|\| OpA->getOpcode() != OpB->getOpcode() \|\|		if (!OpA \|\| !OpB \|\| OpA->getOpcode() != OpB->getOpcode() \|\|
OpA->getType() != OpB->getType())		OpA->getType() != OpB->getType())
return false;		return false;

		if (PtrDelta.isNegative()) {
		if (PtrDelta.isMinSignedValue())
		return false;
		PtrDelta.negate();
		std::swap(OpA, OpB);
		}
		uint64_t Stride = DL.getTypeAllocSize(GTIA.getIndexedType());
		if (PtrDelta.urem(Stride) != 0)
		return false;
		unsigned IdxBitWidth = OpA->getType()->getScalarSizeInBits();
		APInt IdxDiff = PtrDelta.udiv(Stride).zextOrSelf(IdxBitWidth);

// Only look through a ZExt/SExt.		// Only look through a ZExt/SExt.
if (!isa<SExtInst>(OpA) && !isa<ZExtInst>(OpA))		if (!isa<SExtInst>(OpA) && !isa<ZExtInst>(OpA))
return false;		return false;

bool Signed = isa<SExtInst>(OpA);		bool Signed = isa<SExtInst>(OpA);

OpA = dyn_cast<Instruction>(OpA->getOperand(0));		// At this point A could be a function parameter, i.e. not an instruction
		Value *ValA = OpA->getOperand(0);
OpB = dyn_cast<Instruction>(OpB->getOperand(0));		OpB = dyn_cast<Instruction>(OpB->getOperand(0));
if (!OpA \|\| !OpB \|\| OpA->getType() != OpB->getType())		if (!OpB \|\| ValA->getType() != OpB->getType())
return false;		return false;

// Now we need to prove that adding 1 to OpA won't overflow.		// Now we need to prove that adding IdxDiff to ValA won't overflow.
bool Safe = false;		bool Safe = false;
// First attempt: if OpB is an add with NSW/NUW, and OpB is 1 added to OpA,		// First attempt: if OpB is an add with NSW/NUW, and OpB is IdxDiff added to
// we're okay.		// ValA, we're okay.
if (OpB->getOpcode() == Instruction::Add &&		if (OpB->getOpcode() == Instruction::Add &&
		rampitecUnsubmitted Not Done Reply Inline Actions I do not think this can happen. There is a convention to have constant operand always at the last position in a commutative operation. Do you see it in a real world example or just in the artificially crafted testcase? rampitec: I do not think this can happen. There is a convention to have constant operand always at the…
		rtereshinAuthorUnsubmitted Not Done Reply Inline Actions In the artificial one only. I will remove the case from the patch and the test. rtereshin: In the artificial one only. I will remove the case from the patch and the test.
isa<ConstantInt>(OpB->getOperand(1)) &&		((isa<ConstantInt>(OpB->getOperand(0)) &&
cast<ConstantInt>(OpB->getOperand(1))->getSExtValue() > 0) {		IdxDiff.sle(cast<ConstantInt>(OpB->getOperand(0))->getSExtValue())) \|\|
		(isa<ConstantInt>(OpB->getOperand(1)) &&
		IdxDiff.sle(cast<ConstantInt>(OpB->getOperand(1))->getSExtValue())))) {
if (Signed)		if (Signed)
Safe = cast<BinaryOperator>(OpB)->hasNoSignedWrap();		Safe = cast<BinaryOperator>(OpB)->hasNoSignedWrap();
else		else
Safe = cast<BinaryOperator>(OpB)->hasNoUnsignedWrap();		Safe = cast<BinaryOperator>(OpB)->hasNoUnsignedWrap();
}		}

unsigned BitWidth = OpA->getType()->getScalarSizeInBits();		unsigned BitWidth = ValA->getType()->getScalarSizeInBits();

// Second attempt:		// Second attempt:
// If any bits are known to be zero other than the sign bit in OpA, we can		// If all set bits of IdxDiff or any higher order bit other than the sign bit
// add 1 to it while guaranteeing no overflow of any sort.		// are known to be zero in ValA, we can add Diff to it while guaranteeing no
		// overflow of any sort.
if (!Safe) {		if (!Safe) {
		OpA = dyn_cast<Instruction>(ValA);
		if (!OpA)
		return false;
KnownBits Known(BitWidth);		KnownBits Known(BitWidth);
computeKnownBits(OpA, Known, DL, 0, nullptr, OpA, &DT);		computeKnownBits(OpA, Known, DL, 0, nullptr, OpA, &DT);
if (Known.countMaxTrailingOnes() < (BitWidth - 1))		if (Known.Zero.trunc(BitWidth - 1).zext(IdxBitWidth).ult(IdxDiff))
Safe = true;
}

if (!Safe)
return false;		return false;
		}

const SCEV *OffsetSCEVA = SE.getSCEV(OpA);		const SCEV *OffsetSCEVA = SE.getSCEV(ValA);
const SCEV *OffsetSCEVB = SE.getSCEV(OpB);		const SCEV *OffsetSCEVB = SE.getSCEV(OpB);
const SCEV *One = SE.getConstant(APInt(BitWidth, 1));		const SCEV *C = SE.getConstant(IdxDiff.trunc(BitWidth));
const SCEV *X2 = SE.getAddExpr(OffsetSCEVA, One);		const SCEV *X = SE.getAddExpr(OffsetSCEVA, C);
return X2 == OffsetSCEVB;		return X == OffsetSCEVB;
}		}

void Vectorizer::reorder(Instruction *I) {		void Vectorizer::reorder(Instruction *I) {
OrderedBasicBlock OBB(I->getParent());		OrderedBasicBlock OBB(I->getParent());
SmallPtrSet<Instruction *, 16> InstructionsToMove;		SmallPtrSet<Instruction *, 16> InstructionsToMove;
SmallVector<Instruction *, 16> Worklist;		SmallVector<Instruction *, 16> Worklist;

Worklist.push_back(I);		Worklist.push_back(I);
▲ Show 20 Lines • Show All 742 Lines • Show Last 20 Lines

test/CodeGen/X86/loadStore_vectorizer.ll

	; RUN: opt -load-store-vectorizer < %s -S \| FileCheck %s			; RUN: opt -mtriple x86_64-- -load-store-vectorizer < %s -S \| FileCheck %s

	%struct_render_pipeline_state = type opaque			%struct_render_pipeline_state = type opaque

	define fastcc void @main(%struct_render_pipeline_state addrspace(1)* %pso) unnamed_addr {			define fastcc void @test1(%struct_render_pipeline_state addrspace(1)* %pso) unnamed_addr {
				; CHECK-LABEL: @test1
	; CHECK: load i16			; CHECK: load i16
	; CHECK: load i16			; CHECK: load i16
	entry:			entry:
	%tmp = bitcast %struct_render_pipeline_state addrspace(1)* %pso to i16 addrspace(1)*			%tmp = bitcast %struct_render_pipeline_state addrspace(1)* %pso to i16 addrspace(1)*
	%tmp1 = load i16, i16 addrspace(1)* %tmp, align 2			%tmp1 = load i16, i16 addrspace(1)* %tmp, align 2
	%tmp2 = bitcast %struct_render_pipeline_state addrspace(1)* %pso to i8 addrspace(1)*			%tmp2 = bitcast %struct_render_pipeline_state addrspace(1)* %pso to i8 addrspace(1)*
	%sunkaddr51 = getelementptr i8, i8 addrspace(1)* %tmp2, i64 6			%sunkaddr51 = getelementptr i8, i8 addrspace(1)* %tmp2, i64 6
	%tmp3 = bitcast i8 addrspace(1)* %sunkaddr51 to i16 addrspace(1)*			%tmp3 = bitcast i8 addrspace(1)* %sunkaddr51 to i16 addrspace(1)*
	%tmp4 = load i16, i16 addrspace(1)* %tmp3, align 2			%tmp4 = load i16, i16 addrspace(1)* %tmp3, align 2
	ret void			ret void
	}			}

				define fastcc void @test2(%struct_render_pipeline_state addrspace(1)* %pso) unnamed_addr {
				; CHECK-LABEL: @test2
				; CHECK: load <2 x i16>
				entry:
				%tmp = bitcast %struct_render_pipeline_state addrspace(1)* %pso to i16 addrspace(1)*
				%tmp1 = load i16, i16 addrspace(1)* %tmp, align 2
				%tmp2 = bitcast %struct_render_pipeline_state addrspace(1)* %pso to i8 addrspace(1)*
				%sunkaddr51 = getelementptr i8, i8 addrspace(1)* %tmp2, i64 2
				%tmp3 = bitcast i8 addrspace(1)* %sunkaddr51 to i16 addrspace(1)*
				%tmp4 = load i16, i16 addrspace(1)* %tmp3, align 2
				ret void
				}

test/Transforms/LoadStoreVectorizer/AMDGPU/gep-bitcast.ll

Show First 20 Lines • Show All 50 Lines • ▼ Show 20 Lines	define void @vect_zext_bitcast_i8_st1_to_i32_idx(i8 addrspace(1)* %arg1, i32 %base) {
%add4 = add nuw i32 %base, 3		%add4 = add nuw i32 %base, 3
%zext4 = zext i32 %add4 to i64		%zext4 = zext i32 %add4 to i64
%gep4 = getelementptr inbounds i8, i8 addrspace(1)* %arg1, i64 %zext4		%gep4 = getelementptr inbounds i8, i8 addrspace(1)* %arg1, i64 %zext4
%f2i4 = bitcast i8 addrspace(1)* %gep4 to i32 addrspace(1)*		%f2i4 = bitcast i8 addrspace(1)* %gep4 to i32 addrspace(1)*
%load4 = load i32, i32 addrspace(1)* %f2i4, align 4		%load4 = load i32, i32 addrspace(1)* %f2i4, align 4
ret void		ret void
}		}

; TODO: This can be vectorized, but currently vectorizer unable to do it.
; CHECK-LABEL: @vect_zext_bitcast_i8_st4_to_i32_idx		; CHECK-LABEL: @vect_zext_bitcast_i8_st4_to_i32_idx
		; CHECK: load <4 x i32>
define void @vect_zext_bitcast_i8_st4_to_i32_idx(i8 addrspace(1)* %arg1, i32 %base) {		define void @vect_zext_bitcast_i8_st4_to_i32_idx(i8 addrspace(1)* %arg1, i32 %base) {
%add1 = add nuw i32 %base, 0		%add1 = add i32 %base, 0
		arsenmUnsubmitted Not Done Reply Inline Actions Why was the new dropped here? arsenm: Why was the new dropped here?
		rtereshinAuthorUnsubmitted Not Done Reply Inline Actions We don't need to know that `add i32 %base, C0` doesn't wrap if we know that `add n[su]w i32 %base, C1` doesn't wrap and `0 <= C0 <= C1`. `Vectorizer::isConsecutiveAccess` is able to notice and exploit this fact in some cases (w/ and w/o this patch). I don't want to regress on it. Perhaps it's better if we also change the offsets in this test from 0, 4, 8, 12 to something non-zero based, like 4, 8, 12, 16. rtereshin: We don't need to know that `add i32 %base, C0` doesn't wrap if we know that `add n[su]w i32…
%zext1 = zext i32 %add1 to i64		%zext1 = zext i32 %add1 to i64
%gep1 = getelementptr inbounds i8, i8 addrspace(1)* %arg1, i64 %zext1		%gep1 = getelementptr inbounds i8, i8 addrspace(1)* %arg1, i64 %zext1
%f2i1 = bitcast i8 addrspace(1)* %gep1 to i32 addrspace(1)*		%f2i1 = bitcast i8 addrspace(1)* %gep1 to i32 addrspace(1)*
%load1 = load i32, i32 addrspace(1)* %f2i1, align 4		%load1 = load i32, i32 addrspace(1)* %f2i1, align 4
%add2 = add nuw i32 %base, 4		%add2 = add nuw i32 %base, 4
%zext2 = zext i32 %add2 to i64		%zext2 = zext i32 %add2 to i64
%gep2 = getelementptr inbounds i8,i8 addrspace(1)* %arg1, i64 %zext2		%gep2 = getelementptr inbounds i8,i8 addrspace(1)* %arg1, i64 %zext2
%f2i2 = bitcast i8 addrspace(1)* %gep2 to i32 addrspace(1)*		%f2i2 = bitcast i8 addrspace(1)* %gep2 to i32 addrspace(1)*
%load2 = load i32, i32 addrspace(1)* %f2i2, align 4		%load2 = load i32, i32 addrspace(1)* %f2i2, align 4
%add3 = add nuw i32 %base, 8		%add3 = add nuw i32 %base, 8
%zext3 = zext i32 %add3 to i64		%zext3 = zext i32 %add3 to i64
%gep3 = getelementptr inbounds i8, i8 addrspace(1)* %arg1, i64 %zext3		%gep3 = getelementptr inbounds i8, i8 addrspace(1)* %arg1, i64 %zext3
%f2i3 = bitcast i8 addrspace(1)* %gep3 to i32 addrspace(1)*		%f2i3 = bitcast i8 addrspace(1)* %gep3 to i32 addrspace(1)*
%load3 = load i32, i32 addrspace(1)* %f2i3, align 4		%load3 = load i32, i32 addrspace(1)* %f2i3, align 4
%add4 = add nuw i32 %base, 16		%add4 = add nuw i32 %base, 12
%zext4 = zext i32 %add4 to i64		%zext4 = zext i32 %add4 to i64
%gep4 = getelementptr inbounds i8, i8 addrspace(1)* %arg1, i64 %zext4		%gep4 = getelementptr inbounds i8, i8 addrspace(1)* %arg1, i64 %zext4
%f2i4 = bitcast i8 addrspace(1)* %gep4 to i32 addrspace(1)*		%f2i4 = bitcast i8 addrspace(1)* %gep4 to i32 addrspace(1)*
%load4 = load i32, i32 addrspace(1)* %f2i4, align 4		%load4 = load i32, i32 addrspace(1)* %f2i4, align 4
ret void		ret void
}		}

		; CHECK-LABEL: @vect_zext_bitcast_negative_ptr_delta
		; CHECK: load <2 x i32>
		define void @vect_zext_bitcast_negative_ptr_delta(i32 addrspace(1)* %p, i32 %base) {
		%p.bitcasted = bitcast i32 addrspace(1)* %p to i16 addrspace(1)*
		%a.offset = add nuw i32 4, %base
		%t.offset.zexted = zext i32 %base to i64
		%a.offset.zexted = zext i32 %a.offset to i64
		%t.ptr = getelementptr inbounds i16, i16 addrspace(1)* %p.bitcasted, i64 %t.offset.zexted
		%a.ptr = getelementptr inbounds i16, i16 addrspace(1)* %p.bitcasted, i64 %a.offset.zexted
		%b.ptr = getelementptr inbounds i16, i16 addrspace(1)* %t.ptr, i64 6
		%a.ptr.bitcasted = bitcast i16 addrspace(1)* %a.ptr to i32 addrspace(1)*
		%b.ptr.bitcasted = bitcast i16 addrspace(1)* %b.ptr to i32 addrspace(1)*
		%a.val = load i32, i32 addrspace(1)* %a.ptr.bitcasted
		%b.val = load i32, i32 addrspace(1)* %b.ptr.bitcasted
		ret void
		}