This is an archive of the discontinued LLVM Phabricator instance.

[CodeGenPrepare] Reverse LICM pass for shift and rotate patterns.
Needs ReviewPublic

Authored by tkrupa on Jun 27 2018, 11:40 PM.

Download Raw Diff

Details

Reviewers

craig.topper
efriedma
spatel

Summary

This patch fetches back parts of shift and rotate patterns which were hoisted out of the loop in LICM pass (which prevented the patterns from combining). This resolves https://bugs.llvm.org/show_bug.cgi?id=37417 and enables D46946 and D47019 to proceed.

Diff Detail

Repository

rL LLVM

Build Status

Buildable 19811
Build 19811: arc lint + arc unit

Event Timeline

tkrupa created this revision.Jun 27 2018, 11:40 PM

Herald added a subscriber: llvm-commits. · View Herald TranscriptJun 27 2018, 11:40 PM

craig.topper added reviewers: efriedma, spatel.Jun 27 2018, 11:42 PM

I'm not sure D46946 and D47019 are good ideas in the first place, particularly D47019. Expanding an intrinsic to an 8-instruction sequence is getting past the point where we're actually getting any benefit from transforming intrinsic to native IR. Emitting a complicated lowering like that, and trying to recover it in isel seems very tricky to get right, and as far as I can tell we don't get much benefit.

If we are going to expand out x86 shift and rotate intrinsics, we should probably consider pattern-matching on IR, rather than waiting for SelectionDAG. Trying to work around isel limitations in this fashion is fragile, and will probably have a wider effect than you want.

For rotates, there was a proposal to add a generic IR intrinsic for variable rotates on llvmdev, due to the complications involved in late pattern-matching.

In D48705#1147074, @efriedma wrote:

I'm not sure D46946 and D47019 are good ideas in the first place, particularly D47019. Expanding an intrinsic to an 8-instruction sequence is getting past the point where we're actually getting any benefit from transforming intrinsic to native IR. Emitting a complicated lowering like that, and trying to recover it in isel seems very tricky to get right, and as far as I can tell we don't get much benefit.

If we are going to expand out x86 shift and rotate intrinsics, we should probably consider pattern-matching on IR, rather than waiting for SelectionDAG. Trying to work around isel limitations in this fashion is fragile, and will probably have a wider effect than you want.

For rotates, there was a proposal to add a generic IR intrinsic for variable rotates on llvmdev, due to the complications involved in late pattern-matching.

For reference, that was here:
http://lists.llvm.org/pipermail/llvm-dev/2018-May/123292.html

And the discussion stopped with no objections that I see, but there are some open questions about the semantics and form of any intrinsic. The proposal was made by Fabian Giessen. I don't find a Phab ID for Fabian, so maybe we need to ping the dev thread?

In D48705#1147074, @efriedma wrote:

I'm not sure D46946 and D47019 are good ideas in the first place, particularly D47019. Expanding an intrinsic to an 8-instruction sequence is getting past the point where we're actually getting any benefit from transforming intrinsic to native IR. Emitting a complicated lowering like that, and trying to recover it in isel seems very tricky to get right, and as far as I can tell we don't get much benefit.

If we are going to expand out x86 shift and rotate intrinsics, we should probably consider pattern-matching on IR, rather than waiting for SelectionDAG. Trying to work around isel limitations in this fashion is fragile, and will probably have a wider effect than you want.

For rotates, there was a proposal to add a generic IR intrinsic for variable rotates on llvmdev, due to the complications involved in late pattern-matching.

FWIW we don't do the whole pattern matching in lowering stage - most of the pattern disappears during the combining right after creating the DAG because the extra instructions are only needed in IR to avoid creating poison values. Because of this, we can't do pattern matching in IR. If it was possible I would just emit shortened IR - unless I'm missing something?

But all in all yeah, the whole thing is rather tricky.

In D48705#1147264, @tkrupa wrote:

In D48705#1147074, @efriedma wrote:

I'm not sure D46946 and D47019 are good ideas in the first place, particularly D47019. Expanding an intrinsic to an 8-instruction sequence is getting past the point where we're actually getting any benefit from transforming intrinsic to native IR. Emitting a complicated lowering like that, and trying to recover it in isel seems very tricky to get right, and as far as I can tell we don't get much benefit.

If we are going to expand out x86 shift and rotate intrinsics, we should probably consider pattern-matching on IR, rather than waiting for SelectionDAG. Trying to work around isel limitations in this fashion is fragile, and will probably have a wider effect than you want.

For rotates, there was a proposal to add a generic IR intrinsic for variable rotates on llvmdev, due to the complications involved in late pattern-matching.

FWIW we don't do the whole pattern matching in lowering stage - most of the pattern disappears during the combining right after creating the DAG because the extra instructions are only needed in IR to avoid creating poison values. Because of this, we can't do pattern matching in IR. If it was possible I would just emit shortened IR - unless I'm missing something?

But all in all yeah, the whole thing is rather tricky.

I've replied to the existing llvm-dev thread and referenced this patch. Let's see if we can find a good definition for target-independent intrinsic(s) there.

Revision Contents

Path

Size

lib/

CodeGen/

CodeGenPrepare.cpp

90 lines

test/

Transforms/

CodeGenPrepare/

reverse-licm.ll

523 lines

Diff 153260

lib/CodeGen/CodeGenPrepare.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 307 Lines • ▼ Show 20 Lines	void getAnalysisUsage(AnalysisUsage &AU) const override {
AU.addRequired<ProfileSummaryInfoWrapperPass>();		AU.addRequired<ProfileSummaryInfoWrapperPass>();
AU.addRequired<TargetLibraryInfoWrapperPass>();		AU.addRequired<TargetLibraryInfoWrapperPass>();
AU.addRequired<TargetTransformInfoWrapperPass>();		AU.addRequired<TargetTransformInfoWrapperPass>();
AU.addRequired<LoopInfoWrapperPass>();		AU.addRequired<LoopInfoWrapperPass>();
}		}

private:		private:
bool eliminateFallThrough(Function &F);		bool eliminateFallThrough(Function &F);
		bool moveHoistedInstrsBackIntoLoop(Function &F);
bool eliminateMostlyEmptyBlocks(Function &F);		bool eliminateMostlyEmptyBlocks(Function &F);
BasicBlock findDestBlockOfMergeableEmptyBlock(BasicBlock BB);		BasicBlock findDestBlockOfMergeableEmptyBlock(BasicBlock BB);
bool canMergeBlocks(const BasicBlock BB, const BasicBlock DestBB) const;		bool canMergeBlocks(const BasicBlock BB, const BasicBlock DestBB) const;
void eliminateMostlyEmptyBlock(BasicBlock *BB);		void eliminateMostlyEmptyBlock(BasicBlock *BB);
bool isMergingEmptyBlockProfitable(BasicBlock BB, BasicBlock DestBB,		bool isMergingEmptyBlockProfitable(BasicBlock BB, BasicBlock DestBB,
bool isPreheader);		bool isPreheader);
bool optimizeBlock(BasicBlock &BB, bool &ModifiedDT);		bool optimizeBlock(BasicBlock &BB, bool &ModifiedDT);
bool optimizeInst(Instruction *I, bool &ModifiedDT);		bool optimizeInst(Instruction *I, bool &ModifiedDT);
▲ Show 20 Lines • Show All 84 Lines • ▼ Show 20 Lines	while (BB != nullptr) {
// bypassSlowDivision may create new BBs, but we don't want to reapply the		// bypassSlowDivision may create new BBs, but we don't want to reapply the
// optimization to those blocks.		// optimization to those blocks.
BasicBlock* Next = BB->getNextNode();		BasicBlock* Next = BB->getNextNode();
EverMadeChange \|= bypassSlowDivision(BB, BypassWidths);		EverMadeChange \|= bypassSlowDivision(BB, BypassWidths);
BB = Next;		BB = Next;
}		}
}		}

		EverMadeChange \|= moveHoistedInstrsBackIntoLoop(F);

// Eliminate blocks that contain only PHI nodes and an		// Eliminate blocks that contain only PHI nodes and an
// unconditional branch.		// unconditional branch.
EverMadeChange \|= eliminateMostlyEmptyBlocks(F);		EverMadeChange \|= eliminateMostlyEmptyBlocks(F);

// llvm.dbg.value is far away from the value then iSel may not be able		// llvm.dbg.value is far away from the value then iSel may not be able
// handle it properly. iSel will drop llvm.dbg.value if it can not		// handle it properly. iSel will drop llvm.dbg.value if it can not
// find a node corresponding to the value.		// find a node corresponding to the value.
EverMadeChange \|= placeDbgValues(F);		EverMadeChange \|= placeDbgValues(F);
▲ Show 20 Lines • Show All 148 Lines • ▼ Show 20 Lines	if (DestBB == BB)
return nullptr;		return nullptr;

if (!canMergeBlocks(BB, DestBB))		if (!canMergeBlocks(BB, DestBB))
DestBB = nullptr;		DestBB = nullptr;

return DestBB;		return DestBB;
}		}

		// Go through all operands. If an operand is in BB, call this recursively,
		// if it's in the preheader, add it to Worklist (if it's not already there)
		// and add current instruction to Users (if it's not already there).
		static void addPatternMembers(Instruction I, BasicBlock Preheader, BasicBlock* BB,
		SmallVector<Instruction *, 16> &Users,
		SmallVector<Instruction *, 16> &Worklist) {
		for (User::op_iterator II = I->op_begin(), IE = I->op_end(); II != IE; ++II) {
		auto *Op = dyn_cast<Instruction>(II);
		if (!Op \|\| Op->getOpcode() == Instruction::PHI)
		continue;
		if (Op->getParent() == BB)
		addPatternMembers(Op, Preheader, BB, Users, Worklist);
		else if (Op->getParent() == Preheader) {
		if (std::find(Users.begin(), Users.end(), I) == Users.end())
		Users.push_back(I);
		if (std::find(Worklist.begin(), Worklist.end(), Op) == Worklist.end())
		Worklist.push_back(Op);
		}
		}
		}

		/// LCIM pass sometimes hoists a part of a pattern out of a loop
		/// while leaving the rest inside which prevents combining. This function
		/// moves hoisted instruction back if such a pattern is detected.
		bool CodeGenPrepare::moveHoistedInstrsBackIntoLoop(Function &F) {
		bool MadeChange = false;
		SmallVector<Loop *, 16> LoopList(LI->begin(), LI->end());
		while (!LoopList.empty()) {
		Loop *L = LoopList.pop_back_val();
		BasicBlock *Preheader = L->getLoopPreheader();

		for (BasicBlock *BB : L->getBlocks()) {
		for (BasicBlock::iterator I = BB->begin(), E = BB->end(); I != E; ++I) {
		// Only handle patterns starting with either shifts or or(shl, lshr).
		if (I->getOpcode() == Instruction::Or) {
		Instruction *Op0 = dyn_cast<Instruction>(I->getOperand(0));
		Instruction *Op1 = dyn_cast<Instruction>(I->getOperand(1));
		if (!Op0 \|\| !Op1 \|\| Op0->getOpcode() == Op1->getOpcode() \|\|
		!Op0->isLogicalShift() \|\| !Op1->isLogicalShift())
		continue;
		}
		else if (!I->isShift())
		continue;

		// In Users we store all the instructions that are inside of the loop
		// and use instructions from the preheader.
		// In Worklist there are instructions residing in the preheader
		// used by instructions from worklist - potential parts of a pattern
		// to be fetched back into the loop.
		SmallVector<Instruction *, 16> Users;
		SmallVector<Instruction *, 16> Worklist;
		addPatternMembers(&*I, Preheader, BB, Users, Worklist);
		SmallVector<Instruction *, 16>::iterator It, EIt;
		// Keep going until the worklist is empty or there were no fetches
		// in whole iteration.
		Instruction *ToFetch;
		while (true) {
		for (It = Worklist.begin(), EIt = Worklist.end(); It != EIt; ++It) {
		ToFetch = *It;
		Value::user_iterator UI, EI;
		bool SafeToFetch = true;
		// If all users are inside of the loop and a part of the pattern,
		// fetch current instruction and start new iteration.
		for (UI = ToFetch->user_begin(), EI = ToFetch->user_end();
		UI != EI; ++UI) {
		if (std::find(Users.begin(), Users.end(), *UI) == Users.end()) {
		SafeToFetch = false;
		break;
		}
		}
		if (SafeToFetch) {
		MadeChange = true;
		ToFetch->moveBefore(BB->getFirstNonPHI());
		Worklist.erase(It);
		addPatternMembers(ToFetch, Preheader, BB, Users, Worklist);
		break;
		}
		}
		if (It == EIt)
		break;
		}
		}
		}
		}
		return MadeChange;
		}

/// Eliminate blocks that contain only PHI nodes, debug info directives, and an		/// Eliminate blocks that contain only PHI nodes, debug info directives, and an
/// unconditional branch. Passes before isel (e.g. LSR/loopsimplify) often split		/// unconditional branch. Passes before isel (e.g. LSR/loopsimplify) often split
/// edges in ways that are non-optimal for isel. Start by eliminating these		/// edges in ways that are non-optimal for isel. Start by eliminating these
/// blocks so we can split them the way we want them.		/// blocks so we can split them the way we want them.
bool CodeGenPrepare::eliminateMostlyEmptyBlocks(Function &F) {		bool CodeGenPrepare::eliminateMostlyEmptyBlocks(Function &F) {
SmallPtrSet<BasicBlock *, 16> Preheaders;		SmallPtrSet<BasicBlock *, 16> Preheaders;
SmallVector<Loop *, 16> LoopList(LI->begin(), LI->end());		SmallVector<Loop *, 16> LoopList(LI->begin(), LI->end());
while (!LoopList.empty()) {		while (!LoopList.empty()) {
▲ Show 20 Lines • Show All 6,299 Lines • Show Last 20 Lines

test/Transforms/CodeGenPrepare/reverse-licm.ll

This file was added.

				; RUN: opt -codegenprepare -S < %s \| FileCheck %s

				define void @rolv(i32* nocapture %x, i32 %N, <2 x i64>* nocapture readonly %a, <2 x i64> %b) {
				; CHECK-LABEL: @rolv
				; CHECK: for.body:
				; CHECK-NEXT: phi
				; CHECK-NEXT: bitcast
				; CHECK-NEXT: and
				; CHECK-NEXT: sub
				; CHECK-NEXT: icmp
				; CHECK-NEXT: select
				; CHECK-NEXT: getelementptr
				; CHECK-NEXT: bitcast
				; CHECK-NEXT: load
				; CHECK-NEXT: shl
				; CHECK-NEXT: select
				; CHECK-NEXT: lshr
				; CHECK-NEXT: or
				; CHECK-NEXT: extractelement
				entry:
				%cmp7 = icmp eq i32 %N, 0
				br i1 %cmp7, label %for.cond.cleanup, label %for.body.lr.ph

				for.body.lr.ph: ; preds = %entry
				%0 = bitcast <2 x i64> %b to <4 x i32>
				%1 = and <4 x i32> %0, <i32 31, i32 31, i32 31, i32 31>
				%2 = sub nsw <4 x i32> <i32 32, i32 32, i32 32, i32 32>, %1
				%3 = icmp ult <4 x i32> %2, <i32 32, i32 32, i32 32, i32 32>
				%4 = select <4 x i1> %3, <4 x i32> %2, <4 x i32> zeroinitializer
				%wide.trip.count = zext i32 %N to i64
				br label %for.body

				for.cond.cleanup: ; preds = %for.body, %entry
				ret void

				for.body: ; preds = %for.body, %for.body.lr.ph
				%indvars.iv = phi i64 [ 0, %for.body.lr.ph ], [ %indvars.iv.next, %for.body ]
				%arrayidx = getelementptr inbounds <2 x i64>, <2 x i64>* %a, i64 %indvars.iv
				%5 = bitcast <2 x i64>* %arrayidx to <4 x i32>*
				%6 = load <4 x i32>, <4 x i32>* %5, align 16
				%7 = shl <4 x i32> %6, %1
				%8 = select <4 x i1> %3, <4 x i32> %6, <4 x i32> zeroinitializer
				%9 = lshr <4 x i32> %8, %4
				%10 = or <4 x i32> %9, %7
				%11 = extractelement <4 x i32> %10, i32 0
				%idxprom1 = sext i32 %11 to i64
				%arrayidx2 = getelementptr inbounds i32, i32* %x, i64 %idxprom1
				%12 = trunc i64 %indvars.iv to i32
				store i32 %12, i32* %arrayidx2, align 4
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%exitcond = icmp eq i64 %indvars.iv.next, %wide.trip.count
				br i1 %exitcond, label %for.cond.cleanup, label %for.body
				}

				define void @rorv(i32* nocapture %x, i32 %N, <2 x i64>* nocapture readonly %a, <2 x i64> %b) {
				; CHECK-LABEL: @rorv
				; CHECK: for.body:
				; CHECK-NEXT: phi
				; CHECK-NEXT: bitcast
				; CHECK-NEXT: and
				; CHECK-NEXT: sub
				; CHECK-NEXT: icmp
				; CHECK-NEXT: select
				; CHECK-NEXT: getelementptr
				; CHECK-NEXT: bitcast
				; CHECK-NEXT: load
				; CHECK-NEXT: select
				; CHECK-NEXT: shl
				; CHECK-NEXT: lshr
				; CHECK-NEXT: or
				; CHECK-NEXT: extractelement
				entry:
				%cmp7 = icmp eq i32 %N, 0
				br i1 %cmp7, label %for.cond.cleanup, label %for.body.lr.ph

				for.body.lr.ph: ; preds = %entry
				%0 = bitcast <2 x i64> %b to <4 x i32>
				%1 = and <4 x i32> %0, <i32 31, i32 31, i32 31, i32 31>
				%2 = sub nsw <4 x i32> <i32 32, i32 32, i32 32, i32 32>, %1
				%3 = icmp ult <4 x i32> %2, <i32 32, i32 32, i32 32, i32 32>
				%4 = select <4 x i1> %3, <4 x i32> %2, <4 x i32> zeroinitializer
				%wide.trip.count = zext i32 %N to i64
				br label %for.body

				for.cond.cleanup: ; preds = %for.body, %entry
				ret void

				for.body: ; preds = %for.body, %for.body.lr.ph
				%indvars.iv = phi i64 [ 0, %for.body.lr.ph ], [ %indvars.iv.next, %for.body ]
				%arrayidx = getelementptr inbounds <2 x i64>, <2 x i64>* %a, i64 %indvars.iv
				%5 = bitcast <2 x i64>* %arrayidx to <4 x i32>*
				%6 = load <4 x i32>, <4 x i32>* %5, align 16
				%7 = select <4 x i1> %3, <4 x i32> %6, <4 x i32> zeroinitializer
				%8 = shl <4 x i32> %7, %4
				%9 = lshr <4 x i32> %6, %1
				%10 = or <4 x i32> %8, %9
				%11 = extractelement <4 x i32> %10, i32 0
				%idxprom1 = sext i32 %11 to i64
				%arrayidx2 = getelementptr inbounds i32, i32* %x, i64 %idxprom1
				%12 = trunc i64 %indvars.iv to i32
				store i32 %12, i32* %arrayidx2, align 4
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%exitcond = icmp eq i64 %indvars.iv.next, %wide.trip.count
				br i1 %exitcond, label %for.cond.cleanup, label %for.body
				}

				define void @sllv(i32* nocapture %x, i32 %N, <2 x i64>* nocapture readonly %a, <2 x i64> %b) {
				entry:
				; CHECK-LABEL: @sllv
				; CHECK: for.body:
				; CHECK-NEXT: phi
				; CHECK-NEXT: bitcast
				; CHECK-NEXT: icmp
				; CHECK-NEXT: select
				; CHECK-NEXT: getelementptr
				; CHECK-NEXT: bitcast
				; CHECK-NEXT: load
				; CHECK-NEXT: select
				; CHECK-NEXT: shl
				%cmp7 = icmp eq i32 %N, 0
				br i1 %cmp7, label %for.cond.cleanup, label %for.body.lr.ph

				for.body.lr.ph: ; preds = %entry
				%0 = bitcast <2 x i64> %b to <4 x i32>
				%1 = icmp ult <4 x i32> %0, <i32 32, i32 32, i32 32, i32 32>
				%2 = select <4 x i1> %1, <4 x i32> %0, <4 x i32> zeroinitializer
				%wide.trip.count = zext i32 %N to i64
				br label %for.body

				for.cond.cleanup: ; preds = %for.body, %entry
				ret void

				for.body: ; preds = %for.body, %for.body.lr.ph
				%indvars.iv = phi i64 [ 0, %for.body.lr.ph ], [ %indvars.iv.next, %for.body ]
				%arrayidx = getelementptr inbounds <2 x i64>, <2 x i64>* %a, i64 %indvars.iv
				%3 = bitcast <2 x i64>* %arrayidx to <4 x i32>*
				%4 = load <4 x i32>, <4 x i32>* %3, align 16
				%5 = select <4 x i1> %1, <4 x i32> %4, <4 x i32> zeroinitializer
				%6 = shl <4 x i32> %5, %2
				%7 = extractelement <4 x i32> %6, i32 0
				%idxprom1 = sext i32 %7 to i64
				%arrayidx2 = getelementptr inbounds i32, i32* %x, i64 %idxprom1
				%8 = trunc i64 %indvars.iv to i32
				store i32 %8, i32* %arrayidx2, align 4
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%exitcond = icmp eq i64 %indvars.iv.next, %wide.trip.count
				br i1 %exitcond, label %for.cond.cleanup, label %for.body
				}

				define void @sll(i32* nocapture %x, i32 %N, <2 x i64>* nocapture readonly %a, <2 x i64> %b) {
				; CHECK-LABEL: @sll
				; CHECK: for.body:
				; CHECK-NEXT: phi
				; CHECK-NEXT: extractelement
				; CHECK-NEXT: trunc
				; CHECK-NEXT: insertelement
				; CHECK-NEXT: shufflevector
				; CHECK-NEXT: icmp
				; CHECK-NEXT: select
				; CHECK-NEXT: getelementptr
				; CHECK-NEXT: bitcast
				; CHECK-NEXT: load
				; CHECK-NEXT: select
				; CHECK-NEXT: shl
				entry:
				%cmp7 = icmp eq i32 %N, 0
				br i1 %cmp7, label %for.cond.cleanup, label %for.body.lr.ph

				for.body.lr.ph: ; preds = %entry
				%0 = extractelement <2 x i64> %b, i64 0
				%1 = icmp ult i64 %0, 32
				%2 = trunc i64 %0 to i32
				%.splatinsert.i = insertelement <4 x i32> undef, i32 %2, i32 0
				%.splat.i = shufflevector <4 x i32> %.splatinsert.i, <4 x i32> undef, <4 x i32> zeroinitializer
				%3 = select i1 %1, <4 x i32> %.splat.i, <4 x i32> zeroinitializer
				%wide.trip.count = zext i32 %N to i64
				br label %for.body

				for.cond.cleanup: ; preds = %for.body, %entry
				ret void

				for.body: ; preds = %for.body, %for.body.lr.ph
				%indvars.iv = phi i64 [ 0, %for.body.lr.ph ], [ %indvars.iv.next, %for.body ]
				%arrayidx = getelementptr inbounds <2 x i64>, <2 x i64>* %a, i64 %indvars.iv
				%4 = bitcast <2 x i64>* %arrayidx to <4 x i32>*
				%5 = load <4 x i32>, <4 x i32>* %4, align 16
				%6 = select i1 %1, <4 x i32> %5, <4 x i32> zeroinitializer
				%7 = shl <4 x i32> %6, %3
				%8 = extractelement <4 x i32> %7, i32 0
				%idxprom1 = sext i32 %8 to i64
				%arrayidx2 = getelementptr inbounds i32, i32* %x, i64 %idxprom1
				%9 = trunc i64 %indvars.iv to i32
				store i32 %9, i32* %arrayidx2, align 4
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%exitcond = icmp eq i64 %indvars.iv.next, %wide.trip.count
				br i1 %exitcond, label %for.cond.cleanup, label %for.body
				}

				define void @slli(i32* nocapture %x, i32 %N, <2 x i64>* nocapture readonly %a, i32 %b) {
				; CHECK-LABEL: @slli
				; CHECK: for.body:
				; CHECK-NEXT: phi
				; CHECK-NEXT: insertelement
				; CHECK-NEXT: shufflevector
				; CHECK-NEXT: icmp
				; CHECK-NEXT: select
				; CHECK-NEXT: getelementptr
				; CHECK-NEXT: bitcast
				; CHECK-NEXT: load
				; CHECK-NEXT: select
				; CHECK-NEXT: shl
				entry:
				%cmp7 = icmp eq i32 %N, 0
				br i1 %cmp7, label %for.cond.cleanup, label %for.body.lr.ph

				for.body.lr.ph: ; preds = %entry
				%0 = icmp ult i32 %b, 32
				%.splatinsert.i = insertelement <4 x i32> undef, i32 %b, i32 0
				%.splat.i = shufflevector <4 x i32> %.splatinsert.i, <4 x i32> undef, <4 x i32> zeroinitializer
				%1 = select i1 %0, <4 x i32> %.splat.i, <4 x i32> zeroinitializer
				%wide.trip.count = zext i32 %N to i64
				br label %for.body

				for.cond.cleanup: ; preds = %for.body, %entry
				ret void

				for.body: ; preds = %for.body, %for.body.lr.ph
				%indvars.iv = phi i64 [ 0, %for.body.lr.ph ], [ %indvars.iv.next, %for.body ]
				%arrayidx = getelementptr inbounds <2 x i64>, <2 x i64>* %a, i64 %indvars.iv
				%2 = bitcast <2 x i64>* %arrayidx to <4 x i32>*
				%3 = load <4 x i32>, <4 x i32>* %2, align 16
				%4 = select i1 %0, <4 x i32> %3, <4 x i32> zeroinitializer
				%5 = shl <4 x i32> %4, %1
				%6 = extractelement <4 x i32> %5, i32 0
				%idxprom1 = sext i32 %6 to i64
				%arrayidx2 = getelementptr inbounds i32, i32* %x, i64 %idxprom1
				%7 = trunc i64 %indvars.iv to i32
				store i32 %7, i32* %arrayidx2, align 4
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%exitcond = icmp eq i64 %indvars.iv.next, %wide.trip.count
				br i1 %exitcond, label %for.cond.cleanup, label %for.body
				}

				define void @srlv(i32* nocapture %x, i32 %N, <2 x i64>* nocapture readonly %a, <2 x i64> %b) {
				; CHECK-LABEL: @srlv
				; CHECK: for.body:
				; CHECK-NEXT: phi
				; CHECK-NEXT: bitcast
				; CHECK-NEXT: icmp
				; CHECK-NEXT: select
				; CHECK-NEXT: getelementptr
				; CHECK-NEXT: bitcast
				; CHECK-NEXT: load
				; CHECK-NEXT: select
				; CHECK-NEXT: lshr
				entry:
				%cmp7 = icmp eq i32 %N, 0
				br i1 %cmp7, label %for.cond.cleanup, label %for.body.lr.ph

				for.body.lr.ph: ; preds = %entry
				%0 = bitcast <2 x i64> %b to <4 x i32>
				%1 = icmp ult <4 x i32> %0, <i32 32, i32 32, i32 32, i32 32>
				%2 = select <4 x i1> %1, <4 x i32> %0, <4 x i32> zeroinitializer
				%wide.trip.count = zext i32 %N to i64
				br label %for.body

				for.cond.cleanup: ; preds = %for.body, %entry
				ret void

				for.body: ; preds = %for.body, %for.body.lr.ph
				%indvars.iv = phi i64 [ 0, %for.body.lr.ph ], [ %indvars.iv.next, %for.body ]
				%arrayidx = getelementptr inbounds <2 x i64>, <2 x i64>* %a, i64 %indvars.iv
				%3 = bitcast <2 x i64>* %arrayidx to <4 x i32>*
				%4 = load <4 x i32>, <4 x i32>* %3, align 16
				%5 = select <4 x i1> %1, <4 x i32> %4, <4 x i32> zeroinitializer
				%6 = lshr <4 x i32> %5, %2
				%7 = extractelement <4 x i32> %6, i32 0
				%idxprom1 = sext i32 %7 to i64
				%arrayidx2 = getelementptr inbounds i32, i32* %x, i64 %idxprom1
				%8 = trunc i64 %indvars.iv to i32
				store i32 %8, i32* %arrayidx2, align 4
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%exitcond = icmp eq i64 %indvars.iv.next, %wide.trip.count
				br i1 %exitcond, label %for.cond.cleanup, label %for.body
				}

				define void @srl(i32* nocapture %x, i32 %N, <2 x i64>* nocapture readonly %a, <2 x i64> %b) {
				; CHECK-LABEL: @srl
				; CHECK: for.body:
				; CHECK-NEXT: phi
				; CHECK-NEXT: extractelement
				; CHECK-NEXT: trunc
				; CHECK-NEXT: insertelement
				; CHECK-NEXT: shufflevector
				; CHECK-NEXT: icmp
				; CHECK-NEXT: select
				; CHECK-NEXT: getelementptr
				; CHECK-NEXT: bitcast
				; CHECK-NEXT: load
				; CHECK-NEXT: select
				; CHECK-NEXT: lshr
				entry:
				%cmp7 = icmp eq i32 %N, 0
				br i1 %cmp7, label %for.cond.cleanup, label %for.body.lr.ph

				for.body.lr.ph: ; preds = %entry
				%0 = extractelement <2 x i64> %b, i64 0
				%1 = icmp ult i64 %0, 32
				%2 = trunc i64 %0 to i32
				%.splatinsert.i = insertelement <4 x i32> undef, i32 %2, i32 0
				%.splat.i = shufflevector <4 x i32> %.splatinsert.i, <4 x i32> undef, <4 x i32> zeroinitializer
				%3 = select i1 %1, <4 x i32> %.splat.i, <4 x i32> zeroinitializer
				%wide.trip.count = zext i32 %N to i64
				br label %for.body

				for.cond.cleanup: ; preds = %for.body, %entry
				ret void

				for.body: ; preds = %for.body, %for.body.lr.ph
				%indvars.iv = phi i64 [ 0, %for.body.lr.ph ], [ %indvars.iv.next, %for.body ]
				%arrayidx = getelementptr inbounds <2 x i64>, <2 x i64>* %a, i64 %indvars.iv
				%4 = bitcast <2 x i64>* %arrayidx to <4 x i32>*
				%5 = load <4 x i32>, <4 x i32>* %4, align 16
				%6 = select i1 %1, <4 x i32> %5, <4 x i32> zeroinitializer
				%7 = lshr <4 x i32> %6, %3
				%8 = extractelement <4 x i32> %7, i32 0
				%idxprom1 = sext i32 %8 to i64
				%arrayidx2 = getelementptr inbounds i32, i32* %x, i64 %idxprom1
				%9 = trunc i64 %indvars.iv to i32
				store i32 %9, i32* %arrayidx2, align 4
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%exitcond = icmp eq i64 %indvars.iv.next, %wide.trip.count
				br i1 %exitcond, label %for.cond.cleanup, label %for.body
				}

				; Function Attrs: norecurse nounwind uwtable
				define void @srli(i32* nocapture %x, i32 %N, <2 x i64>* nocapture readonly %a, i32 %b) {
				; CHECK-LABEL: @srli
				; CHECK: for.body:
				; CHECK-NEXT: phi
				; CHECK-NEXT: insertelement
				; CHECK-NEXT: shufflevector
				; CHECK-NEXT: icmp
				; CHECK-NEXT: select
				; CHECK-NEXT: getelementptr
				; CHECK-NEXT: bitcast
				; CHECK-NEXT: load
				; CHECK-NEXT: select
				; CHECK-NEXT: lshr
				entry:
				%cmp7 = icmp eq i32 %N, 0
				br i1 %cmp7, label %for.cond.cleanup, label %for.body.lr.ph

				for.body.lr.ph: ; preds = %entry
				%0 = icmp ult i32 %b, 32
				%.splatinsert.i = insertelement <4 x i32> undef, i32 %b, i32 0
				%.splat.i = shufflevector <4 x i32> %.splatinsert.i, <4 x i32> undef, <4 x i32> zeroinitializer
				%1 = select i1 %0, <4 x i32> %.splat.i, <4 x i32> zeroinitializer
				%wide.trip.count = zext i32 %N to i64
				br label %for.body

				for.cond.cleanup: ; preds = %for.body, %entry
				ret void

				for.body: ; preds = %for.body, %for.body.lr.ph
				%indvars.iv = phi i64 [ 0, %for.body.lr.ph ], [ %indvars.iv.next, %for.body ]
				%arrayidx = getelementptr inbounds <2 x i64>, <2 x i64>* %a, i64 %indvars.iv
				%2 = bitcast <2 x i64>* %arrayidx to <4 x i32>*
				%3 = load <4 x i32>, <4 x i32>* %2, align 16
				%4 = select i1 %0, <4 x i32> %3, <4 x i32> zeroinitializer
				%5 = lshr <4 x i32> %4, %1
				%6 = extractelement <4 x i32> %5, i32 0
				%idxprom1 = sext i32 %6 to i64
				%arrayidx2 = getelementptr inbounds i32, i32* %x, i64 %idxprom1
				%7 = trunc i64 %indvars.iv to i32
				store i32 %7, i32* %arrayidx2, align 4
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%exitcond = icmp eq i64 %indvars.iv.next, %wide.trip.count
				br i1 %exitcond, label %for.cond.cleanup, label %for.body
				}

				define void @srav(i32* nocapture %x, i32 %N, <2 x i64>* nocapture readonly %a, <2 x i64> %b) {
				; CHECK-LABEL: @srav
				; CHECK: for.body:
				; CHECK-NEXT: phi
				; CHECK-NEXT: bitcast
				; CHECK-NEXT: icmp
				; CHECK-NEXT: select
				; CHECK-NEXT: getelementptr
				; CHECK-NEXT: bitcast
				; CHECK-NEXT: load
				; CHECK-NEXT: and
				; CHECK-NEXT: select
				; CHECK-NEXT: ashr
				entry:
				%cmp7 = icmp eq i32 %N, 0
				br i1 %cmp7, label %for.cond.cleanup, label %for.body.lr.ph

				for.body.lr.ph: ; preds = %entry
				%0 = bitcast <2 x i64> %b to <4 x i32>
				%1 = icmp ult <4 x i32> %0, <i32 31, i32 31, i32 31, i32 31>
				%2 = select <4 x i1> %1, <4 x i32> %0, <4 x i32> zeroinitializer
				%wide.trip.count = zext i32 %N to i64
				br label %for.body

				for.cond.cleanup: ; preds = %for.body, %entry
				ret void

				for.body: ; preds = %for.body, %for.body.lr.ph
				%indvars.iv = phi i64 [ 0, %for.body.lr.ph ], [ %indvars.iv.next, %for.body ]
				%arrayidx = getelementptr inbounds <2 x i64>, <2 x i64>* %a, i64 %indvars.iv
				%3 = bitcast <2 x i64>* %arrayidx to <4 x i32>*
				%4 = load <4 x i32>, <4 x i32>* %3, align 16
				%5 = and <4 x i32> %4, <i32 -2147483648, i32 -2147483648, i32 -2147483648, i32 -2147483648>
				%6 = select <4 x i1> %1, <4 x i32> %4, <4 x i32> %5
				%7 = ashr <4 x i32> %6, %2
				%8 = extractelement <4 x i32> %7, i32 0
				%idxprom1 = sext i32 %8 to i64
				%arrayidx2 = getelementptr inbounds i32, i32* %x, i64 %idxprom1
				%9 = trunc i64 %indvars.iv to i32
				store i32 %9, i32* %arrayidx2, align 4
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%exitcond = icmp eq i64 %indvars.iv.next, %wide.trip.count
				br i1 %exitcond, label %for.cond.cleanup, label %for.body
				}

				define void @sra(i32* nocapture %x, i32 %N, <2 x i64>* nocapture readonly %a, <2 x i64> %b) {
				; CHECK-LABEL: @sra
				; CHECK: for.body:
				; CHECK-NEXT: phi
				; CHECK-NEXT: extractelement
				; CHECK-NEXT: trunc
				; CHECK-NEXT: insertelement
				; CHECK-NEXT: shufflevector
				; CHECK-NEXT: icmp
				; CHECK-NEXT: select
				; CHECK-NEXT: getelementptr
				; CHECK-NEXT: bitcast
				; CHECK-NEXT: load
				; CHECK-NEXT: and
				; CHECK-NEXT: select
				; CHECK-NEXT: ashr
				entry:
				%cmp7 = icmp eq i32 %N, 0
				br i1 %cmp7, label %for.cond.cleanup, label %for.body.lr.ph

				for.body.lr.ph: ; preds = %entry
				%0 = extractelement <2 x i64> %b, i64 0
				%1 = icmp ult i64 %0, 31
				%2 = trunc i64 %0 to i32
				%.splatinsert.i = insertelement <4 x i32> undef, i32 %2, i32 0
				%.splat.i = shufflevector <4 x i32> %.splatinsert.i, <4 x i32> undef, <4 x i32> zeroinitializer
				%3 = select i1 %1, <4 x i32> %.splat.i, <4 x i32> zeroinitializer
				%wide.trip.count = zext i32 %N to i64
				br label %for.body

				for.cond.cleanup: ; preds = %for.body, %entry
				ret void

				for.body: ; preds = %for.body, %for.body.lr.ph
				%indvars.iv = phi i64 [ 0, %for.body.lr.ph ], [ %indvars.iv.next, %for.body ]
				%arrayidx = getelementptr inbounds <2 x i64>, <2 x i64>* %a, i64 %indvars.iv
				%4 = bitcast <2 x i64>* %arrayidx to <4 x i32>*
				%5 = load <4 x i32>, <4 x i32>* %4, align 16
				%6 = and <4 x i32> %5, <i32 -2147483648, i32 -2147483648, i32 -2147483648, i32 -2147483648>
				%7 = select i1 %1, <4 x i32> %5, <4 x i32> %6
				%8 = ashr <4 x i32> %7, %3
				%9 = extractelement <4 x i32> %8, i32 0
				%idxprom1 = sext i32 %9 to i64
				%arrayidx2 = getelementptr inbounds i32, i32* %x, i64 %idxprom1
				%10 = trunc i64 %indvars.iv to i32
				store i32 %10, i32* %arrayidx2, align 4
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%exitcond = icmp eq i64 %indvars.iv.next, %wide.trip.count
				br i1 %exitcond, label %for.cond.cleanup, label %for.body
				}

				define void @srai(i32* nocapture %x, i32 %N, <2 x i64>* nocapture readonly %a, i32 %b) {
				; CHECK-LABEL: @srai
				; CHECK: for.body:
				; CHECK-NEXT: phi
				; CHECK-NEXT: insertelement
				; CHECK-NEXT: shufflevector
				; CHECK-NEXT: icmp
				; CHECK-NEXT: select
				; CHECK-NEXT: getelementptr
				; CHECK-NEXT: bitcast
				; CHECK-NEXT: load
				; CHECK-NEXT: and
				; CHECK-NEXT: select
				; CHECK-NEXT: ashr
				entry:
				%cmp7 = icmp eq i32 %N, 0
				br i1 %cmp7, label %for.cond.cleanup, label %for.body.lr.ph

				for.body.lr.ph: ; preds = %entry
				%0 = icmp ult i32 %b, 31
				%.splatinsert.i = insertelement <4 x i32> undef, i32 %b, i32 0
				%.splat.i = shufflevector <4 x i32> %.splatinsert.i, <4 x i32> undef, <4 x i32> zeroinitializer
				%1 = select i1 %0, <4 x i32> %.splat.i, <4 x i32> zeroinitializer
				%wide.trip.count = zext i32 %N to i64
				br label %for.body

				for.cond.cleanup: ; preds = %for.body, %entry
				ret void

				for.body: ; preds = %for.body, %for.body.lr.ph
				%indvars.iv = phi i64 [ 0, %for.body.lr.ph ], [ %indvars.iv.next, %for.body ]
				%arrayidx = getelementptr inbounds <2 x i64>, <2 x i64>* %a, i64 %indvars.iv
				%2 = bitcast <2 x i64>* %arrayidx to <4 x i32>*
				%3 = load <4 x i32>, <4 x i32>* %2, align 16
				%4 = and <4 x i32> %3, <i32 -2147483648, i32 -2147483648, i32 -2147483648, i32 -2147483648>
				%5 = select i1 %0, <4 x i32> %3, <4 x i32> %4
				%6 = ashr <4 x i32> %5, %1
				%7 = extractelement <4 x i32> %6, i32 0
				%idxprom1 = sext i32 %7 to i64
				%arrayidx2 = getelementptr inbounds i32, i32* %x, i64 %idxprom1
				%8 = trunc i64 %indvars.iv to i32
				store i32 %8, i32* %arrayidx2, align 4
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%exitcond = icmp eq i64 %indvars.iv.next, %wide.trip.count
				br i1 %exitcond, label %for.cond.cleanup, label %for.body
				}