This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Transforms/Vectorize/
-
Transforms/
-
Vectorize/
6/6
VectorCombine.cpp
-
test/Transforms/VectorCombine/X86/
-
Transforms/
-
VectorCombine/
-
X86/
-
shuffle.ll

Differential D76727

[VectorCombine] transform bitcasted shuffle to narrower elements
ClosedPublic

Authored by spatel on Mar 24 2020, 12:51 PM.

Download Raw Diff

Details

Reviewers

lebedev.ri
efriedma
RKSimon
t.p.northover

Commits

rGb6050ca18168: [VectorCombine] transform bitcasted shuffle to narrower elements

Summary

bitcast (shuf V, MaskC) --> shuf (bitcast V), MaskC'

We do not attempt this in InstCombine because we do not want to change types and create new shuffle ops that are potentially not lowered as well as the original code. Here, we can check the cost model to see if it is worthwhile.

I've aggressively enabled this transform even if the types are the same size and/or equal cost because moving the bitcast allows InstCombine to make further simplifications.

In the motivating cases from PR35454:
https://bugs.llvm.org/show_bug.cgi?id=35454
...this is enough to let instcombine and the backend eliminate the redundant shuffles, but we probably want to extend VectorCombine to handle the inverse pattern (shuffle-of-bitcast) to get that simplification directly in IR.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

spatel created this revision.Mar 24 2020, 12:51 PM

Herald added subscribers: hiraditya, mcrosier. · View Herald TranscriptMar 24 2020, 12:51 PM

Do we need to count the cost of bitcasts too?

In D76727#1939961, @lebedev.ri wrote:

Do we need to count the cost of bitcasts too?

Hmm...is there a scenario where the old bitcast is different in cost than the new bitcast? That op is effectively getting hoisted in the test cases shown here, so it would cancel out if we put it on both sides of the cost comparison.
The TTI API that I think we'd use for this is: TTI.getCastInstrCost(Instruction::BitCast, DestTy, SrcTy).

From reading through the getCastInstrCost()'s i don't think any backend
currently models it, but there's this comment in AArch64ISelLowering.cpp

namespace llvm {

namespace AArch64ISD {

enum NodeType : unsigned {
<...>
  /// Natural vector cast. ISD::BITCAST is not natural in the big-endian
  /// world w.r.t vectors; which causes additional REV instructions to be
  /// generated to compensate for the byte-swapping. But sometimes we do
  /// need to re-interpret the data in SIMD vector registers in big-endian
  /// mode without emitting such REV instructions.
  NVCAST,

which is consistent with https://reviews.llvm.org/D40633#inline-355090 by @efriedma:

On some targets, vector bitcasts aren't free (IIRC big-endian ARM is like this).

In D76727#1940722, @lebedev.ri wrote:
From reading through the getCastInstrCost()'s i don't think any backend
currently models it, but there's this comment in AArch64ISelLowering.cpp
namespace llvm {

namespace AArch64ISD {

enum NodeType : unsigned {
<...>
  /// Natural vector cast. ISD::BITCAST is not natural in the big-endian
  /// world w.r.t vectors; which causes additional REV instructions to be
  /// generated to compensate for the byte-swapping. But sometimes we do
  /// need to re-interpret the data in SIMD vector registers in big-endian
  /// mode without emitting such REV instructions.
  NVCAST,
which is consistent with https://reviews.llvm.org/D40633#inline-355090 by @efriedma:

On some targets, vector bitcasts aren't free (IIRC big-endian ARM is like this).

I agree that bitcasts may not be free, but I don't see how that affects the cost calc for this transform.

I'm open to ideas on how to improve this, but I'm not sure how to proceed without some concrete examples:

This transform is too narrow to effectively cost model in isolation? Ie, we need to pattern match something bigger than just cast+shuf.
Implement a generic DAGCombine version of x86's canWidenShuffleElements() to allow targets to reverse this?
Limit this transform to targets where the bitcast is free (and potentially improve the base cost model to account for big-endian)?

Eeeek.
I'm not sure what i was thinking, there is indeed no point in modelling the cost
of bitcast here because we have the exact same bitcast instruction in either case.

llvm/lib/Transforms/Vectorize/VectorCombine.cpp
257	I'm not sure how the second part of the check can fail?

spatel marked 3 inline comments as done.Mar 25 2020, 7:21 AM

spatel added inline comments.

llvm/lib/Transforms/Vectorize/VectorCombine.cpp
257	That's one that I've tripped on a few times in the past (but failed to add a test for here): a shufflevector can change the vector length (the mask value can have more/less elements than the source values). I'll add negative tests for both of those clauses.
267–268	The types are reversed here. This is covered by the SSE2 run of the 1st test, but I still missed the bug.

Patch updated:

Fix bug in cost calc - types were inverted (the new shuffle is executed in the destination type).
Added code comment to explain the cost calc.
Added negative tests and test comments.

spatel mentioned this in D76844: [InstCombine] try to reduce shuffle with bitcasted operand.Mar 26 2020, 6:37 AM

llvm/lib/Transforms/Vectorize/VectorCombine.cpp
248	It may be useful to prefix the function[s] with a short blurb explaining what happens here, what pattern we strive to replace with what pattern.
287–288	NewMaskC.reserve(NewMask.size()); for (int NewMaskElt : NewMask) NewMaskC.push_back(Builder.getInt32(NewMaskElt));

This revision is now accepted and ready to land.Mar 26 2020, 12:11 PM

spatel added a parent revision: D72467: Remove "mask" operand from shufflevector..Mar 31 2020, 6:41 AM

D72467 is not actually a parent, but adding that here because this would conflict (force another rebase of that patch).

spatel marked 3 inline comments as done.Apr 2 2020, 10:17 AM

spatel added inline comments.

llvm/lib/Transforms/Vectorize/VectorCombine.cpp
287–288	Obsoleted by D72467 .

Patch updated - no logic diffs from before but:

Rebased/simplified with new mask-operand-less version of shuffle (D72467, D77183).
Added function comment to explain the transform/motivation.

Closed by commit rGb6050ca18168: [VectorCombine] transform bitcasted shuffle to narrower elements (authored by spatel). · Explain WhyApr 2 2020, 10:50 AM

This revision was automatically updated to reflect the committed changes.

Herald added a project: Restricted Project. · View Herald TranscriptApr 2 2020, 10:50 AM

spatel mentioned this in rGf4448063ccf1: [InstCombine] try to reduce shuffle with bitcasted operand.Apr 2 2020, 10:51 AM

spatel mentioned this in rG389704cc601b: [PhaseOrdering] add shuffle tests based on D40633; NFC.Apr 3 2020, 10:15 AM

spatel mentioned this in D77881: [VectorUtils] add IR-level analysis for widening of shuffle mask .Apr 10 2020, 8:51 AM

spatel mentioned this in rGc23cbefd9d73: [VectorUtils] add IR-level analysis for widening of shuffle mask.Apr 12 2020, 7:28 AM

spatel mentioned this in D78371: [VectorCombine] transform bitcasted shuffle to wider elements.Apr 17 2020, 7:43 AM

spatel mentioned this in rGbef6e67e95fb: [VectorCombine] transform bitcasted shuffle to wider elements.Apr 19 2020, 5:52 AM

Revision Contents

Path

Size

llvm/

lib/

Transforms/

Vectorize/

VectorCombine.cpp

45 lines

test/

Transforms/

VectorCombine/

X86/

shuffle.ll

87 lines

Diff 254566

llvm/lib/Transforms/Vectorize/VectorCombine.cpp

Show All 11 Lines
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#include "llvm/Transforms/Vectorize/VectorCombine.h"		#include "llvm/Transforms/Vectorize/VectorCombine.h"
#include "llvm/ADT/Statistic.h"		#include "llvm/ADT/Statistic.h"
#include "llvm/Analysis/GlobalsModRef.h"		#include "llvm/Analysis/GlobalsModRef.h"
#include "llvm/Analysis/TargetTransformInfo.h"		#include "llvm/Analysis/TargetTransformInfo.h"
#include "llvm/Analysis/ValueTracking.h"		#include "llvm/Analysis/ValueTracking.h"
		#include "llvm/Analysis/VectorUtils.h"
#include "llvm/IR/Dominators.h"		#include "llvm/IR/Dominators.h"
#include "llvm/IR/Function.h"		#include "llvm/IR/Function.h"
#include "llvm/IR/IRBuilder.h"		#include "llvm/IR/IRBuilder.h"
#include "llvm/IR/PatternMatch.h"		#include "llvm/IR/PatternMatch.h"
#include "llvm/InitializePasses.h"		#include "llvm/InitializePasses.h"
#include "llvm/Pass.h"		#include "llvm/Pass.h"
#include "llvm/Support/CommandLine.h"		#include "llvm/Support/CommandLine.h"
#include "llvm/Transforms/Vectorize.h"		#include "llvm/Transforms/Vectorize.h"
▲ Show 20 Lines • Show All 211 Lines • ▼ Show 20 Lines	static bool foldExtractExtract(Instruction &I, const TargetTransformInfo &TTI) {
if (Pred != CmpInst::BAD_ICMP_PREDICATE)		if (Pred != CmpInst::BAD_ICMP_PREDICATE)
foldExtExtCmp(Ext0, Ext1, I, TTI);		foldExtExtCmp(Ext0, Ext1, I, TTI);
else		else
foldExtExtBinop(Ext0, Ext1, I, TTI);		foldExtExtBinop(Ext0, Ext1, I, TTI);

return true;		return true;
}		}

		/// If this is a bitcast to narrow elements from a shuffle of wider elements,
		lebedev.riUnsubmitted Done Reply Inline Actions It may be useful to prefix the function[s] with a short blurb explaining what happens here, what pattern we strive to replace with what pattern. lebedev.ri: It may be useful to prefix the function[s] with a short blurb explaining what happens here…
		/// try to bitcast the source vector to the narrow type followed by shuffle.
		/// This can enable further transforms by moving bitcasts or shuffles together.
		static bool foldBitcastShuf(Instruction &I, const TargetTransformInfo &TTI) {
		Value *V;
		ArrayRef<int> Mask;
		if (!match(&I, m_BitCast(m_OneUse(m_ShuffleVector(m_Value(V), m_Undef(),
		m_Mask(Mask))))))
		return false;

		lebedev.riUnsubmitted Done Reply Inline Actions I'm not sure how the second part of the check can fail? lebedev.ri: I'm not sure how the second part of the check can fail?
		spatelAuthorUnsubmitted Done Reply Inline Actions That's one that I've tripped on a few times in the past (but failed to add a test for here): a shufflevector can change the vector length (the mask value can have more/less elements than the source values). I'll add negative tests for both of those clauses. spatel: That's one that I've tripped on a few times in the past (but failed to add a test for here): a…
		Type *DestTy = I.getType();
		Type *SrcTy = V->getType();
		if (!DestTy->isVectorTy() \|\| I.getOperand(0)->getType() != SrcTy)
		return false;

		// TODO: Handle bitcast from narrow element type to wide element type.
		assert(SrcTy->isVectorTy() && "Shuffle of non-vector type?");
		unsigned DestNumElts = DestTy->getVectorNumElements();
		unsigned SrcNumElts = SrcTy->getVectorNumElements();
		if (SrcNumElts > DestNumElts)
		return false;
		spatelAuthorUnsubmitted Done Reply Inline Actions The types are reversed here. This is covered by the SSE2 run of the 1st test, but I still missed the bug. spatel: The types are reversed here. This is covered by the SSE2 run of the 1st test, but I still…

		// The new shuffle must not cost more than the old shuffle. The bitcast is
		// moved ahead of the shuffle, so assume that it has the same cost as before.
		if (TTI.getShuffleCost(TargetTransformInfo::SK_PermuteSingleSrc, DestTy) >
		TTI.getShuffleCost(TargetTransformInfo::SK_PermuteSingleSrc, SrcTy))
		return false;

		// Bitcast the source vector and expand the shuffle mask to the equivalent for
		// narrow elements.
		// bitcast (shuf V, MaskC) --> shuf (bitcast V), MaskC'
		IRBuilder<> Builder(&I);
		Value *CastV = Builder.CreateBitCast(V, DestTy);
		SmallVector<int, 16> NewMask;
		assert(DestNumElts % SrcNumElts == 0 && "Unexpected shuffle mask");
		unsigned ScaleFactor = DestNumElts / SrcNumElts;
		scaleShuffleMask(ScaleFactor, Mask, NewMask);
		Value *Shuf = Builder.CreateShuffleVector(CastV, UndefValue::get(DestTy),
		NewMask);
		I.replaceAllUsesWith(Shuf);
		return true;
		lebedev.riUnsubmitted Done Reply Inline Actions NewMaskC.reserve(NewMask.size()); for (int NewMaskElt : NewMask) NewMaskC.push_back(Builder.getInt32(NewMaskElt)); lebedev.ri: ``` NewMaskC.reserve(NewMask.size()); for (int NewMaskElt : NewMask) NewMaskC.push_back…
		spatelAuthorUnsubmitted Done Reply Inline Actions Obsoleted by D72467 . spatel: Obsoleted by D72467 .
		}

/// This is the entry point for all transforms. Pass manager differences are		/// This is the entry point for all transforms. Pass manager differences are
/// handled in the callers of this function.		/// handled in the callers of this function.
static bool runImpl(Function &F, const TargetTransformInfo &TTI,		static bool runImpl(Function &F, const TargetTransformInfo &TTI,
const DominatorTree &DT) {		const DominatorTree &DT) {
if (DisableVectorCombine)		if (DisableVectorCombine)
return false;		return false;

bool MadeChange = false;		bool MadeChange = false;
for (BasicBlock &BB : F) {		for (BasicBlock &BB : F) {
// Ignore unreachable basic blocks.		// Ignore unreachable basic blocks.
if (!DT.isReachableFromEntry(&BB))		if (!DT.isReachableFromEntry(&BB))
continue;		continue;
// Do not delete instructions under here and invalidate the iterator.		// Do not delete instructions under here and invalidate the iterator.
// Walk the block backwards for efficiency. We're matching a chain of		// Walk the block backwards for efficiency. We're matching a chain of
// use->defs, so we're more likely to succeed by starting from the bottom.		// use->defs, so we're more likely to succeed by starting from the bottom.
// TODO: It could be more efficient to remove dead instructions		// TODO: It could be more efficient to remove dead instructions
// iteratively in this loop rather than waiting until the end.		// iteratively in this loop rather than waiting until the end.
for (Instruction &I : make_range(BB.rbegin(), BB.rend())) {		for (Instruction &I : make_range(BB.rbegin(), BB.rend())) {
if (isa<DbgInfoIntrinsic>(I))		if (isa<DbgInfoIntrinsic>(I))
continue;		continue;
MadeChange \|= foldExtractExtract(I, TTI);		MadeChange \|= foldExtractExtract(I, TTI);
		MadeChange \|= foldBitcastShuf(I, TTI);
}		}
}		}

// We're done with transforms, so remove dead instructions.		// We're done with transforms, so remove dead instructions.
if (MadeChange)		if (MadeChange)
for (BasicBlock &BB : F)		for (BasicBlock &BB : F)
SimplifyInstructionsInBlock(&BB);		SimplifyInstructionsInBlock(&BB);

▲ Show 20 Lines • Show All 54 Lines • Show Last 20 Lines

llvm/test/Transforms/VectorCombine/X86/shuffle.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; RUN: opt < %s -vector-combine -S -mtriple=x86_64-- -mattr=SSE2 \| FileCheck %s --check-prefixes=CHECK,SSE			; RUN: opt < %s -vector-combine -S -mtriple=x86_64-- -mattr=SSE2 \| FileCheck %s --check-prefixes=CHECK,SSE
	; RUN: opt < %s -vector-combine -S -mtriple=x86_64-- -mattr=AVX2 \| FileCheck %s --check-prefixes=CHECK,AVX			; RUN: opt < %s -vector-combine -S -mtriple=x86_64-- -mattr=AVX2 \| FileCheck %s --check-prefixes=CHECK,AVX

				; x86 does not have a cheap v16i8 shuffle until SSSE3 (pshufb)

	define <16 x i8> @bitcast_shuf_narrow_element(<4 x i32> %v) {			define <16 x i8> @bitcast_shuf_narrow_element(<4 x i32> %v) {
	; CHECK-LABEL: @bitcast_shuf_narrow_element(			; SSE-LABEL: @bitcast_shuf_narrow_element(
	; CHECK-NEXT: [[SHUF:%.]] = shufflevector <4 x i32> [[V:%.]], <4 x i32> undef, <4 x i32> <i32 3, i32 2, i32 1, i32 0>			; SSE-NEXT: [[SHUF:%.]] = shufflevector <4 x i32> [[V:%.]], <4 x i32> undef, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
	; CHECK-NEXT: [[R:%.*]] = bitcast <4 x i32> [[SHUF]] to <16 x i8>			; SSE-NEXT: [[R:%.*]] = bitcast <4 x i32> [[SHUF]] to <16 x i8>
	; CHECK-NEXT: ret <16 x i8> [[R]]			; SSE-NEXT: ret <16 x i8> [[R]]
				;
				; AVX-LABEL: @bitcast_shuf_narrow_element(
				; AVX-NEXT: [[TMP1:%.]] = bitcast <4 x i32> [[V:%.]] to <16 x i8>
				; AVX-NEXT: [[TMP2:%.*]] = shufflevector <16 x i8> [[TMP1]], <16 x i8> undef, <16 x i32> <i32 12, i32 13, i32 14, i32 15, i32 8, i32 9, i32 10, i32 11, i32 4, i32 5, i32 6, i32 7, i32 0, i32 1, i32 2, i32 3>
				; AVX-NEXT: ret <16 x i8> [[TMP2]]
	;			;
	%shuf = shufflevector <4 x i32> %v, <4 x i32> undef, <4 x i32> <i32 3, i32 2, i32 1, i32 0>			%shuf = shufflevector <4 x i32> %v, <4 x i32> undef, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
	%r = bitcast <4 x i32> %shuf to <16 x i8>			%r = bitcast <4 x i32> %shuf to <16 x i8>
	ret <16 x i8> %r			ret <16 x i8> %r
	}			}

				; v4f32 is the same cost as v4i32, so this always works

	define <4 x float> @bitcast_shuf_same_size(<4 x i32> %v) {			define <4 x float> @bitcast_shuf_same_size(<4 x i32> %v) {
	; CHECK-LABEL: @bitcast_shuf_same_size(			; CHECK-LABEL: @bitcast_shuf_same_size(
	; CHECK-NEXT: [[SHUF:%.]] = shufflevector <4 x i32> [[V:%.]], <4 x i32> undef, <4 x i32> <i32 3, i32 2, i32 1, i32 0>			; CHECK-NEXT: [[TMP1:%.]] = bitcast <4 x i32> [[V:%.]] to <4 x float>
	; CHECK-NEXT: [[R:%.*]] = bitcast <4 x i32> [[SHUF]] to <4 x float>			; CHECK-NEXT: [[TMP2:%.*]] = shufflevector <4 x float> [[TMP1]], <4 x float> undef, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
	; CHECK-NEXT: ret <4 x float> [[R]]			; CHECK-NEXT: ret <4 x float> [[TMP2]]
	;			;
	%shuf = shufflevector <4 x i32> %v, <4 x i32> undef, <4 x i32> <i32 3, i32 2, i32 1, i32 0>			%shuf = shufflevector <4 x i32> %v, <4 x i32> undef, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
	%r = bitcast <4 x i32> %shuf to <4 x float>			%r = bitcast <4 x i32> %shuf to <4 x float>
	ret <4 x float> %r			ret <4 x float> %r
	}			}

				; Negative test - length-changing shuffle

	define <16 x i8> @bitcast_shuf_narrow_element_wrong_size(<2 x i32> %v) {			define <16 x i8> @bitcast_shuf_narrow_element_wrong_size(<2 x i32> %v) {
	; CHECK-LABEL: @bitcast_shuf_narrow_element_wrong_size(			; CHECK-LABEL: @bitcast_shuf_narrow_element_wrong_size(
	; CHECK-NEXT: [[SHUF:%.]] = shufflevector <2 x i32> [[V:%.]], <2 x i32> undef, <4 x i32> <i32 1, i32 0, i32 1, i32 0>			; CHECK-NEXT: [[SHUF:%.]] = shufflevector <2 x i32> [[V:%.]], <2 x i32> undef, <4 x i32> <i32 1, i32 0, i32 1, i32 0>
	; CHECK-NEXT: [[R:%.*]] = bitcast <4 x i32> [[SHUF]] to <16 x i8>			; CHECK-NEXT: [[R:%.*]] = bitcast <4 x i32> [[SHUF]] to <16 x i8>
	; CHECK-NEXT: ret <16 x i8> [[R]]			; CHECK-NEXT: ret <16 x i8> [[R]]
	;			;
	%shuf = shufflevector <2 x i32> %v, <2 x i32> undef, <4 x i32> <i32 1, i32 0, i32 1, i32 0>			%shuf = shufflevector <2 x i32> %v, <2 x i32> undef, <4 x i32> <i32 1, i32 0, i32 1, i32 0>
	%r = bitcast <4 x i32> %shuf to <16 x i8>			%r = bitcast <4 x i32> %shuf to <16 x i8>
	ret <16 x i8> %r			ret <16 x i8> %r
	}			}

				; Negative test - must cast to vector type

	define i128 @bitcast_shuf_narrow_element_wrong_type(<4 x i32> %v) {			define i128 @bitcast_shuf_narrow_element_wrong_type(<4 x i32> %v) {
	; CHECK-LABEL: @bitcast_shuf_narrow_element_wrong_type(			; CHECK-LABEL: @bitcast_shuf_narrow_element_wrong_type(
	; CHECK-NEXT: [[SHUF:%.]] = shufflevector <4 x i32> [[V:%.]], <4 x i32> undef, <4 x i32> <i32 3, i32 2, i32 1, i32 0>			; CHECK-NEXT: [[SHUF:%.]] = shufflevector <4 x i32> [[V:%.]], <4 x i32> undef, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
	; CHECK-NEXT: [[R:%.*]] = bitcast <4 x i32> [[SHUF]] to i128			; CHECK-NEXT: [[R:%.*]] = bitcast <4 x i32> [[SHUF]] to i128
	; CHECK-NEXT: ret i128 [[R]]			; CHECK-NEXT: ret i128 [[R]]
	;			;
	%shuf = shufflevector <4 x i32> %v, <4 x i32> undef, <4 x i32> <i32 3, i32 2, i32 1, i32 0>			%shuf = shufflevector <4 x i32> %v, <4 x i32> undef, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
	%r = bitcast <4 x i32> %shuf to i128			%r = bitcast <4 x i32> %shuf to i128
	ret i128 %r			ret i128 %r
	}			}

				; Negative test - but might want to try this

	define <4 x i32> @bitcast_shuf_wide_element(<8 x i16> %v) {			define <4 x i32> @bitcast_shuf_wide_element(<8 x i16> %v) {
	; CHECK-LABEL: @bitcast_shuf_wide_element(			; CHECK-LABEL: @bitcast_shuf_wide_element(
	; CHECK-NEXT: [[SHUF:%.]] = shufflevector <8 x i16> [[V:%.]], <8 x i16> undef, <8 x i32> <i32 0, i32 1, i32 0, i32 1, i32 2, i32 3, i32 2, i32 3>			; CHECK-NEXT: [[SHUF:%.]] = shufflevector <8 x i16> [[V:%.]], <8 x i16> undef, <8 x i32> <i32 0, i32 1, i32 0, i32 1, i32 2, i32 3, i32 2, i32 3>
	; CHECK-NEXT: [[R:%.*]] = bitcast <8 x i16> [[SHUF]] to <4 x i32>			; CHECK-NEXT: [[R:%.*]] = bitcast <8 x i16> [[SHUF]] to <4 x i32>
	; CHECK-NEXT: ret <4 x i32> [[R]]			; CHECK-NEXT: ret <4 x i32> [[R]]
	;			;
	%shuf = shufflevector <8 x i16> %v, <8 x i16> undef, <8 x i32> <i32 0, i32 1, i32 0, i32 1, i32 2, i32 3, i32 2, i32 3>			%shuf = shufflevector <8 x i16> %v, <8 x i16> undef, <8 x i32> <i32 0, i32 1, i32 0, i32 1, i32 2, i32 3, i32 2, i32 3>
	%r = bitcast <8 x i16> %shuf to <4 x i32>			%r = bitcast <8 x i16> %shuf to <4 x i32>
	ret <4 x i32> %r			ret <4 x i32> %r
	}			}

	declare void @use(<4 x i32>)			declare void @use(<4 x i32>)

				; Negative test - don't create an extra shuffle

	define <16 x i8> @bitcast_shuf_uses(<4 x i32> %v) {			define <16 x i8> @bitcast_shuf_uses(<4 x i32> %v) {
	; CHECK-LABEL: @bitcast_shuf_uses(			; CHECK-LABEL: @bitcast_shuf_uses(
	; CHECK-NEXT: [[SHUF:%.]] = shufflevector <4 x i32> [[V:%.]], <4 x i32> undef, <4 x i32> <i32 3, i32 2, i32 1, i32 0>			; CHECK-NEXT: [[SHUF:%.]] = shufflevector <4 x i32> [[V:%.]], <4 x i32> undef, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
	; CHECK-NEXT: call void @use(<4 x i32> [[SHUF]])			; CHECK-NEXT: call void @use(<4 x i32> [[SHUF]])
	; CHECK-NEXT: [[R:%.*]] = bitcast <4 x i32> [[SHUF]] to <16 x i8>			; CHECK-NEXT: [[R:%.*]] = bitcast <4 x i32> [[SHUF]] to <16 x i8>
	; CHECK-NEXT: ret <16 x i8> [[R]]			; CHECK-NEXT: ret <16 x i8> [[R]]
	;			;
	%shuf = shufflevector <4 x i32> %v, <4 x i32> undef, <4 x i32> <i32 3, i32 2, i32 1, i32 0>			%shuf = shufflevector <4 x i32> %v, <4 x i32> undef, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
	call void @use(<4 x i32> %shuf)			call void @use(<4 x i32> %shuf)
	%r = bitcast <4 x i32> %shuf to <16 x i8>			%r = bitcast <4 x i32> %shuf to <16 x i8>
	ret <16 x i8> %r			ret <16 x i8> %r
	}			}

	define <2 x i64> @PR35454_1(<2 x i64> %v) {			define <2 x i64> @PR35454_1(<2 x i64> %v) {
	; CHECK-LABEL: @PR35454_1(			; SSE-LABEL: @PR35454_1(
	; CHECK-NEXT: [[BC:%.]] = bitcast <2 x i64> [[V:%.]] to <4 x i32>			; SSE-NEXT: [[BC:%.]] = bitcast <2 x i64> [[V:%.]] to <4 x i32>
	; CHECK-NEXT: [[PERMIL:%.*]] = shufflevector <4 x i32> [[BC]], <4 x i32> undef, <4 x i32> <i32 3, i32 2, i32 1, i32 0>			; SSE-NEXT: [[PERMIL:%.*]] = shufflevector <4 x i32> [[BC]], <4 x i32> undef, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
	; CHECK-NEXT: [[BC1:%.*]] = bitcast <4 x i32> [[PERMIL]] to <16 x i8>			; SSE-NEXT: [[BC1:%.*]] = bitcast <4 x i32> [[PERMIL]] to <16 x i8>
	; CHECK-NEXT: [[ADD:%.*]] = shl <16 x i8> [[BC1]], <i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1>			; SSE-NEXT: [[ADD:%.*]] = shl <16 x i8> [[BC1]], <i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1>
	; CHECK-NEXT: [[BC2:%.*]] = bitcast <16 x i8> [[ADD]] to <4 x i32>			; SSE-NEXT: [[BC2:%.*]] = bitcast <16 x i8> [[ADD]] to <4 x i32>
	; CHECK-NEXT: [[PERMIL1:%.*]] = shufflevector <4 x i32> [[BC2]], <4 x i32> undef, <4 x i32> <i32 3, i32 2, i32 1, i32 0>			; SSE-NEXT: [[PERMIL1:%.*]] = shufflevector <4 x i32> [[BC2]], <4 x i32> undef, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
	; CHECK-NEXT: [[BC3:%.*]] = bitcast <4 x i32> [[PERMIL1]] to <2 x i64>			; SSE-NEXT: [[BC3:%.*]] = bitcast <4 x i32> [[PERMIL1]] to <2 x i64>
	; CHECK-NEXT: ret <2 x i64> [[BC3]]			; SSE-NEXT: ret <2 x i64> [[BC3]]
				;
				; AVX-LABEL: @PR35454_1(
				; AVX-NEXT: [[BC:%.]] = bitcast <2 x i64> [[V:%.]] to <4 x i32>
				; AVX-NEXT: [[TMP1:%.*]] = bitcast <4 x i32> [[BC]] to <16 x i8>
				; AVX-NEXT: [[TMP2:%.*]] = shufflevector <16 x i8> [[TMP1]], <16 x i8> undef, <16 x i32> <i32 12, i32 13, i32 14, i32 15, i32 8, i32 9, i32 10, i32 11, i32 4, i32 5, i32 6, i32 7, i32 0, i32 1, i32 2, i32 3>
				; AVX-NEXT: [[ADD:%.*]] = shl <16 x i8> [[TMP2]], <i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1>
				; AVX-NEXT: [[BC2:%.*]] = bitcast <16 x i8> [[ADD]] to <4 x i32>
				; AVX-NEXT: [[PERMIL1:%.*]] = shufflevector <4 x i32> [[BC2]], <4 x i32> undef, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
				; AVX-NEXT: [[BC3:%.*]] = bitcast <4 x i32> [[PERMIL1]] to <2 x i64>
				; AVX-NEXT: ret <2 x i64> [[BC3]]
	;			;
	%bc = bitcast <2 x i64> %v to <4 x i32>			%bc = bitcast <2 x i64> %v to <4 x i32>
	%permil = shufflevector <4 x i32> %bc, <4 x i32> undef, <4 x i32> <i32 3, i32 2, i32 1, i32 0>			%permil = shufflevector <4 x i32> %bc, <4 x i32> undef, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
	%bc1 = bitcast <4 x i32> %permil to <16 x i8>			%bc1 = bitcast <4 x i32> %permil to <16 x i8>
	%add = shl <16 x i8> %bc1, <i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1>			%add = shl <16 x i8> %bc1, <i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1>
	%bc2 = bitcast <16 x i8> %add to <4 x i32>			%bc2 = bitcast <16 x i8> %add to <4 x i32>
	%permil1 = shufflevector <4 x i32> %bc2, <4 x i32> undef, <4 x i32> <i32 3, i32 2, i32 1, i32 0>			%permil1 = shufflevector <4 x i32> %bc2, <4 x i32> undef, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
	%bc3 = bitcast <4 x i32> %permil1 to <2 x i64>			%bc3 = bitcast <4 x i32> %permil1 to <2 x i64>
	ret <2 x i64> %bc3			ret <2 x i64> %bc3
	}			}

	define <2 x i64> @PR35454_2(<2 x i64> %v) {			define <2 x i64> @PR35454_2(<2 x i64> %v) {
	; CHECK-LABEL: @PR35454_2(			; SSE-LABEL: @PR35454_2(
	; CHECK-NEXT: [[BC:%.]] = bitcast <2 x i64> [[V:%.]] to <4 x i32>			; SSE-NEXT: [[BC:%.]] = bitcast <2 x i64> [[V:%.]] to <4 x i32>
	; CHECK-NEXT: [[PERMIL:%.*]] = shufflevector <4 x i32> [[BC]], <4 x i32> undef, <4 x i32> <i32 3, i32 2, i32 1, i32 0>			; SSE-NEXT: [[PERMIL:%.*]] = shufflevector <4 x i32> [[BC]], <4 x i32> undef, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
	; CHECK-NEXT: [[BC1:%.*]] = bitcast <4 x i32> [[PERMIL]] to <8 x i16>			; SSE-NEXT: [[BC1:%.*]] = bitcast <4 x i32> [[PERMIL]] to <8 x i16>
	; CHECK-NEXT: [[ADD:%.*]] = shl <8 x i16> [[BC1]], <i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1>			; SSE-NEXT: [[ADD:%.*]] = shl <8 x i16> [[BC1]], <i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1>
	; CHECK-NEXT: [[BC2:%.*]] = bitcast <8 x i16> [[ADD]] to <4 x i32>			; SSE-NEXT: [[BC2:%.*]] = bitcast <8 x i16> [[ADD]] to <4 x i32>
	; CHECK-NEXT: [[PERMIL1:%.*]] = shufflevector <4 x i32> [[BC2]], <4 x i32> undef, <4 x i32> <i32 3, i32 2, i32 1, i32 0>			; SSE-NEXT: [[PERMIL1:%.*]] = shufflevector <4 x i32> [[BC2]], <4 x i32> undef, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
	; CHECK-NEXT: [[BC3:%.*]] = bitcast <4 x i32> [[PERMIL1]] to <2 x i64>			; SSE-NEXT: [[BC3:%.*]] = bitcast <4 x i32> [[PERMIL1]] to <2 x i64>
	; CHECK-NEXT: ret <2 x i64> [[BC3]]			; SSE-NEXT: ret <2 x i64> [[BC3]]
				;
				; AVX-LABEL: @PR35454_2(
				; AVX-NEXT: [[BC:%.]] = bitcast <2 x i64> [[V:%.]] to <4 x i32>
				; AVX-NEXT: [[TMP1:%.*]] = bitcast <4 x i32> [[BC]] to <8 x i16>
				; AVX-NEXT: [[TMP2:%.*]] = shufflevector <8 x i16> [[TMP1]], <8 x i16> undef, <8 x i32> <i32 6, i32 7, i32 4, i32 5, i32 2, i32 3, i32 0, i32 1>
				; AVX-NEXT: [[ADD:%.*]] = shl <8 x i16> [[TMP2]], <i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1>
				; AVX-NEXT: [[BC2:%.*]] = bitcast <8 x i16> [[ADD]] to <4 x i32>
				; AVX-NEXT: [[PERMIL1:%.*]] = shufflevector <4 x i32> [[BC2]], <4 x i32> undef, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
				; AVX-NEXT: [[BC3:%.*]] = bitcast <4 x i32> [[PERMIL1]] to <2 x i64>
				; AVX-NEXT: ret <2 x i64> [[BC3]]
	;			;
	%bc = bitcast <2 x i64> %v to <4 x i32>			%bc = bitcast <2 x i64> %v to <4 x i32>
	%permil = shufflevector <4 x i32> %bc, <4 x i32> undef, <4 x i32> <i32 3, i32 2, i32 1, i32 0>			%permil = shufflevector <4 x i32> %bc, <4 x i32> undef, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
	%bc1 = bitcast <4 x i32> %permil to <8 x i16>			%bc1 = bitcast <4 x i32> %permil to <8 x i16>
	%add = shl <8 x i16> %bc1, <i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1>			%add = shl <8 x i16> %bc1, <i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1>
	%bc2 = bitcast <8 x i16> %add to <4 x i32>			%bc2 = bitcast <8 x i16> %add to <4 x i32>
	%permil1 = shufflevector <4 x i32> %bc2, <4 x i32> undef, <4 x i32> <i32 3, i32 2, i32 1, i32 0>			%permil1 = shufflevector <4 x i32> %bc2, <4 x i32> undef, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
	%bc3 = bitcast <4 x i32> %permil1 to <2 x i64>			%bc3 = bitcast <4 x i32> %permil1 to <2 x i64>
	ret <2 x i64> %bc3			ret <2 x i64> %bc3
	}			}