This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Transforms/InstCombine/
-
Transforms/
-
InstCombine/
2/4
InstCombineCasts.cpp
-
test/Transforms/InstCombine/AArch64/
-
Transforms/
-
InstCombine/
-
AArch64/
2
sve-cast-of-alloc.ll

Differential D87378

[SVE] Fix InstCombinerImpl::PromoteCastOfAllocation for scalable vectors
ClosedPublic

Authored by david-arm on Sep 9 2020, 6:51 AM.

Download Raw Diff

Details

Reviewers

sdesmalen
ctetreau
c-rhodes
rengolin
efriedma

Commits

rG59c4d5aad060: [SVE] Fix InstCombinerImpl::PromoteCastOfAllocation for scalable vectors

Summary

In this patch I've fixed some warnings that arose from the implicit
cast of TypeSize -> uint64_t. I tried writing a variety of different
cases to show how this optimisation might work for scalable vectors
and found:

The optimisation does not work for cases where the cast type

is scalable and the allocated type is not. This because we need to
know how many times the cast type fits into the allocated type.

If we pass all the various checks for the case when the allocated

type is scalable and the cast type is not, then when creating the
new alloca we have to take vscale into account. This leads to
sub-optimal IR that is worse than the original IR.

For the remaining case when both the alloca and cast types are

scalable it is hard to find examples where the optimisation would
kick in, except for simple bitcasts, because we typically fail the
ABI alignment checks.

For now I've changed the code to bail out if only one of the alloca
and cast types is scalable. This means we continue to support the
existing cases where both types are fixed, and also the specific case
when both types are scalable with the same size and alignment, for
example a simple bitcast of an alloca to another type.

I've added tests that show we don't attempt to promote the alloca,
except for simple bitcasts:

Transforms/InstCombine/AArch64/sve-cast-of-alloc.ll

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

david-arm created this revision.Sep 9 2020, 6:51 AM

Herald added a reviewer: rengolin. · View Herald TranscriptSep 9 2020, 6:51 AM

Herald added a reviewer: efriedma. · View Herald Transcript

Herald added a project: Restricted Project. · View Herald Transcript

Herald added subscribers: llvm-commits, psnobl, hiraditya and 2 others. · View Herald Transcript

david-arm requested review of this revision.Sep 9 2020, 6:51 AM

Harbormaster completed remote builds in B71088: Diff 290730.Sep 9 2020, 9:02 AM

sdesmalen added inline comments.Sep 9 2020, 9:37 AM

llvm/lib/Transforms/InstCombine/InstCombineCasts.cpp
103	This also seems to prevent the optimisation on bitcasts between scalable-vector-type T1 and an alloca with scalable-vector-type T2. You can make the expression use `!=` instead of `\|\|`, but then the code below needs some more changes to work on TypeSize.

david-arm added inline comments.Sep 9 2020, 10:00 AM

llvm/lib/Transforms/InstCombine/InstCombineCasts.cpp
103	Yes, this was deliberate. I think I discussed this in the commit message. I was trying to avoid making needless code changes that's all because casting alloca to scalable still causes us to fail the alignment checks below. I simply couldn't write a test case that proved the additional work needed was correct. I thought it better to avoid changing the code path and then being unable to defend it. I still have the original patch that converted everything to using TypeSize, which I could make use of here if we are able to come up with a test that showed a before and after difference.

sdesmalen added inline comments.Sep 14 2020, 3:09 AM

llvm/lib/Transforms/InstCombine/InstCombineCasts.cpp
103	Okay, thanks for explaining! I guess an extra argument is that because scalable vectors are not allowed in arrays, there is little benefit of making the code below work for the scalable/scalable case even if the ABI alignment would match. If it is not too much effort, it would be nice if this function could handle: %tmp = alloca <vscale x 16 x i8>, align 16 %cast = bitcast <vscale x 16 x i8>* %tmp to <vscale x 2 x i64>* store volatile <vscale x 2 x i64> zeroinitializer, <vscale x 2 x i64>* %cast, align 16 %reload = load volatile <vscale x 2 x i64>, <vscale x 2 x i64>* %cast, align 16 store <vscale x 2 x i64> %reload, <vscale x 2 x i64>* %out, align 16ret void -> %tmp = alloca <vscale x 2 x i64>, align 16 store volatile <vscale x 2 x i64> zeroinitializer, <vscale x 2 x i64>* %tmp, align 16 %reload = load volatile <vscale x 2 x i64>, <vscale x 2 x i64>* %tmp, align 16 store <vscale x 2 x i64> %reload, <vscale x 2 x i64>* %out, align 16 ret void Even if that is only handled as a special case before bailing out. This case is currently supported albeit with the necessary warnings being emitted.
llvm/test/Transforms/InstCombine/AArch64/sve-cast-of-alloc.ll
31	Not something to be fixed in this patch, but InstCombine changing the alignment to 64 seems wrong for scalable vectors.
47	nit: This test does not actually test your change, because it bails out on the first alignment check. The same holds for the two functions below (`scalable16i32_to_fixed16i32` and `scalable32i32_to_scalable16i32`). I think it's fine to leave the tests in, because at least it guards against possible regressions in case the ABI alignment or checks ever change.

Added support for 'promotion' of alloca type to the cast type when both types are scalable, have the same size and alignment.

david-arm marked an inline comment as done.Sep 16 2020, 1:42 AM

LGTM, thanks for the changes @david-arm!

llvm/lib/Transforms/InstCombine/InstCombineCasts.cpp
142	If arrays of scalable types are not supported, this should be an `assert`.

This revision is now accepted and ready to land.Sep 22 2020, 6:07 AM

Closed by commit rG59c4d5aad060: [SVE] Fix InstCombinerImpl::PromoteCastOfAllocation for scalable vectors (authored by david-arm). · Explain WhySep 23 2020, 12:43 AM

This revision was automatically updated to reflect the committed changes.

david-arm added a commit: rG59c4d5aad060: [SVE] Fix InstCombinerImpl::PromoteCastOfAllocation for scalable vectors.

Revision Contents

Path

Size

llvm/

lib/

Transforms/

InstCombine/

InstCombineCasts.cpp

24 lines

test/

Transforms/

InstCombine/

AArch64/

sve-cast-of-alloc.ll

142 lines

Diff 293664

llvm/lib/Transforms/InstCombine/InstCombineCasts.cpp

Show First 20 Lines • Show All 88 Lines • ▼ Show 20 Lines	Instruction *InstCombinerImpl::PromoteCastOfAllocation(BitCastInst &CI,
IRBuilderBase::InsertPointGuard Guard(Builder);		IRBuilderBase::InsertPointGuard Guard(Builder);
Builder.SetInsertPoint(&AI);		Builder.SetInsertPoint(&AI);

// Get the type really allocated and the type casted to.		// Get the type really allocated and the type casted to.
Type *AllocElTy = AI.getAllocatedType();		Type *AllocElTy = AI.getAllocatedType();
Type *CastElTy = PTy->getElementType();		Type *CastElTy = PTy->getElementType();
if (!AllocElTy->isSized() \|\| !CastElTy->isSized()) return nullptr;		if (!AllocElTy->isSized() \|\| !CastElTy->isSized()) return nullptr;

		// This optimisation does not work for cases where the cast type
		// is scalable and the allocated type is not. This because we need to
		// know how many times the casted type fits into the allocated type.
		// For the opposite case where the allocated type is scalable and the
		// cast type is not this leads to poor code quality due to the
		// introduction of 'vscale' into the calculations. It seems better to
		// bail out for this case too until we've done a proper cost-benefit
		sdesmalenUnsubmitted Not Done Reply Inline Actions This also seems to prevent the optimisation on bitcasts between scalable-vector-type T1 and an alloca with scalable-vector-type T2. You can make the expression use `!=` instead of `\|\|`, but then the code below needs some more changes to work on TypeSize. sdesmalen: This also seems to prevent the optimisation on bitcasts between scalable-vector-type T1 and an…
		david-armAuthorUnsubmitted Done Reply Inline Actions Yes, this was deliberate. I think I discussed this in the commit message. I was trying to avoid making needless code changes that's all because casting alloca to scalable still causes us to fail the alignment checks below. I simply couldn't write a test case that proved the additional work needed was correct. I thought it better to avoid changing the code path and then being unable to defend it. I still have the original patch that converted everything to using TypeSize, which I could make use of here if we are able to come up with a test that showed a before and after difference. david-arm: Yes, this was deliberate. I think I discussed this in the commit message. I was trying to avoid…
		sdesmalenUnsubmitted Done Reply Inline Actions Okay, thanks for explaining! I guess an extra argument is that because scalable vectors are not allowed in arrays, there is little benefit of making the code below work for the scalable/scalable case even if the ABI alignment would match. If it is not too much effort, it would be nice if this function could handle: %tmp = alloca <vscale x 16 x i8>, align 16 %cast = bitcast <vscale x 16 x i8>* %tmp to <vscale x 2 x i64>* store volatile <vscale x 2 x i64> zeroinitializer, <vscale x 2 x i64>* %cast, align 16 %reload = load volatile <vscale x 2 x i64>, <vscale x 2 x i64>* %cast, align 16 store <vscale x 2 x i64> %reload, <vscale x 2 x i64>* %out, align 16ret void -> %tmp = alloca <vscale x 2 x i64>, align 16 store volatile <vscale x 2 x i64> zeroinitializer, <vscale x 2 x i64>* %tmp, align 16 %reload = load volatile <vscale x 2 x i64>, <vscale x 2 x i64>* %tmp, align 16 store <vscale x 2 x i64> %reload, <vscale x 2 x i64>* %out, align 16 ret void Even if that is only handled as a special case before bailing out. This case is currently supported albeit with the necessary warnings being emitted. sdesmalen: Okay, thanks for explaining! I guess an extra argument is that because scalable vectors are not…
		// analysis.
		bool AllocIsScalable = isa<ScalableVectorType>(AllocElTy);
		bool CastIsScalable = isa<ScalableVectorType>(CastElTy);
		if (AllocIsScalable != CastIsScalable) return nullptr;

Align AllocElTyAlign = DL.getABITypeAlign(AllocElTy);		Align AllocElTyAlign = DL.getABITypeAlign(AllocElTy);
Align CastElTyAlign = DL.getABITypeAlign(CastElTy);		Align CastElTyAlign = DL.getABITypeAlign(CastElTy);
if (CastElTyAlign < AllocElTyAlign) return nullptr;		if (CastElTyAlign < AllocElTyAlign) return nullptr;

// If the allocation has multiple uses, only promote it if we are strictly		// If the allocation has multiple uses, only promote it if we are strictly
// increasing the alignment of the resultant allocation. If we keep it the		// increasing the alignment of the resultant allocation. If we keep it the
// same, we open the door to infinite loops of various kinds.		// same, we open the door to infinite loops of various kinds.
if (!AI.hasOneUse() && CastElTyAlign == AllocElTyAlign) return nullptr;		if (!AI.hasOneUse() && CastElTyAlign == AllocElTyAlign) return nullptr;

uint64_t AllocElTySize = DL.getTypeAllocSize(AllocElTy);		// The alloc and cast types should be either both fixed or both scalable.
uint64_t CastElTySize = DL.getTypeAllocSize(CastElTy);		uint64_t AllocElTySize = DL.getTypeAllocSize(AllocElTy).getKnownMinSize();
		uint64_t CastElTySize = DL.getTypeAllocSize(CastElTy).getKnownMinSize();
if (CastElTySize == 0 \|\| AllocElTySize == 0) return nullptr;		if (CastElTySize == 0 \|\| AllocElTySize == 0) return nullptr;

// If the allocation has multiple uses, only promote it if we're not		// If the allocation has multiple uses, only promote it if we're not
// shrinking the amount of memory being allocated.		// shrinking the amount of memory being allocated.
uint64_t AllocElTyStoreSize = DL.getTypeStoreSize(AllocElTy);		uint64_t AllocElTyStoreSize = DL.getTypeStoreSize(AllocElTy).getKnownMinSize();
uint64_t CastElTyStoreSize = DL.getTypeStoreSize(CastElTy);		uint64_t CastElTyStoreSize = DL.getTypeStoreSize(CastElTy).getKnownMinSize();
if (!AI.hasOneUse() && CastElTyStoreSize < AllocElTyStoreSize) return nullptr;		if (!AI.hasOneUse() && CastElTyStoreSize < AllocElTyStoreSize) return nullptr;

// See if we can satisfy the modulus by pulling a scale out of the array		// See if we can satisfy the modulus by pulling a scale out of the array
// size argument.		// size argument.
unsigned ArraySizeScale;		unsigned ArraySizeScale;
uint64_t ArrayOffset;		uint64_t ArrayOffset;
Value *NumElements = // See if the array size is a decomposable linear expr.		Value *NumElements = // See if the array size is a decomposable linear expr.
decomposeSimpleLinearExpr(AI.getOperand(0), ArraySizeScale, ArrayOffset);		decomposeSimpleLinearExpr(AI.getOperand(0), ArraySizeScale, ArrayOffset);

// If we can now satisfy the modulus, by using a non-1 scale, we really can		// If we can now satisfy the modulus, by using a non-1 scale, we really can
// do the xform.		// do the xform.
if ((AllocElTySize*ArraySizeScale) % CastElTySize != 0 \|\|		if ((AllocElTySize*ArraySizeScale) % CastElTySize != 0 \|\|
(AllocElTySize*ArrayOffset ) % CastElTySize != 0) return nullptr;		(AllocElTySize*ArrayOffset ) % CastElTySize != 0) return nullptr;

		// We don't currently support arrays of scalable types.
		assert(!AllocIsScalable \|\| (ArrayOffset == 1 && ArraySizeScale == 0));
		sdesmalenUnsubmitted Not Done Reply Inline Actions If arrays of scalable types are not supported, this should be an `assert`. sdesmalen: If arrays of scalable types are not supported, this should be an `assert`.

unsigned Scale = (AllocElTySize*ArraySizeScale)/CastElTySize;		unsigned Scale = (AllocElTySize*ArraySizeScale)/CastElTySize;
Value *Amt = nullptr;		Value *Amt = nullptr;
if (Scale == 1) {		if (Scale == 1) {
Amt = NumElements;		Amt = NumElements;
} else {		} else {
Amt = ConstantInt::get(AI.getArraySize()->getType(), Scale);		Amt = ConstantInt::get(AI.getArraySize()->getType(), Scale);
// Insert before the alloca, not before the cast.		// Insert before the alloca, not before the cast.
Amt = Builder.CreateMul(Amt, NumElements);		Amt = Builder.CreateMul(Amt, NumElements);
▲ Show 20 Lines • Show All 2,577 Lines • Show Last 20 Lines

llvm/test/Transforms/InstCombine/AArch64/sve-cast-of-alloc.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
				; RUN: opt -instcombine -mtriple aarch64-linux-gnu -mattr=+sve -S < %s 2>%t \| FileCheck %s
				; RUN: FileCheck --check-prefix=WARN --allow-empty %s <%t

				; If this check fails please read test/CodeGen/AArch64/README for instructions on how to resolve it.
				; WARN-NOT: warning

				define void @fixed_array16i32_to_scalable4i32(<vscale x 4 x i32>* %out) {
				; CHECK-LABEL: @fixed_array16i32_to_scalable4i32(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: [[TMP:%.*]] = alloca [16 x i32], align 16
				; CHECK-NEXT: [[CAST:%.]] = bitcast [16 x i32] [[TMP]] to <vscale x 4 x i32>*
				; CHECK-NEXT: store volatile <vscale x 4 x i32> zeroinitializer, <vscale x 4 x i32>* [[CAST]], align 16
				; CHECK-NEXT: [[RELOAD:%.]] = load volatile <vscale x 4 x i32>, <vscale x 4 x i32> [[CAST]], align 16
				; CHECK-NEXT: store <vscale x 4 x i32> [[RELOAD]], <vscale x 4 x i32>* [[OUT:%.*]], align 16
				; CHECK-NEXT: ret void
				;
				entry:
				%tmp = alloca [16 x i32], align 16
				%cast = bitcast [16 x i32]* %tmp to <vscale x 4 x i32>*
				store volatile <vscale x 4 x i32> zeroinitializer, <vscale x 4 x i32>* %cast, align 16
				%reload = load volatile <vscale x 4 x i32>, <vscale x 4 x i32>* %cast, align 16
				store <vscale x 4 x i32> %reload, <vscale x 4 x i32>* %out, align 16
				ret void
				}

				define void @scalable4i32_to_fixed16i32(<16 x i32>* %out) {
				; CHECK-LABEL: @scalable4i32_to_fixed16i32(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: [[TMP:%.*]] = alloca <vscale x 4 x i32>, align 64
				; CHECK-NEXT: [[CAST:%.]] = bitcast <vscale x 4 x i32> [[TMP]] to <16 x i32>*
				sdesmalenUnsubmitted Not Done Reply Inline Actions Not something to be fixed in this patch, but InstCombine changing the alignment to 64 seems wrong for scalable vectors. sdesmalen: Not something to be fixed in this patch, but InstCombine changing the alignment to 64 seems…
				; CHECK-NEXT: store <16 x i32> zeroinitializer, <16 x i32>* [[CAST]], align 64
				; CHECK-NEXT: [[RELOAD:%.]] = load volatile <16 x i32>, <16 x i32> [[CAST]], align 64
				; CHECK-NEXT: store <16 x i32> [[RELOAD]], <16 x i32>* [[OUT:%.*]], align 16
				; CHECK-NEXT: ret void
				;
				entry:
				%tmp = alloca <vscale x 4 x i32>, align 16
				%cast = bitcast <vscale x 4 x i32>* %tmp to <16 x i32>*
				store <16 x i32> zeroinitializer, <16 x i32>* %cast, align 16
				%reload = load volatile <16 x i32>, <16 x i32>* %cast, align 16
				store <16 x i32> %reload, <16 x i32>* %out, align 16
				ret void
				}

				define void @fixed16i32_to_scalable4i32(<vscale x 4 x i32>* %out) {
				; CHECK-LABEL: @fixed16i32_to_scalable4i32(
				sdesmalenUnsubmitted Not Done Reply Inline Actions nit: This test does not actually test your change, because it bails out on the first alignment check. The same holds for the two functions below (`scalable16i32_to_fixed16i32` and `scalable32i32_to_scalable16i32`). I think it's fine to leave the tests in, because at least it guards against possible regressions in case the ABI alignment or checks ever change. sdesmalen: nit: This test does not actually test your change, because it bails out on the first alignment…
				; CHECK-NEXT: entry:
				; CHECK-NEXT: [[TMP:%.*]] = alloca <16 x i32>, align 16
				; CHECK-NEXT: [[CAST:%.]] = bitcast <16 x i32> [[TMP]] to <vscale x 4 x i32>*
				; CHECK-NEXT: store volatile <vscale x 4 x i32> zeroinitializer, <vscale x 4 x i32>* [[CAST]], align 16
				; CHECK-NEXT: [[RELOAD:%.]] = load volatile <vscale x 4 x i32>, <vscale x 4 x i32> [[CAST]], align 16
				; CHECK-NEXT: store <vscale x 4 x i32> [[RELOAD]], <vscale x 4 x i32>* [[OUT:%.*]], align 16
				; CHECK-NEXT: ret void
				;
				entry:
				%tmp = alloca <16 x i32>, align 16
				%cast = bitcast <16 x i32>* %tmp to <vscale x 4 x i32>*
				store volatile <vscale x 4 x i32> zeroinitializer, <vscale x 4 x i32>* %cast, align 16
				%reload = load volatile <vscale x 4 x i32>, <vscale x 4 x i32>* %cast, align 16
				store <vscale x 4 x i32> %reload, <vscale x 4 x i32>* %out, align 16
				ret void
				}

				define void @scalable16i32_to_fixed16i32(<16 x i32>* %out) {
				; CHECK-LABEL: @scalable16i32_to_fixed16i32(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: [[TMP:%.*]] = alloca <vscale x 16 x i32>, align 64
				; CHECK-NEXT: [[CAST:%.]] = bitcast <vscale x 16 x i32> [[TMP]] to <16 x i32>*
				; CHECK-NEXT: store volatile <16 x i32> zeroinitializer, <16 x i32>* [[CAST]], align 64
				; CHECK-NEXT: [[RELOAD:%.]] = load volatile <16 x i32>, <16 x i32> [[CAST]], align 64
				; CHECK-NEXT: store <16 x i32> [[RELOAD]], <16 x i32>* [[OUT:%.*]], align 16
				; CHECK-NEXT: ret void
				;
				entry:
				%tmp = alloca <vscale x 16 x i32>, align 16
				%cast = bitcast <vscale x 16 x i32>* %tmp to <16 x i32>*
				store volatile <16 x i32> zeroinitializer, <16 x i32>* %cast, align 16
				%reload = load volatile <16 x i32>, <16 x i32>* %cast, align 16
				store <16 x i32> %reload, <16 x i32>* %out, align 16
				ret void
				}

				define void @scalable32i32_to_scalable16i32(<vscale x 16 x i32>* %out) {
				; CHECK-LABEL: @scalable32i32_to_scalable16i32(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: [[TMP:%.*]] = alloca <vscale x 32 x i32>, align 64
				; CHECK-NEXT: [[CAST:%.]] = bitcast <vscale x 32 x i32> [[TMP]] to <vscale x 16 x i32>*
				; CHECK-NEXT: store volatile <vscale x 16 x i32> zeroinitializer, <vscale x 16 x i32>* [[CAST]], align 64
				; CHECK-NEXT: [[RELOAD:%.]] = load volatile <vscale x 16 x i32>, <vscale x 16 x i32> [[CAST]], align 64
				; CHECK-NEXT: store <vscale x 16 x i32> [[RELOAD]], <vscale x 16 x i32>* [[OUT:%.*]], align 16
				; CHECK-NEXT: ret void
				;
				entry:
				%tmp = alloca <vscale x 32 x i32>, align 16
				%cast = bitcast <vscale x 32 x i32>* %tmp to <vscale x 16 x i32>*
				store volatile <vscale x 16 x i32> zeroinitializer, <vscale x 16 x i32>* %cast, align 16
				%reload = load volatile <vscale x 16 x i32>, <vscale x 16 x i32>* %cast, align 16
				store <vscale x 16 x i32> %reload, <vscale x 16 x i32>* %out, align 16
				ret void
				}

				define void @scalable32i16_to_scalable16i32(<vscale x 16 x i32>* %out) {
				; CHECK-LABEL: @scalable32i16_to_scalable16i32(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: [[TMP:%.*]] = alloca <vscale x 16 x i32>, align 64
				; CHECK-NEXT: store volatile <vscale x 16 x i32> zeroinitializer, <vscale x 16 x i32>* [[TMP]], align 64
				; CHECK-NEXT: [[RELOAD:%.]] = load volatile <vscale x 16 x i32>, <vscale x 16 x i32> [[TMP]], align 64
				; CHECK-NEXT: store <vscale x 16 x i32> [[RELOAD]], <vscale x 16 x i32>* [[OUT:%.*]], align 16
				; CHECK-NEXT: ret void
				;
				entry:
				%tmp = alloca <vscale x 32 x i16>, align 16
				%cast = bitcast <vscale x 32 x i16>* %tmp to <vscale x 16 x i32>*
				store volatile <vscale x 16 x i32> zeroinitializer, <vscale x 16 x i32>* %cast, align 16
				%reload = load volatile <vscale x 16 x i32>, <vscale x 16 x i32>* %cast, align 16
				store <vscale x 16 x i32> %reload, <vscale x 16 x i32>* %out, align 16
				ret void
				}

				define void @scalable32i16_to_scalable16i32_multiuse(<vscale x 16 x i32>* %out, <vscale x 32 x i16>* %out2) {
				; CHECK-LABEL: @scalable32i16_to_scalable16i32_multiuse(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: [[TMP:%.*]] = alloca <vscale x 32 x i16>, align 64
				; CHECK-NEXT: [[CAST:%.]] = bitcast <vscale x 32 x i16> [[TMP]] to <vscale x 16 x i32>*
				; CHECK-NEXT: store volatile <vscale x 16 x i32> zeroinitializer, <vscale x 16 x i32>* [[CAST]], align 64
				; CHECK-NEXT: [[RELOAD:%.]] = load volatile <vscale x 16 x i32>, <vscale x 16 x i32> [[CAST]], align 64
				; CHECK-NEXT: store <vscale x 16 x i32> [[RELOAD]], <vscale x 16 x i32>* [[OUT:%.*]], align 16
				; CHECK-NEXT: [[RELOAD2:%.]] = load volatile <vscale x 32 x i16>, <vscale x 32 x i16> [[TMP]], align 64
				; CHECK-NEXT: store <vscale x 32 x i16> [[RELOAD2]], <vscale x 32 x i16>* [[OUT2:%.*]], align 16
				; CHECK-NEXT: ret void
				;
				entry:
				%tmp = alloca <vscale x 32 x i16>, align 16
				%cast = bitcast <vscale x 32 x i16>* %tmp to <vscale x 16 x i32>*
				store volatile <vscale x 16 x i32> zeroinitializer, <vscale x 16 x i32>* %cast, align 16
				%reload = load volatile <vscale x 16 x i32>, <vscale x 16 x i32>* %cast, align 16
				store <vscale x 16 x i32> %reload, <vscale x 16 x i32>* %out, align 16
				%reload2 = load volatile <vscale x 32 x i16>, <vscale x 32 x i16>* %tmp, align 16
				store <vscale x 32 x i16> %reload2, <vscale x 32 x i16>* %out2, align 16
				ret void
				}