This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Add bf16 storage support
ClosedPublic

Authored by Pierre-vh on Dec 6 2022, 12:24 AM.

Download Raw Diff

Details

Reviewers

arsenm
foad
yaxunl

Commits

rG678d8946ba2b: [AMDGPU] Add bf16 storage support

Summary

[Clang] Declare AMDGPU target as supporting BF16 for storage-only purposes on amdgcn
- Add Sema & CodeGen tests cases.
- Also add cases that D138651 would have covered as this patch replaces it.
[AMDGPU] Add BF16 storage-only support
- Support legalization/dealing with bf16 operations in DAGIsel.
- bf16 as a type remains illegal and is represented as i16 for storage purposes.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

Pierre-vh created this revision.Dec 6 2022, 12:24 AM

Herald added a project: Restricted Project. · View Herald TranscriptDec 6 2022, 12:24 AM

Herald added subscribers: kosarev, kerbowa, hiraditya and 4 others. · View Herald Transcript

Pierre-vh requested review of this revision.Dec 6 2022, 12:24 AM

Herald added subscribers: llvm-commits, cfe-commits, wdng. · View Herald TranscriptDec 6 2022, 12:24 AM

Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptDec 6 2022, 12:24 AM

Pierre-vh mentioned this in D138651: [CUDA][HIP] Don't diagnose use for __bf16.Dec 6 2022, 12:25 AM

Only accept bf16 on AMDGCN; r600 doesn't support it (we could but it's not worth the effort I think; I'll look at it if we find out it's needed)
Remove bf16 types from a few register classes

arsenm requested changes to this revision.Dec 6 2022, 4:57 AM

arsenm added inline comments.

clang/lib/Basic/Targets/AMDGPU.h
119	Don't understand this mangling. What is u6?
clang/test/CodeGenCUDA/amdgpu-bf16.cu
31	should also test a load
clang/test/SemaCUDA/amdgpu-bf16.cu
44	check casts to different int and float types? Is construction of bf16 vectors allowed?
llvm/lib/Target/AMDGPU/SIISelLowering.cpp
4796–4808	The generic legalizer should have handled this?
5147–5160	Ditto
llvm/test/CodeGen/AMDGPU/bf16-ops.ll
2–5	Drop -verify-machineinstrs
23	Use opaque pointers
llvm/test/CodeGen/AMDGPU/bf16.ll
433	Missing v3 test
900	Ret of vectors
954	Don't use anonymous values. Also use opaque pointers
957	Should also test call argument, call return, passed in byval, sret, implicit sret, and passed in argument in overflow stack area

This revision now requires changes to proceed.Dec 6 2022, 4:57 AM

arsenm added inline comments.Dec 6 2022, 5:00 AM

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
147	Do you really need to add this to a register class? The only thing this would be useful for is for the calling convention contexts, which should promote to i16. If you do that you don't need most of the rest of this patch

Address comments: diff is lighter and added a lot more tests

Pierre-vh edited the summary of this revision. (Show Details)Dec 6 2022, 7:03 AM

Pierre-vh added inline comments.

clang/lib/Basic/Targets/AMDGPU.h
119	Not sure; for that one I just copy-pasted the implementation of other targets. All other targets use that mangling scheme
clang/test/SemaCUDA/amdgpu-bf16.cu
44	Added cast + vec sema test and vec assign codegen test too No conversions are allowed apparently but I don't think it matters for the initial patch; if needed we can always add it later I think
llvm/lib/Target/AMDGPU/SIISelLowering.cpp
4796–4808	It looks like those operations are not implemented in the generic legalizer, e.g. I get Do not know how to promote this operator's operand!

arsenm added inline comments.Dec 6 2022, 7:28 AM

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
4796–4808	Right, this is the code that would go there

Pierre-vh added inline comments.Dec 6 2022, 7:33 AM

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
4796–4808	Do I just copy/paste this code in that PromoteInt function, and keep a copy here too in LowerOperation? (not really a fan of copy-pasting code in different files, I'd rather keep it all here) We need to have the lowering too AFAIK, it didn't go well when I tried to remove it

arsenm added inline comments.Dec 6 2022, 8:15 AM

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
4796–4808	I'm not following why you need to handle it here

Harbormaster completed remote builds in B201387: Diff 480476.Dec 6 2022, 11:53 AM

Pierre-vh added inline comments.Dec 7 2022, 12:07 AM

clang/lib/Basic/Targets/AMDGPU.h
119	Ah I remember now, it's just C++ mangling. I don't quite understand the lowercase "u" but a quick search in Clang tells me it's vendor-extended types. So it's just u6 -> vendor extended type, 6 characters following + __bf16 (name of the type).
llvm/lib/Target/AMDGPU/SIISelLowering.cpp
4796–4808	IIRC: I need to handle FP_TO_BF16 in ReplaceNodeResult because that's what the Integer Legalizer calls (through CustomLowerNode) I need to handle both opcodes in LowerOperation because otherwise they'll fail selection. They can be left over from expanding/legalizing other operations.

arsenm added inline comments.Dec 7 2022, 8:19 AM

clang/lib/Basic/Targets/AMDGPU.h
119	Do we really need an override for this? I'd expect a reasonable default. Plus I think a virtual function for something that's only a parameterless, static string is a bit ridiculous
llvm/lib/Target/AMDGPU/AMDGPUCallingConv.td
49 ↗	(On Diff #480476)	Without being added to a register class, all the tablegen changes should not do anything

Pierre-vh added a child revision: D139608: [Clang][NFC] Add default `getBFloat16Mangling` impl.Dec 8 2022, 12:22 AM

Comments

clang/lib/Basic/Targets/AMDGPU.h
119	Default impl asserts if not implemented. I think it's to make sure targets are all aware of what it takes to support bfloat and they don't end up partially implementing it? /// Return the mangled code of bfloat. virtual const char *getBFloat16Mangling() const { llvm_unreachable("bfloat not implemented on this target"); } I'd say let's stick to the current pattern in this diff; I created D139608 to change it
llvm/lib/Target/AMDGPU/AMDGPUCallingConv.td
49 ↗	(On Diff #480476)	bf16 ones seem to not be needed but if I don't have the v2bf16 ones I get "cannot allocate arguments" in "test_arg_store_v2bf16"

arsenm added inline comments.Dec 8 2022, 7:43 AM

llvm/lib/Target/AMDGPU/AMDGPUCallingConv.td
49 ↗	(On Diff #480476)	Sounds like a type legality logic bug which I don't want to spend time fighting
llvm/lib/Target/AMDGPU/SIISelLowering.cpp
911	If you wanted the promote to i32, you could have done it here instead of in the tablegen cc handling
4796–4808	But why are they custom? We don't have to handle FP16_TO_FP or FP_TO_FP16 there, and they aren't custom lowered. They have the same basic properties. We have this: setOperationAction(ISD::FP16_TO_FP, MVT::i16, Promote); AddPromotedToType(ISD::FP16_TO_FP, MVT::i16, MVT::i32); setOperationAction(ISD::FP_TO_FP16, MVT::i16, Promote); AddPromotedToType(ISD::FP_TO_FP16, MVT::i16, MVT::i32); I'd expect the same basic pattern
5552	Op.getValueType()
5557	Should be specific cast, not FPExtOrRound. I don't think the FP_ROUND case would be correct
5563	Should use Op.getValueType() instead of Op->getValueType(0)
5567–5570	ExpandNode covers lowering BF16_TO_FP. It also has a shift by 16-bits into the high bits. Is this correct?

Harbormaster completed remote builds in B201891: Diff 481169.Dec 8 2022, 9:39 AM

Address some comments; I wasn't able to get rid of the custom legalization/CC tablegen.

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
911	Do you mean somewhere else in that function? Changing v2bf16 to i32 here doesn't fix it I also tried changing the function above but I kept running into asserts so I just left the TableGen CC for now
4796–4808	PromoteIntegerOperand, PromoteFloatOperand and PromoteIntegerResult don't handle FP_TO_BF16 and BF16_TO_FP, and unless we put a Custom lowering mode it'll assert/unreachable. I tried to make it work (for a while) using the default expand but I can't quite get it to work. It feels like there is some legalizer work missing for handling BF16 like we want to. Even though it's not ideal I think the custom lowering is easiest
5557	But we need to do f32 -> f16, isn't FP_ROUND used for that? I thought it's what we needed
5567–5570	Ah I didn't know that, though as long as we use custom lowering, and our FP_TO_BF16/BF16_TO_FP methods are consistent, it should be fine, no?

Harbormaster completed remote builds in B202176: Diff 481557.Dec 9 2022, 2:42 AM

arsenm added inline comments.Dec 9 2022, 9:23 AM

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
911	Yes, that should force the bitcast of the argument type
4796–4808	What about Expand? that's where the implemented part is
5567–5570	bfloat16 has the same number of exponent bits in the same high bits as f32; I kind of think the idea is you can just do a bitshift and then operate on f32? I think the fp_extend here is wrong

arsenm added inline comments.Dec 9 2022, 9:39 AM

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
5567–5570	The default legalization also looks wrong to me. I don't understand why it isn't shifting down the mantissa bit

Fix f32/bf16 conversions, remove CC tablegen

Pierre-vh edited the summary of this revision. (Show Details)Dec 12 2022, 12:51 AM

Pierre-vh added inline comments.

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
4796–4808	Last I tried, Expand will emit a libcall in many cases that we don't handle
5567–5570	Indeed it was terribly wrong. I rewrote both legalizations following what I found online: https://en.wikipedia.org/wiki/Bfloat16_floating-point_format bf16 is designed to be very easily convertible from/to f32, save for some edge cases with denormalized numbers I think, thus: bf16 -> f32 is just left-shift by 16, filling the least-significant bits with zeroes. f32 -> bf16 is just cutting off the 16 least-significant bits.

Harbormaster completed remote builds in B202504: Diff 482017.Dec 12 2022, 1:38 AM

ye-luo added a subscriber: ye-luo.Dec 12 2022, 2:10 PM

arsenm added inline comments.Dec 12 2022, 2:51 PM

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
4796–4808	Library call is supposed to be a distinct action now, the DAG only did about 5% of the work to migrate to using it. This code can go to the default expand action
5557	This is just this
5571	can just hardcode i32
5583	This is just this
5589	This can be any_extend and the combiner will probably turn it into one
5591	Can just hardcode i32
llvm/test/CodeGen/AMDGPU/bf16.ll
12	Doesn't cover the different f32/f64 conversions?
2954	Use poison instead of undef in tests

Comments + add (f32/f64) from/to bf16 conversion tests

Pierre-vh added inline comments.Dec 13 2022, 12:36 AM

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
4796–4808	Does it need to happen in this commit? It'll delay the review quite a bit I think if other people have to review it If it needs to happen, when what do I need to do? Use the Expand action & fix the legalizer in places where it needs to be fixed? I feel like it might be better suited for a follow-up patch; I can create a task and pick it up when I come back from vacation if you want

Harbormaster completed remote builds in B202774: Diff 482379.Dec 13 2022, 1:31 AM

Move to common DAG legalization code

LGTM. with nits The lower could handle the vector case easily, but it didn't before either

llvm/lib/CodeGen/SelectionDAG/LegalizeDAG.cpp
2915	Braces
2943	Braces
2947	Don't see the point of this assert, should work fine for vectors

This revision is now accepted and ready to land.Dec 13 2022, 7:07 AM

Comments

This revision was landed with ongoing or failed builds.Dec 13 2022, 7:34 AM

Closed by commit rG678d8946ba2b: [AMDGPU] Add bf16 storage support (authored by Pierre-vh). · Explain Why

This revision was automatically updated to reflect the committed changes.

Pierre-vh added a commit: rG678d8946ba2b: [AMDGPU] Add bf16 storage support.

Harbormaster completed remote builds in B202845: Diff 482476.Dec 13 2022, 9:11 AM

Revision Contents

Path

Size

clang/

lib/

Basic/

Targets/

AMDGPU.h

3 lines

AMDGPU.cpp

6 lines

test/

CodeGenCUDA/

amdgpu-bf16.cu

129 lines

SemaCUDA/

amdgpu-bf16.cu

99 lines

llvm/

lib/

CodeGen/

SelectionDAG/

LegalizeDAG.cpp

32 lines

LegalizeIntegerTypes.cpp

8 lines

LegalizeTypes.h

2 lines

Target/

AMDGPU/

AMDGPUISelLowering.cpp

4 lines

SIISelLowering.cpp

19 lines

test/

CodeGen/

AMDGPU/

bf16-ops.ll

32 lines

bf16.ll

3189 lines

Diff 482477

clang/lib/Basic/Targets/AMDGPU.h

Show First 20 Lines • Show All 109 Lines • ▼ Show 20 Lines	public:
uint64_t getPointerAlignV(LangAS AddrSpace) const override {		uint64_t getPointerAlignV(LangAS AddrSpace) const override {
return getPointerWidthV(AddrSpace);		return getPointerWidthV(AddrSpace);
}		}

uint64_t getMaxPointerWidth() const override {		uint64_t getMaxPointerWidth() const override {
return getTriple().getArch() == llvm::Triple::amdgcn ? 64 : 32;		return getTriple().getArch() == llvm::Triple::amdgcn ? 64 : 32;
}		}

		bool hasBFloat16Type() const override { return isAMDGCN(getTriple()); }
		const char *getBFloat16Mangling() const override { return "u6__bf16"; };
		arsenmUnsubmitted Done Reply Inline Actions Don't understand this mangling. What is u6? arsenm: Don't understand this mangling. What is u6?
		Pierre-vhAuthorUnsubmitted Done Reply Inline Actions Not sure; for that one I just copy-pasted the implementation of other targets. All other targets use that mangling scheme Pierre-vh: Not sure; for that one I just copy-pasted the implementation of other targets. All other…
		Pierre-vhAuthorUnsubmitted Done Reply Inline Actions Ah I remember now, it's just C++ mangling. I don't quite understand the lowercase "u" but a quick search in Clang tells me it's vendor-extended types. So it's just u6 -> vendor extended type, 6 characters following + __bf16 (name of the type). Pierre-vh: Ah I remember now, it's just C++ mangling. I don't quite understand the lowercase "u" but a…
		arsenmUnsubmitted Done Reply Inline Actions Do we really need an override for this? I'd expect a reasonable default. Plus I think a virtual function for something that's only a parameterless, static string is a bit ridiculous arsenm: Do we really need an override for this? I'd expect a reasonable default. Plus I think a virtual…
		Pierre-vhAuthorUnsubmitted Done Reply Inline Actions Default impl asserts if not implemented. I think it's to make sure targets are all aware of what it takes to support bfloat and they don't end up partially implementing it? /// Return the mangled code of bfloat. virtual const char getBFloat16Mangling() const { llvm_unreachable("bfloat not implemented on this target"); } I'd say let's stick to the current pattern in this diff; I created D139608 to change it Pierre-vh:* Default impl asserts if not implemented. I think it's to make sure targets are all aware of…

const char *getClobbers() const override { return ""; }		const char *getClobbers() const override { return ""; }

ArrayRef<const char *> getGCCRegNames() const override;		ArrayRef<const char *> getGCCRegNames() const override;

ArrayRef<TargetInfo::GCCRegAlias> getGCCRegAliases() const override {		ArrayRef<TargetInfo::GCCRegAlias> getGCCRegAliases() const override {
return std::nullopt;		return std::nullopt;
}		}

▲ Show 20 Lines • Show All 344 Lines • Show Last 20 Lines

clang/lib/Basic/Targets/AMDGPU.cpp

Show First 20 Lines • Show All 359 Lines • ▼ Show 20 Lines	: TargetInfo(Triple),
llvm::AMDGPU::getArchAttrR600(GPUKind)) {		llvm::AMDGPU::getArchAttrR600(GPUKind)) {
resetDataLayout(isAMDGCN(getTriple()) ? DataLayoutStringAMDGCN		resetDataLayout(isAMDGCN(getTriple()) ? DataLayoutStringAMDGCN
: DataLayoutStringR600);		: DataLayoutStringR600);

setAddressSpaceMap(Triple.getOS() == llvm::Triple::Mesa3D \|\|		setAddressSpaceMap(Triple.getOS() == llvm::Triple::Mesa3D \|\|
!isAMDGCN(Triple));		!isAMDGCN(Triple));
UseAddrSpaceMapMangling = true;		UseAddrSpaceMapMangling = true;

		if (isAMDGCN(Triple)) {
		// __bf16 is always available as a load/store only type on AMDGCN.
		BFloat16Width = BFloat16Align = 16;
		BFloat16Format = &llvm::APFloat::BFloat();
		}

HasLegalHalfType = true;		HasLegalHalfType = true;
HasFloat16 = true;		HasFloat16 = true;
WavefrontSize = GPUFeatures & llvm::AMDGPU::FEATURE_WAVE32 ? 32 : 64;		WavefrontSize = GPUFeatures & llvm::AMDGPU::FEATURE_WAVE32 ? 32 : 64;
AllowAMDGPUUnsafeFPAtomics = Opts.AllowAMDGPUUnsafeFPAtomics;		AllowAMDGPUUnsafeFPAtomics = Opts.AllowAMDGPUUnsafeFPAtomics;

// Set pointer width and alignment for the generic address space.		// Set pointer width and alignment for the generic address space.
PointerWidth = PointerAlign = getPointerWidthV(LangAS::Default);		PointerWidth = PointerAlign = getPointerWidthV(LangAS::Default);
if (getMaxPointerWidth() == 64) {		if (getMaxPointerWidth() == 64) {
▲ Show 20 Lines • Show All 111 Lines • Show Last 20 Lines

clang/test/CodeGenCUDA/amdgpu-bf16.cu

This file was added.

				// NOTE: Assertions have been autogenerated by utils/update_cc_test_checks.py
				// REQUIRES: amdgpu-registered-target
				// REQUIRES: x86-registered-target

				// RUN: %clang_cc1 "-aux-triple" "x86_64-unknown-linux-gnu" "-triple" "amdgcn-amd-amdhsa" \
				// RUN: -fcuda-is-device "-aux-target-cpu" "x86-64" -emit-llvm -o - %s \| FileCheck %s

				#include "Inputs/cuda.h"

				// CHECK-LABEL: @_Z8test_argPu6__bf16u6__bf16(
				// CHECK-NEXT: entry:
				// CHECK-NEXT: [[OUT_ADDR:%.*]] = alloca ptr, align 8, addrspace(5)
				// CHECK-NEXT: [[IN_ADDR:%.*]] = alloca bfloat, align 2, addrspace(5)
				// CHECK-NEXT: [[BF16:%.*]] = alloca bfloat, align 2, addrspace(5)
				// CHECK-NEXT: [[OUT_ADDR_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[OUT_ADDR]] to ptr
				// CHECK-NEXT: [[IN_ADDR_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[IN_ADDR]] to ptr
				// CHECK-NEXT: [[BF16_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[BF16]] to ptr
				// CHECK-NEXT: store ptr [[OUT:%.*]], ptr [[OUT_ADDR_ASCAST]], align 8
				// CHECK-NEXT: store bfloat [[IN:%.*]], ptr [[IN_ADDR_ASCAST]], align 2
				// CHECK-NEXT: [[TMP0:%.*]] = load bfloat, ptr [[IN_ADDR_ASCAST]], align 2
				// CHECK-NEXT: store bfloat [[TMP0]], ptr [[BF16_ASCAST]], align 2
				// CHECK-NEXT: [[TMP1:%.*]] = load bfloat, ptr [[BF16_ASCAST]], align 2
				// CHECK-NEXT: [[TMP2:%.*]] = load ptr, ptr [[OUT_ADDR_ASCAST]], align 8
				// CHECK-NEXT: store bfloat [[TMP1]], ptr [[TMP2]], align 2
				// CHECK-NEXT: ret void
				//
				__device__ void test_arg(__bf16 *out, __bf16 in) {
				__bf16 bf16 = in;
				*out = bf16;
				}

				arsenmUnsubmitted Done Reply Inline Actions should also test a load arsenm: should also test a load
				// CHECK-LABEL: @_Z9test_loadPu6__bf16S_(
				// CHECK-NEXT: entry:
				// CHECK-NEXT: [[OUT_ADDR:%.*]] = alloca ptr, align 8, addrspace(5)
				// CHECK-NEXT: [[IN_ADDR:%.*]] = alloca ptr, align 8, addrspace(5)
				// CHECK-NEXT: [[BF16:%.*]] = alloca bfloat, align 2, addrspace(5)
				// CHECK-NEXT: [[OUT_ADDR_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[OUT_ADDR]] to ptr
				// CHECK-NEXT: [[IN_ADDR_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[IN_ADDR]] to ptr
				// CHECK-NEXT: [[BF16_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[BF16]] to ptr
				// CHECK-NEXT: store ptr [[OUT:%.*]], ptr [[OUT_ADDR_ASCAST]], align 8
				// CHECK-NEXT: store ptr [[IN:%.*]], ptr [[IN_ADDR_ASCAST]], align 8
				// CHECK-NEXT: [[TMP0:%.*]] = load ptr, ptr [[IN_ADDR_ASCAST]], align 8
				// CHECK-NEXT: [[TMP1:%.*]] = load bfloat, ptr [[TMP0]], align 2
				// CHECK-NEXT: store bfloat [[TMP1]], ptr [[BF16_ASCAST]], align 2
				// CHECK-NEXT: [[TMP2:%.*]] = load bfloat, ptr [[BF16_ASCAST]], align 2
				// CHECK-NEXT: [[TMP3:%.*]] = load ptr, ptr [[OUT_ADDR_ASCAST]], align 8
				// CHECK-NEXT: store bfloat [[TMP2]], ptr [[TMP3]], align 2
				// CHECK-NEXT: ret void
				//
				__device__ void test_load(__bf16 out, __bf16 in) {
				__bf16 bf16 = *in;
				*out = bf16;
				}

				// CHECK-LABEL: @_Z8test_retu6__bf16(
				// CHECK-NEXT: entry:
				// CHECK-NEXT: [[RETVAL:%.*]] = alloca bfloat, align 2, addrspace(5)
				// CHECK-NEXT: [[IN_ADDR:%.*]] = alloca bfloat, align 2, addrspace(5)
				// CHECK-NEXT: [[RETVAL_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[RETVAL]] to ptr
				// CHECK-NEXT: [[IN_ADDR_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[IN_ADDR]] to ptr
				// CHECK-NEXT: store bfloat [[IN:%.*]], ptr [[IN_ADDR_ASCAST]], align 2
				// CHECK-NEXT: [[TMP0:%.*]] = load bfloat, ptr [[IN_ADDR_ASCAST]], align 2
				// CHECK-NEXT: ret bfloat [[TMP0]]
				//
				__device__ __bf16 test_ret( __bf16 in) {
				return in;
				}

				// CHECK-LABEL: @_Z9test_callu6__bf16(
				// CHECK-NEXT: entry:
				// CHECK-NEXT: [[RETVAL:%.*]] = alloca bfloat, align 2, addrspace(5)
				// CHECK-NEXT: [[IN_ADDR:%.*]] = alloca bfloat, align 2, addrspace(5)
				// CHECK-NEXT: [[RETVAL_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[RETVAL]] to ptr
				// CHECK-NEXT: [[IN_ADDR_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[IN_ADDR]] to ptr
				// CHECK-NEXT: store bfloat [[IN:%.*]], ptr [[IN_ADDR_ASCAST]], align 2
				// CHECK-NEXT: [[TMP0:%.*]] = load bfloat, ptr [[IN_ADDR_ASCAST]], align 2
				// CHECK-NEXT: [[CALL:%.*]] = call contract noundef bfloat @_Z8test_retu6__bf16(bfloat noundef [[TMP0]]) #[[ATTR1:[0-9]+]]
				// CHECK-NEXT: ret bfloat [[CALL]]
				//
				__device__ __bf16 test_call( __bf16 in) {
				return test_ret(in);
				}


				// CHECK-LABEL: @_Z15test_vec_assignv(
				// CHECK-NEXT: entry:
				// CHECK-NEXT: [[VEC2_A:%.*]] = alloca <2 x bfloat>, align 4, addrspace(5)
				// CHECK-NEXT: [[VEC2_B:%.*]] = alloca <2 x bfloat>, align 4, addrspace(5)
				// CHECK-NEXT: [[VEC4_A:%.*]] = alloca <4 x bfloat>, align 8, addrspace(5)
				// CHECK-NEXT: [[VEC4_B:%.*]] = alloca <4 x bfloat>, align 8, addrspace(5)
				// CHECK-NEXT: [[VEC8_A:%.*]] = alloca <8 x bfloat>, align 16, addrspace(5)
				// CHECK-NEXT: [[VEC8_B:%.*]] = alloca <8 x bfloat>, align 16, addrspace(5)
				// CHECK-NEXT: [[VEC16_A:%.*]] = alloca <16 x bfloat>, align 32, addrspace(5)
				// CHECK-NEXT: [[VEC16_B:%.*]] = alloca <16 x bfloat>, align 32, addrspace(5)
				// CHECK-NEXT: [[VEC2_A_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[VEC2_A]] to ptr
				// CHECK-NEXT: [[VEC2_B_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[VEC2_B]] to ptr
				// CHECK-NEXT: [[VEC4_A_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[VEC4_A]] to ptr
				// CHECK-NEXT: [[VEC4_B_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[VEC4_B]] to ptr
				// CHECK-NEXT: [[VEC8_A_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[VEC8_A]] to ptr
				// CHECK-NEXT: [[VEC8_B_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[VEC8_B]] to ptr
				// CHECK-NEXT: [[VEC16_A_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[VEC16_A]] to ptr
				// CHECK-NEXT: [[VEC16_B_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[VEC16_B]] to ptr
				// CHECK-NEXT: [[TMP0:%.*]] = load <2 x bfloat>, ptr [[VEC2_B_ASCAST]], align 4
				// CHECK-NEXT: store <2 x bfloat> [[TMP0]], ptr [[VEC2_A_ASCAST]], align 4
				// CHECK-NEXT: [[TMP1:%.*]] = load <4 x bfloat>, ptr [[VEC4_B_ASCAST]], align 8
				// CHECK-NEXT: store <4 x bfloat> [[TMP1]], ptr [[VEC4_A_ASCAST]], align 8
				// CHECK-NEXT: [[TMP2:%.*]] = load <8 x bfloat>, ptr [[VEC8_B_ASCAST]], align 16
				// CHECK-NEXT: store <8 x bfloat> [[TMP2]], ptr [[VEC8_A_ASCAST]], align 16
				// CHECK-NEXT: [[TMP3:%.*]] = load <16 x bfloat>, ptr [[VEC16_B_ASCAST]], align 32
				// CHECK-NEXT: store <16 x bfloat> [[TMP3]], ptr [[VEC16_A_ASCAST]], align 32
				// CHECK-NEXT: ret void
				//
				__device__ void test_vec_assign() {
				typedef __attribute__((ext_vector_type(2))) __bf16 bf16_x2;
				bf16_x2 vec2_a, vec2_b;
				vec2_a = vec2_b;

				typedef __attribute__((ext_vector_type(4))) __bf16 bf16_x4;
				bf16_x4 vec4_a, vec4_b;
				vec4_a = vec4_b;

				typedef __attribute__((ext_vector_type(8))) __bf16 bf16_x8;
				bf16_x8 vec8_a, vec8_b;
				vec8_a = vec8_b;

				typedef __attribute__((ext_vector_type(16))) __bf16 bf16_x16;
				bf16_x16 vec16_a, vec16_b;
				vec16_a = vec16_b;
				}

clang/test/SemaCUDA/amdgpu-bf16.cu

This file was added.

				// REQUIRES: amdgpu-registered-target
				// REQUIRES: x86-registered-target

				// RUN: %clang_cc1 "-triple" "x86_64-unknown-linux-gnu" "-aux-triple" "amdgcn-amd-amdhsa"\
				// RUN: "-target-cpu" "x86-64" -fsyntax-only -verify=amdgcn %s
				// RUN: %clang_cc1 "-aux-triple" "x86_64-unknown-linux-gnu" "-triple" "amdgcn-amd-amdhsa"\
				// RUN: -fcuda-is-device "-aux-target-cpu" "x86-64" -fsyntax-only -verify=amdgcn %s

				// RUN: %clang_cc1 "-aux-triple" "x86_64-unknown-linux-gnu" "-triple" "r600-unknown-unknown"\
				// RUN: -fcuda-is-device "-aux-target-cpu" "x86-64" -fsyntax-only -verify=amdgcn,r600 %s

				// AMDGCN has storage-only support for bf16. R600 does not support it should error out when
				// it's the main target.

				#include "Inputs/cuda.h"

				// There should be no errors on using the type itself, or when loading/storing values for amdgcn.
				// r600 should error on all uses of the type.

				// r600-error@+1 {{__bf16 is not supported on this target}}
				typedef __attribute__((ext_vector_type(2))) __bf16 bf16_x2;
				// r600-error@+1 {{__bf16 is not supported on this target}}
				typedef __attribute__((ext_vector_type(4))) __bf16 bf16_x4;
				// r600-error@+1 {{__bf16 is not supported on this target}}
				typedef __attribute__((ext_vector_type(8))) __bf16 bf16_x8;
				// r600-error@+1 {{__bf16 is not supported on this target}}
				typedef __attribute__((ext_vector_type(16))) __bf16 bf16_x16;

				// r600-error@+1 2 {{__bf16 is not supported on this target}}
				__device__ void test(bool b, __bf16 *out, __bf16 in) {
				__bf16 bf16 = in; // r600-error {{__bf16 is not supported on this target}}

				bf16 + bf16; // amdgcn-error {{invalid operands to binary expression ('__bf16' and '__bf16')}}
				bf16 - bf16; // amdgcn-error {{invalid operands to binary expression ('__bf16' and '__bf16')}}
				bf16 * bf16; // amdgcn-error {{invalid operands to binary expression ('__bf16' and '__bf16')}}
				bf16 / bf16; // amdgcn-error {{invalid operands to binary expression ('__bf16' and '__bf16')}}

				__fp16 fp16;

				bf16 + fp16; // amdgcn-error {{invalid operands to binary expression ('__bf16' and '__fp16')}}
				fp16 + bf16; // amdgcn-error {{invalid operands to binary expression ('__fp16' and '__bf16')}}
				bf16 - fp16; // amdgcn-error {{invalid operands to binary expression ('__bf16' and '__fp16')}}
				fp16 - bf16; // amdgcn-error {{invalid operands to binary expression ('__fp16' and '__bf16')}}
				bf16 * fp16; // amdgcn-error {{invalid operands to binary expression ('__bf16' and '__fp16')}}
				arsenmUnsubmitted Done Reply Inline Actions check casts to different int and float types? Is construction of bf16 vectors allowed? arsenm: check casts to different int and float types? Is construction of bf16 vectors allowed?
				Pierre-vhAuthorUnsubmitted Done Reply Inline Actions Added cast + vec sema test and vec assign codegen test too No conversions are allowed apparently but I don't think it matters for the initial patch; if needed we can always add it later I think Pierre-vh: Added cast + vec sema test and vec assign codegen test too No conversions are allowed…
				fp16 * bf16; // amdgcn-error {{invalid operands to binary expression ('__fp16' and '__bf16')}}
				bf16 / fp16; // amdgcn-error {{invalid operands to binary expression ('__bf16' and '__fp16')}}
				fp16 / bf16; // amdgcn-error {{invalid operands to binary expression ('__fp16' and '__bf16')}}
				bf16 = fp16; // amdgcn-error {{assigning to '__bf16' from incompatible type '__fp16'}}
				fp16 = bf16; // amdgcn-error {{assigning to '__fp16' from incompatible type '__bf16'}}
				bf16 + (b ? fp16 : bf16); // amdgcn-error {{incompatible operand types ('__fp16' and '__bf16')}}
				*out = bf16;

				// amdgcn-error@+1 {{static_cast from '__bf16' to 'unsigned short' is not allowed}}
				unsigned short u16bf16 = static_cast<unsigned short>(bf16);
				// amdgcn-error@+2 {{C-style cast from 'unsigned short' to '__bf16' is not allowed}}
				// r600-error@+1 {{__bf16 is not supported on this target}}
				bf16 = (__bf16)u16bf16;

				// amdgcn-error@+1 {{static_cast from '__bf16' to 'float' is not allowed}}
				float f32bf16 = static_cast<float>(bf16);
				// amdgcn-error@+2 {{C-style cast from 'float' to '__bf16' is not allowed}}
				// r600-error@+1 {{__bf16 is not supported on this target}}
				bf16 = (__bf16)f32bf16;

				// amdgcn-error@+1 {{static_cast from '__bf16' to 'double' is not allowed}}
				double f64bf16 = static_cast<double>(bf16);
				// amdgcn-error@+2 {{C-style cast from 'double' to '__bf16' is not allowed}}
				// r600-error@+1 {{__bf16 is not supported on this target}}
				bf16 = (__bf16)f64bf16;

				// r600-error@+1 {{__bf16 is not supported on this target}}
				typedef __attribute__((ext_vector_type(2))) __bf16 bf16_x2;
				bf16_x2 vec2_a, vec2_b;
				vec2_a = vec2_b;

				// r600-error@+1 {{__bf16 is not supported on this target}}
				typedef __attribute__((ext_vector_type(4))) __bf16 bf16_x4;
				bf16_x4 vec4_a, vec4_b;
				vec4_a = vec4_b;

				// r600-error@+1 {{__bf16 is not supported on this target}}
				typedef __attribute__((ext_vector_type(8))) __bf16 bf16_x8;
				bf16_x8 vec8_a, vec8_b;
				vec8_a = vec8_b;

				// r600-error@+1 {{__bf16 is not supported on this target}}
				typedef __attribute__((ext_vector_type(16))) __bf16 bf16_x16;
				bf16_x16 vec16_a, vec16_b;
				vec16_a = vec16_b;
				}

				// r600-error@+1 2 {{__bf16 is not supported on this target}}
				__bf16 hostfn(__bf16 a) {
				return a;
				}

				// r600-error@+2 {{__bf16 is not supported on this target}}
				// r600-error@+1 {{vector size not an integral multiple of component size}}
				typedef __bf16 foo __attribute__((__vector_size__(16), __aligned__(16)));

llvm/lib/CodeGen/SelectionDAG/LegalizeDAG.cpp

Show First 20 Lines • Show All 2,902 Lines • ▼ Show 20 Lines	bool SelectionDAGLegalize::ExpandNode(SDNode *Node) {
case ISD::FP_EXTEND:		case ISD::FP_EXTEND:
if ((Tmp1 = EmitStackConvert(Node->getOperand(0),		if ((Tmp1 = EmitStackConvert(Node->getOperand(0),
Node->getOperand(0).getValueType(),		Node->getOperand(0).getValueType(),
Node->getValueType(0), dl)))		Node->getValueType(0), dl)))
Results.push_back(Tmp1);		Results.push_back(Tmp1);
break;		break;
case ISD::BF16_TO_FP: {		case ISD::BF16_TO_FP: {
// Always expand bf16 to f32 casts, they lower to ext + shift.		// Always expand bf16 to f32 casts, they lower to ext + shift.
SDValue Op = DAG.getNode(ISD::BITCAST, dl, MVT::i16, Node->getOperand(0));		//
Op = DAG.getNode(ISD::ANY_EXTEND, dl, MVT::i32, Op);		// Note that the operand of this code can be bf16 or an integer type in case
		// bf16 is not supported on the target and was softened.
		SDValue Op = Node->getOperand(0);
		if (Op.getValueType() == MVT::bf16) {
		arsenmUnsubmitted Done Reply Inline Actions Braces arsenm: Braces
		Op = DAG.getNode(ISD::ANY_EXTEND, dl, MVT::i32,
		DAG.getNode(ISD::BITCAST, dl, MVT::i16, Op));
		} else {
		Op = DAG.getAnyExtOrTrunc(Op, dl, MVT::i32);
		}
Op = DAG.getNode(		Op = DAG.getNode(
ISD::SHL, dl, MVT::i32, Op,		ISD::SHL, dl, MVT::i32, Op,
DAG.getConstant(16, dl,		DAG.getConstant(16, dl,
TLI.getShiftAmountTy(MVT::i32, DAG.getDataLayout())));		TLI.getShiftAmountTy(MVT::i32, DAG.getDataLayout())));
Op = DAG.getNode(ISD::BITCAST, dl, MVT::f32, Op);		Op = DAG.getNode(ISD::BITCAST, dl, MVT::f32, Op);
// Add fp_extend in case the output is bigger than f32.		// Add fp_extend in case the output is bigger than f32.
if (Node->getValueType(0) != MVT::f32)		if (Node->getValueType(0) != MVT::f32)
Op = DAG.getNode(ISD::FP_EXTEND, dl, Node->getValueType(0), Op);		Op = DAG.getNode(ISD::FP_EXTEND, dl, Node->getValueType(0), Op);
Results.push_back(Op);		Results.push_back(Op);
break;		break;
}		}
		case ISD::FP_TO_BF16: {
		SDValue Op = Node->getOperand(0);
		if (Op.getValueType() != MVT::f32)
		Op = DAG.getNode(ISD::FP_ROUND, dl, MVT::f32, Op,
		DAG.getIntPtrConstant(0, dl, /isTarget=/true));
		Op = DAG.getNode(
		ISD::SRL, dl, MVT::i32, DAG.getNode(ISD::BITCAST, dl, MVT::i32, Op),
		DAG.getConstant(16, dl,
		TLI.getShiftAmountTy(MVT::i32, DAG.getDataLayout())));
		// The result of this node can be bf16 or an integer type in case bf16 is
		// not supported on the target and was softened to i16 for storage.
		if (Node->getValueType(0) == MVT::bf16) {
		arsenmUnsubmitted Done Reply Inline Actions Braces arsenm: Braces
		Op = DAG.getNode(ISD::BITCAST, dl, MVT::bf16,
		DAG.getNode(ISD::TRUNCATE, dl, MVT::i16, Op));
		} else {
		Op = DAG.getAnyExtOrTrunc(Op, dl, Node->getValueType(0));
		arsenmUnsubmitted Done Reply Inline Actions Don't see the point of this assert, should work fine for vectors arsenm: Don't see the point of this assert, should work fine for vectors
		}
		Results.push_back(Op);
		break;
		}
case ISD::SIGN_EXTEND_INREG: {		case ISD::SIGN_EXTEND_INREG: {
EVT ExtraVT = cast<VTSDNode>(Node->getOperand(1))->getVT();		EVT ExtraVT = cast<VTSDNode>(Node->getOperand(1))->getVT();
EVT VT = Node->getValueType(0);		EVT VT = Node->getValueType(0);

// An in-register sign-extend of a boolean is a negation:		// An in-register sign-extend of a boolean is a negation:
// 'true' (1) sign-extended is -1.		// 'true' (1) sign-extended is -1.
// 'false' (0) sign-extended is 0.		// 'false' (0) sign-extended is 0.
// However, we must mask the high bits of the source operand because the		// However, we must mask the high bits of the source operand because the
▲ Show 20 Lines • Show All 2,197 Lines • Show Last 20 Lines

llvm/lib/CodeGen/SelectionDAG/LegalizeIntegerTypes.cpp

Show First 20 Lines • Show All 142 Lines • ▼ Show 20 Lines	#endif
case ISD::STRICT_FP_TO_UINT:		case ISD::STRICT_FP_TO_UINT:
case ISD::FP_TO_SINT:		case ISD::FP_TO_SINT:
case ISD::FP_TO_UINT: Res = PromoteIntRes_FP_TO_XINT(N); break;		case ISD::FP_TO_UINT: Res = PromoteIntRes_FP_TO_XINT(N); break;

case ISD::FP_TO_SINT_SAT:		case ISD::FP_TO_SINT_SAT:
case ISD::FP_TO_UINT_SAT:		case ISD::FP_TO_UINT_SAT:
Res = PromoteIntRes_FP_TO_XINT_SAT(N); break;		Res = PromoteIntRes_FP_TO_XINT_SAT(N); break;

case ISD::FP_TO_FP16: Res = PromoteIntRes_FP_TO_FP16(N); break;		case ISD::FP_TO_BF16:
		case ISD::FP_TO_FP16:
		Res = PromoteIntRes_FP_TO_FP16_BF16(N);
		break;

case ISD::FLT_ROUNDS_: Res = PromoteIntRes_FLT_ROUNDS(N); break;		case ISD::FLT_ROUNDS_: Res = PromoteIntRes_FLT_ROUNDS(N); break;

case ISD::AND:		case ISD::AND:
case ISD::OR:		case ISD::OR:
case ISD::XOR:		case ISD::XOR:
case ISD::ADD:		case ISD::ADD:
case ISD::SUB:		case ISD::SUB:
▲ Show 20 Lines • Show All 555 Lines • ▼ Show 20 Lines
SDValue DAGTypeLegalizer::PromoteIntRes_FP_TO_XINT_SAT(SDNode *N) {		SDValue DAGTypeLegalizer::PromoteIntRes_FP_TO_XINT_SAT(SDNode *N) {
// Promote the result type, while keeping the original width in Op1.		// Promote the result type, while keeping the original width in Op1.
EVT NVT = TLI.getTypeToTransformTo(*DAG.getContext(), N->getValueType(0));		EVT NVT = TLI.getTypeToTransformTo(*DAG.getContext(), N->getValueType(0));
SDLoc dl(N);		SDLoc dl(N);
return DAG.getNode(N->getOpcode(), dl, NVT, N->getOperand(0),		return DAG.getNode(N->getOpcode(), dl, NVT, N->getOperand(0),
N->getOperand(1));		N->getOperand(1));
}		}

SDValue DAGTypeLegalizer::PromoteIntRes_FP_TO_FP16(SDNode *N) {		SDValue DAGTypeLegalizer::PromoteIntRes_FP_TO_FP16_BF16(SDNode *N) {
EVT NVT = TLI.getTypeToTransformTo(*DAG.getContext(), N->getValueType(0));		EVT NVT = TLI.getTypeToTransformTo(*DAG.getContext(), N->getValueType(0));
SDLoc dl(N);		SDLoc dl(N);

return DAG.getNode(N->getOpcode(), dl, NVT, N->getOperand(0));		return DAG.getNode(N->getOpcode(), dl, NVT, N->getOperand(0));
}		}

SDValue DAGTypeLegalizer::PromoteIntRes_FLT_ROUNDS(SDNode *N) {		SDValue DAGTypeLegalizer::PromoteIntRes_FLT_ROUNDS(SDNode *N) {
EVT NVT = TLI.getTypeToTransformTo(*DAG.getContext(), N->getValueType(0));		EVT NVT = TLI.getTypeToTransformTo(*DAG.getContext(), N->getValueType(0));
▲ Show 20 Lines • Show All 930 Lines • ▼ Show 20 Lines	bool DAGTypeLegalizer::PromoteIntegerOperand(SDNode *N, unsigned OpNo) {
case ISD::MLOAD: Res = PromoteIntOp_MLOAD(cast<MaskedLoadSDNode>(N),		case ISD::MLOAD: Res = PromoteIntOp_MLOAD(cast<MaskedLoadSDNode>(N),
OpNo); break;		OpNo); break;
case ISD::MGATHER: Res = PromoteIntOp_MGATHER(cast<MaskedGatherSDNode>(N),		case ISD::MGATHER: Res = PromoteIntOp_MGATHER(cast<MaskedGatherSDNode>(N),
OpNo); break;		OpNo); break;
case ISD::MSCATTER: Res = PromoteIntOp_MSCATTER(cast<MaskedScatterSDNode>(N),		case ISD::MSCATTER: Res = PromoteIntOp_MSCATTER(cast<MaskedScatterSDNode>(N),
OpNo); break;		OpNo); break;
case ISD::VP_TRUNCATE:		case ISD::VP_TRUNCATE:
case ISD::TRUNCATE: Res = PromoteIntOp_TRUNCATE(N); break;		case ISD::TRUNCATE: Res = PromoteIntOp_TRUNCATE(N); break;
		case ISD::BF16_TO_FP:
case ISD::FP16_TO_FP:		case ISD::FP16_TO_FP:
case ISD::VP_UINT_TO_FP:		case ISD::VP_UINT_TO_FP:
case ISD::UINT_TO_FP: Res = PromoteIntOp_UINT_TO_FP(N); break;		case ISD::UINT_TO_FP: Res = PromoteIntOp_UINT_TO_FP(N); break;
case ISD::STRICT_UINT_TO_FP: Res = PromoteIntOp_STRICT_UINT_TO_FP(N); break;		case ISD::STRICT_UINT_TO_FP: Res = PromoteIntOp_STRICT_UINT_TO_FP(N); break;
case ISD::ZERO_EXTEND: Res = PromoteIntOp_ZERO_EXTEND(N); break;		case ISD::ZERO_EXTEND: Res = PromoteIntOp_ZERO_EXTEND(N); break;
case ISD::EXTRACT_SUBVECTOR: Res = PromoteIntOp_EXTRACT_SUBVECTOR(N); break;		case ISD::EXTRACT_SUBVECTOR: Res = PromoteIntOp_EXTRACT_SUBVECTOR(N); break;
case ISD::INSERT_SUBVECTOR: Res = PromoteIntOp_INSERT_SUBVECTOR(N); break;		case ISD::INSERT_SUBVECTOR: Res = PromoteIntOp_INSERT_SUBVECTOR(N); break;

▲ Show 20 Lines • Show All 3,949 Lines • Show Last 20 Lines

llvm/lib/CodeGen/SelectionDAG/LegalizeTypes.h

Show First 20 Lines • Show All 318 Lines • ▼ Show 20 Lines	private:
SDValue PromoteIntRes_BUILD_PAIR(SDNode *N);		SDValue PromoteIntRes_BUILD_PAIR(SDNode *N);
SDValue PromoteIntRes_Constant(SDNode *N);		SDValue PromoteIntRes_Constant(SDNode *N);
SDValue PromoteIntRes_CTLZ(SDNode *N);		SDValue PromoteIntRes_CTLZ(SDNode *N);
SDValue PromoteIntRes_CTPOP_PARITY(SDNode *N);		SDValue PromoteIntRes_CTPOP_PARITY(SDNode *N);
SDValue PromoteIntRes_CTTZ(SDNode *N);		SDValue PromoteIntRes_CTTZ(SDNode *N);
SDValue PromoteIntRes_EXTRACT_VECTOR_ELT(SDNode *N);		SDValue PromoteIntRes_EXTRACT_VECTOR_ELT(SDNode *N);
SDValue PromoteIntRes_FP_TO_XINT(SDNode *N);		SDValue PromoteIntRes_FP_TO_XINT(SDNode *N);
SDValue PromoteIntRes_FP_TO_XINT_SAT(SDNode *N);		SDValue PromoteIntRes_FP_TO_XINT_SAT(SDNode *N);
SDValue PromoteIntRes_FP_TO_FP16(SDNode *N);		SDValue PromoteIntRes_FP_TO_FP16_BF16(SDNode *N);
SDValue PromoteIntRes_FREEZE(SDNode *N);		SDValue PromoteIntRes_FREEZE(SDNode *N);
SDValue PromoteIntRes_INT_EXTEND(SDNode *N);		SDValue PromoteIntRes_INT_EXTEND(SDNode *N);
SDValue PromoteIntRes_LOAD(LoadSDNode *N);		SDValue PromoteIntRes_LOAD(LoadSDNode *N);
SDValue PromoteIntRes_MLOAD(MaskedLoadSDNode *N);		SDValue PromoteIntRes_MLOAD(MaskedLoadSDNode *N);
SDValue PromoteIntRes_MGATHER(MaskedGatherSDNode *N);		SDValue PromoteIntRes_MGATHER(MaskedGatherSDNode *N);
SDValue PromoteIntRes_Overflow(SDNode *N);		SDValue PromoteIntRes_Overflow(SDNode *N);
SDValue PromoteIntRes_SADDSUBO(SDNode *N, unsigned ResNo);		SDValue PromoteIntRes_SADDSUBO(SDNode *N, unsigned ResNo);
SDValue PromoteIntRes_Select(SDNode *N);		SDValue PromoteIntRes_Select(SDNode *N);
▲ Show 20 Lines • Show All 787 Lines • Show Last 20 Lines

llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp

Show First 20 Lines • Show All 157 Lines • ▼ Show 20 Lines	AMDGPUTargetLowering::AMDGPUTargetLowering(const TargetMachine &TM,

for (MVT VT : MVT::integer_fixedlen_vector_valuetypes())		for (MVT VT : MVT::integer_fixedlen_vector_valuetypes())
for (auto MemVT :		for (auto MemVT :
{MVT::v2i8, MVT::v4i8, MVT::v2i16, MVT::v3i16, MVT::v4i16})		{MVT::v2i8, MVT::v4i8, MVT::v2i16, MVT::v3i16, MVT::v4i16})
setLoadExtAction({ISD::SEXTLOAD, ISD::ZEXTLOAD, ISD::EXTLOAD}, VT, MemVT,		setLoadExtAction({ISD::SEXTLOAD, ISD::ZEXTLOAD, ISD::EXTLOAD}, VT, MemVT,
Expand);		Expand);

setLoadExtAction(ISD::EXTLOAD, MVT::f32, MVT::f16, Expand);		setLoadExtAction(ISD::EXTLOAD, MVT::f32, MVT::f16, Expand);
		setLoadExtAction(ISD::EXTLOAD, MVT::f32, MVT::bf16, Expand);
setLoadExtAction(ISD::EXTLOAD, MVT::v2f32, MVT::v2f16, Expand);		setLoadExtAction(ISD::EXTLOAD, MVT::v2f32, MVT::v2f16, Expand);
setLoadExtAction(ISD::EXTLOAD, MVT::v3f32, MVT::v3f16, Expand);		setLoadExtAction(ISD::EXTLOAD, MVT::v3f32, MVT::v3f16, Expand);
setLoadExtAction(ISD::EXTLOAD, MVT::v4f32, MVT::v4f16, Expand);		setLoadExtAction(ISD::EXTLOAD, MVT::v4f32, MVT::v4f16, Expand);
setLoadExtAction(ISD::EXTLOAD, MVT::v8f32, MVT::v8f16, Expand);		setLoadExtAction(ISD::EXTLOAD, MVT::v8f32, MVT::v8f16, Expand);
setLoadExtAction(ISD::EXTLOAD, MVT::v16f32, MVT::v16f16, Expand);		setLoadExtAction(ISD::EXTLOAD, MVT::v16f32, MVT::v16f16, Expand);
setLoadExtAction(ISD::EXTLOAD, MVT::v32f32, MVT::v32f16, Expand);		setLoadExtAction(ISD::EXTLOAD, MVT::v32f32, MVT::v32f16, Expand);

setLoadExtAction(ISD::EXTLOAD, MVT::f64, MVT::f32, Expand);		setLoadExtAction(ISD::EXTLOAD, MVT::f64, MVT::f32, Expand);
setLoadExtAction(ISD::EXTLOAD, MVT::v2f64, MVT::v2f32, Expand);		setLoadExtAction(ISD::EXTLOAD, MVT::v2f64, MVT::v2f32, Expand);
setLoadExtAction(ISD::EXTLOAD, MVT::v3f64, MVT::v3f32, Expand);		setLoadExtAction(ISD::EXTLOAD, MVT::v3f64, MVT::v3f32, Expand);
setLoadExtAction(ISD::EXTLOAD, MVT::v4f64, MVT::v4f32, Expand);		setLoadExtAction(ISD::EXTLOAD, MVT::v4f64, MVT::v4f32, Expand);
setLoadExtAction(ISD::EXTLOAD, MVT::v8f64, MVT::v8f32, Expand);		setLoadExtAction(ISD::EXTLOAD, MVT::v8f64, MVT::v8f32, Expand);
setLoadExtAction(ISD::EXTLOAD, MVT::v16f64, MVT::v16f32, Expand);		setLoadExtAction(ISD::EXTLOAD, MVT::v16f64, MVT::v16f32, Expand);

setLoadExtAction(ISD::EXTLOAD, MVT::f64, MVT::f16, Expand);		setLoadExtAction(ISD::EXTLOAD, MVT::f64, MVT::f16, Expand);
		setLoadExtAction(ISD::EXTLOAD, MVT::f64, MVT::bf16, Expand);
setLoadExtAction(ISD::EXTLOAD, MVT::v2f64, MVT::v2f16, Expand);		setLoadExtAction(ISD::EXTLOAD, MVT::v2f64, MVT::v2f16, Expand);
setLoadExtAction(ISD::EXTLOAD, MVT::v3f64, MVT::v3f16, Expand);		setLoadExtAction(ISD::EXTLOAD, MVT::v3f64, MVT::v3f16, Expand);
setLoadExtAction(ISD::EXTLOAD, MVT::v4f64, MVT::v4f16, Expand);		setLoadExtAction(ISD::EXTLOAD, MVT::v4f64, MVT::v4f16, Expand);
setLoadExtAction(ISD::EXTLOAD, MVT::v8f64, MVT::v8f16, Expand);		setLoadExtAction(ISD::EXTLOAD, MVT::v8f64, MVT::v8f16, Expand);
setLoadExtAction(ISD::EXTLOAD, MVT::v16f64, MVT::v16f16, Expand);		setLoadExtAction(ISD::EXTLOAD, MVT::v16f64, MVT::v16f16, Expand);

setOperationAction(ISD::STORE, MVT::f32, Promote);		setOperationAction(ISD::STORE, MVT::f32, Promote);
AddPromotedToType(ISD::STORE, MVT::f32, MVT::i32);		AddPromotedToType(ISD::STORE, MVT::f32, MVT::i32);
▲ Show 20 Lines • Show All 78 Lines • ▼ Show 20 Lines	AMDGPUTargetLowering::AMDGPUTargetLowering(const TargetMachine &TM,
setTruncStoreAction(MVT::i64, MVT::i16, Expand);		setTruncStoreAction(MVT::i64, MVT::i16, Expand);
setTruncStoreAction(MVT::i64, MVT::i32, Expand);		setTruncStoreAction(MVT::i64, MVT::i32, Expand);

setTruncStoreAction(MVT::v2i64, MVT::v2i1, Expand);		setTruncStoreAction(MVT::v2i64, MVT::v2i1, Expand);
setTruncStoreAction(MVT::v2i64, MVT::v2i8, Expand);		setTruncStoreAction(MVT::v2i64, MVT::v2i8, Expand);
setTruncStoreAction(MVT::v2i64, MVT::v2i16, Expand);		setTruncStoreAction(MVT::v2i64, MVT::v2i16, Expand);
setTruncStoreAction(MVT::v2i64, MVT::v2i32, Expand);		setTruncStoreAction(MVT::v2i64, MVT::v2i32, Expand);

		setTruncStoreAction(MVT::f32, MVT::bf16, Expand);
setTruncStoreAction(MVT::f32, MVT::f16, Expand);		setTruncStoreAction(MVT::f32, MVT::f16, Expand);
setTruncStoreAction(MVT::v2f32, MVT::v2f16, Expand);		setTruncStoreAction(MVT::v2f32, MVT::v2f16, Expand);
setTruncStoreAction(MVT::v3f32, MVT::v3f16, Expand);		setTruncStoreAction(MVT::v3f32, MVT::v3f16, Expand);
setTruncStoreAction(MVT::v4f32, MVT::v4f16, Expand);		setTruncStoreAction(MVT::v4f32, MVT::v4f16, Expand);
setTruncStoreAction(MVT::v8f32, MVT::v8f16, Expand);		setTruncStoreAction(MVT::v8f32, MVT::v8f16, Expand);
setTruncStoreAction(MVT::v16f32, MVT::v16f16, Expand);		setTruncStoreAction(MVT::v16f32, MVT::v16f16, Expand);
setTruncStoreAction(MVT::v32f32, MVT::v32f16, Expand);		setTruncStoreAction(MVT::v32f32, MVT::v32f16, Expand);

		setTruncStoreAction(MVT::f64, MVT::bf16, Expand);
setTruncStoreAction(MVT::f64, MVT::f16, Expand);		setTruncStoreAction(MVT::f64, MVT::f16, Expand);
setTruncStoreAction(MVT::f64, MVT::f32, Expand);		setTruncStoreAction(MVT::f64, MVT::f32, Expand);

setTruncStoreAction(MVT::v2f64, MVT::v2f32, Expand);		setTruncStoreAction(MVT::v2f64, MVT::v2f32, Expand);
setTruncStoreAction(MVT::v2f64, MVT::v2f16, Expand);		setTruncStoreAction(MVT::v2f64, MVT::v2f16, Expand);

setTruncStoreAction(MVT::v3i64, MVT::v3i32, Expand);		setTruncStoreAction(MVT::v3i64, MVT::v3i32, Expand);
setTruncStoreAction(MVT::v3i64, MVT::v3i16, Expand);		setTruncStoreAction(MVT::v3i64, MVT::v3i16, Expand);
▲ Show 20 Lines • Show All 4,644 Lines • Show Last 20 Lines

llvm/lib/Target/AMDGPU/SIISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 138 Lines • ▼ Show 20 Lines	SITargetLowering::SITargetLowering(const TargetMachine &TM,
addRegisterClass(MVT::v8f64, TRI->getVGPRClassForBitWidth(512));		addRegisterClass(MVT::v8f64, TRI->getVGPRClassForBitWidth(512));

addRegisterClass(MVT::v16i64, &AMDGPU::SGPR_1024RegClass);		addRegisterClass(MVT::v16i64, &AMDGPU::SGPR_1024RegClass);
addRegisterClass(MVT::v16f64, TRI->getVGPRClassForBitWidth(1024));		addRegisterClass(MVT::v16f64, TRI->getVGPRClassForBitWidth(1024));

if (Subtarget->has16BitInsts()) {		if (Subtarget->has16BitInsts()) {
addRegisterClass(MVT::i16, &AMDGPU::SReg_32RegClass);		addRegisterClass(MVT::i16, &AMDGPU::SReg_32RegClass);
addRegisterClass(MVT::f16, &AMDGPU::SReg_32RegClass);		addRegisterClass(MVT::f16, &AMDGPU::SReg_32RegClass);

		arsenmUnsubmitted Done Reply Inline Actions Do you really need to add this to a register class? The only thing this would be useful for is for the calling convention contexts, which should promote to i16. If you do that you don't need most of the rest of this patch arsenm: Do you really need to add this to a register class? The only thing this would be useful for is…
// Unless there are also VOP3P operations, not operations are really legal.		// Unless there are also VOP3P operations, not operations are really legal.
addRegisterClass(MVT::v2i16, &AMDGPU::SReg_32RegClass);		addRegisterClass(MVT::v2i16, &AMDGPU::SReg_32RegClass);
addRegisterClass(MVT::v2f16, &AMDGPU::SReg_32RegClass);		addRegisterClass(MVT::v2f16, &AMDGPU::SReg_32RegClass);
addRegisterClass(MVT::v4i16, &AMDGPU::SReg_64RegClass);		addRegisterClass(MVT::v4i16, &AMDGPU::SReg_64RegClass);
addRegisterClass(MVT::v4f16, &AMDGPU::SReg_64RegClass);		addRegisterClass(MVT::v4f16, &AMDGPU::SReg_64RegClass);
addRegisterClass(MVT::v8i16, &AMDGPU::SGPR_128RegClass);		addRegisterClass(MVT::v8i16, &AMDGPU::SGPR_128RegClass);
addRegisterClass(MVT::v8f16, &AMDGPU::SGPR_128RegClass);		addRegisterClass(MVT::v8f16, &AMDGPU::SGPR_128RegClass);
addRegisterClass(MVT::v16i16, &AMDGPU::SGPR_256RegClass);		addRegisterClass(MVT::v16i16, &AMDGPU::SGPR_256RegClass);
▲ Show 20 Lines • Show All 314 Lines • ▼ Show 20 Lines	else
setOperationAction({ISD::FCEIL, ISD::FTRUNC, ISD::FRINT, ISD::FFLOOR},		setOperationAction({ISD::FCEIL, ISD::FTRUNC, ISD::FRINT, ISD::FFLOOR},
MVT::f64, Custom);		MVT::f64, Custom);

setOperationAction(ISD::FFLOOR, MVT::f64, Legal);		setOperationAction(ISD::FFLOOR, MVT::f64, Legal);

setOperationAction({ISD::FSIN, ISD::FCOS, ISD::FDIV}, MVT::f32, Custom);		setOperationAction({ISD::FSIN, ISD::FCOS, ISD::FDIV}, MVT::f32, Custom);
setOperationAction(ISD::FDIV, MVT::f64, Custom);		setOperationAction(ISD::FDIV, MVT::f64, Custom);

		setOperationAction(ISD::BF16_TO_FP, {MVT::i16, MVT::f32, MVT::f64}, Expand);
		setOperationAction(ISD::FP_TO_BF16, {MVT::i16, MVT::f32, MVT::f64}, Expand);

if (Subtarget->has16BitInsts()) {		if (Subtarget->has16BitInsts()) {
setOperationAction({ISD::Constant, ISD::SMIN, ISD::SMAX, ISD::UMIN,		setOperationAction({ISD::Constant, ISD::SMIN, ISD::SMAX, ISD::UMIN,
ISD::UMAX, ISD::UADDSAT, ISD::USUBSAT},		ISD::UMAX, ISD::UADDSAT, ISD::USUBSAT},
MVT::i16, Legal);		MVT::i16, Legal);

AddPromotedToType(ISD::SIGN_EXTEND, MVT::i16, MVT::i32);		AddPromotedToType(ISD::SIGN_EXTEND, MVT::i16, MVT::i32);

setOperationAction({ISD::ROTR, ISD::ROTL, ISD::SELECT_CC, ISD::BR_CC},		setOperationAction({ISD::ROTR, ISD::ROTL, ISD::SELECT_CC, ISD::BR_CC},
▲ Show 20 Lines • Show All 355 Lines • ▼ Show 20 Lines	MVT SITargetLowering::getRegisterTypeForCallingConv(LLVMContext &Context,
EVT VT) const {		EVT VT) const {
if (CC == CallingConv::AMDGPU_KERNEL)		if (CC == CallingConv::AMDGPU_KERNEL)
return TargetLowering::getRegisterTypeForCallingConv(Context, CC, VT);		return TargetLowering::getRegisterTypeForCallingConv(Context, CC, VT);

if (VT.isVector()) {		if (VT.isVector()) {
EVT ScalarVT = VT.getScalarType();		EVT ScalarVT = VT.getScalarType();
unsigned Size = ScalarVT.getSizeInBits();		unsigned Size = ScalarVT.getSizeInBits();
if (Size == 16) {		if (Size == 16) {
if (Subtarget->has16BitInsts())		if (Subtarget->has16BitInsts()) {
return VT.isInteger() ? MVT::v2i16 : MVT::v2f16;		if (VT.isInteger())
		return MVT::v2i16;
		return (ScalarVT == MVT::bf16 ? MVT::i32 : MVT::v2f16);
		}
return VT.isInteger() ? MVT::i32 : MVT::f32;		return VT.isInteger() ? MVT::i32 : MVT::f32;
}		}

if (Size < 16)		if (Size < 16)
return Subtarget->has16BitInsts() ? MVT::i16 : MVT::i32;		return Subtarget->has16BitInsts() ? MVT::i16 : MVT::i32;
return Size == 32 ? ScalarVT.getSimpleVT() : MVT::i32;		return Size == 32 ? ScalarVT.getSimpleVT() : MVT::i32;
}		}

Show All 36 Lines	unsigned SITargetLowering::getVectorTypeBreakdownForCallingConv(
if (CC != CallingConv::AMDGPU_KERNEL && VT.isVector()) {		if (CC != CallingConv::AMDGPU_KERNEL && VT.isVector()) {
unsigned NumElts = VT.getVectorNumElements();		unsigned NumElts = VT.getVectorNumElements();
EVT ScalarVT = VT.getScalarType();		EVT ScalarVT = VT.getScalarType();
unsigned Size = ScalarVT.getSizeInBits();		unsigned Size = ScalarVT.getSizeInBits();
// FIXME: We should fix the ABI to be the same on targets without 16-bit		// FIXME: We should fix the ABI to be the same on targets without 16-bit
// support, but unless we can properly handle 3-vectors, it will be still be		// support, but unless we can properly handle 3-vectors, it will be still be
// inconsistent.		// inconsistent.
if (Size == 16 && Subtarget->has16BitInsts()) {		if (Size == 16 && Subtarget->has16BitInsts()) {
		if (ScalarVT == MVT::bf16) {
		RegisterVT = MVT::i32;
		IntermediateVT = MVT::v2bf16;
		arsenmUnsubmitted Done Reply Inline Actions If you wanted the promote to i32, you could have done it here instead of in the tablegen cc handling arsenm: If you wanted the promote to i32, you could have done it here instead of in the tablegen cc…
		Pierre-vhAuthorUnsubmitted Done Reply Inline Actions Do you mean somewhere else in that function? Changing v2bf16 to i32 here doesn't fix it I also tried changing the function above but I kept running into asserts so I just left the TableGen CC for now Pierre-vh: Do you mean somewhere else in that function? Changing v2bf16 to i32 here doesn't fix it I also…
		arsenmUnsubmitted Done Reply Inline Actions Yes, that should force the bitcast of the argument type arsenm: Yes, that should force the bitcast of the argument type
		} else {
RegisterVT = VT.isInteger() ? MVT::v2i16 : MVT::v2f16;		RegisterVT = VT.isInteger() ? MVT::v2i16 : MVT::v2f16;
IntermediateVT = RegisterVT;		IntermediateVT = RegisterVT;
		}
NumIntermediates = (NumElts + 1) / 2;		NumIntermediates = (NumElts + 1) / 2;
return NumIntermediates;		return NumIntermediates;
}		}

if (Size == 32) {		if (Size == 32) {
RegisterVT = ScalarVT.getSimpleVT();		RegisterVT = ScalarVT.getSimpleVT();
IntermediateVT = RegisterVT;		IntermediateVT = RegisterVT;
NumIntermediates = NumElts;		NumIntermediates = NumElts;
▲ Show 20 Lines • Show All 3,864 Lines • ▼ Show 20 Lines	SDValue SITargetLowering::LowerOperation(SDValue Op, SelectionDAG &DAG) const {
case ISD::UMULO:		case ISD::UMULO:
return lowerXMULO(Op, DAG);		return lowerXMULO(Op, DAG);
case ISD::SMUL_LOHI:		case ISD::SMUL_LOHI:
case ISD::UMUL_LOHI:		case ISD::UMUL_LOHI:
return lowerXMUL_LOHI(Op, DAG);		return lowerXMUL_LOHI(Op, DAG);
case ISD::DYNAMIC_STACKALLOC:		case ISD::DYNAMIC_STACKALLOC:
return LowerDYNAMIC_STACKALLOC(Op, DAG);		return LowerDYNAMIC_STACKALLOC(Op, DAG);
}		}
return SDValue();		return SDValue();
}		}

// Used for D16: Casts the result of an instruction into the right vector,		// Used for D16: Casts the result of an instruction into the right vector,
// packs values if loads return unpacked values.		// packs values if loads return unpacked values.
static SDValue adjustLoadValueTypeImpl(SDValue Result, EVT LoadVT,		static SDValue adjustLoadValueTypeImpl(SDValue Result, EVT LoadVT,
const SDLoc &DL,		const SDLoc &DL,
SelectionDAG &DAG, bool Unpacked) {		SelectionDAG &DAG, bool Unpacked) {
if (!LoadVT.isVector())		if (!LoadVT.isVector())
return Result;		return Result;

// Cast back to the original packed type or to a larger type that is a		// Cast back to the original packed type or to a larger type that is a
// multiple of 32 bit for D16. Widening the return type is a required for		// multiple of 32 bit for D16. Widening the return type is a required for
		arsenmUnsubmitted Done Reply Inline Actions The generic legalizer should have handled this? arsenm: The generic legalizer should have handled this?
		Pierre-vhAuthorUnsubmitted Done Reply Inline Actions It looks like those operations are not implemented in the generic legalizer, e.g. I get Do not know how to promote this operator's operand! Pierre-vh: It looks like those operations are not implemented in the generic legalizer, e.g. I get ```…
		arsenmUnsubmitted Done Reply Inline Actions Right, this is the code that would go there arsenm: Right, this is the code that would go there
		Pierre-vhAuthorUnsubmitted Done Reply Inline Actions Do I just copy/paste this code in that PromoteInt function, and keep a copy here too in LowerOperation? (not really a fan of copy-pasting code in different files, I'd rather keep it all here) We need to have the lowering too AFAIK, it didn't go well when I tried to remove it Pierre-vh: Do I just copy/paste this code in that PromoteInt function, and keep a copy here too in…
		arsenmUnsubmitted Done Reply Inline Actions I'm not following why you need to handle it here arsenm: I'm not following why you need to handle it here
		Pierre-vhAuthorUnsubmitted Done Reply Inline Actions IIRC: I need to handle FP_TO_BF16 in ReplaceNodeResult because that's what the Integer Legalizer calls (through CustomLowerNode) I need to handle both opcodes in LowerOperation because otherwise they'll fail selection. They can be left over from expanding/legalizing other operations. Pierre-vh: IIRC: - I need to handle FP_TO_BF16 in ReplaceNodeResult because that's what the Integer…
		arsenmUnsubmitted Done Reply Inline Actions But why are they custom? We don't have to handle FP16_TO_FP or FP_TO_FP16 there, and they aren't custom lowered. They have the same basic properties. We have this: setOperationAction(ISD::FP16_TO_FP, MVT::i16, Promote); AddPromotedToType(ISD::FP16_TO_FP, MVT::i16, MVT::i32); setOperationAction(ISD::FP_TO_FP16, MVT::i16, Promote); AddPromotedToType(ISD::FP_TO_FP16, MVT::i16, MVT::i32); I'd expect the same basic pattern arsenm: But why are they custom? We don't have to handle FP16_TO_FP or FP_TO_FP16 there, and they…
		Pierre-vhAuthorUnsubmitted Done Reply Inline Actions PromoteIntegerOperand, PromoteFloatOperand and PromoteIntegerResult don't handle FP_TO_BF16 and BF16_TO_FP, and unless we put a Custom lowering mode it'll assert/unreachable. I tried to make it work (for a while) using the default expand but I can't quite get it to work. It feels like there is some legalizer work missing for handling BF16 like we want to. Even though it's not ideal I think the custom lowering is easiest Pierre-vh: PromoteIntegerOperand, PromoteFloatOperand and PromoteIntegerResult don't handle FP_TO_BF16 and…
		arsenmUnsubmitted Not Done Reply Inline Actions What about Expand? that's where the implemented part is arsenm: What about Expand? that's where the implemented part is
		Pierre-vhAuthorUnsubmitted Done Reply Inline Actions Last I tried, Expand will emit a libcall in many cases that we don't handle Pierre-vh: Last I tried, Expand will emit a libcall in many cases that we don't handle
		arsenmUnsubmitted Not Done Reply Inline Actions Library call is supposed to be a distinct action now, the DAG only did about 5% of the work to migrate to using it. This code can go to the default expand action arsenm: Library call is supposed to be a distinct action now, the DAG only did about 5% of the work to…
		Pierre-vhAuthorUnsubmitted Done Reply Inline Actions Does it need to happen in this commit? It'll delay the review quite a bit I think if other people have to review it If it needs to happen, when what do I need to do? Use the Expand action & fix the legalizer in places where it needs to be fixed? I feel like it might be better suited for a follow-up patch; I can create a task and pick it up when I come back from vacation if you want Pierre-vh: Does it need to happen in this commit? It'll delay the review quite a bit I think if other…
// legalization.		// legalization.
EVT FittingLoadVT = LoadVT;		EVT FittingLoadVT = LoadVT;
if ((LoadVT.getVectorNumElements() % 2) == 1) {		if ((LoadVT.getVectorNumElements() % 2) == 1) {
FittingLoadVT =		FittingLoadVT =
EVT::getVectorVT(*DAG.getContext(), LoadVT.getVectorElementType(),		EVT::getVectorVT(*DAG.getContext(), LoadVT.getVectorElementType(),
LoadVT.getVectorNumElements() + 1);		LoadVT.getVectorNumElements() + 1);
}		}

▲ Show 20 Lines • Show All 322 Lines • ▼ Show 20 Lines	case ISD::FABS: {

SDValue Op = DAG.getNode(ISD::AND, SL, MVT::i32,		SDValue Op = DAG.getNode(ISD::AND, SL, MVT::i32,
BC,		BC,
DAG.getConstant(0x7fff7fff, SL, MVT::i32));		DAG.getConstant(0x7fff7fff, SL, MVT::i32));
Results.push_back(DAG.getNode(ISD::BITCAST, SL, MVT::v2f16, Op));		Results.push_back(DAG.getNode(ISD::BITCAST, SL, MVT::v2f16, Op));
return;		return;
}		}
default:		default:
break;		break;
}		}
}		}

/// Helper function for LowerBRCOND		/// Helper function for LowerBRCOND
static SDNode *findUser(SDValue Value, unsigned Opcode) {		static SDNode *findUser(SDValue Value, unsigned Opcode) {

SDNode *Parent = Value.getNode();		SDNode *Parent = Value.getNode();
for (SDNode::use_iterator I = Parent->use_begin(), E = Parent->use_end();		for (SDNode::use_iterator I = Parent->use_begin(), E = Parent->use_end();
I != E; ++I) {		I != E; ++I) {

if (I.getUse().get() != Value)		if (I.getUse().get() != Value)
continue;		continue;

		arsenmUnsubmitted Done Reply Inline Actions Ditto arsenm: Ditto
if (I->getOpcode() == Opcode)		if (I->getOpcode() == Opcode)
return *I;		return *I;
}		}
return nullptr;		return nullptr;
}		}

unsigned SITargetLowering::isCFIntrinsic(const SDNode *Intr) const {		unsigned SITargetLowering::isCFIntrinsic(const SDNode *Intr) const {
if (Intr->getOpcode() == ISD::INTRINSIC_W_CHAIN) {		if (Intr->getOpcode() == ISD::INTRINSIC_W_CHAIN) {
▲ Show 20 Lines • Show All 375 Lines • ▼ Show 20 Lines
}		}

SDValue SITargetLowering::getSegmentAperture(unsigned AS, const SDLoc &DL,		SDValue SITargetLowering::getSegmentAperture(unsigned AS, const SDLoc &DL,
SelectionDAG &DAG) const {		SelectionDAG &DAG) const {
if (Subtarget->hasApertureRegs()) {		if (Subtarget->hasApertureRegs()) {
const unsigned ApertureRegNo = (AS == AMDGPUAS::LOCAL_ADDRESS)		const unsigned ApertureRegNo = (AS == AMDGPUAS::LOCAL_ADDRESS)
? AMDGPU::SRC_SHARED_BASE		? AMDGPU::SRC_SHARED_BASE
: AMDGPU::SRC_PRIVATE_BASE;		: AMDGPU::SRC_PRIVATE_BASE;
// Note: this feature (register) is broken. When used as a 32-bit operand,		// Note: this feature (register) is broken. When used as a 32-bit operand,
		arsenmUnsubmitted Done Reply Inline Actions Op.getValueType() arsenm: Op.getValueType()
// it returns a wrong value (all zeroes?). The real value is in the upper 32		// it returns a wrong value (all zeroes?). The real value is in the upper 32
// bits.		// bits.
//		//
// To work around the issue, directly emit a 64 bit mov from this register		// To work around the issue, directly emit a 64 bit mov from this register
// then extract the high bits. Note that this shouldn't even result in a		// then extract the high bits. Note that this shouldn't even result in a
		arsenmUnsubmitted Done Reply Inline Actions Should be specific cast, not FPExtOrRound. I don't think the FP_ROUND case would be correct arsenm: Should be specific cast, not FPExtOrRound. I don't think the FP_ROUND case would be correct
		Pierre-vhAuthorUnsubmitted Done Reply Inline Actions But we need to do f32 -> f16, isn't FP_ROUND used for that? I thought it's what we needed Pierre-vh: But we need to do f32 -> f16, isn't FP_ROUND used for that? I thought it's what we needed
		arsenmUnsubmitted Done Reply Inline Actions This is just this arsenm: This is just this
// shift being emitted and simply become a pair of registers (e.g.):		// shift being emitted and simply become a pair of registers (e.g.):
// s_mov_b64 s[6:7], src_shared_base		// s_mov_b64 s[6:7], src_shared_base
// v_mov_b32_e32 v1, s7		// v_mov_b32_e32 v1, s7
//		//
// FIXME: It would be more natural to emit a CopyFromReg here, but then copy		// FIXME: It would be more natural to emit a CopyFromReg here, but then copy
// coalescing would kick in and it would think it's okay to use the "HI"		// coalescing would kick in and it would think it's okay to use the "HI"
		arsenmUnsubmitted Done Reply Inline Actions Should use Op.getValueType() instead of Op->getValueType(0) arsenm: Should use Op.getValueType() instead of Op->getValueType(0)
// subregister directly (instead of extracting the HI 32 bits) which is an		// subregister directly (instead of extracting the HI 32 bits) which is an
// artificial (unusable) register.		// artificial (unusable) register.
// Register TableGen definitions would need an overhaul to get rid of the		// Register TableGen definitions would need an overhaul to get rid of the
// artificial "HI" aperture registers and prevent this kind of issue from		// artificial "HI" aperture registers and prevent this kind of issue from
// happening.		// happening.
SDNode *Mov = DAG.getMachineNode(AMDGPU::S_MOV_B64, DL, MVT::i64,		SDNode *Mov = DAG.getMachineNode(AMDGPU::S_MOV_B64, DL, MVT::i64,
DAG.getRegister(ApertureRegNo, MVT::i64));		DAG.getRegister(ApertureRegNo, MVT::i64));
		arsenmUnsubmitted Done Reply Inline Actions ExpandNode covers lowering BF16_TO_FP. It also has a shift by 16-bits into the high bits. Is this correct? arsenm: ExpandNode covers lowering BF16_TO_FP. It also has a shift by 16-bits into the high bits. Is…
		Pierre-vhAuthorUnsubmitted Done Reply Inline Actions Ah I didn't know that, though as long as we use custom lowering, and our FP_TO_BF16/BF16_TO_FP methods are consistent, it should be fine, no? Pierre-vh: Ah I didn't know that, though as long as we use custom lowering, and our FP_TO_BF16/BF16_TO_FP…
		arsenmUnsubmitted Done Reply Inline Actions bfloat16 has the same number of exponent bits in the same high bits as f32; I kind of think the idea is you can just do a bitshift and then operate on f32? I think the fp_extend here is wrong arsenm: bfloat16 has the same number of exponent bits in the same high bits as f32; I kind of think the…
		arsenmUnsubmitted Done Reply Inline Actions The default legalization also looks wrong to me. I don't understand why it isn't shifting down the mantissa bit arsenm: The default legalization also looks wrong to me. I don't understand why it isn't shifting down…
		Pierre-vhAuthorUnsubmitted Done Reply Inline Actions Indeed it was terribly wrong. I rewrote both legalizations following what I found online: https://en.wikipedia.org/wiki/Bfloat16_floating-point_format bf16 is designed to be very easily convertible from/to f32, save for some edge cases with denormalized numbers I think, thus: bf16 -> f32 is just left-shift by 16, filling the least-significant bits with zeroes. f32 -> bf16 is just cutting off the 16 least-significant bits. Pierre-vh: Indeed it was terribly wrong. I rewrote both legalizations following what I found online: https…
return DAG.getNode(		return DAG.getNode(
		arsenmUnsubmitted Done Reply Inline Actions can just hardcode i32 arsenm: can just hardcode i32
ISD::TRUNCATE, DL, MVT::i32,		ISD::TRUNCATE, DL, MVT::i32,
DAG.getNode(ISD::SRL, DL, MVT::i64,		DAG.getNode(ISD::SRL, DL, MVT::i64,
{SDValue(Mov, 0), DAG.getConstant(32, DL, MVT::i64)}));		{SDValue(Mov, 0), DAG.getConstant(32, DL, MVT::i64)}));
}		}

// For code object version 5, private_base and shared_base are passed through		// For code object version 5, private_base and shared_base are passed through
// implicit kernargs.		// implicit kernargs.
if (AMDGPU::getAmdhsaCodeObjectVersion() == 5) {		if (AMDGPU::getAmdhsaCodeObjectVersion() == 5) {
ImplicitParameter Param =		ImplicitParameter Param =
(AS == AMDGPUAS::LOCAL_ADDRESS) ? SHARED_BASE : PRIVATE_BASE;		(AS == AMDGPUAS::LOCAL_ADDRESS) ? SHARED_BASE : PRIVATE_BASE;
return loadImplicitKernelArgument(DAG, MVT::i32, DL, Align(4), Param);		return loadImplicitKernelArgument(DAG, MVT::i32, DL, Align(4), Param);
}		}
		arsenmUnsubmitted Done Reply Inline Actions This is just this arsenm: This is just this

MachineFunction &MF = DAG.getMachineFunction();		MachineFunction &MF = DAG.getMachineFunction();
SIMachineFunctionInfo *Info = MF.getInfo<SIMachineFunctionInfo>();		SIMachineFunctionInfo *Info = MF.getInfo<SIMachineFunctionInfo>();
Register UserSGPR = Info->getQueuePtrUserSGPR();		Register UserSGPR = Info->getQueuePtrUserSGPR();
if (UserSGPR == AMDGPU::NoRegister) {		if (UserSGPR == AMDGPU::NoRegister) {
// We probably are in a function incorrectly marked with		// We probably are in a function incorrectly marked with
		arsenmUnsubmitted Done Reply Inline Actions This can be any_extend and the combiner will probably turn it into one arsenm: This can be any_extend and the combiner will probably turn it into one
// amdgpu-no-queue-ptr. This is undefined.		// amdgpu-no-queue-ptr. This is undefined.
return DAG.getUNDEF(MVT::i32);		return DAG.getUNDEF(MVT::i32);
		arsenmUnsubmitted Done Reply Inline Actions Can just hardcode i32 arsenm: Can just hardcode i32
}		}

SDValue QueuePtr = CreateLiveInRegister(		SDValue QueuePtr = CreateLiveInRegister(
DAG, &AMDGPU::SReg_64RegClass, UserSGPR, MVT::i64);		DAG, &AMDGPU::SReg_64RegClass, UserSGPR, MVT::i64);

// Offset into amd_queue_t for group_segment_aperture_base_hi /		// Offset into amd_queue_t for group_segment_aperture_base_hi /
// private_segment_aperture_base_hi.		// private_segment_aperture_base_hi.
uint32_t StructOffset = (AS == AMDGPUAS::LOCAL_ADDRESS) ? 0x40 : 0x44;		uint32_t StructOffset = (AS == AMDGPUAS::LOCAL_ADDRESS) ? 0x40 : 0x44;
▲ Show 20 Lines • Show All 7,725 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/bf16-ops.ll

This file was added.

				; RUN: not llc < %s -march=amdgcn -mcpu=hawaii
				; RUN: not llc < %s -march=amdgcn -mcpu=tonga
				; RUN: not llc < %s -march=amdgcn -mcpu=gfx900
				; RUN: not llc < %s -march=amdgcn -mcpu=gfx1010

				arsenmUnsubmitted Done Reply Inline Actions Drop -verify-machineinstrs arsenm: Drop -verify-machineinstrs
				; TODO: Add GlobalISel tests, currently it silently miscompiles as GISel does not handle BF16 at all.

				; We only have storage-only BF16 support so check codegen fails if we attempt to do operations on bfloats.

				define void @test_fneg(bfloat %a, ptr addrspace(1) %out) {
				%result = fneg bfloat %a
				store bfloat %result, ptr addrspace(1) %out
				ret void
				}

				define void @test_fabs(bfloat %a, ptr addrspace(1) %out) {
				%result = fabs bfloat %a
				store bfloat %result, ptr addrspace(1) %out
				ret void
				}

				define void @test_add(bfloat %a, bfloat %b, ptr addrspace(1) %out) {
				%result = fadd bfloat %a, %b
				arsenmUnsubmitted Done Reply Inline Actions Use opaque pointers arsenm: Use opaque pointers
				store bfloat %result, ptr addrspace(1) %out
				ret void
				}

				define void @test_mul(bfloat %a, bfloat %b, ptr addrspace(1) %out) {
				%result = fmul bfloat %a, %b
				store bfloat %result, ptr addrspace(1) %out
				ret void
				}

llvm/test/CodeGen/AMDGPU/bf16.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
				; RUN: llc < %s -march=amdgcn -verify-machineinstrs \| FileCheck %s -check-prefixes=GCN
				; RUN: llc < %s -march=amdgcn -mcpu=hawaii -verify-machineinstrs \| FileCheck %s -check-prefixes=GFX7
				; RUN: llc < %s -march=amdgcn -mcpu=tonga -verify-machineinstrs \| FileCheck %s -check-prefixes=GFX8
				; RUN: llc < %s -march=amdgcn -mcpu=gfx900 -verify-machineinstrs \| FileCheck %s -check-prefixes=GFX9
				; RUN: llc < %s -march=amdgcn -mcpu=gfx1010 -verify-machineinstrs \| FileCheck %s -check-prefixes=GFX10

				; We only have storage-only BF16 support. We can load/store those values as we treat them as u16, but
				; we don't support operations on them. As such, codegen is expected to fail for any operation other
				; than simple load/stores.

				define void @test_load_store(ptr addrspace(1) %in, ptr addrspace(1) %out) {
				arsenmUnsubmitted Done Reply Inline Actions Doesn't cover the different f32/f64 conversions? arsenm: Doesn't cover the different f32/f64 conversions?
				; GCN-LABEL: test_load_store:
				; GCN: ; %bb.0:
				; GCN-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GCN-NEXT: s_mov_b32 s6, 0
				; GCN-NEXT: s_mov_b32 s7, 0xf000
				; GCN-NEXT: s_mov_b32 s4, s6
				; GCN-NEXT: s_mov_b32 s5, s6
				; GCN-NEXT: buffer_load_ushort v0, v[0:1], s[4:7], 0 addr64
				; GCN-NEXT: s_waitcnt vmcnt(0)
				; GCN-NEXT: buffer_store_short v0, v[2:3], s[4:7], 0 addr64
				; GCN-NEXT: s_waitcnt vmcnt(0) expcnt(0)
				; GCN-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX7-LABEL: test_load_store:
				; GFX7: ; %bb.0:
				; GFX7-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX7-NEXT: s_mov_b32 s6, 0
				; GFX7-NEXT: s_mov_b32 s7, 0xf000
				; GFX7-NEXT: s_mov_b32 s4, s6
				; GFX7-NEXT: s_mov_b32 s5, s6
				; GFX7-NEXT: buffer_load_ushort v0, v[0:1], s[4:7], 0 addr64
				; GFX7-NEXT: s_waitcnt vmcnt(0)
				; GFX7-NEXT: buffer_store_short v0, v[2:3], s[4:7], 0 addr64
				; GFX7-NEXT: s_waitcnt vmcnt(0)
				; GFX7-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX8-LABEL: test_load_store:
				; GFX8: ; %bb.0:
				; GFX8-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX8-NEXT: flat_load_ushort v0, v[0:1]
				; GFX8-NEXT: s_waitcnt vmcnt(0)
				; GFX8-NEXT: flat_store_short v[2:3], v0
				; GFX8-NEXT: s_waitcnt vmcnt(0)
				; GFX8-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX9-LABEL: test_load_store:
				; GFX9: ; %bb.0:
				; GFX9-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX9-NEXT: global_load_ushort v0, v[0:1], off
				; GFX9-NEXT: s_waitcnt vmcnt(0)
				; GFX9-NEXT: global_store_short v[2:3], v0, off
				; GFX9-NEXT: s_waitcnt vmcnt(0)
				; GFX9-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX10-LABEL: test_load_store:
				; GFX10: ; %bb.0:
				; GFX10-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX10-NEXT: global_load_ushort v0, v[0:1], off
				; GFX10-NEXT: s_waitcnt vmcnt(0)
				; GFX10-NEXT: global_store_short v[2:3], v0, off
				; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX10-NEXT: s_setpc_b64 s[30:31]
				%val = load bfloat, ptr addrspace(1) %in
				store bfloat %val, ptr addrspace(1) %out
				ret void
				}

				define void @test_load_store_f32_to_bf16(ptr addrspace(1) %in, ptr addrspace(1) %out) {
				; GCN-LABEL: test_load_store_f32_to_bf16:
				; GCN: ; %bb.0:
				; GCN-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GCN-NEXT: s_mov_b32 s6, 0
				; GCN-NEXT: s_mov_b32 s7, 0xf000
				; GCN-NEXT: s_mov_b32 s4, s6
				; GCN-NEXT: s_mov_b32 s5, s6
				; GCN-NEXT: buffer_load_dword v0, v[0:1], s[4:7], 0 addr64
				; GCN-NEXT: s_waitcnt vmcnt(0)
				; GCN-NEXT: v_lshrrev_b32_e32 v0, 16, v0
				; GCN-NEXT: buffer_store_short v0, v[2:3], s[4:7], 0 addr64
				; GCN-NEXT: s_waitcnt vmcnt(0) expcnt(0)
				; GCN-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX7-LABEL: test_load_store_f32_to_bf16:
				; GFX7: ; %bb.0:
				; GFX7-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX7-NEXT: s_mov_b32 s6, 0
				; GFX7-NEXT: s_mov_b32 s7, 0xf000
				; GFX7-NEXT: s_mov_b32 s4, s6
				; GFX7-NEXT: s_mov_b32 s5, s6
				; GFX7-NEXT: buffer_load_dword v0, v[0:1], s[4:7], 0 addr64
				; GFX7-NEXT: s_waitcnt vmcnt(0)
				; GFX7-NEXT: v_lshrrev_b32_e32 v0, 16, v0
				; GFX7-NEXT: buffer_store_short v0, v[2:3], s[4:7], 0 addr64
				; GFX7-NEXT: s_waitcnt vmcnt(0)
				; GFX7-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX8-LABEL: test_load_store_f32_to_bf16:
				; GFX8: ; %bb.0:
				; GFX8-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX8-NEXT: flat_load_dword v0, v[0:1]
				; GFX8-NEXT: s_waitcnt vmcnt(0)
				; GFX8-NEXT: v_lshrrev_b32_e32 v0, 16, v0
				; GFX8-NEXT: flat_store_short v[2:3], v0
				; GFX8-NEXT: s_waitcnt vmcnt(0)
				; GFX8-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX9-LABEL: test_load_store_f32_to_bf16:
				; GFX9: ; %bb.0:
				; GFX9-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX9-NEXT: global_load_dword v0, v[0:1], off
				; GFX9-NEXT: s_waitcnt vmcnt(0)
				; GFX9-NEXT: global_store_short_d16_hi v[2:3], v0, off
				; GFX9-NEXT: s_waitcnt vmcnt(0)
				; GFX9-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX10-LABEL: test_load_store_f32_to_bf16:
				; GFX10: ; %bb.0:
				; GFX10-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX10-NEXT: global_load_dword v0, v[0:1], off
				; GFX10-NEXT: s_waitcnt vmcnt(0)
				; GFX10-NEXT: global_store_short_d16_hi v[2:3], v0, off
				; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX10-NEXT: s_setpc_b64 s[30:31]
				%val = load float, ptr addrspace(1) %in
				%val.bf16 = fptrunc float %val to bfloat
				store bfloat %val.bf16, ptr addrspace(1) %out
				ret void
				}

				define void @test_load_store_f64_to_bf16(ptr addrspace(1) %in, ptr addrspace(1) %out) {
				; GCN-LABEL: test_load_store_f64_to_bf16:
				; GCN: ; %bb.0:
				; GCN-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GCN-NEXT: s_mov_b32 s6, 0
				; GCN-NEXT: s_mov_b32 s7, 0xf000
				; GCN-NEXT: s_mov_b32 s4, s6
				; GCN-NEXT: s_mov_b32 s5, s6
				; GCN-NEXT: buffer_load_dwordx2 v[0:1], v[0:1], s[4:7], 0 addr64
				; GCN-NEXT: s_waitcnt vmcnt(0)
				; GCN-NEXT: v_cvt_f32_f64_e32 v0, v[0:1]
				; GCN-NEXT: v_lshrrev_b32_e32 v0, 16, v0
				; GCN-NEXT: buffer_store_short v0, v[2:3], s[4:7], 0 addr64
				; GCN-NEXT: s_waitcnt vmcnt(0) expcnt(0)
				; GCN-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX7-LABEL: test_load_store_f64_to_bf16:
				; GFX7: ; %bb.0:
				; GFX7-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX7-NEXT: s_mov_b32 s6, 0
				; GFX7-NEXT: s_mov_b32 s7, 0xf000
				; GFX7-NEXT: s_mov_b32 s4, s6
				; GFX7-NEXT: s_mov_b32 s5, s6
				; GFX7-NEXT: buffer_load_dwordx2 v[0:1], v[0:1], s[4:7], 0 addr64
				; GFX7-NEXT: s_waitcnt vmcnt(0)
				; GFX7-NEXT: v_cvt_f32_f64_e32 v0, v[0:1]
				; GFX7-NEXT: v_lshrrev_b32_e32 v0, 16, v0
				; GFX7-NEXT: buffer_store_short v0, v[2:3], s[4:7], 0 addr64
				; GFX7-NEXT: s_waitcnt vmcnt(0)
				; GFX7-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX8-LABEL: test_load_store_f64_to_bf16:
				; GFX8: ; %bb.0:
				; GFX8-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX8-NEXT: flat_load_dwordx2 v[0:1], v[0:1]
				; GFX8-NEXT: s_waitcnt vmcnt(0)
				; GFX8-NEXT: v_cvt_f32_f64_e32 v0, v[0:1]
				; GFX8-NEXT: v_lshrrev_b32_e32 v0, 16, v0
				; GFX8-NEXT: flat_store_short v[2:3], v0
				; GFX8-NEXT: s_waitcnt vmcnt(0)
				; GFX8-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX9-LABEL: test_load_store_f64_to_bf16:
				; GFX9: ; %bb.0:
				; GFX9-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX9-NEXT: global_load_dwordx2 v[0:1], v[0:1], off
				; GFX9-NEXT: s_waitcnt vmcnt(0)
				; GFX9-NEXT: v_cvt_f32_f64_e32 v0, v[0:1]
				; GFX9-NEXT: global_store_short_d16_hi v[2:3], v0, off
				; GFX9-NEXT: s_waitcnt vmcnt(0)
				; GFX9-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX10-LABEL: test_load_store_f64_to_bf16:
				; GFX10: ; %bb.0:
				; GFX10-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX10-NEXT: global_load_dwordx2 v[0:1], v[0:1], off
				; GFX10-NEXT: s_waitcnt vmcnt(0)
				; GFX10-NEXT: v_cvt_f32_f64_e32 v0, v[0:1]
				; GFX10-NEXT: global_store_short_d16_hi v[2:3], v0, off
				; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX10-NEXT: s_setpc_b64 s[30:31]
				%val = load double, ptr addrspace(1) %in
				%val.bf16 = fptrunc double %val to bfloat
				store bfloat %val.bf16, ptr addrspace(1) %out
				ret void
				}

				define void @test_load_store_bf16_to_f32(ptr addrspace(1) %in, ptr addrspace(1) %out) {
				; GCN-LABEL: test_load_store_bf16_to_f32:
				; GCN: ; %bb.0:
				; GCN-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GCN-NEXT: s_mov_b32 s6, 0
				; GCN-NEXT: s_mov_b32 s7, 0xf000
				; GCN-NEXT: s_mov_b32 s4, s6
				; GCN-NEXT: s_mov_b32 s5, s6
				; GCN-NEXT: buffer_load_ushort v0, v[0:1], s[4:7], 0 addr64
				; GCN-NEXT: s_waitcnt vmcnt(0)
				; GCN-NEXT: v_lshlrev_b32_e32 v0, 16, v0
				; GCN-NEXT: buffer_store_dword v0, v[2:3], s[4:7], 0 addr64
				; GCN-NEXT: s_waitcnt vmcnt(0) expcnt(0)
				; GCN-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX7-LABEL: test_load_store_bf16_to_f32:
				; GFX7: ; %bb.0:
				; GFX7-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX7-NEXT: s_mov_b32 s6, 0
				; GFX7-NEXT: s_mov_b32 s7, 0xf000
				; GFX7-NEXT: s_mov_b32 s4, s6
				; GFX7-NEXT: s_mov_b32 s5, s6
				; GFX7-NEXT: buffer_load_ushort v0, v[0:1], s[4:7], 0 addr64
				; GFX7-NEXT: s_waitcnt vmcnt(0)
				; GFX7-NEXT: v_lshlrev_b32_e32 v0, 16, v0
				; GFX7-NEXT: buffer_store_dword v0, v[2:3], s[4:7], 0 addr64
				; GFX7-NEXT: s_waitcnt vmcnt(0)
				; GFX7-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX8-LABEL: test_load_store_bf16_to_f32:
				; GFX8: ; %bb.0:
				; GFX8-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX8-NEXT: flat_load_ushort v0, v[0:1]
				; GFX8-NEXT: s_waitcnt vmcnt(0)
				; GFX8-NEXT: v_lshlrev_b32_e32 v0, 16, v0
				; GFX8-NEXT: flat_store_dword v[2:3], v0
				; GFX8-NEXT: s_waitcnt vmcnt(0)
				; GFX8-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX9-LABEL: test_load_store_bf16_to_f32:
				; GFX9: ; %bb.0:
				; GFX9-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX9-NEXT: v_mov_b32_e32 v4, 0
				; GFX9-NEXT: global_load_short_d16_hi v4, v[0:1], off
				; GFX9-NEXT: s_waitcnt vmcnt(0)
				; GFX9-NEXT: global_store_dword v[2:3], v4, off
				; GFX9-NEXT: s_waitcnt vmcnt(0)
				; GFX9-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX10-LABEL: test_load_store_bf16_to_f32:
				; GFX10: ; %bb.0:
				; GFX10-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX10-NEXT: v_mov_b32_e32 v4, 0
				; GFX10-NEXT: global_load_short_d16_hi v4, v[0:1], off
				; GFX10-NEXT: s_waitcnt vmcnt(0)
				; GFX10-NEXT: global_store_dword v[2:3], v4, off
				; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX10-NEXT: s_setpc_b64 s[30:31]
				%val = load bfloat, ptr addrspace(1) %in
				%val.f32 = fpext bfloat %val to float
				store float %val.f32, ptr addrspace(1) %out
				ret void
				}

				define void @test_load_store_bf16_to_f64(ptr addrspace(1) %in, ptr addrspace(1) %out) {
				; GCN-LABEL: test_load_store_bf16_to_f64:
				; GCN: ; %bb.0:
				; GCN-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GCN-NEXT: s_mov_b32 s6, 0
				; GCN-NEXT: s_mov_b32 s7, 0xf000
				; GCN-NEXT: s_mov_b32 s4, s6
				; GCN-NEXT: s_mov_b32 s5, s6
				; GCN-NEXT: buffer_load_ushort v0, v[0:1], s[4:7], 0 addr64
				; GCN-NEXT: s_waitcnt vmcnt(0)
				; GCN-NEXT: v_lshlrev_b32_e32 v0, 16, v0
				; GCN-NEXT: v_cvt_f64_f32_e32 v[0:1], v0
				; GCN-NEXT: buffer_store_dwordx2 v[0:1], v[2:3], s[4:7], 0 addr64
				; GCN-NEXT: s_waitcnt vmcnt(0) expcnt(0)
				; GCN-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX7-LABEL: test_load_store_bf16_to_f64:
				; GFX7: ; %bb.0:
				; GFX7-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX7-NEXT: s_mov_b32 s6, 0
				; GFX7-NEXT: s_mov_b32 s7, 0xf000
				; GFX7-NEXT: s_mov_b32 s4, s6
				; GFX7-NEXT: s_mov_b32 s5, s6
				; GFX7-NEXT: buffer_load_ushort v0, v[0:1], s[4:7], 0 addr64
				; GFX7-NEXT: s_waitcnt vmcnt(0)
				; GFX7-NEXT: v_lshlrev_b32_e32 v0, 16, v0
				; GFX7-NEXT: v_cvt_f64_f32_e32 v[0:1], v0
				; GFX7-NEXT: buffer_store_dwordx2 v[0:1], v[2:3], s[4:7], 0 addr64
				; GFX7-NEXT: s_waitcnt vmcnt(0)
				; GFX7-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX8-LABEL: test_load_store_bf16_to_f64:
				; GFX8: ; %bb.0:
				; GFX8-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX8-NEXT: flat_load_ushort v0, v[0:1]
				; GFX8-NEXT: s_waitcnt vmcnt(0)
				; GFX8-NEXT: v_lshlrev_b32_e32 v0, 16, v0
				; GFX8-NEXT: v_cvt_f64_f32_e32 v[0:1], v0
				; GFX8-NEXT: flat_store_dwordx2 v[2:3], v[0:1]
				; GFX8-NEXT: s_waitcnt vmcnt(0)
				; GFX8-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX9-LABEL: test_load_store_bf16_to_f64:
				; GFX9: ; %bb.0:
				; GFX9-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX9-NEXT: v_mov_b32_e32 v4, 0
				; GFX9-NEXT: global_load_short_d16_hi v4, v[0:1], off
				; GFX9-NEXT: s_waitcnt vmcnt(0)
				; GFX9-NEXT: v_cvt_f64_f32_e32 v[0:1], v4
				; GFX9-NEXT: global_store_dwordx2 v[2:3], v[0:1], off
				; GFX9-NEXT: s_waitcnt vmcnt(0)
				; GFX9-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX10-LABEL: test_load_store_bf16_to_f64:
				; GFX10: ; %bb.0:
				; GFX10-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX10-NEXT: v_mov_b32_e32 v4, 0
				; GFX10-NEXT: global_load_short_d16_hi v4, v[0:1], off
				; GFX10-NEXT: s_waitcnt vmcnt(0)
				; GFX10-NEXT: v_cvt_f64_f32_e32 v[0:1], v4
				; GFX10-NEXT: global_store_dwordx2 v[2:3], v[0:1], off
				; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX10-NEXT: s_setpc_b64 s[30:31]
				%val = load bfloat, ptr addrspace(1) %in
				%val.f64 = fpext bfloat %val to double
				store double %val.f64, ptr addrspace(1) %out
				ret void
				}

				define void @test_load_store_v2bf16(ptr addrspace(1) %in, ptr addrspace(1) %out) {
				; GCN-LABEL: test_load_store_v2bf16:
				; GCN: ; %bb.0:
				; GCN-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GCN-NEXT: s_mov_b32 s6, 0
				; GCN-NEXT: s_mov_b32 s7, 0xf000
				; GCN-NEXT: s_mov_b32 s4, s6
				; GCN-NEXT: s_mov_b32 s5, s6
				; GCN-NEXT: buffer_load_dword v0, v[0:1], s[4:7], 0 addr64
				; GCN-NEXT: s_waitcnt vmcnt(0)
				; GCN-NEXT: buffer_store_dword v0, v[2:3], s[4:7], 0 addr64
				; GCN-NEXT: s_waitcnt vmcnt(0) expcnt(0)
				; GCN-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX7-LABEL: test_load_store_v2bf16:
				; GFX7: ; %bb.0:
				; GFX7-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX7-NEXT: s_mov_b32 s6, 0
				; GFX7-NEXT: s_mov_b32 s7, 0xf000
				; GFX7-NEXT: s_mov_b32 s4, s6
				; GFX7-NEXT: s_mov_b32 s5, s6
				; GFX7-NEXT: buffer_load_dword v0, v[0:1], s[4:7], 0 addr64
				; GFX7-NEXT: s_waitcnt vmcnt(0)
				; GFX7-NEXT: buffer_store_dword v0, v[2:3], s[4:7], 0 addr64
				; GFX7-NEXT: s_waitcnt vmcnt(0)
				; GFX7-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX8-LABEL: test_load_store_v2bf16:
				; GFX8: ; %bb.0:
				; GFX8-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX8-NEXT: flat_load_dword v0, v[0:1]
				; GFX8-NEXT: s_waitcnt vmcnt(0)
				; GFX8-NEXT: flat_store_dword v[2:3], v0
				; GFX8-NEXT: s_waitcnt vmcnt(0)
				; GFX8-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX9-LABEL: test_load_store_v2bf16:
				; GFX9: ; %bb.0:
				; GFX9-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX9-NEXT: global_load_dword v0, v[0:1], off
				; GFX9-NEXT: s_waitcnt vmcnt(0)
				; GFX9-NEXT: global_store_dword v[2:3], v0, off
				; GFX9-NEXT: s_waitcnt vmcnt(0)
				; GFX9-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX10-LABEL: test_load_store_v2bf16:
				; GFX10: ; %bb.0:
				; GFX10-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX10-NEXT: global_load_dword v0, v[0:1], off
				; GFX10-NEXT: s_waitcnt vmcnt(0)
				; GFX10-NEXT: global_store_dword v[2:3], v0, off
				; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX10-NEXT: s_setpc_b64 s[30:31]
				%val = load <2 x bfloat>, ptr addrspace(1) %in
				store <2 x bfloat> %val, ptr addrspace(1) %out
				ret void
				}

				define void @test_load_store_v4bf16(ptr addrspace(1) %in, ptr addrspace(1) %out) {
				; GCN-LABEL: test_load_store_v4bf16:
				; GCN: ; %bb.0:
				; GCN-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GCN-NEXT: s_mov_b32 s6, 0
				; GCN-NEXT: s_mov_b32 s7, 0xf000
				; GCN-NEXT: s_mov_b32 s4, s6
				; GCN-NEXT: s_mov_b32 s5, s6
				; GCN-NEXT: buffer_load_dwordx2 v[0:1], v[0:1], s[4:7], 0 addr64
				; GCN-NEXT: s_waitcnt vmcnt(0)
				; GCN-NEXT: buffer_store_dwordx2 v[0:1], v[2:3], s[4:7], 0 addr64
				; GCN-NEXT: s_waitcnt vmcnt(0) expcnt(0)
				; GCN-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX7-LABEL: test_load_store_v4bf16:
				; GFX7: ; %bb.0:
				; GFX7-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX7-NEXT: s_mov_b32 s6, 0
				; GFX7-NEXT: s_mov_b32 s7, 0xf000
				; GFX7-NEXT: s_mov_b32 s4, s6
				; GFX7-NEXT: s_mov_b32 s5, s6
				; GFX7-NEXT: buffer_load_dwordx2 v[0:1], v[0:1], s[4:7], 0 addr64
				; GFX7-NEXT: s_waitcnt vmcnt(0)
				; GFX7-NEXT: buffer_store_dwordx2 v[0:1], v[2:3], s[4:7], 0 addr64
				; GFX7-NEXT: s_waitcnt vmcnt(0)
				; GFX7-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX8-LABEL: test_load_store_v4bf16:
				; GFX8: ; %bb.0:
				; GFX8-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX8-NEXT: flat_load_dwordx2 v[0:1], v[0:1]
				; GFX8-NEXT: s_waitcnt vmcnt(0)
				; GFX8-NEXT: flat_store_dwordx2 v[2:3], v[0:1]
				; GFX8-NEXT: s_waitcnt vmcnt(0)
				; GFX8-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX9-LABEL: test_load_store_v4bf16:
				; GFX9: ; %bb.0:
				arsenmUnsubmitted Done Reply Inline Actions Missing v3 test arsenm: Missing v3 test
				; GFX9-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX9-NEXT: global_load_dwordx2 v[0:1], v[0:1], off
				; GFX9-NEXT: s_waitcnt vmcnt(0)
				; GFX9-NEXT: global_store_dwordx2 v[2:3], v[0:1], off
				; GFX9-NEXT: s_waitcnt vmcnt(0)
				; GFX9-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX10-LABEL: test_load_store_v4bf16:
				; GFX10: ; %bb.0:
				; GFX10-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX10-NEXT: global_load_dwordx2 v[0:1], v[0:1], off
				; GFX10-NEXT: s_waitcnt vmcnt(0)
				; GFX10-NEXT: global_store_dwordx2 v[2:3], v[0:1], off
				; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX10-NEXT: s_setpc_b64 s[30:31]
				%val = load <4 x bfloat>, ptr addrspace(1) %in
				store <4 x bfloat> %val, ptr addrspace(1) %out
				ret void
				}

				define void @test_load_store_v8bf16(ptr addrspace(1) %in, ptr addrspace(1) %out) {
				; GCN-LABEL: test_load_store_v8bf16:
				; GCN: ; %bb.0:
				; GCN-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GCN-NEXT: s_mov_b32 s6, 0
				; GCN-NEXT: s_mov_b32 s7, 0xf000
				; GCN-NEXT: s_mov_b32 s4, s6
				; GCN-NEXT: s_mov_b32 s5, s6
				; GCN-NEXT: buffer_load_dwordx4 v[4:7], v[0:1], s[4:7], 0 addr64
				; GCN-NEXT: s_waitcnt vmcnt(0)
				; GCN-NEXT: buffer_store_dwordx4 v[4:7], v[2:3], s[4:7], 0 addr64
				; GCN-NEXT: s_waitcnt vmcnt(0) expcnt(0)
				; GCN-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX7-LABEL: test_load_store_v8bf16:
				; GFX7: ; %bb.0:
				; GFX7-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX7-NEXT: s_mov_b32 s6, 0
				; GFX7-NEXT: s_mov_b32 s7, 0xf000
				; GFX7-NEXT: s_mov_b32 s4, s6
				; GFX7-NEXT: s_mov_b32 s5, s6
				; GFX7-NEXT: buffer_load_dwordx4 v[4:7], v[0:1], s[4:7], 0 addr64
				; GFX7-NEXT: s_waitcnt vmcnt(0)
				; GFX7-NEXT: buffer_store_dwordx4 v[4:7], v[2:3], s[4:7], 0 addr64
				; GFX7-NEXT: s_waitcnt vmcnt(0)
				; GFX7-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX8-LABEL: test_load_store_v8bf16:
				; GFX8: ; %bb.0:
				; GFX8-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX8-NEXT: flat_load_dwordx4 v[4:7], v[0:1]
				; GFX8-NEXT: s_waitcnt vmcnt(0)
				; GFX8-NEXT: flat_store_dwordx4 v[2:3], v[4:7]
				; GFX8-NEXT: s_waitcnt vmcnt(0)
				; GFX8-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX9-LABEL: test_load_store_v8bf16:
				; GFX9: ; %bb.0:
				; GFX9-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX9-NEXT: global_load_dwordx4 v[4:7], v[0:1], off
				; GFX9-NEXT: s_waitcnt vmcnt(0)
				; GFX9-NEXT: global_store_dwordx4 v[2:3], v[4:7], off
				; GFX9-NEXT: s_waitcnt vmcnt(0)
				; GFX9-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX10-LABEL: test_load_store_v8bf16:
				; GFX10: ; %bb.0:
				; GFX10-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX10-NEXT: global_load_dwordx4 v[4:7], v[0:1], off
				; GFX10-NEXT: s_waitcnt vmcnt(0)
				; GFX10-NEXT: global_store_dwordx4 v[2:3], v[4:7], off
				; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX10-NEXT: s_setpc_b64 s[30:31]
				%val = load <8 x bfloat>, ptr addrspace(1) %in
				store <8 x bfloat> %val, ptr addrspace(1) %out
				ret void
				}

				define void @test_load_store_v16bf16(ptr addrspace(1) %in, ptr addrspace(1) %out) {
				; GCN-LABEL: test_load_store_v16bf16:
				; GCN: ; %bb.0:
				; GCN-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GCN-NEXT: s_mov_b32 s6, 0
				; GCN-NEXT: s_mov_b32 s7, 0xf000
				; GCN-NEXT: s_mov_b32 s4, s6
				; GCN-NEXT: s_mov_b32 s5, s6
				; GCN-NEXT: buffer_load_dwordx4 v[4:7], v[0:1], s[4:7], 0 addr64 offset:16
				; GCN-NEXT: buffer_load_dwordx4 v[8:11], v[0:1], s[4:7], 0 addr64
				; GCN-NEXT: s_waitcnt vmcnt(1)
				; GCN-NEXT: buffer_store_dwordx4 v[4:7], v[2:3], s[4:7], 0 addr64 offset:16
				; GCN-NEXT: s_waitcnt vmcnt(1)
				; GCN-NEXT: buffer_store_dwordx4 v[8:11], v[2:3], s[4:7], 0 addr64
				; GCN-NEXT: s_waitcnt vmcnt(0) expcnt(0)
				; GCN-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX7-LABEL: test_load_store_v16bf16:
				; GFX7: ; %bb.0:
				; GFX7-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX7-NEXT: s_mov_b32 s6, 0
				; GFX7-NEXT: s_mov_b32 s7, 0xf000
				; GFX7-NEXT: s_mov_b32 s4, s6
				; GFX7-NEXT: s_mov_b32 s5, s6
				; GFX7-NEXT: buffer_load_dwordx4 v[4:7], v[0:1], s[4:7], 0 addr64 offset:16
				; GFX7-NEXT: buffer_load_dwordx4 v[8:11], v[0:1], s[4:7], 0 addr64
				; GFX7-NEXT: s_waitcnt vmcnt(1)
				; GFX7-NEXT: buffer_store_dwordx4 v[4:7], v[2:3], s[4:7], 0 addr64 offset:16
				; GFX7-NEXT: s_waitcnt vmcnt(1)
				; GFX7-NEXT: buffer_store_dwordx4 v[8:11], v[2:3], s[4:7], 0 addr64
				; GFX7-NEXT: s_waitcnt vmcnt(0)
				; GFX7-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX8-LABEL: test_load_store_v16bf16:
				; GFX8: ; %bb.0:
				; GFX8-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX8-NEXT: v_add_u32_e32 v4, vcc, 16, v0
				; GFX8-NEXT: v_addc_u32_e32 v5, vcc, 0, v1, vcc
				; GFX8-NEXT: flat_load_dwordx4 v[4:7], v[4:5]
				; GFX8-NEXT: flat_load_dwordx4 v[8:11], v[0:1]
				; GFX8-NEXT: v_add_u32_e32 v0, vcc, 16, v2
				; GFX8-NEXT: v_addc_u32_e32 v1, vcc, 0, v3, vcc
				; GFX8-NEXT: s_waitcnt vmcnt(1)
				; GFX8-NEXT: flat_store_dwordx4 v[0:1], v[4:7]
				; GFX8-NEXT: s_waitcnt vmcnt(1)
				; GFX8-NEXT: flat_store_dwordx4 v[2:3], v[8:11]
				; GFX8-NEXT: s_waitcnt vmcnt(0)
				; GFX8-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX9-LABEL: test_load_store_v16bf16:
				; GFX9: ; %bb.0:
				; GFX9-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX9-NEXT: global_load_dwordx4 v[4:7], v[0:1], off offset:16
				; GFX9-NEXT: global_load_dwordx4 v[8:11], v[0:1], off
				; GFX9-NEXT: s_waitcnt vmcnt(1)
				; GFX9-NEXT: global_store_dwordx4 v[2:3], v[4:7], off offset:16
				; GFX9-NEXT: s_waitcnt vmcnt(1)
				; GFX9-NEXT: global_store_dwordx4 v[2:3], v[8:11], off
				; GFX9-NEXT: s_waitcnt vmcnt(0)
				; GFX9-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX10-LABEL: test_load_store_v16bf16:
				; GFX10: ; %bb.0:
				; GFX10-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX10-NEXT: s_clause 0x1
				; GFX10-NEXT: global_load_dwordx4 v[4:7], v[0:1], off offset:16
				; GFX10-NEXT: global_load_dwordx4 v[8:11], v[0:1], off
				; GFX10-NEXT: s_waitcnt vmcnt(1)
				; GFX10-NEXT: global_store_dwordx4 v[2:3], v[4:7], off offset:16
				; GFX10-NEXT: s_waitcnt vmcnt(0)
				; GFX10-NEXT: global_store_dwordx4 v[2:3], v[8:11], off
				; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX10-NEXT: s_setpc_b64 s[30:31]
				%val = load <16 x bfloat>, ptr addrspace(1) %in
				store <16 x bfloat> %val, ptr addrspace(1) %out
				ret void
				}

				define void @test_arg_store(bfloat %in, ptr addrspace(1) %out) {
				; GCN-LABEL: test_arg_store:
				; GCN: ; %bb.0:
				; GCN-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GCN-NEXT: v_lshrrev_b32_e32 v0, 16, v0
				; GCN-NEXT: s_mov_b32 s6, 0
				; GCN-NEXT: s_mov_b32 s7, 0xf000
				; GCN-NEXT: s_mov_b32 s4, s6
				; GCN-NEXT: s_mov_b32 s5, s6
				; GCN-NEXT: buffer_store_short v0, v[1:2], s[4:7], 0 addr64
				; GCN-NEXT: s_waitcnt vmcnt(0) expcnt(0)
				; GCN-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX7-LABEL: test_arg_store:
				; GFX7: ; %bb.0:
				; GFX7-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX7-NEXT: s_mov_b32 s6, 0
				; GFX7-NEXT: v_lshrrev_b32_e32 v0, 16, v0
				; GFX7-NEXT: s_mov_b32 s7, 0xf000
				; GFX7-NEXT: s_mov_b32 s4, s6
				; GFX7-NEXT: s_mov_b32 s5, s6
				; GFX7-NEXT: buffer_store_short v0, v[1:2], s[4:7], 0 addr64
				; GFX7-NEXT: s_waitcnt vmcnt(0)
				; GFX7-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX8-LABEL: test_arg_store:
				; GFX8: ; %bb.0:
				; GFX8-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX8-NEXT: v_lshrrev_b32_e32 v0, 16, v0
				; GFX8-NEXT: flat_store_short v[1:2], v0
				; GFX8-NEXT: s_waitcnt vmcnt(0)
				; GFX8-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX9-LABEL: test_arg_store:
				; GFX9: ; %bb.0:
				; GFX9-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX9-NEXT: global_store_short_d16_hi v[1:2], v0, off
				; GFX9-NEXT: s_waitcnt vmcnt(0)
				; GFX9-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX10-LABEL: test_arg_store:
				; GFX10: ; %bb.0:
				; GFX10-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX10-NEXT: global_store_short_d16_hi v[1:2], v0, off
				; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX10-NEXT: s_setpc_b64 s[30:31]
				store bfloat %in, ptr addrspace(1) %out
				ret void
				}

				define void @test_arg_store_v2bf16(<2 x bfloat> %in, ptr addrspace(1) %out) {
				; GCN-LABEL: test_arg_store_v2bf16:
				; GCN: ; %bb.0:
				; GCN-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GCN-NEXT: v_lshrrev_b32_e32 v1, 16, v1
				; GCN-NEXT: s_mov_b32 s6, 0
				; GCN-NEXT: v_alignbit_b32 v0, v1, v0, 16
				; GCN-NEXT: s_mov_b32 s7, 0xf000
				; GCN-NEXT: s_mov_b32 s4, s6
				; GCN-NEXT: s_mov_b32 s5, s6
				; GCN-NEXT: buffer_store_dword v0, v[2:3], s[4:7], 0 addr64
				; GCN-NEXT: s_waitcnt vmcnt(0) expcnt(0)
				; GCN-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX7-LABEL: test_arg_store_v2bf16:
				; GFX7: ; %bb.0:
				; GFX7-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX7-NEXT: v_lshrrev_b32_e32 v1, 16, v1
				; GFX7-NEXT: s_mov_b32 s6, 0
				; GFX7-NEXT: v_alignbit_b32 v0, v1, v0, 16
				; GFX7-NEXT: s_mov_b32 s7, 0xf000
				; GFX7-NEXT: s_mov_b32 s4, s6
				; GFX7-NEXT: s_mov_b32 s5, s6
				; GFX7-NEXT: buffer_store_dword v0, v[2:3], s[4:7], 0 addr64
				; GFX7-NEXT: s_waitcnt vmcnt(0)
				; GFX7-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX8-LABEL: test_arg_store_v2bf16:
				; GFX8: ; %bb.0:
				; GFX8-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX8-NEXT: flat_store_dword v[1:2], v0
				; GFX8-NEXT: s_waitcnt vmcnt(0)
				; GFX8-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX9-LABEL: test_arg_store_v2bf16:
				; GFX9: ; %bb.0:
				; GFX9-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX9-NEXT: global_store_dword v[1:2], v0, off
				; GFX9-NEXT: s_waitcnt vmcnt(0)
				; GFX9-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX10-LABEL: test_arg_store_v2bf16:
				; GFX10: ; %bb.0:
				; GFX10-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX10-NEXT: global_store_dword v[1:2], v0, off
				; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX10-NEXT: s_setpc_b64 s[30:31]
				store <2 x bfloat> %in, ptr addrspace(1) %out
				ret void
				}

				define void @test_arg_store_v3bf16(<3 x bfloat> %in, <3 x bfloat> addrspace(1)* %out) {
				; GCN-LABEL: test_arg_store_v3bf16:
				; GCN: ; %bb.0:
				; GCN-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GCN-NEXT: v_lshrrev_b32_e32 v1, 16, v1
				; GCN-NEXT: v_lshrrev_b32_e32 v2, 16, v2
				; GCN-NEXT: s_mov_b32 s7, 0xf000
				; GCN-NEXT: s_mov_b32 s6, 0
				; GCN-NEXT: v_alignbit_b32 v0, v1, v0, 16
				; GCN-NEXT: s_mov_b32 s4, s6
				; GCN-NEXT: s_mov_b32 s5, s6
				; GCN-NEXT: buffer_store_short v2, v[3:4], s[4:7], 0 addr64 offset:4
				; GCN-NEXT: buffer_store_dword v0, v[3:4], s[4:7], 0 addr64
				; GCN-NEXT: s_waitcnt vmcnt(0) expcnt(0)
				; GCN-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX7-LABEL: test_arg_store_v3bf16:
				; GFX7: ; %bb.0:
				; GFX7-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX7-NEXT: v_lshrrev_b32_e32 v1, 16, v1
				; GFX7-NEXT: s_mov_b32 s6, 0
				; GFX7-NEXT: v_alignbit_b32 v0, v1, v0, 16
				; GFX7-NEXT: v_lshrrev_b32_e32 v1, 16, v2
				; GFX7-NEXT: s_mov_b32 s7, 0xf000
				; GFX7-NEXT: s_mov_b32 s4, s6
				; GFX7-NEXT: s_mov_b32 s5, s6
				; GFX7-NEXT: buffer_store_short v1, v[3:4], s[4:7], 0 addr64 offset:4
				; GFX7-NEXT: buffer_store_dword v0, v[3:4], s[4:7], 0 addr64
				; GFX7-NEXT: s_waitcnt vmcnt(0)
				; GFX7-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX8-LABEL: test_arg_store_v3bf16:
				; GFX8: ; %bb.0:
				; GFX8-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX8-NEXT: flat_store_dword v[2:3], v0
				; GFX8-NEXT: v_add_u32_e32 v2, vcc, 4, v2
				; GFX8-NEXT: v_addc_u32_e32 v3, vcc, 0, v3, vcc
				; GFX8-NEXT: flat_store_short v[2:3], v1
				; GFX8-NEXT: s_waitcnt vmcnt(0)
				; GFX8-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX9-LABEL: test_arg_store_v3bf16:
				; GFX9: ; %bb.0:
				; GFX9-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX9-NEXT: global_store_short v[2:3], v1, off offset:4
				; GFX9-NEXT: global_store_dword v[2:3], v0, off
				; GFX9-NEXT: s_waitcnt vmcnt(0)
				; GFX9-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX10-LABEL: test_arg_store_v3bf16:
				; GFX10: ; %bb.0:
				; GFX10-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX10-NEXT: global_store_short v[2:3], v1, off offset:4
				; GFX10-NEXT: global_store_dword v[2:3], v0, off
				; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX10-NEXT: s_setpc_b64 s[30:31]
				store <3 x bfloat> %in, <3 x bfloat> addrspace(1) * %out
				ret void
				}

				define void @test_arg_store_v4bf16(<4 x bfloat> %in, ptr addrspace(1) %out) {
				; GCN-LABEL: test_arg_store_v4bf16:
				; GCN: ; %bb.0:
				; GCN-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GCN-NEXT: v_lshrrev_b32_e32 v3, 16, v3
				; GCN-NEXT: v_lshrrev_b32_e32 v6, 16, v1
				; GCN-NEXT: s_mov_b32 s6, 0
				; GCN-NEXT: v_alignbit_b32 v1, v3, v2, 16
				; GCN-NEXT: v_alignbit_b32 v0, v6, v0, 16
				; GCN-NEXT: s_mov_b32 s7, 0xf000
				; GCN-NEXT: s_mov_b32 s4, s6
				; GCN-NEXT: s_mov_b32 s5, s6
				; GCN-NEXT: buffer_store_dwordx2 v[0:1], v[4:5], s[4:7], 0 addr64
				; GCN-NEXT: s_waitcnt vmcnt(0) expcnt(0)
				; GCN-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX7-LABEL: test_arg_store_v4bf16:
				; GFX7: ; %bb.0:
				; GFX7-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX7-NEXT: v_lshrrev_b32_e32 v3, 16, v3
				; GFX7-NEXT: v_lshrrev_b32_e32 v1, 16, v1
				; GFX7-NEXT: s_mov_b32 s6, 0
				; GFX7-NEXT: v_alignbit_b32 v2, v3, v2, 16
				; GFX7-NEXT: v_alignbit_b32 v1, v1, v0, 16
				; GFX7-NEXT: s_mov_b32 s7, 0xf000
				; GFX7-NEXT: s_mov_b32 s4, s6
				; GFX7-NEXT: s_mov_b32 s5, s6
				; GFX7-NEXT: buffer_store_dwordx2 v[1:2], v[4:5], s[4:7], 0 addr64
				; GFX7-NEXT: s_waitcnt vmcnt(0)
				; GFX7-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX8-LABEL: test_arg_store_v4bf16:
				; GFX8: ; %bb.0:
				; GFX8-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX8-NEXT: flat_store_dwordx2 v[2:3], v[0:1]
				; GFX8-NEXT: s_waitcnt vmcnt(0)
				; GFX8-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX9-LABEL: test_arg_store_v4bf16:
				; GFX9: ; %bb.0:
				; GFX9-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX9-NEXT: global_store_dwordx2 v[2:3], v[0:1], off
				; GFX9-NEXT: s_waitcnt vmcnt(0)
				; GFX9-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX10-LABEL: test_arg_store_v4bf16:
				; GFX10: ; %bb.0:
				; GFX10-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX10-NEXT: global_store_dwordx2 v[2:3], v[0:1], off
				; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX10-NEXT: s_setpc_b64 s[30:31]
				store <4 x bfloat> %in, ptr addrspace(1) %out
				ret void
				}

				define void @test_arg_store_v8bf16(<8 x bfloat> %in, ptr addrspace(1) %out) {
				; GCN-LABEL: test_arg_store_v8bf16:
				; GCN: ; %bb.0:
				; GCN-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GCN-NEXT: s_mov_b32 s7, 0xf000
				; GCN-NEXT: s_mov_b32 s6, 0
				; GCN-NEXT: v_lshrrev_b32_e32 v7, 16, v7
				; GCN-NEXT: v_lshrrev_b32_e32 v10, 16, v5
				; GCN-NEXT: v_lshrrev_b32_e32 v3, 16, v3
				; GCN-NEXT: v_lshrrev_b32_e32 v1, 16, v1
				; GCN-NEXT: s_mov_b32 s4, s6
				; GCN-NEXT: s_mov_b32 s5, s6
				; GCN-NEXT: v_alignbit_b32 v5, v7, v6, 16
				; GCN-NEXT: v_alignbit_b32 v4, v10, v4, 16
				; GCN-NEXT: v_alignbit_b32 v3, v3, v2, 16
				; GCN-NEXT: v_alignbit_b32 v2, v1, v0, 16
				; GCN-NEXT: buffer_store_dwordx4 v[2:5], v[8:9], s[4:7], 0 addr64
				; GCN-NEXT: s_waitcnt vmcnt(0) expcnt(0)
				; GCN-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX7-LABEL: test_arg_store_v8bf16:
				; GFX7: ; %bb.0:
				; GFX7-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX7-NEXT: s_mov_b32 s6, 0
				; GFX7-NEXT: v_lshrrev_b32_e32 v7, 16, v7
				; GFX7-NEXT: v_lshrrev_b32_e32 v5, 16, v5
				; GFX7-NEXT: v_lshrrev_b32_e32 v3, 16, v3
				; GFX7-NEXT: v_lshrrev_b32_e32 v1, 16, v1
				; GFX7-NEXT: s_mov_b32 s7, 0xf000
				; GFX7-NEXT: s_mov_b32 s4, s6
				; GFX7-NEXT: s_mov_b32 s5, s6
				; GFX7-NEXT: v_alignbit_b32 v6, v7, v6, 16
				; GFX7-NEXT: v_alignbit_b32 v5, v5, v4, 16
				; GFX7-NEXT: v_alignbit_b32 v4, v3, v2, 16
				; GFX7-NEXT: v_alignbit_b32 v3, v1, v0, 16
				; GFX7-NEXT: buffer_store_dwordx4 v[3:6], v[8:9], s[4:7], 0 addr64
				; GFX7-NEXT: s_waitcnt vmcnt(0)
				; GFX7-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX8-LABEL: test_arg_store_v8bf16:
				; GFX8: ; %bb.0:
				; GFX8-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX8-NEXT: flat_store_dwordx4 v[4:5], v[0:3]
				; GFX8-NEXT: s_waitcnt vmcnt(0)
				; GFX8-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX9-LABEL: test_arg_store_v8bf16:
				; GFX9: ; %bb.0:
				; GFX9-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX9-NEXT: global_store_dwordx4 v[4:5], v[0:3], off
				; GFX9-NEXT: s_waitcnt vmcnt(0)
				; GFX9-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX10-LABEL: test_arg_store_v8bf16:
				; GFX10: ; %bb.0:
				; GFX10-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX10-NEXT: global_store_dwordx4 v[4:5], v[0:3], off
				; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX10-NEXT: s_setpc_b64 s[30:31]
				store <8 x bfloat> %in, ptr addrspace(1) %out
				ret void
				}

				define void @test_arg_store_v16bf16(<16 x bfloat> %in, ptr addrspace(1) %out) {
				; GCN-LABEL: test_arg_store_v16bf16:
				; GCN: ; %bb.0:
				; GCN-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GCN-NEXT: v_lshrrev_b32_e32 v7, 16, v7
				; GCN-NEXT: v_lshrrev_b32_e32 v18, 16, v5
				; GCN-NEXT: v_lshrrev_b32_e32 v3, 16, v3
				; GCN-NEXT: v_lshrrev_b32_e32 v1, 16, v1
				; GCN-NEXT: s_mov_b32 s7, 0xf000
				; GCN-NEXT: s_mov_b32 s6, 0
				; GCN-NEXT: v_lshrrev_b32_e32 v15, 16, v15
				; GCN-NEXT: v_lshrrev_b32_e32 v19, 16, v13
				; GCN-NEXT: v_lshrrev_b32_e32 v11, 16, v11
				; GCN-NEXT: v_lshrrev_b32_e32 v9, 16, v9
				; GCN-NEXT: v_alignbit_b32 v5, v7, v6, 16
				; GCN-NEXT: v_alignbit_b32 v4, v18, v4, 16
				; GCN-NEXT: v_alignbit_b32 v3, v3, v2, 16
				; GCN-NEXT: v_alignbit_b32 v2, v1, v0, 16
				; GCN-NEXT: s_mov_b32 s4, s6
				; GCN-NEXT: s_mov_b32 s5, s6
				; GCN-NEXT: v_alignbit_b32 v13, v15, v14, 16
				; GCN-NEXT: v_alignbit_b32 v12, v19, v12, 16
				; GCN-NEXT: v_alignbit_b32 v11, v11, v10, 16
				; GCN-NEXT: v_alignbit_b32 v10, v9, v8, 16
				arsenmUnsubmitted Done Reply Inline Actions Ret of vectors arsenm: Ret of vectors
				; GCN-NEXT: buffer_store_dwordx4 v[10:13], v[16:17], s[4:7], 0 addr64 offset:16
				; GCN-NEXT: buffer_store_dwordx4 v[2:5], v[16:17], s[4:7], 0 addr64
				; GCN-NEXT: s_waitcnt vmcnt(0) expcnt(0)
				; GCN-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX7-LABEL: test_arg_store_v16bf16:
				; GFX7: ; %bb.0:
				; GFX7-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX7-NEXT: v_lshrrev_b32_e32 v5, 16, v5
				; GFX7-NEXT: v_lshrrev_b32_e32 v3, 16, v3
				; GFX7-NEXT: v_lshrrev_b32_e32 v1, 16, v1
				; GFX7-NEXT: v_alignbit_b32 v5, v5, v4, 16
				; GFX7-NEXT: v_alignbit_b32 v4, v3, v2, 16
				; GFX7-NEXT: v_alignbit_b32 v3, v1, v0, 16
				; GFX7-NEXT: v_lshrrev_b32_e32 v0, 16, v15
				; GFX7-NEXT: v_alignbit_b32 v14, v0, v14, 16
				; GFX7-NEXT: v_lshrrev_b32_e32 v0, 16, v13
				; GFX7-NEXT: v_alignbit_b32 v13, v0, v12, 16
				; GFX7-NEXT: v_lshrrev_b32_e32 v0, 16, v11
				; GFX7-NEXT: s_mov_b32 s6, 0
				; GFX7-NEXT: v_alignbit_b32 v12, v0, v10, 16
				; GFX7-NEXT: v_lshrrev_b32_e32 v0, 16, v9
				; GFX7-NEXT: v_lshrrev_b32_e32 v7, 16, v7
				; GFX7-NEXT: s_mov_b32 s7, 0xf000
				; GFX7-NEXT: s_mov_b32 s4, s6
				; GFX7-NEXT: s_mov_b32 s5, s6
				; GFX7-NEXT: v_alignbit_b32 v11, v0, v8, 16
				; GFX7-NEXT: v_alignbit_b32 v6, v7, v6, 16
				; GFX7-NEXT: buffer_store_dwordx4 v[11:14], v[16:17], s[4:7], 0 addr64 offset:16
				; GFX7-NEXT: buffer_store_dwordx4 v[3:6], v[16:17], s[4:7], 0 addr64
				; GFX7-NEXT: s_waitcnt vmcnt(0)
				; GFX7-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX8-LABEL: test_arg_store_v16bf16:
				; GFX8: ; %bb.0:
				; GFX8-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX8-NEXT: flat_store_dwordx4 v[8:9], v[0:3]
				; GFX8-NEXT: s_nop 0
				; GFX8-NEXT: v_add_u32_e32 v0, vcc, 16, v8
				; GFX8-NEXT: v_addc_u32_e32 v1, vcc, 0, v9, vcc
				; GFX8-NEXT: flat_store_dwordx4 v[0:1], v[4:7]
				; GFX8-NEXT: s_waitcnt vmcnt(0)
				; GFX8-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX9-LABEL: test_arg_store_v16bf16:
				; GFX9: ; %bb.0:
				; GFX9-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX9-NEXT: global_store_dwordx4 v[8:9], v[4:7], off offset:16
				; GFX9-NEXT: global_store_dwordx4 v[8:9], v[0:3], off
				; GFX9-NEXT: s_waitcnt vmcnt(0)
				; GFX9-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX10-LABEL: test_arg_store_v16bf16:
				; GFX10: ; %bb.0:
				arsenmUnsubmitted Done Reply Inline Actions Don't use anonymous values. Also use opaque pointers arsenm: Don't use anonymous values. Also use opaque pointers
				; GFX10-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX10-NEXT: global_store_dwordx4 v[8:9], v[4:7], off offset:16
				arsenmUnsubmitted Done Reply Inline Actions Should also test call argument, call return, passed in byval, sret, implicit sret, and passed in argument in overflow stack area arsenm: Should also test call argument, call return, passed in byval, sret, implicit sret, and passed…
				; GFX10-NEXT: global_store_dwordx4 v[8:9], v[0:3], off
				; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX10-NEXT: s_setpc_b64 s[30:31]
				store <16 x bfloat> %in, ptr addrspace(1) %out
				ret void
				}

				define amdgpu_gfx void @test_inreg_arg_store(bfloat inreg %in, ptr addrspace(1) %out) {
				; GCN-LABEL: test_inreg_arg_store:
				; GCN: ; %bb.0:
				; GCN-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GCN-NEXT: s_lshr_b32 s34, s4, 16
				; GCN-NEXT: s_mov_b32 s38, 0
				; GCN-NEXT: s_mov_b32 s39, 0xf000
				; GCN-NEXT: s_mov_b32 s36, s38
				; GCN-NEXT: s_mov_b32 s37, s38
				; GCN-NEXT: v_mov_b32_e32 v2, s34
				; GCN-NEXT: buffer_store_short v2, v[0:1], s[36:39], 0 addr64
				; GCN-NEXT: s_waitcnt vmcnt(0) expcnt(0)
				; GCN-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX7-LABEL: test_inreg_arg_store:
				; GFX7: ; %bb.0:
				; GFX7-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX7-NEXT: s_lshr_b32 s34, s4, 16
				; GFX7-NEXT: s_mov_b32 s38, 0
				; GFX7-NEXT: s_mov_b32 s39, 0xf000
				; GFX7-NEXT: s_mov_b32 s36, s38
				; GFX7-NEXT: s_mov_b32 s37, s38
				; GFX7-NEXT: v_mov_b32_e32 v2, s34
				; GFX7-NEXT: buffer_store_short v2, v[0:1], s[36:39], 0 addr64
				; GFX7-NEXT: s_waitcnt vmcnt(0)
				; GFX7-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX8-LABEL: test_inreg_arg_store:
				; GFX8: ; %bb.0:
				; GFX8-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX8-NEXT: s_lshr_b32 s34, s4, 16
				; GFX8-NEXT: v_mov_b32_e32 v2, s34
				; GFX8-NEXT: flat_store_short v[0:1], v2
				; GFX8-NEXT: s_waitcnt vmcnt(0)
				; GFX8-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX9-LABEL: test_inreg_arg_store:
				; GFX9: ; %bb.0:
				; GFX9-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX9-NEXT: v_mov_b32_e32 v2, s4
				; GFX9-NEXT: global_store_short_d16_hi v[0:1], v2, off
				; GFX9-NEXT: s_waitcnt vmcnt(0)
				; GFX9-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX10-LABEL: test_inreg_arg_store:
				; GFX10: ; %bb.0:
				; GFX10-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX10-NEXT: v_mov_b32_e32 v2, s4
				; GFX10-NEXT: global_store_short_d16_hi v[0:1], v2, off
				; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX10-NEXT: s_setpc_b64 s[30:31]
				store bfloat %in, ptr addrspace(1) %out
				ret void
				}

				define bfloat @test_byval(ptr addrspace(5) byval(bfloat) %bv, bfloat %val) {
				; GCN-LABEL: test_byval:
				; GCN: ; %bb.0:
				; GCN-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GCN-NEXT: v_lshrrev_b32_e32 v1, 16, v0
				; GCN-NEXT: buffer_store_short v1, off, s[0:3], s32
				; GCN-NEXT: s_waitcnt vmcnt(0) expcnt(0)
				; GCN-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX7-LABEL: test_byval:
				; GFX7: ; %bb.0:
				; GFX7-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX7-NEXT: v_lshrrev_b32_e32 v1, 16, v0
				; GFX7-NEXT: buffer_store_short v1, off, s[0:3], s32
				; GFX7-NEXT: s_waitcnt vmcnt(0)
				; GFX7-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX8-LABEL: test_byval:
				; GFX8: ; %bb.0:
				; GFX8-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX8-NEXT: v_lshrrev_b32_e32 v1, 16, v0
				; GFX8-NEXT: buffer_store_short v1, off, s[0:3], s32
				; GFX8-NEXT: s_waitcnt vmcnt(0)
				; GFX8-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX9-LABEL: test_byval:
				; GFX9: ; %bb.0:
				; GFX9-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX9-NEXT: buffer_store_short_d16_hi v0, off, s[0:3], s32
				; GFX9-NEXT: s_waitcnt vmcnt(0)
				; GFX9-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX10-LABEL: test_byval:
				; GFX10: ; %bb.0:
				; GFX10-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX10-NEXT: buffer_store_short_d16_hi v0, off, s[0:3], s32
				; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX10-NEXT: s_setpc_b64 s[30:31]
				store bfloat %val, ptr addrspace(5) %bv
				%retval = load bfloat, ptr addrspace(5) %bv
				ret bfloat %retval
				}

				define void @test_sret(ptr addrspace(5) sret(bfloat) %sret, bfloat %val) {
				; GCN-LABEL: test_sret:
				; GCN: ; %bb.0:
				; GCN-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GCN-NEXT: v_lshrrev_b32_e32 v1, 16, v1
				; GCN-NEXT: buffer_store_short v1, v0, s[0:3], 0 offen
				; GCN-NEXT: s_waitcnt vmcnt(0) expcnt(0)
				; GCN-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX7-LABEL: test_sret:
				; GFX7: ; %bb.0:
				; GFX7-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX7-NEXT: v_lshrrev_b32_e32 v1, 16, v1
				; GFX7-NEXT: buffer_store_short v1, v0, s[0:3], 0 offen
				; GFX7-NEXT: s_waitcnt vmcnt(0)
				; GFX7-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX8-LABEL: test_sret:
				; GFX8: ; %bb.0:
				; GFX8-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX8-NEXT: v_lshrrev_b32_e32 v1, 16, v1
				; GFX8-NEXT: buffer_store_short v1, v0, s[0:3], 0 offen
				; GFX8-NEXT: s_waitcnt vmcnt(0)
				; GFX8-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX9-LABEL: test_sret:
				; GFX9: ; %bb.0:
				; GFX9-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX9-NEXT: buffer_store_short_d16_hi v1, v0, s[0:3], 0 offen
				; GFX9-NEXT: s_waitcnt vmcnt(0)
				; GFX9-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX10-LABEL: test_sret:
				; GFX10: ; %bb.0:
				; GFX10-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX10-NEXT: buffer_store_short_d16_hi v1, v0, s[0:3], 0 offen
				; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX10-NEXT: s_setpc_b64 s[30:31]
				store bfloat %val, ptr addrspace(5) %sret
				ret void
				}

				define void @test_bitcast_from_bfloat(ptr addrspace(1) %in, ptr addrspace(1) %out) {
				; GCN-LABEL: test_bitcast_from_bfloat:
				; GCN: ; %bb.0:
				; GCN-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GCN-NEXT: s_mov_b32 s6, 0
				; GCN-NEXT: s_mov_b32 s7, 0xf000
				; GCN-NEXT: s_mov_b32 s4, s6
				; GCN-NEXT: s_mov_b32 s5, s6
				; GCN-NEXT: buffer_load_ushort v0, v[0:1], s[4:7], 0 addr64
				; GCN-NEXT: s_waitcnt vmcnt(0)
				; GCN-NEXT: buffer_store_short v0, v[2:3], s[4:7], 0 addr64
				; GCN-NEXT: s_waitcnt vmcnt(0) expcnt(0)
				; GCN-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX7-LABEL: test_bitcast_from_bfloat:
				; GFX7: ; %bb.0:
				; GFX7-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX7-NEXT: s_mov_b32 s6, 0
				; GFX7-NEXT: s_mov_b32 s7, 0xf000
				; GFX7-NEXT: s_mov_b32 s4, s6
				; GFX7-NEXT: s_mov_b32 s5, s6
				; GFX7-NEXT: buffer_load_ushort v0, v[0:1], s[4:7], 0 addr64
				; GFX7-NEXT: s_waitcnt vmcnt(0)
				; GFX7-NEXT: buffer_store_short v0, v[2:3], s[4:7], 0 addr64
				; GFX7-NEXT: s_waitcnt vmcnt(0)
				; GFX7-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX8-LABEL: test_bitcast_from_bfloat:
				; GFX8: ; %bb.0:
				; GFX8-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX8-NEXT: flat_load_ushort v0, v[0:1]
				; GFX8-NEXT: s_waitcnt vmcnt(0)
				; GFX8-NEXT: flat_store_short v[2:3], v0
				; GFX8-NEXT: s_waitcnt vmcnt(0)
				; GFX8-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX9-LABEL: test_bitcast_from_bfloat:
				; GFX9: ; %bb.0:
				; GFX9-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX9-NEXT: global_load_ushort v0, v[0:1], off
				; GFX9-NEXT: s_waitcnt vmcnt(0)
				; GFX9-NEXT: global_store_short v[2:3], v0, off
				; GFX9-NEXT: s_waitcnt vmcnt(0)
				; GFX9-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX10-LABEL: test_bitcast_from_bfloat:
				; GFX10: ; %bb.0:
				; GFX10-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX10-NEXT: global_load_ushort v0, v[0:1], off
				; GFX10-NEXT: s_waitcnt vmcnt(0)
				; GFX10-NEXT: global_store_short v[2:3], v0, off
				; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX10-NEXT: s_setpc_b64 s[30:31]
				%val = load bfloat, ptr addrspace(1) %in
				%val_int = bitcast bfloat %val to i16
				store i16 %val_int, ptr addrspace(1) %out
				ret void
				}

				define void @test_bitcast_to_bfloat(ptr addrspace(1) %out, ptr addrspace(1) %in) {
				; GCN-LABEL: test_bitcast_to_bfloat:
				; GCN: ; %bb.0:
				; GCN-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GCN-NEXT: s_mov_b32 s6, 0
				; GCN-NEXT: s_mov_b32 s7, 0xf000
				; GCN-NEXT: s_mov_b32 s4, s6
				; GCN-NEXT: s_mov_b32 s5, s6
				; GCN-NEXT: buffer_load_ushort v2, v[2:3], s[4:7], 0 addr64
				; GCN-NEXT: s_waitcnt vmcnt(0)
				; GCN-NEXT: buffer_store_short v2, v[0:1], s[4:7], 0 addr64
				; GCN-NEXT: s_waitcnt vmcnt(0) expcnt(0)
				; GCN-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX7-LABEL: test_bitcast_to_bfloat:
				; GFX7: ; %bb.0:
				; GFX7-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX7-NEXT: s_mov_b32 s6, 0
				; GFX7-NEXT: s_mov_b32 s7, 0xf000
				; GFX7-NEXT: s_mov_b32 s4, s6
				; GFX7-NEXT: s_mov_b32 s5, s6
				; GFX7-NEXT: buffer_load_ushort v2, v[2:3], s[4:7], 0 addr64
				; GFX7-NEXT: s_waitcnt vmcnt(0)
				; GFX7-NEXT: buffer_store_short v2, v[0:1], s[4:7], 0 addr64
				; GFX7-NEXT: s_waitcnt vmcnt(0)
				; GFX7-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX8-LABEL: test_bitcast_to_bfloat:
				; GFX8: ; %bb.0:
				; GFX8-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX8-NEXT: flat_load_ushort v2, v[2:3]
				; GFX8-NEXT: s_waitcnt vmcnt(0)
				; GFX8-NEXT: flat_store_short v[0:1], v2
				; GFX8-NEXT: s_waitcnt vmcnt(0)
				; GFX8-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX9-LABEL: test_bitcast_to_bfloat:
				; GFX9: ; %bb.0:
				; GFX9-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX9-NEXT: global_load_ushort v2, v[2:3], off
				; GFX9-NEXT: s_waitcnt vmcnt(0)
				; GFX9-NEXT: global_store_short v[0:1], v2, off
				; GFX9-NEXT: s_waitcnt vmcnt(0)
				; GFX9-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX10-LABEL: test_bitcast_to_bfloat:
				; GFX10: ; %bb.0:
				; GFX10-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX10-NEXT: global_load_ushort v2, v[2:3], off
				; GFX10-NEXT: s_waitcnt vmcnt(0)
				; GFX10-NEXT: global_store_short v[0:1], v2, off
				; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX10-NEXT: s_setpc_b64 s[30:31]
				%val = load i16, ptr addrspace(1) %in
				%val_fp = bitcast i16 %val to bfloat
				store bfloat %val_fp, ptr addrspace(1) %out
				ret void
				}

				define bfloat @test_ret(bfloat %in) {
				; GCN-LABEL: test_ret:
				; GCN: ; %bb.0: ; %entry
				; GCN-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GCN-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX7-LABEL: test_ret:
				; GFX7: ; %bb.0: ; %entry
				; GFX7-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX7-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX8-LABEL: test_ret:
				; GFX8: ; %bb.0: ; %entry
				; GFX8-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX8-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX9-LABEL: test_ret:
				; GFX9: ; %bb.0: ; %entry
				; GFX9-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX9-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX10-LABEL: test_ret:
				; GFX10: ; %bb.0: ; %entry
				; GFX10-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX10-NEXT: s_setpc_b64 s[30:31]
				entry:
				ret bfloat %in
				}

				define <2 x bfloat> @test_ret_v2bf16(<2 x bfloat> %in) {
				; GCN-LABEL: test_ret_v2bf16:
				; GCN: ; %bb.0: ; %entry
				; GCN-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GCN-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX7-LABEL: test_ret_v2bf16:
				; GFX7: ; %bb.0: ; %entry
				; GFX7-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX7-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX8-LABEL: test_ret_v2bf16:
				; GFX8: ; %bb.0: ; %entry
				; GFX8-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX8-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX9-LABEL: test_ret_v2bf16:
				; GFX9: ; %bb.0: ; %entry
				; GFX9-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX9-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX10-LABEL: test_ret_v2bf16:
				; GFX10: ; %bb.0: ; %entry
				; GFX10-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX10-NEXT: s_setpc_b64 s[30:31]
				entry:
				ret <2 x bfloat> %in
				}

				define <3 x bfloat> @test_ret_v3bf16(<3 x bfloat> %in) {
				; GCN-LABEL: test_ret_v3bf16:
				; GCN: ; %bb.0: ; %entry
				; GCN-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GCN-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX7-LABEL: test_ret_v3bf16:
				; GFX7: ; %bb.0: ; %entry
				; GFX7-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX7-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX8-LABEL: test_ret_v3bf16:
				; GFX8: ; %bb.0: ; %entry
				; GFX8-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX8-NEXT: v_and_b32_e32 v1, 0xffff, v1
				; GFX8-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX9-LABEL: test_ret_v3bf16:
				; GFX9: ; %bb.0: ; %entry
				; GFX9-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX9-NEXT: v_and_b32_e32 v2, 0xffff0000, v0
				; GFX9-NEXT: s_mov_b32 s4, 0xffff
				; GFX9-NEXT: v_and_or_b32 v0, v0, s4, v2
				; GFX9-NEXT: v_and_b32_e32 v1, 0xffff, v1
				; GFX9-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX10-LABEL: test_ret_v3bf16:
				; GFX10: ; %bb.0: ; %entry
				; GFX10-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX10-NEXT: v_and_b32_e32 v2, 0xffff0000, v0
				; GFX10-NEXT: v_and_b32_e32 v1, 0xffff, v1
				; GFX10-NEXT: v_and_or_b32 v0, 0xffff, v0, v2
				; GFX10-NEXT: s_setpc_b64 s[30:31]
				entry:
				ret <3 x bfloat> %in
				}

				define <4 x bfloat> @test_ret_v4bf16(<4 x bfloat> %in) {
				; GCN-LABEL: test_ret_v4bf16:
				; GCN: ; %bb.0: ; %entry
				; GCN-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GCN-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX7-LABEL: test_ret_v4bf16:
				; GFX7: ; %bb.0: ; %entry
				; GFX7-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX7-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX8-LABEL: test_ret_v4bf16:
				; GFX8: ; %bb.0: ; %entry
				; GFX8-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX8-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX9-LABEL: test_ret_v4bf16:
				; GFX9: ; %bb.0: ; %entry
				; GFX9-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX9-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX10-LABEL: test_ret_v4bf16:
				; GFX10: ; %bb.0: ; %entry
				; GFX10-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX10-NEXT: s_setpc_b64 s[30:31]
				entry:
				ret <4 x bfloat> %in
				}

				define <8 x bfloat> @test_ret_v8bf16(<8 x bfloat> %in) {
				; GCN-LABEL: test_ret_v8bf16:
				; GCN: ; %bb.0: ; %entry
				; GCN-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GCN-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX7-LABEL: test_ret_v8bf16:
				; GFX7: ; %bb.0: ; %entry
				; GFX7-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX7-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX8-LABEL: test_ret_v8bf16:
				; GFX8: ; %bb.0: ; %entry
				; GFX8-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX8-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX9-LABEL: test_ret_v8bf16:
				; GFX9: ; %bb.0: ; %entry
				; GFX9-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX9-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX10-LABEL: test_ret_v8bf16:
				; GFX10: ; %bb.0: ; %entry
				; GFX10-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX10-NEXT: s_setpc_b64 s[30:31]
				entry:
				ret <8 x bfloat> %in
				}

				define <16 x bfloat> @test_ret_v16bf16(<16 x bfloat> %in) {
				; GCN-LABEL: test_ret_v16bf16:
				; GCN: ; %bb.0: ; %entry
				; GCN-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GCN-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX7-LABEL: test_ret_v16bf16:
				; GFX7: ; %bb.0: ; %entry
				; GFX7-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX7-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX8-LABEL: test_ret_v16bf16:
				; GFX8: ; %bb.0: ; %entry
				; GFX8-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX8-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX9-LABEL: test_ret_v16bf16:
				; GFX9: ; %bb.0: ; %entry
				; GFX9-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX9-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX10-LABEL: test_ret_v16bf16:
				; GFX10: ; %bb.0: ; %entry
				; GFX10-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX10-NEXT: s_setpc_b64 s[30:31]
				entry:
				ret <16 x bfloat> %in
				}

				define void @test_call(bfloat %in, ptr addrspace(5) %out) {
				; GCN-LABEL: test_call:
				; GCN: ; %bb.0: ; %entry
				; GCN-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GCN-NEXT: s_or_saveexec_b64 s[4:5], -1
				; GCN-NEXT: buffer_store_dword v2, off, s[0:3], s32 ; 4-byte Folded Spill
				; GCN-NEXT: s_mov_b64 exec, s[4:5]
				; GCN-NEXT: s_waitcnt expcnt(0)
				; GCN-NEXT: v_writelane_b32 v2, s33, 2
				; GCN-NEXT: s_mov_b32 s33, s32
				; GCN-NEXT: s_addk_i32 s32, 0x400
				; GCN-NEXT: v_writelane_b32 v2, s30, 0
				; GCN-NEXT: v_writelane_b32 v2, s31, 1
				; GCN-NEXT: s_getpc_b64 s[4:5]
				; GCN-NEXT: s_add_u32 s4, s4, test_arg_store@gotpcrel32@lo+4
				; GCN-NEXT: s_addc_u32 s5, s5, test_arg_store@gotpcrel32@hi+12
				; GCN-NEXT: s_load_dwordx2 s[4:5], s[4:5], 0x0
				; GCN-NEXT: s_waitcnt lgkmcnt(0)
				; GCN-NEXT: s_swappc_b64 s[30:31], s[4:5]
				; GCN-NEXT: v_lshrrev_b32_e32 v0, 16, v0
				; GCN-NEXT: buffer_store_short v0, v1, s[0:3], 0 offen
				; GCN-NEXT: s_waitcnt vmcnt(0)
				; GCN-NEXT: v_readlane_b32 s31, v2, 1
				; GCN-NEXT: v_readlane_b32 s30, v2, 0
				; GCN-NEXT: s_addk_i32 s32, 0xfc00
				; GCN-NEXT: v_readlane_b32 s33, v2, 2
				; GCN-NEXT: s_or_saveexec_b64 s[4:5], -1
				; GCN-NEXT: buffer_load_dword v2, off, s[0:3], s32 ; 4-byte Folded Reload
				; GCN-NEXT: s_mov_b64 exec, s[4:5]
				; GCN-NEXT: s_waitcnt vmcnt(0) expcnt(0)
				; GCN-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX7-LABEL: test_call:
				; GFX7: ; %bb.0: ; %entry
				; GFX7-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX7-NEXT: s_or_saveexec_b64 s[4:5], -1
				; GFX7-NEXT: buffer_store_dword v2, off, s[0:3], s32 ; 4-byte Folded Spill
				; GFX7-NEXT: s_mov_b64 exec, s[4:5]
				; GFX7-NEXT: v_writelane_b32 v2, s33, 2
				; GFX7-NEXT: s_mov_b32 s33, s32
				; GFX7-NEXT: s_addk_i32 s32, 0x400
				; GFX7-NEXT: s_getpc_b64 s[4:5]
				; GFX7-NEXT: s_add_u32 s4, s4, test_arg_store@gotpcrel32@lo+4
				; GFX7-NEXT: s_addc_u32 s5, s5, test_arg_store@gotpcrel32@hi+12
				; GFX7-NEXT: s_load_dwordx2 s[4:5], s[4:5], 0x0
				; GFX7-NEXT: v_writelane_b32 v2, s30, 0
				; GFX7-NEXT: v_writelane_b32 v2, s31, 1
				; GFX7-NEXT: s_waitcnt lgkmcnt(0)
				; GFX7-NEXT: s_swappc_b64 s[30:31], s[4:5]
				; GFX7-NEXT: v_lshrrev_b32_e32 v0, 16, v0
				; GFX7-NEXT: buffer_store_short v0, v1, s[0:3], 0 offen
				; GFX7-NEXT: s_waitcnt vmcnt(0)
				; GFX7-NEXT: v_readlane_b32 s31, v2, 1
				; GFX7-NEXT: v_readlane_b32 s30, v2, 0
				; GFX7-NEXT: s_addk_i32 s32, 0xfc00
				; GFX7-NEXT: v_readlane_b32 s33, v2, 2
				; GFX7-NEXT: s_or_saveexec_b64 s[4:5], -1
				; GFX7-NEXT: buffer_load_dword v2, off, s[0:3], s32 ; 4-byte Folded Reload
				; GFX7-NEXT: s_mov_b64 exec, s[4:5]
				; GFX7-NEXT: s_waitcnt vmcnt(0)
				; GFX7-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX8-LABEL: test_call:
				; GFX8: ; %bb.0: ; %entry
				; GFX8-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX8-NEXT: s_or_saveexec_b64 s[4:5], -1
				; GFX8-NEXT: buffer_store_dword v2, off, s[0:3], s32 ; 4-byte Folded Spill
				; GFX8-NEXT: s_mov_b64 exec, s[4:5]
				; GFX8-NEXT: v_writelane_b32 v2, s33, 2
				; GFX8-NEXT: s_mov_b32 s33, s32
				; GFX8-NEXT: s_addk_i32 s32, 0x400
				; GFX8-NEXT: s_getpc_b64 s[4:5]
				; GFX8-NEXT: s_add_u32 s4, s4, test_arg_store@gotpcrel32@lo+4
				; GFX8-NEXT: s_addc_u32 s5, s5, test_arg_store@gotpcrel32@hi+12
				; GFX8-NEXT: s_load_dwordx2 s[4:5], s[4:5], 0x0
				; GFX8-NEXT: v_writelane_b32 v2, s30, 0
				; GFX8-NEXT: v_writelane_b32 v2, s31, 1
				; GFX8-NEXT: s_waitcnt lgkmcnt(0)
				; GFX8-NEXT: s_swappc_b64 s[30:31], s[4:5]
				; GFX8-NEXT: v_lshrrev_b32_e32 v0, 16, v0
				; GFX8-NEXT: buffer_store_short v0, v1, s[0:3], 0 offen
				; GFX8-NEXT: s_waitcnt vmcnt(0)
				; GFX8-NEXT: v_readlane_b32 s31, v2, 1
				; GFX8-NEXT: v_readlane_b32 s30, v2, 0
				; GFX8-NEXT: s_addk_i32 s32, 0xfc00
				; GFX8-NEXT: v_readlane_b32 s33, v2, 2
				; GFX8-NEXT: s_or_saveexec_b64 s[4:5], -1
				; GFX8-NEXT: buffer_load_dword v2, off, s[0:3], s32 ; 4-byte Folded Reload
				; GFX8-NEXT: s_mov_b64 exec, s[4:5]
				; GFX8-NEXT: s_waitcnt vmcnt(0)
				; GFX8-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX9-LABEL: test_call:
				; GFX9: ; %bb.0: ; %entry
				; GFX9-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX9-NEXT: s_or_saveexec_b64 s[4:5], -1
				; GFX9-NEXT: buffer_store_dword v2, off, s[0:3], s32 ; 4-byte Folded Spill
				; GFX9-NEXT: s_mov_b64 exec, s[4:5]
				; GFX9-NEXT: v_writelane_b32 v2, s33, 2
				; GFX9-NEXT: s_mov_b32 s33, s32
				; GFX9-NEXT: s_addk_i32 s32, 0x400
				; GFX9-NEXT: s_getpc_b64 s[4:5]
				; GFX9-NEXT: s_add_u32 s4, s4, test_arg_store@gotpcrel32@lo+4
				; GFX9-NEXT: s_addc_u32 s5, s5, test_arg_store@gotpcrel32@hi+12
				; GFX9-NEXT: s_load_dwordx2 s[4:5], s[4:5], 0x0
				; GFX9-NEXT: v_writelane_b32 v2, s30, 0
				; GFX9-NEXT: v_writelane_b32 v2, s31, 1
				; GFX9-NEXT: s_waitcnt lgkmcnt(0)
				; GFX9-NEXT: s_swappc_b64 s[30:31], s[4:5]
				; GFX9-NEXT: buffer_store_short_d16_hi v0, v1, s[0:3], 0 offen
				; GFX9-NEXT: s_waitcnt vmcnt(0)
				; GFX9-NEXT: v_readlane_b32 s31, v2, 1
				; GFX9-NEXT: v_readlane_b32 s30, v2, 0
				; GFX9-NEXT: s_addk_i32 s32, 0xfc00
				; GFX9-NEXT: v_readlane_b32 s33, v2, 2
				; GFX9-NEXT: s_or_saveexec_b64 s[4:5], -1
				; GFX9-NEXT: buffer_load_dword v2, off, s[0:3], s32 ; 4-byte Folded Reload
				; GFX9-NEXT: s_mov_b64 exec, s[4:5]
				; GFX9-NEXT: s_waitcnt vmcnt(0)
				; GFX9-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX10-LABEL: test_call:
				; GFX10: ; %bb.0: ; %entry
				; GFX10-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX10-NEXT: s_or_saveexec_b32 s4, -1
				; GFX10-NEXT: buffer_store_dword v2, off, s[0:3], s32 ; 4-byte Folded Spill
				; GFX10-NEXT: s_waitcnt_depctr 0xffe3
				; GFX10-NEXT: s_mov_b32 exec_lo, s4
				; GFX10-NEXT: v_writelane_b32 v2, s33, 2
				; GFX10-NEXT: s_mov_b32 s33, s32
				; GFX10-NEXT: s_addk_i32 s32, 0x200
				; GFX10-NEXT: s_getpc_b64 s[4:5]
				; GFX10-NEXT: s_add_u32 s4, s4, test_arg_store@gotpcrel32@lo+4
				; GFX10-NEXT: s_addc_u32 s5, s5, test_arg_store@gotpcrel32@hi+12
				; GFX10-NEXT: s_load_dwordx2 s[4:5], s[4:5], 0x0
				; GFX10-NEXT: v_writelane_b32 v2, s30, 0
				; GFX10-NEXT: v_writelane_b32 v2, s31, 1
				; GFX10-NEXT: s_waitcnt lgkmcnt(0)
				; GFX10-NEXT: s_swappc_b64 s[30:31], s[4:5]
				; GFX10-NEXT: buffer_store_short_d16_hi v0, v1, s[0:3], 0 offen
				; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX10-NEXT: v_readlane_b32 s31, v2, 1
				; GFX10-NEXT: v_readlane_b32 s30, v2, 0
				; GFX10-NEXT: s_addk_i32 s32, 0xfe00
				; GFX10-NEXT: v_readlane_b32 s33, v2, 2
				; GFX10-NEXT: s_or_saveexec_b32 s4, -1
				; GFX10-NEXT: buffer_load_dword v2, off, s[0:3], s32 ; 4-byte Folded Reload
				; GFX10-NEXT: s_waitcnt_depctr 0xffe3
				; GFX10-NEXT: s_mov_b32 exec_lo, s4
				; GFX10-NEXT: s_waitcnt vmcnt(0)
				; GFX10-NEXT: s_setpc_b64 s[30:31]
				entry:
				%result = call bfloat @test_arg_store(bfloat %in)
				store volatile bfloat %result, ptr addrspace(5) %out
				ret void
				}

				define void @test_call_v2bf16(<2 x bfloat> %in, ptr addrspace(5) %out) {
				; GCN-LABEL: test_call_v2bf16:
				; GCN: ; %bb.0: ; %entry
				; GCN-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GCN-NEXT: s_or_saveexec_b64 s[4:5], -1
				; GCN-NEXT: buffer_store_dword v3, off, s[0:3], s32 ; 4-byte Folded Spill
				; GCN-NEXT: s_mov_b64 exec, s[4:5]
				; GCN-NEXT: s_waitcnt expcnt(0)
				; GCN-NEXT: v_writelane_b32 v3, s33, 2
				; GCN-NEXT: s_mov_b32 s33, s32
				; GCN-NEXT: s_addk_i32 s32, 0x400
				; GCN-NEXT: v_writelane_b32 v3, s30, 0
				; GCN-NEXT: v_writelane_b32 v3, s31, 1
				; GCN-NEXT: s_getpc_b64 s[4:5]
				; GCN-NEXT: s_add_u32 s4, s4, test_arg_store_v2bf16@gotpcrel32@lo+4
				; GCN-NEXT: s_addc_u32 s5, s5, test_arg_store_v2bf16@gotpcrel32@hi+12
				; GCN-NEXT: s_load_dwordx2 s[4:5], s[4:5], 0x0
				; GCN-NEXT: s_waitcnt lgkmcnt(0)
				; GCN-NEXT: s_swappc_b64 s[30:31], s[4:5]
				; GCN-NEXT: v_lshrrev_b32_e32 v0, 16, v0
				; GCN-NEXT: v_lshrrev_b32_e32 v1, 16, v1
				; GCN-NEXT: v_add_i32_e32 v4, vcc, 2, v2
				; GCN-NEXT: buffer_store_short v1, v4, s[0:3], 0 offen
				; GCN-NEXT: s_waitcnt vmcnt(0)
				; GCN-NEXT: buffer_store_short v0, v2, s[0:3], 0 offen
				; GCN-NEXT: s_waitcnt vmcnt(0)
				; GCN-NEXT: v_readlane_b32 s31, v3, 1
				; GCN-NEXT: v_readlane_b32 s30, v3, 0
				; GCN-NEXT: s_addk_i32 s32, 0xfc00
				; GCN-NEXT: v_readlane_b32 s33, v3, 2
				; GCN-NEXT: s_or_saveexec_b64 s[4:5], -1
				; GCN-NEXT: buffer_load_dword v3, off, s[0:3], s32 ; 4-byte Folded Reload
				; GCN-NEXT: s_mov_b64 exec, s[4:5]
				; GCN-NEXT: s_waitcnt vmcnt(0) expcnt(0)
				; GCN-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX7-LABEL: test_call_v2bf16:
				; GFX7: ; %bb.0: ; %entry
				; GFX7-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX7-NEXT: s_or_saveexec_b64 s[4:5], -1
				; GFX7-NEXT: buffer_store_dword v3, off, s[0:3], s32 ; 4-byte Folded Spill
				; GFX7-NEXT: s_mov_b64 exec, s[4:5]
				; GFX7-NEXT: v_writelane_b32 v3, s33, 2
				; GFX7-NEXT: s_mov_b32 s33, s32
				; GFX7-NEXT: s_addk_i32 s32, 0x400
				; GFX7-NEXT: s_getpc_b64 s[4:5]
				; GFX7-NEXT: s_add_u32 s4, s4, test_arg_store_v2bf16@gotpcrel32@lo+4
				; GFX7-NEXT: s_addc_u32 s5, s5, test_arg_store_v2bf16@gotpcrel32@hi+12
				; GFX7-NEXT: s_load_dwordx2 s[4:5], s[4:5], 0x0
				; GFX7-NEXT: v_writelane_b32 v3, s30, 0
				; GFX7-NEXT: v_writelane_b32 v3, s31, 1
				; GFX7-NEXT: s_waitcnt lgkmcnt(0)
				; GFX7-NEXT: s_swappc_b64 s[30:31], s[4:5]
				; GFX7-NEXT: v_lshrrev_b32_e32 v1, 16, v1
				; GFX7-NEXT: v_add_i32_e32 v4, vcc, 2, v2
				; GFX7-NEXT: v_lshrrev_b32_e32 v0, 16, v0
				; GFX7-NEXT: buffer_store_short v1, v4, s[0:3], 0 offen
				; GFX7-NEXT: s_waitcnt vmcnt(0)
				; GFX7-NEXT: buffer_store_short v0, v2, s[0:3], 0 offen
				; GFX7-NEXT: s_waitcnt vmcnt(0)
				; GFX7-NEXT: v_readlane_b32 s31, v3, 1
				; GFX7-NEXT: v_readlane_b32 s30, v3, 0
				; GFX7-NEXT: s_addk_i32 s32, 0xfc00
				; GFX7-NEXT: v_readlane_b32 s33, v3, 2
				; GFX7-NEXT: s_or_saveexec_b64 s[4:5], -1
				; GFX7-NEXT: buffer_load_dword v3, off, s[0:3], s32 ; 4-byte Folded Reload
				; GFX7-NEXT: s_mov_b64 exec, s[4:5]
				; GFX7-NEXT: s_waitcnt vmcnt(0)
				; GFX7-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX8-LABEL: test_call_v2bf16:
				; GFX8: ; %bb.0: ; %entry
				; GFX8-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX8-NEXT: s_or_saveexec_b64 s[4:5], -1
				; GFX8-NEXT: buffer_store_dword v2, off, s[0:3], s32 ; 4-byte Folded Spill
				; GFX8-NEXT: s_mov_b64 exec, s[4:5]
				; GFX8-NEXT: v_writelane_b32 v2, s33, 2
				; GFX8-NEXT: s_mov_b32 s33, s32
				; GFX8-NEXT: s_addk_i32 s32, 0x400
				; GFX8-NEXT: s_getpc_b64 s[4:5]
				; GFX8-NEXT: s_add_u32 s4, s4, test_arg_store_v2bf16@gotpcrel32@lo+4
				; GFX8-NEXT: s_addc_u32 s5, s5, test_arg_store_v2bf16@gotpcrel32@hi+12
				; GFX8-NEXT: s_load_dwordx2 s[4:5], s[4:5], 0x0
				; GFX8-NEXT: v_writelane_b32 v2, s30, 0
				; GFX8-NEXT: v_writelane_b32 v2, s31, 1
				; GFX8-NEXT: s_waitcnt lgkmcnt(0)
				; GFX8-NEXT: s_swappc_b64 s[30:31], s[4:5]
				; GFX8-NEXT: buffer_store_dword v0, v1, s[0:3], 0 offen
				; GFX8-NEXT: s_waitcnt vmcnt(0)
				; GFX8-NEXT: v_readlane_b32 s31, v2, 1
				; GFX8-NEXT: v_readlane_b32 s30, v2, 0
				; GFX8-NEXT: s_addk_i32 s32, 0xfc00
				; GFX8-NEXT: v_readlane_b32 s33, v2, 2
				; GFX8-NEXT: s_or_saveexec_b64 s[4:5], -1
				; GFX8-NEXT: buffer_load_dword v2, off, s[0:3], s32 ; 4-byte Folded Reload
				; GFX8-NEXT: s_mov_b64 exec, s[4:5]
				; GFX8-NEXT: s_waitcnt vmcnt(0)
				; GFX8-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX9-LABEL: test_call_v2bf16:
				; GFX9: ; %bb.0: ; %entry
				; GFX9-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX9-NEXT: s_or_saveexec_b64 s[4:5], -1
				; GFX9-NEXT: buffer_store_dword v2, off, s[0:3], s32 ; 4-byte Folded Spill
				; GFX9-NEXT: s_mov_b64 exec, s[4:5]
				; GFX9-NEXT: v_writelane_b32 v2, s33, 2
				; GFX9-NEXT: s_mov_b32 s33, s32
				; GFX9-NEXT: s_addk_i32 s32, 0x400
				; GFX9-NEXT: s_getpc_b64 s[4:5]
				; GFX9-NEXT: s_add_u32 s4, s4, test_arg_store_v2bf16@gotpcrel32@lo+4
				; GFX9-NEXT: s_addc_u32 s5, s5, test_arg_store_v2bf16@gotpcrel32@hi+12
				; GFX9-NEXT: s_load_dwordx2 s[4:5], s[4:5], 0x0
				; GFX9-NEXT: v_writelane_b32 v2, s30, 0
				; GFX9-NEXT: v_writelane_b32 v2, s31, 1
				; GFX9-NEXT: s_waitcnt lgkmcnt(0)
				; GFX9-NEXT: s_swappc_b64 s[30:31], s[4:5]
				; GFX9-NEXT: buffer_store_dword v0, v1, s[0:3], 0 offen
				; GFX9-NEXT: s_waitcnt vmcnt(0)
				; GFX9-NEXT: v_readlane_b32 s31, v2, 1
				; GFX9-NEXT: v_readlane_b32 s30, v2, 0
				; GFX9-NEXT: s_addk_i32 s32, 0xfc00
				; GFX9-NEXT: v_readlane_b32 s33, v2, 2
				; GFX9-NEXT: s_or_saveexec_b64 s[4:5], -1
				; GFX9-NEXT: buffer_load_dword v2, off, s[0:3], s32 ; 4-byte Folded Reload
				; GFX9-NEXT: s_mov_b64 exec, s[4:5]
				; GFX9-NEXT: s_waitcnt vmcnt(0)
				; GFX9-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX10-LABEL: test_call_v2bf16:
				; GFX10: ; %bb.0: ; %entry
				; GFX10-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX10-NEXT: s_or_saveexec_b32 s4, -1
				; GFX10-NEXT: buffer_store_dword v2, off, s[0:3], s32 ; 4-byte Folded Spill
				; GFX10-NEXT: s_waitcnt_depctr 0xffe3
				; GFX10-NEXT: s_mov_b32 exec_lo, s4
				; GFX10-NEXT: v_writelane_b32 v2, s33, 2
				; GFX10-NEXT: s_mov_b32 s33, s32
				; GFX10-NEXT: s_addk_i32 s32, 0x200
				; GFX10-NEXT: s_getpc_b64 s[4:5]
				; GFX10-NEXT: s_add_u32 s4, s4, test_arg_store_v2bf16@gotpcrel32@lo+4
				; GFX10-NEXT: s_addc_u32 s5, s5, test_arg_store_v2bf16@gotpcrel32@hi+12
				; GFX10-NEXT: s_load_dwordx2 s[4:5], s[4:5], 0x0
				; GFX10-NEXT: v_writelane_b32 v2, s30, 0
				; GFX10-NEXT: v_writelane_b32 v2, s31, 1
				; GFX10-NEXT: s_waitcnt lgkmcnt(0)
				; GFX10-NEXT: s_swappc_b64 s[30:31], s[4:5]
				; GFX10-NEXT: buffer_store_dword v0, v1, s[0:3], 0 offen
				; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX10-NEXT: v_readlane_b32 s31, v2, 1
				; GFX10-NEXT: v_readlane_b32 s30, v2, 0
				; GFX10-NEXT: s_addk_i32 s32, 0xfe00
				; GFX10-NEXT: v_readlane_b32 s33, v2, 2
				; GFX10-NEXT: s_or_saveexec_b32 s4, -1
				; GFX10-NEXT: buffer_load_dword v2, off, s[0:3], s32 ; 4-byte Folded Reload
				; GFX10-NEXT: s_waitcnt_depctr 0xffe3
				; GFX10-NEXT: s_mov_b32 exec_lo, s4
				; GFX10-NEXT: s_waitcnt vmcnt(0)
				; GFX10-NEXT: s_setpc_b64 s[30:31]
				entry:
				%result = call <2 x bfloat> @test_arg_store_v2bf16(<2 x bfloat> %in)
				store volatile <2 x bfloat> %result, ptr addrspace(5) %out
				ret void
				}

				define void @test_call_v3bf16(<3 x bfloat> %in, ptr addrspace(5) %out) {
				; GCN-LABEL: test_call_v3bf16:
				; GCN: ; %bb.0: ; %entry
				; GCN-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GCN-NEXT: s_or_saveexec_b64 s[4:5], -1
				; GCN-NEXT: buffer_store_dword v4, off, s[0:3], s32 ; 4-byte Folded Spill
				; GCN-NEXT: s_mov_b64 exec, s[4:5]
				; GCN-NEXT: s_waitcnt expcnt(0)
				; GCN-NEXT: v_writelane_b32 v4, s33, 2
				; GCN-NEXT: s_mov_b32 s33, s32
				; GCN-NEXT: s_addk_i32 s32, 0x400
				; GCN-NEXT: v_writelane_b32 v4, s30, 0
				; GCN-NEXT: v_writelane_b32 v4, s31, 1
				; GCN-NEXT: s_getpc_b64 s[4:5]
				; GCN-NEXT: s_add_u32 s4, s4, test_arg_store_v2bf16@gotpcrel32@lo+4
				; GCN-NEXT: s_addc_u32 s5, s5, test_arg_store_v2bf16@gotpcrel32@hi+12
				; GCN-NEXT: s_load_dwordx2 s[4:5], s[4:5], 0x0
				; GCN-NEXT: s_waitcnt lgkmcnt(0)
				; GCN-NEXT: s_swappc_b64 s[30:31], s[4:5]
				; GCN-NEXT: v_lshrrev_b32_e32 v1, 16, v1
				; GCN-NEXT: v_lshrrev_b32_e32 v2, 16, v2
				; GCN-NEXT: v_add_i32_e32 v5, vcc, 4, v3
				; GCN-NEXT: v_alignbit_b32 v0, v1, v0, 16
				; GCN-NEXT: buffer_store_short v2, v5, s[0:3], 0 offen
				; GCN-NEXT: s_waitcnt vmcnt(0)
				; GCN-NEXT: buffer_store_dword v0, v3, s[0:3], 0 offen
				; GCN-NEXT: s_waitcnt vmcnt(0)
				; GCN-NEXT: v_readlane_b32 s31, v4, 1
				; GCN-NEXT: v_readlane_b32 s30, v4, 0
				; GCN-NEXT: s_addk_i32 s32, 0xfc00
				; GCN-NEXT: v_readlane_b32 s33, v4, 2
				; GCN-NEXT: s_or_saveexec_b64 s[4:5], -1
				; GCN-NEXT: buffer_load_dword v4, off, s[0:3], s32 ; 4-byte Folded Reload
				; GCN-NEXT: s_mov_b64 exec, s[4:5]
				; GCN-NEXT: s_waitcnt vmcnt(0) expcnt(0)
				; GCN-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX7-LABEL: test_call_v3bf16:
				; GFX7: ; %bb.0: ; %entry
				; GFX7-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX7-NEXT: s_or_saveexec_b64 s[4:5], -1
				; GFX7-NEXT: buffer_store_dword v4, off, s[0:3], s32 ; 4-byte Folded Spill
				; GFX7-NEXT: s_mov_b64 exec, s[4:5]
				; GFX7-NEXT: v_writelane_b32 v4, s33, 2
				; GFX7-NEXT: s_mov_b32 s33, s32
				; GFX7-NEXT: s_addk_i32 s32, 0x400
				; GFX7-NEXT: s_getpc_b64 s[4:5]
				; GFX7-NEXT: s_add_u32 s4, s4, test_arg_store_v2bf16@gotpcrel32@lo+4
				; GFX7-NEXT: s_addc_u32 s5, s5, test_arg_store_v2bf16@gotpcrel32@hi+12
				; GFX7-NEXT: s_load_dwordx2 s[4:5], s[4:5], 0x0
				; GFX7-NEXT: v_writelane_b32 v4, s30, 0
				; GFX7-NEXT: v_writelane_b32 v4, s31, 1
				; GFX7-NEXT: s_waitcnt lgkmcnt(0)
				; GFX7-NEXT: s_swappc_b64 s[30:31], s[4:5]
				; GFX7-NEXT: v_lshrrev_b32_e32 v1, 16, v1
				; GFX7-NEXT: v_alignbit_b32 v0, v1, v0, 16
				; GFX7-NEXT: v_lshrrev_b32_e32 v1, 16, v2
				; GFX7-NEXT: v_add_i32_e32 v2, vcc, 4, v3
				; GFX7-NEXT: buffer_store_short v1, v2, s[0:3], 0 offen
				; GFX7-NEXT: s_waitcnt vmcnt(0)
				; GFX7-NEXT: buffer_store_dword v0, v3, s[0:3], 0 offen
				; GFX7-NEXT: s_waitcnt vmcnt(0)
				; GFX7-NEXT: v_readlane_b32 s31, v4, 1
				; GFX7-NEXT: v_readlane_b32 s30, v4, 0
				; GFX7-NEXT: s_addk_i32 s32, 0xfc00
				; GFX7-NEXT: v_readlane_b32 s33, v4, 2
				; GFX7-NEXT: s_or_saveexec_b64 s[4:5], -1
				; GFX7-NEXT: buffer_load_dword v4, off, s[0:3], s32 ; 4-byte Folded Reload
				; GFX7-NEXT: s_mov_b64 exec, s[4:5]
				; GFX7-NEXT: s_waitcnt vmcnt(0)
				; GFX7-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX8-LABEL: test_call_v3bf16:
				; GFX8: ; %bb.0: ; %entry
				; GFX8-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX8-NEXT: s_or_saveexec_b64 s[4:5], -1
				; GFX8-NEXT: buffer_store_dword v3, off, s[0:3], s32 ; 4-byte Folded Spill
				; GFX8-NEXT: s_mov_b64 exec, s[4:5]
				; GFX8-NEXT: v_writelane_b32 v3, s33, 2
				; GFX8-NEXT: s_mov_b32 s33, s32
				; GFX8-NEXT: s_addk_i32 s32, 0x400
				; GFX8-NEXT: s_getpc_b64 s[4:5]
				; GFX8-NEXT: s_add_u32 s4, s4, test_arg_store_v2bf16@gotpcrel32@lo+4
				; GFX8-NEXT: s_addc_u32 s5, s5, test_arg_store_v2bf16@gotpcrel32@hi+12
				; GFX8-NEXT: s_load_dwordx2 s[4:5], s[4:5], 0x0
				; GFX8-NEXT: v_writelane_b32 v3, s30, 0
				; GFX8-NEXT: v_and_b32_e32 v1, 0xffff, v1
				; GFX8-NEXT: v_writelane_b32 v3, s31, 1
				; GFX8-NEXT: s_waitcnt lgkmcnt(0)
				; GFX8-NEXT: s_swappc_b64 s[30:31], s[4:5]
				; GFX8-NEXT: v_add_u32_e32 v4, vcc, 4, v2
				; GFX8-NEXT: buffer_store_short v1, v4, s[0:3], 0 offen
				; GFX8-NEXT: s_waitcnt vmcnt(0)
				; GFX8-NEXT: buffer_store_dword v0, v2, s[0:3], 0 offen
				; GFX8-NEXT: s_waitcnt vmcnt(0)
				; GFX8-NEXT: v_readlane_b32 s31, v3, 1
				; GFX8-NEXT: v_readlane_b32 s30, v3, 0
				; GFX8-NEXT: s_addk_i32 s32, 0xfc00
				; GFX8-NEXT: v_readlane_b32 s33, v3, 2
				; GFX8-NEXT: s_or_saveexec_b64 s[4:5], -1
				; GFX8-NEXT: buffer_load_dword v3, off, s[0:3], s32 ; 4-byte Folded Reload
				; GFX8-NEXT: s_mov_b64 exec, s[4:5]
				; GFX8-NEXT: s_waitcnt vmcnt(0)
				; GFX8-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX9-LABEL: test_call_v3bf16:
				; GFX9: ; %bb.0: ; %entry
				; GFX9-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX9-NEXT: s_or_saveexec_b64 s[4:5], -1
				; GFX9-NEXT: buffer_store_dword v3, off, s[0:3], s32 ; 4-byte Folded Spill
				; GFX9-NEXT: s_mov_b64 exec, s[4:5]
				; GFX9-NEXT: v_writelane_b32 v3, s33, 2
				; GFX9-NEXT: s_mov_b32 s33, s32
				; GFX9-NEXT: s_addk_i32 s32, 0x400
				; GFX9-NEXT: v_and_b32_e32 v4, 0xffff0000, v0
				; GFX9-NEXT: s_mov_b32 s4, 0xffff
				; GFX9-NEXT: v_and_or_b32 v0, v0, s4, v4
				; GFX9-NEXT: s_getpc_b64 s[4:5]
				; GFX9-NEXT: s_add_u32 s4, s4, test_arg_store_v2bf16@gotpcrel32@lo+4
				; GFX9-NEXT: s_addc_u32 s5, s5, test_arg_store_v2bf16@gotpcrel32@hi+12
				; GFX9-NEXT: s_load_dwordx2 s[4:5], s[4:5], 0x0
				; GFX9-NEXT: v_writelane_b32 v3, s30, 0
				; GFX9-NEXT: v_and_b32_e32 v1, 0xffff, v1
				; GFX9-NEXT: v_writelane_b32 v3, s31, 1
				; GFX9-NEXT: s_waitcnt lgkmcnt(0)
				; GFX9-NEXT: s_swappc_b64 s[30:31], s[4:5]
				; GFX9-NEXT: buffer_store_short v1, v2, s[0:3], 0 offen offset:4
				; GFX9-NEXT: s_waitcnt vmcnt(0)
				; GFX9-NEXT: buffer_store_dword v0, v2, s[0:3], 0 offen
				; GFX9-NEXT: s_waitcnt vmcnt(0)
				; GFX9-NEXT: v_readlane_b32 s31, v3, 1
				; GFX9-NEXT: v_readlane_b32 s30, v3, 0
				; GFX9-NEXT: s_addk_i32 s32, 0xfc00
				; GFX9-NEXT: v_readlane_b32 s33, v3, 2
				; GFX9-NEXT: s_or_saveexec_b64 s[4:5], -1
				; GFX9-NEXT: buffer_load_dword v3, off, s[0:3], s32 ; 4-byte Folded Reload
				; GFX9-NEXT: s_mov_b64 exec, s[4:5]
				; GFX9-NEXT: s_waitcnt vmcnt(0)
				; GFX9-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX10-LABEL: test_call_v3bf16:
				; GFX10: ; %bb.0: ; %entry
				; GFX10-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX10-NEXT: s_or_saveexec_b32 s4, -1
				; GFX10-NEXT: buffer_store_dword v3, off, s[0:3], s32 ; 4-byte Folded Spill
				; GFX10-NEXT: s_waitcnt_depctr 0xffe3
				; GFX10-NEXT: s_mov_b32 exec_lo, s4
				; GFX10-NEXT: v_writelane_b32 v3, s33, 2
				; GFX10-NEXT: s_mov_b32 s33, s32
				; GFX10-NEXT: s_addk_i32 s32, 0x200
				; GFX10-NEXT: s_getpc_b64 s[4:5]
				; GFX10-NEXT: s_add_u32 s4, s4, test_arg_store_v2bf16@gotpcrel32@lo+4
				; GFX10-NEXT: s_addc_u32 s5, s5, test_arg_store_v2bf16@gotpcrel32@hi+12
				; GFX10-NEXT: v_and_b32_e32 v4, 0xffff0000, v0
				; GFX10-NEXT: s_load_dwordx2 s[4:5], s[4:5], 0x0
				; GFX10-NEXT: v_writelane_b32 v3, s30, 0
				; GFX10-NEXT: v_and_b32_e32 v1, 0xffff, v1
				; GFX10-NEXT: v_and_or_b32 v0, 0xffff, v0, v4
				; GFX10-NEXT: v_writelane_b32 v3, s31, 1
				; GFX10-NEXT: s_waitcnt lgkmcnt(0)
				; GFX10-NEXT: s_swappc_b64 s[30:31], s[4:5]
				; GFX10-NEXT: buffer_store_short v1, v2, s[0:3], 0 offen offset:4
				; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX10-NEXT: buffer_store_dword v0, v2, s[0:3], 0 offen
				; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX10-NEXT: v_readlane_b32 s31, v3, 1
				; GFX10-NEXT: v_readlane_b32 s30, v3, 0
				; GFX10-NEXT: s_addk_i32 s32, 0xfe00
				; GFX10-NEXT: v_readlane_b32 s33, v3, 2
				; GFX10-NEXT: s_or_saveexec_b32 s4, -1
				; GFX10-NEXT: buffer_load_dword v3, off, s[0:3], s32 ; 4-byte Folded Reload
				; GFX10-NEXT: s_waitcnt_depctr 0xffe3
				; GFX10-NEXT: s_mov_b32 exec_lo, s4
				; GFX10-NEXT: s_waitcnt vmcnt(0)
				; GFX10-NEXT: s_setpc_b64 s[30:31]
				entry:
				%result = call <3 x bfloat> @test_arg_store_v2bf16(<3 x bfloat> %in)
				store volatile <3 x bfloat> %result, ptr addrspace(5) %out
				ret void
				}

				define void @test_call_v4bf16(<4 x bfloat> %in, ptr addrspace(5) %out) {
				; GCN-LABEL: test_call_v4bf16:
				; GCN: ; %bb.0: ; %entry
				; GCN-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GCN-NEXT: s_or_saveexec_b64 s[4:5], -1
				; GCN-NEXT: buffer_store_dword v5, off, s[0:3], s32 ; 4-byte Folded Spill
				; GCN-NEXT: s_mov_b64 exec, s[4:5]
				; GCN-NEXT: s_waitcnt expcnt(0)
				; GCN-NEXT: v_writelane_b32 v5, s33, 2
				; GCN-NEXT: s_mov_b32 s33, s32
				; GCN-NEXT: s_addk_i32 s32, 0x400
				; GCN-NEXT: v_writelane_b32 v5, s30, 0
				; GCN-NEXT: v_writelane_b32 v5, s31, 1
				; GCN-NEXT: s_getpc_b64 s[4:5]
				; GCN-NEXT: s_add_u32 s4, s4, test_arg_store_v2bf16@gotpcrel32@lo+4
				; GCN-NEXT: s_addc_u32 s5, s5, test_arg_store_v2bf16@gotpcrel32@hi+12
				; GCN-NEXT: s_load_dwordx2 s[4:5], s[4:5], 0x0
				; GCN-NEXT: s_waitcnt lgkmcnt(0)
				; GCN-NEXT: s_swappc_b64 s[30:31], s[4:5]
				; GCN-NEXT: v_lshrrev_b32_e32 v0, 16, v0
				; GCN-NEXT: v_lshrrev_b32_e32 v1, 16, v1
				; GCN-NEXT: v_lshrrev_b32_e32 v2, 16, v2
				; GCN-NEXT: v_lshrrev_b32_e32 v3, 16, v3
				; GCN-NEXT: v_add_i32_e32 v6, vcc, 6, v4
				; GCN-NEXT: v_add_i32_e32 v7, vcc, 4, v4
				; GCN-NEXT: v_add_i32_e32 v8, vcc, 2, v4
				; GCN-NEXT: buffer_store_short v3, v6, s[0:3], 0 offen
				; GCN-NEXT: s_waitcnt vmcnt(0)
				; GCN-NEXT: buffer_store_short v2, v7, s[0:3], 0 offen
				; GCN-NEXT: s_waitcnt vmcnt(0)
				; GCN-NEXT: buffer_store_short v1, v8, s[0:3], 0 offen
				; GCN-NEXT: s_waitcnt vmcnt(0)
				; GCN-NEXT: buffer_store_short v0, v4, s[0:3], 0 offen
				; GCN-NEXT: s_waitcnt vmcnt(0)
				; GCN-NEXT: v_readlane_b32 s31, v5, 1
				; GCN-NEXT: v_readlane_b32 s30, v5, 0
				; GCN-NEXT: s_addk_i32 s32, 0xfc00
				; GCN-NEXT: v_readlane_b32 s33, v5, 2
				; GCN-NEXT: s_or_saveexec_b64 s[4:5], -1
				; GCN-NEXT: buffer_load_dword v5, off, s[0:3], s32 ; 4-byte Folded Reload
				; GCN-NEXT: s_mov_b64 exec, s[4:5]
				; GCN-NEXT: s_waitcnt vmcnt(0) expcnt(0)
				; GCN-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX7-LABEL: test_call_v4bf16:
				; GFX7: ; %bb.0: ; %entry
				; GFX7-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX7-NEXT: s_or_saveexec_b64 s[4:5], -1
				; GFX7-NEXT: buffer_store_dword v5, off, s[0:3], s32 ; 4-byte Folded Spill
				; GFX7-NEXT: s_mov_b64 exec, s[4:5]
				; GFX7-NEXT: v_writelane_b32 v5, s33, 2
				; GFX7-NEXT: s_mov_b32 s33, s32
				; GFX7-NEXT: s_addk_i32 s32, 0x400
				; GFX7-NEXT: s_getpc_b64 s[4:5]
				; GFX7-NEXT: s_add_u32 s4, s4, test_arg_store_v2bf16@gotpcrel32@lo+4
				; GFX7-NEXT: s_addc_u32 s5, s5, test_arg_store_v2bf16@gotpcrel32@hi+12
				; GFX7-NEXT: s_load_dwordx2 s[4:5], s[4:5], 0x0
				; GFX7-NEXT: v_writelane_b32 v5, s30, 0
				; GFX7-NEXT: v_writelane_b32 v5, s31, 1
				; GFX7-NEXT: s_waitcnt lgkmcnt(0)
				; GFX7-NEXT: s_swappc_b64 s[30:31], s[4:5]
				; GFX7-NEXT: v_lshrrev_b32_e32 v3, 16, v3
				; GFX7-NEXT: v_add_i32_e32 v6, vcc, 6, v4
				; GFX7-NEXT: v_lshrrev_b32_e32 v2, 16, v2
				; GFX7-NEXT: buffer_store_short v3, v6, s[0:3], 0 offen
				; GFX7-NEXT: s_waitcnt vmcnt(0)
				; GFX7-NEXT: v_add_i32_e32 v3, vcc, 4, v4
				; GFX7-NEXT: v_lshrrev_b32_e32 v1, 16, v1
				; GFX7-NEXT: buffer_store_short v2, v3, s[0:3], 0 offen
				; GFX7-NEXT: s_waitcnt vmcnt(0)
				; GFX7-NEXT: v_add_i32_e32 v2, vcc, 2, v4
				; GFX7-NEXT: v_lshrrev_b32_e32 v0, 16, v0
				; GFX7-NEXT: buffer_store_short v1, v2, s[0:3], 0 offen
				; GFX7-NEXT: s_waitcnt vmcnt(0)
				; GFX7-NEXT: buffer_store_short v0, v4, s[0:3], 0 offen
				; GFX7-NEXT: s_waitcnt vmcnt(0)
				; GFX7-NEXT: v_readlane_b32 s31, v5, 1
				; GFX7-NEXT: v_readlane_b32 s30, v5, 0
				; GFX7-NEXT: s_addk_i32 s32, 0xfc00
				; GFX7-NEXT: v_readlane_b32 s33, v5, 2
				; GFX7-NEXT: s_or_saveexec_b64 s[4:5], -1
				; GFX7-NEXT: buffer_load_dword v5, off, s[0:3], s32 ; 4-byte Folded Reload
				; GFX7-NEXT: s_mov_b64 exec, s[4:5]
				; GFX7-NEXT: s_waitcnt vmcnt(0)
				; GFX7-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX8-LABEL: test_call_v4bf16:
				; GFX8: ; %bb.0: ; %entry
				; GFX8-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX8-NEXT: s_or_saveexec_b64 s[4:5], -1
				; GFX8-NEXT: buffer_store_dword v3, off, s[0:3], s32 ; 4-byte Folded Spill
				; GFX8-NEXT: s_mov_b64 exec, s[4:5]
				; GFX8-NEXT: v_writelane_b32 v3, s33, 2
				; GFX8-NEXT: s_mov_b32 s33, s32
				; GFX8-NEXT: s_addk_i32 s32, 0x400
				; GFX8-NEXT: s_getpc_b64 s[4:5]
				; GFX8-NEXT: s_add_u32 s4, s4, test_arg_store_v2bf16@gotpcrel32@lo+4
				; GFX8-NEXT: s_addc_u32 s5, s5, test_arg_store_v2bf16@gotpcrel32@hi+12
				; GFX8-NEXT: s_load_dwordx2 s[4:5], s[4:5], 0x0
				; GFX8-NEXT: v_writelane_b32 v3, s30, 0
				; GFX8-NEXT: v_writelane_b32 v3, s31, 1
				; GFX8-NEXT: s_waitcnt lgkmcnt(0)
				; GFX8-NEXT: s_swappc_b64 s[30:31], s[4:5]
				; GFX8-NEXT: v_add_u32_e32 v6, vcc, 4, v2
				; GFX8-NEXT: v_lshrrev_b32_e32 v4, 16, v0
				; GFX8-NEXT: v_lshrrev_b32_e32 v5, 16, v1
				; GFX8-NEXT: buffer_store_short v1, v6, s[0:3], 0 offen
				; GFX8-NEXT: s_waitcnt vmcnt(0)
				; GFX8-NEXT: buffer_store_short v0, v2, s[0:3], 0 offen
				; GFX8-NEXT: s_waitcnt vmcnt(0)
				; GFX8-NEXT: v_add_u32_e32 v0, vcc, 6, v2
				; GFX8-NEXT: buffer_store_short v5, v0, s[0:3], 0 offen
				; GFX8-NEXT: s_waitcnt vmcnt(0)
				; GFX8-NEXT: v_add_u32_e32 v0, vcc, 2, v2
				; GFX8-NEXT: buffer_store_short v4, v0, s[0:3], 0 offen
				; GFX8-NEXT: s_waitcnt vmcnt(0)
				; GFX8-NEXT: v_readlane_b32 s31, v3, 1
				; GFX8-NEXT: v_readlane_b32 s30, v3, 0
				; GFX8-NEXT: s_addk_i32 s32, 0xfc00
				; GFX8-NEXT: v_readlane_b32 s33, v3, 2
				; GFX8-NEXT: s_or_saveexec_b64 s[4:5], -1
				; GFX8-NEXT: buffer_load_dword v3, off, s[0:3], s32 ; 4-byte Folded Reload
				; GFX8-NEXT: s_mov_b64 exec, s[4:5]
				; GFX8-NEXT: s_waitcnt vmcnt(0)
				; GFX8-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX9-LABEL: test_call_v4bf16:
				; GFX9: ; %bb.0: ; %entry
				; GFX9-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX9-NEXT: s_or_saveexec_b64 s[4:5], -1
				; GFX9-NEXT: buffer_store_dword v3, off, s[0:3], s32 ; 4-byte Folded Spill
				; GFX9-NEXT: s_mov_b64 exec, s[4:5]
				; GFX9-NEXT: v_writelane_b32 v3, s33, 2
				; GFX9-NEXT: s_mov_b32 s33, s32
				; GFX9-NEXT: s_addk_i32 s32, 0x400
				; GFX9-NEXT: s_getpc_b64 s[4:5]
				; GFX9-NEXT: s_add_u32 s4, s4, test_arg_store_v2bf16@gotpcrel32@lo+4
				; GFX9-NEXT: s_addc_u32 s5, s5, test_arg_store_v2bf16@gotpcrel32@hi+12
				; GFX9-NEXT: s_load_dwordx2 s[4:5], s[4:5], 0x0
				; GFX9-NEXT: v_writelane_b32 v3, s30, 0
				; GFX9-NEXT: v_writelane_b32 v3, s31, 1
				; GFX9-NEXT: s_waitcnt lgkmcnt(0)
				; GFX9-NEXT: s_swappc_b64 s[30:31], s[4:5]
				; GFX9-NEXT: buffer_store_short_d16_hi v1, v2, s[0:3], 0 offen offset:6
				; GFX9-NEXT: s_waitcnt vmcnt(0)
				; GFX9-NEXT: buffer_store_short v1, v2, s[0:3], 0 offen offset:4
				; GFX9-NEXT: s_waitcnt vmcnt(0)
				; GFX9-NEXT: buffer_store_short_d16_hi v0, v2, s[0:3], 0 offen offset:2
				; GFX9-NEXT: s_waitcnt vmcnt(0)
				; GFX9-NEXT: buffer_store_short v0, v2, s[0:3], 0 offen
				; GFX9-NEXT: s_waitcnt vmcnt(0)
				; GFX9-NEXT: v_readlane_b32 s31, v3, 1
				; GFX9-NEXT: v_readlane_b32 s30, v3, 0
				; GFX9-NEXT: s_addk_i32 s32, 0xfc00
				; GFX9-NEXT: v_readlane_b32 s33, v3, 2
				; GFX9-NEXT: s_or_saveexec_b64 s[4:5], -1
				; GFX9-NEXT: buffer_load_dword v3, off, s[0:3], s32 ; 4-byte Folded Reload
				; GFX9-NEXT: s_mov_b64 exec, s[4:5]
				; GFX9-NEXT: s_waitcnt vmcnt(0)
				; GFX9-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX10-LABEL: test_call_v4bf16:
				; GFX10: ; %bb.0: ; %entry
				; GFX10-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX10-NEXT: s_or_saveexec_b32 s4, -1
				; GFX10-NEXT: buffer_store_dword v3, off, s[0:3], s32 ; 4-byte Folded Spill
				; GFX10-NEXT: s_waitcnt_depctr 0xffe3
				; GFX10-NEXT: s_mov_b32 exec_lo, s4
				; GFX10-NEXT: v_writelane_b32 v3, s33, 2
				; GFX10-NEXT: s_mov_b32 s33, s32
				; GFX10-NEXT: s_addk_i32 s32, 0x200
				; GFX10-NEXT: s_getpc_b64 s[4:5]
				; GFX10-NEXT: s_add_u32 s4, s4, test_arg_store_v2bf16@gotpcrel32@lo+4
				; GFX10-NEXT: s_addc_u32 s5, s5, test_arg_store_v2bf16@gotpcrel32@hi+12
				; GFX10-NEXT: s_load_dwordx2 s[4:5], s[4:5], 0x0
				; GFX10-NEXT: v_writelane_b32 v3, s30, 0
				; GFX10-NEXT: v_writelane_b32 v3, s31, 1
				; GFX10-NEXT: s_waitcnt lgkmcnt(0)
				; GFX10-NEXT: s_swappc_b64 s[30:31], s[4:5]
				; GFX10-NEXT: buffer_store_short_d16_hi v1, v2, s[0:3], 0 offen offset:6
				; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX10-NEXT: buffer_store_short v1, v2, s[0:3], 0 offen offset:4
				; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX10-NEXT: buffer_store_short_d16_hi v0, v2, s[0:3], 0 offen offset:2
				; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX10-NEXT: buffer_store_short v0, v2, s[0:3], 0 offen
				; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX10-NEXT: v_readlane_b32 s31, v3, 1
				; GFX10-NEXT: v_readlane_b32 s30, v3, 0
				; GFX10-NEXT: s_addk_i32 s32, 0xfe00
				; GFX10-NEXT: v_readlane_b32 s33, v3, 2
				; GFX10-NEXT: s_or_saveexec_b32 s4, -1
				; GFX10-NEXT: buffer_load_dword v3, off, s[0:3], s32 ; 4-byte Folded Reload
				; GFX10-NEXT: s_waitcnt_depctr 0xffe3
				; GFX10-NEXT: s_mov_b32 exec_lo, s4
				; GFX10-NEXT: s_waitcnt vmcnt(0)
				; GFX10-NEXT: s_setpc_b64 s[30:31]
				entry:
				%result = call <4 x bfloat> @test_arg_store_v2bf16(<4 x bfloat> %in)
				store volatile <4 x bfloat> %result, ptr addrspace(5) %out
				ret void
				}

				define void @test_call_v8bf16(<8 x bfloat> %in, ptr addrspace(5) %out) {
				; GCN-LABEL: test_call_v8bf16:
				; GCN: ; %bb.0: ; %entry
				; GCN-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GCN-NEXT: s_or_saveexec_b64 s[4:5], -1
				; GCN-NEXT: buffer_store_dword v9, off, s[0:3], s32 ; 4-byte Folded Spill
				; GCN-NEXT: s_mov_b64 exec, s[4:5]
				; GCN-NEXT: s_waitcnt expcnt(0)
				; GCN-NEXT: v_writelane_b32 v9, s33, 2
				; GCN-NEXT: s_mov_b32 s33, s32
				; GCN-NEXT: s_addk_i32 s32, 0x400
				; GCN-NEXT: v_writelane_b32 v9, s30, 0
				; GCN-NEXT: v_writelane_b32 v9, s31, 1
				; GCN-NEXT: s_getpc_b64 s[4:5]
				; GCN-NEXT: s_add_u32 s4, s4, test_arg_store_v2bf16@gotpcrel32@lo+4
				; GCN-NEXT: s_addc_u32 s5, s5, test_arg_store_v2bf16@gotpcrel32@hi+12
				; GCN-NEXT: s_load_dwordx2 s[4:5], s[4:5], 0x0
				; GCN-NEXT: s_waitcnt lgkmcnt(0)
				; GCN-NEXT: s_swappc_b64 s[30:31], s[4:5]
				; GCN-NEXT: v_lshrrev_b32_e32 v0, 16, v0
				; GCN-NEXT: v_lshrrev_b32_e32 v1, 16, v1
				; GCN-NEXT: v_lshrrev_b32_e32 v2, 16, v2
				; GCN-NEXT: v_lshrrev_b32_e32 v3, 16, v3
				; GCN-NEXT: v_lshrrev_b32_e32 v4, 16, v4
				; GCN-NEXT: v_lshrrev_b32_e32 v5, 16, v5
				; GCN-NEXT: v_lshrrev_b32_e32 v6, 16, v6
				; GCN-NEXT: v_lshrrev_b32_e32 v7, 16, v7
				; GCN-NEXT: v_add_i32_e32 v10, vcc, 14, v8
				; GCN-NEXT: v_add_i32_e32 v11, vcc, 12, v8
				; GCN-NEXT: v_add_i32_e32 v12, vcc, 10, v8
				; GCN-NEXT: v_add_i32_e32 v13, vcc, 8, v8
				; GCN-NEXT: v_add_i32_e32 v14, vcc, 6, v8
				; GCN-NEXT: v_add_i32_e32 v15, vcc, 4, v8
				; GCN-NEXT: v_add_i32_e32 v16, vcc, 2, v8
				; GCN-NEXT: buffer_store_short v7, v10, s[0:3], 0 offen
				; GCN-NEXT: s_waitcnt vmcnt(0)
				; GCN-NEXT: buffer_store_short v6, v11, s[0:3], 0 offen
				; GCN-NEXT: s_waitcnt vmcnt(0)
				; GCN-NEXT: buffer_store_short v5, v12, s[0:3], 0 offen
				; GCN-NEXT: s_waitcnt vmcnt(0)
				; GCN-NEXT: buffer_store_short v4, v13, s[0:3], 0 offen
				; GCN-NEXT: s_waitcnt vmcnt(0)
				; GCN-NEXT: buffer_store_short v3, v14, s[0:3], 0 offen
				; GCN-NEXT: s_waitcnt vmcnt(0)
				; GCN-NEXT: buffer_store_short v2, v15, s[0:3], 0 offen
				; GCN-NEXT: s_waitcnt vmcnt(0)
				; GCN-NEXT: buffer_store_short v1, v16, s[0:3], 0 offen
				; GCN-NEXT: s_waitcnt vmcnt(0)
				; GCN-NEXT: buffer_store_short v0, v8, s[0:3], 0 offen
				; GCN-NEXT: s_waitcnt vmcnt(0)
				; GCN-NEXT: v_readlane_b32 s31, v9, 1
				; GCN-NEXT: v_readlane_b32 s30, v9, 0
				; GCN-NEXT: s_addk_i32 s32, 0xfc00
				; GCN-NEXT: v_readlane_b32 s33, v9, 2
				; GCN-NEXT: s_or_saveexec_b64 s[4:5], -1
				; GCN-NEXT: buffer_load_dword v9, off, s[0:3], s32 ; 4-byte Folded Reload
				; GCN-NEXT: s_mov_b64 exec, s[4:5]
				; GCN-NEXT: s_waitcnt vmcnt(0) expcnt(0)
				; GCN-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX7-LABEL: test_call_v8bf16:
				; GFX7: ; %bb.0: ; %entry
				; GFX7-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX7-NEXT: s_or_saveexec_b64 s[4:5], -1
				; GFX7-NEXT: buffer_store_dword v9, off, s[0:3], s32 ; 4-byte Folded Spill
				; GFX7-NEXT: s_mov_b64 exec, s[4:5]
				; GFX7-NEXT: v_writelane_b32 v9, s33, 2
				; GFX7-NEXT: s_mov_b32 s33, s32
				; GFX7-NEXT: s_addk_i32 s32, 0x400
				; GFX7-NEXT: s_getpc_b64 s[4:5]
				; GFX7-NEXT: s_add_u32 s4, s4, test_arg_store_v2bf16@gotpcrel32@lo+4
				; GFX7-NEXT: s_addc_u32 s5, s5, test_arg_store_v2bf16@gotpcrel32@hi+12
				; GFX7-NEXT: s_load_dwordx2 s[4:5], s[4:5], 0x0
				; GFX7-NEXT: v_writelane_b32 v9, s30, 0
				; GFX7-NEXT: v_writelane_b32 v9, s31, 1
				; GFX7-NEXT: s_waitcnt lgkmcnt(0)
				; GFX7-NEXT: s_swappc_b64 s[30:31], s[4:5]
				; GFX7-NEXT: v_lshrrev_b32_e32 v7, 16, v7
				; GFX7-NEXT: v_add_i32_e32 v10, vcc, 14, v8
				; GFX7-NEXT: v_lshrrev_b32_e32 v6, 16, v6
				; GFX7-NEXT: buffer_store_short v7, v10, s[0:3], 0 offen
				; GFX7-NEXT: s_waitcnt vmcnt(0)
				; GFX7-NEXT: v_add_i32_e32 v7, vcc, 12, v8
				; GFX7-NEXT: v_lshrrev_b32_e32 v5, 16, v5
				; GFX7-NEXT: buffer_store_short v6, v7, s[0:3], 0 offen
				; GFX7-NEXT: s_waitcnt vmcnt(0)
				; GFX7-NEXT: v_add_i32_e32 v6, vcc, 10, v8
				; GFX7-NEXT: v_lshrrev_b32_e32 v4, 16, v4
				; GFX7-NEXT: buffer_store_short v5, v6, s[0:3], 0 offen
				; GFX7-NEXT: s_waitcnt vmcnt(0)
				; GFX7-NEXT: v_add_i32_e32 v5, vcc, 8, v8
				; GFX7-NEXT: v_lshrrev_b32_e32 v3, 16, v3
				; GFX7-NEXT: buffer_store_short v4, v5, s[0:3], 0 offen
				; GFX7-NEXT: s_waitcnt vmcnt(0)
				; GFX7-NEXT: v_add_i32_e32 v4, vcc, 6, v8
				; GFX7-NEXT: v_lshrrev_b32_e32 v2, 16, v2
				; GFX7-NEXT: buffer_store_short v3, v4, s[0:3], 0 offen
				; GFX7-NEXT: s_waitcnt vmcnt(0)
				; GFX7-NEXT: v_add_i32_e32 v3, vcc, 4, v8
				; GFX7-NEXT: v_lshrrev_b32_e32 v1, 16, v1
				; GFX7-NEXT: buffer_store_short v2, v3, s[0:3], 0 offen
				; GFX7-NEXT: s_waitcnt vmcnt(0)
				; GFX7-NEXT: v_add_i32_e32 v2, vcc, 2, v8
				; GFX7-NEXT: v_lshrrev_b32_e32 v0, 16, v0
				; GFX7-NEXT: buffer_store_short v1, v2, s[0:3], 0 offen
				; GFX7-NEXT: s_waitcnt vmcnt(0)
				; GFX7-NEXT: buffer_store_short v0, v8, s[0:3], 0 offen
				; GFX7-NEXT: s_waitcnt vmcnt(0)
				; GFX7-NEXT: v_readlane_b32 s31, v9, 1
				; GFX7-NEXT: v_readlane_b32 s30, v9, 0
				; GFX7-NEXT: s_addk_i32 s32, 0xfc00
				; GFX7-NEXT: v_readlane_b32 s33, v9, 2
				; GFX7-NEXT: s_or_saveexec_b64 s[4:5], -1
				; GFX7-NEXT: buffer_load_dword v9, off, s[0:3], s32 ; 4-byte Folded Reload
				; GFX7-NEXT: s_mov_b64 exec, s[4:5]
				; GFX7-NEXT: s_waitcnt vmcnt(0)
				; GFX7-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX8-LABEL: test_call_v8bf16:
				; GFX8: ; %bb.0: ; %entry
				; GFX8-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX8-NEXT: s_or_saveexec_b64 s[4:5], -1
				; GFX8-NEXT: buffer_store_dword v5, off, s[0:3], s32 ; 4-byte Folded Spill
				; GFX8-NEXT: s_mov_b64 exec, s[4:5]
				; GFX8-NEXT: v_writelane_b32 v5, s33, 2
				; GFX8-NEXT: s_mov_b32 s33, s32
				; GFX8-NEXT: s_addk_i32 s32, 0x400
				; GFX8-NEXT: s_getpc_b64 s[4:5]
				; GFX8-NEXT: s_add_u32 s4, s4, test_arg_store_v2bf16@gotpcrel32@lo+4
				; GFX8-NEXT: s_addc_u32 s5, s5, test_arg_store_v2bf16@gotpcrel32@hi+12
				; GFX8-NEXT: s_load_dwordx2 s[4:5], s[4:5], 0x0
				; GFX8-NEXT: v_writelane_b32 v5, s30, 0
				; GFX8-NEXT: v_writelane_b32 v5, s31, 1
				; GFX8-NEXT: s_waitcnt lgkmcnt(0)
				; GFX8-NEXT: s_swappc_b64 s[30:31], s[4:5]
				; GFX8-NEXT: v_add_u32_e32 v10, vcc, 12, v4
				; GFX8-NEXT: v_lshrrev_b32_e32 v9, 16, v3
				; GFX8-NEXT: buffer_store_short v3, v10, s[0:3], 0 offen
				; GFX8-NEXT: s_waitcnt vmcnt(0)
				; GFX8-NEXT: v_add_u32_e32 v3, vcc, 8, v4
				; GFX8-NEXT: v_lshrrev_b32_e32 v8, 16, v2
				; GFX8-NEXT: buffer_store_short v2, v3, s[0:3], 0 offen
				; GFX8-NEXT: s_waitcnt vmcnt(0)
				; GFX8-NEXT: v_add_u32_e32 v2, vcc, 4, v4
				; GFX8-NEXT: v_lshrrev_b32_e32 v6, 16, v0
				; GFX8-NEXT: buffer_store_short v1, v2, s[0:3], 0 offen
				; GFX8-NEXT: s_waitcnt vmcnt(0)
				; GFX8-NEXT: buffer_store_short v0, v4, s[0:3], 0 offen
				; GFX8-NEXT: s_waitcnt vmcnt(0)
				; GFX8-NEXT: v_add_u32_e32 v0, vcc, 14, v4
				; GFX8-NEXT: buffer_store_short v9, v0, s[0:3], 0 offen
				; GFX8-NEXT: s_waitcnt vmcnt(0)
				; GFX8-NEXT: v_add_u32_e32 v0, vcc, 10, v4
				; GFX8-NEXT: v_lshrrev_b32_e32 v7, 16, v1
				; GFX8-NEXT: buffer_store_short v8, v0, s[0:3], 0 offen
				; GFX8-NEXT: s_waitcnt vmcnt(0)
				; GFX8-NEXT: v_add_u32_e32 v0, vcc, 6, v4
				; GFX8-NEXT: buffer_store_short v7, v0, s[0:3], 0 offen
				; GFX8-NEXT: s_waitcnt vmcnt(0)
				; GFX8-NEXT: v_add_u32_e32 v0, vcc, 2, v4
				; GFX8-NEXT: buffer_store_short v6, v0, s[0:3], 0 offen
				; GFX8-NEXT: s_waitcnt vmcnt(0)
				; GFX8-NEXT: v_readlane_b32 s31, v5, 1
				; GFX8-NEXT: v_readlane_b32 s30, v5, 0
				; GFX8-NEXT: s_addk_i32 s32, 0xfc00
				; GFX8-NEXT: v_readlane_b32 s33, v5, 2
				; GFX8-NEXT: s_or_saveexec_b64 s[4:5], -1
				; GFX8-NEXT: buffer_load_dword v5, off, s[0:3], s32 ; 4-byte Folded Reload
				; GFX8-NEXT: s_mov_b64 exec, s[4:5]
				; GFX8-NEXT: s_waitcnt vmcnt(0)
				; GFX8-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX9-LABEL: test_call_v8bf16:
				; GFX9: ; %bb.0: ; %entry
				; GFX9-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX9-NEXT: s_or_saveexec_b64 s[4:5], -1
				; GFX9-NEXT: buffer_store_dword v5, off, s[0:3], s32 ; 4-byte Folded Spill
				; GFX9-NEXT: s_mov_b64 exec, s[4:5]
				; GFX9-NEXT: v_writelane_b32 v5, s33, 2
				; GFX9-NEXT: s_mov_b32 s33, s32
				; GFX9-NEXT: s_addk_i32 s32, 0x400
				; GFX9-NEXT: s_getpc_b64 s[4:5]
				; GFX9-NEXT: s_add_u32 s4, s4, test_arg_store_v2bf16@gotpcrel32@lo+4
				; GFX9-NEXT: s_addc_u32 s5, s5, test_arg_store_v2bf16@gotpcrel32@hi+12
				; GFX9-NEXT: s_load_dwordx2 s[4:5], s[4:5], 0x0
				; GFX9-NEXT: v_writelane_b32 v5, s30, 0
				; GFX9-NEXT: v_writelane_b32 v5, s31, 1
				; GFX9-NEXT: s_waitcnt lgkmcnt(0)
				; GFX9-NEXT: s_swappc_b64 s[30:31], s[4:5]
				; GFX9-NEXT: buffer_store_short_d16_hi v3, v4, s[0:3], 0 offen offset:14
				; GFX9-NEXT: s_waitcnt vmcnt(0)
				; GFX9-NEXT: buffer_store_short v3, v4, s[0:3], 0 offen offset:12
				; GFX9-NEXT: s_waitcnt vmcnt(0)
				; GFX9-NEXT: buffer_store_short_d16_hi v2, v4, s[0:3], 0 offen offset:10
				; GFX9-NEXT: s_waitcnt vmcnt(0)
				; GFX9-NEXT: buffer_store_short v2, v4, s[0:3], 0 offen offset:8
				; GFX9-NEXT: s_waitcnt vmcnt(0)
				; GFX9-NEXT: buffer_store_short_d16_hi v1, v4, s[0:3], 0 offen offset:6
				; GFX9-NEXT: s_waitcnt vmcnt(0)
				; GFX9-NEXT: buffer_store_short v1, v4, s[0:3], 0 offen offset:4
				; GFX9-NEXT: s_waitcnt vmcnt(0)
				; GFX9-NEXT: buffer_store_short_d16_hi v0, v4, s[0:3], 0 offen offset:2
				; GFX9-NEXT: s_waitcnt vmcnt(0)
				; GFX9-NEXT: buffer_store_short v0, v4, s[0:3], 0 offen
				; GFX9-NEXT: s_waitcnt vmcnt(0)
				; GFX9-NEXT: v_readlane_b32 s31, v5, 1
				; GFX9-NEXT: v_readlane_b32 s30, v5, 0
				; GFX9-NEXT: s_addk_i32 s32, 0xfc00
				; GFX9-NEXT: v_readlane_b32 s33, v5, 2
				; GFX9-NEXT: s_or_saveexec_b64 s[4:5], -1
				; GFX9-NEXT: buffer_load_dword v5, off, s[0:3], s32 ; 4-byte Folded Reload
				; GFX9-NEXT: s_mov_b64 exec, s[4:5]
				; GFX9-NEXT: s_waitcnt vmcnt(0)
				; GFX9-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX10-LABEL: test_call_v8bf16:
				; GFX10: ; %bb.0: ; %entry
				; GFX10-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX10-NEXT: s_or_saveexec_b32 s4, -1
				; GFX10-NEXT: buffer_store_dword v5, off, s[0:3], s32 ; 4-byte Folded Spill
				; GFX10-NEXT: s_waitcnt_depctr 0xffe3
				; GFX10-NEXT: s_mov_b32 exec_lo, s4
				; GFX10-NEXT: v_writelane_b32 v5, s33, 2
				; GFX10-NEXT: s_mov_b32 s33, s32
				; GFX10-NEXT: s_addk_i32 s32, 0x200
				; GFX10-NEXT: s_getpc_b64 s[4:5]
				; GFX10-NEXT: s_add_u32 s4, s4, test_arg_store_v2bf16@gotpcrel32@lo+4
				; GFX10-NEXT: s_addc_u32 s5, s5, test_arg_store_v2bf16@gotpcrel32@hi+12
				; GFX10-NEXT: s_load_dwordx2 s[4:5], s[4:5], 0x0
				; GFX10-NEXT: v_writelane_b32 v5, s30, 0
				; GFX10-NEXT: v_writelane_b32 v5, s31, 1
				; GFX10-NEXT: s_waitcnt lgkmcnt(0)
				; GFX10-NEXT: s_swappc_b64 s[30:31], s[4:5]
				; GFX10-NEXT: buffer_store_short_d16_hi v3, v4, s[0:3], 0 offen offset:14
				; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX10-NEXT: buffer_store_short v3, v4, s[0:3], 0 offen offset:12
				; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX10-NEXT: buffer_store_short_d16_hi v2, v4, s[0:3], 0 offen offset:10
				; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX10-NEXT: buffer_store_short v2, v4, s[0:3], 0 offen offset:8
				; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX10-NEXT: buffer_store_short_d16_hi v1, v4, s[0:3], 0 offen offset:6
				; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX10-NEXT: buffer_store_short v1, v4, s[0:3], 0 offen offset:4
				; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX10-NEXT: buffer_store_short_d16_hi v0, v4, s[0:3], 0 offen offset:2
				; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX10-NEXT: buffer_store_short v0, v4, s[0:3], 0 offen
				; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX10-NEXT: v_readlane_b32 s31, v5, 1
				; GFX10-NEXT: v_readlane_b32 s30, v5, 0
				; GFX10-NEXT: s_addk_i32 s32, 0xfe00
				; GFX10-NEXT: v_readlane_b32 s33, v5, 2
				; GFX10-NEXT: s_or_saveexec_b32 s4, -1
				; GFX10-NEXT: buffer_load_dword v5, off, s[0:3], s32 ; 4-byte Folded Reload
				; GFX10-NEXT: s_waitcnt_depctr 0xffe3
				; GFX10-NEXT: s_mov_b32 exec_lo, s4
				; GFX10-NEXT: s_waitcnt vmcnt(0)
				; GFX10-NEXT: s_setpc_b64 s[30:31]
				entry:
				%result = call <8 x bfloat> @test_arg_store_v2bf16(<8 x bfloat> %in)
				store volatile <8 x bfloat> %result, ptr addrspace(5) %out
				ret void
				}

				define void @test_call_v16bf16(<16 x bfloat> %in, ptr addrspace(5) %out) {
				; GCN-LABEL: test_call_v16bf16:
				; GCN: ; %bb.0: ; %entry
				; GCN-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GCN-NEXT: s_or_saveexec_b64 s[4:5], -1
				; GCN-NEXT: buffer_store_dword v17, off, s[0:3], s32 ; 4-byte Folded Spill
				; GCN-NEXT: s_mov_b64 exec, s[4:5]
				; GCN-NEXT: s_waitcnt expcnt(0)
				; GCN-NEXT: v_writelane_b32 v17, s33, 2
				; GCN-NEXT: s_mov_b32 s33, s32
				; GCN-NEXT: s_addk_i32 s32, 0x400
				; GCN-NEXT: v_writelane_b32 v17, s30, 0
				; GCN-NEXT: v_writelane_b32 v17, s31, 1
				; GCN-NEXT: s_getpc_b64 s[4:5]
				; GCN-NEXT: s_add_u32 s4, s4, test_arg_store_v2bf16@gotpcrel32@lo+4
				; GCN-NEXT: s_addc_u32 s5, s5, test_arg_store_v2bf16@gotpcrel32@hi+12
				; GCN-NEXT: s_load_dwordx2 s[4:5], s[4:5], 0x0
				; GCN-NEXT: s_waitcnt lgkmcnt(0)
				; GCN-NEXT: s_swappc_b64 s[30:31], s[4:5]
				; GCN-NEXT: v_lshrrev_b32_e32 v0, 16, v0
				; GCN-NEXT: v_lshrrev_b32_e32 v1, 16, v1
				; GCN-NEXT: v_lshrrev_b32_e32 v2, 16, v2
				; GCN-NEXT: v_lshrrev_b32_e32 v3, 16, v3
				; GCN-NEXT: v_lshrrev_b32_e32 v4, 16, v4
				; GCN-NEXT: v_lshrrev_b32_e32 v5, 16, v5
				; GCN-NEXT: v_lshrrev_b32_e32 v6, 16, v6
				; GCN-NEXT: v_lshrrev_b32_e32 v7, 16, v7
				; GCN-NEXT: v_lshrrev_b32_e32 v8, 16, v8
				; GCN-NEXT: v_lshrrev_b32_e32 v9, 16, v9
				; GCN-NEXT: v_lshrrev_b32_e32 v10, 16, v10
				; GCN-NEXT: v_lshrrev_b32_e32 v11, 16, v11
				; GCN-NEXT: v_lshrrev_b32_e32 v12, 16, v12
				; GCN-NEXT: v_lshrrev_b32_e32 v13, 16, v13
				; GCN-NEXT: v_lshrrev_b32_e32 v14, 16, v14
				; GCN-NEXT: v_lshrrev_b32_e32 v15, 16, v15
				; GCN-NEXT: v_add_i32_e32 v18, vcc, 30, v16
				; GCN-NEXT: v_add_i32_e32 v19, vcc, 28, v16
				; GCN-NEXT: v_add_i32_e32 v20, vcc, 26, v16
				; GCN-NEXT: buffer_store_short v15, v18, s[0:3], 0 offen
				; GCN-NEXT: s_waitcnt vmcnt(0) expcnt(0)
				; GCN-NEXT: v_add_i32_e32 v15, vcc, 24, v16
				; GCN-NEXT: v_add_i32_e32 v18, vcc, 22, v16
				; GCN-NEXT: buffer_store_short v14, v19, s[0:3], 0 offen
				; GCN-NEXT: s_waitcnt vmcnt(0) expcnt(0)
				; GCN-NEXT: v_add_i32_e32 v14, vcc, 20, v16
				; GCN-NEXT: v_add_i32_e32 v19, vcc, 18, v16
				; GCN-NEXT: buffer_store_short v13, v20, s[0:3], 0 offen
				; GCN-NEXT: s_waitcnt vmcnt(0) expcnt(0)
				; GCN-NEXT: v_add_i32_e32 v13, vcc, 16, v16
				; GCN-NEXT: v_add_i32_e32 v20, vcc, 14, v16
				; GCN-NEXT: buffer_store_short v12, v15, s[0:3], 0 offen
				; GCN-NEXT: s_waitcnt vmcnt(0) expcnt(0)
				; GCN-NEXT: v_add_i32_e32 v12, vcc, 12, v16
				; GCN-NEXT: v_add_i32_e32 v15, vcc, 10, v16
				; GCN-NEXT: buffer_store_short v11, v18, s[0:3], 0 offen
				; GCN-NEXT: s_waitcnt vmcnt(0) expcnt(0)
				; GCN-NEXT: v_add_i32_e32 v11, vcc, 8, v16
				; GCN-NEXT: v_add_i32_e32 v18, vcc, 6, v16
				; GCN-NEXT: buffer_store_short v10, v14, s[0:3], 0 offen
				; GCN-NEXT: s_waitcnt vmcnt(0) expcnt(0)
				; GCN-NEXT: v_add_i32_e32 v10, vcc, 4, v16
				; GCN-NEXT: v_add_i32_e32 v14, vcc, 2, v16
				; GCN-NEXT: buffer_store_short v9, v19, s[0:3], 0 offen
				; GCN-NEXT: s_waitcnt vmcnt(0)
				; GCN-NEXT: buffer_store_short v8, v13, s[0:3], 0 offen
				; GCN-NEXT: s_waitcnt vmcnt(0)
				; GCN-NEXT: buffer_store_short v7, v20, s[0:3], 0 offen
				; GCN-NEXT: s_waitcnt vmcnt(0)
				; GCN-NEXT: buffer_store_short v6, v12, s[0:3], 0 offen
				; GCN-NEXT: s_waitcnt vmcnt(0)
				; GCN-NEXT: buffer_store_short v5, v15, s[0:3], 0 offen
				; GCN-NEXT: s_waitcnt vmcnt(0)
				; GCN-NEXT: buffer_store_short v4, v11, s[0:3], 0 offen
				; GCN-NEXT: s_waitcnt vmcnt(0)
				; GCN-NEXT: buffer_store_short v3, v18, s[0:3], 0 offen
				; GCN-NEXT: s_waitcnt vmcnt(0)
				; GCN-NEXT: buffer_store_short v2, v10, s[0:3], 0 offen
				; GCN-NEXT: s_waitcnt vmcnt(0)
				; GCN-NEXT: buffer_store_short v1, v14, s[0:3], 0 offen
				; GCN-NEXT: s_waitcnt vmcnt(0)
				; GCN-NEXT: buffer_store_short v0, v16, s[0:3], 0 offen
				; GCN-NEXT: s_waitcnt vmcnt(0)
				; GCN-NEXT: v_readlane_b32 s31, v17, 1
				; GCN-NEXT: v_readlane_b32 s30, v17, 0
				; GCN-NEXT: s_addk_i32 s32, 0xfc00
				; GCN-NEXT: v_readlane_b32 s33, v17, 2
				; GCN-NEXT: s_or_saveexec_b64 s[4:5], -1
				; GCN-NEXT: buffer_load_dword v17, off, s[0:3], s32 ; 4-byte Folded Reload
				; GCN-NEXT: s_mov_b64 exec, s[4:5]
				; GCN-NEXT: s_waitcnt vmcnt(0) expcnt(0)
				; GCN-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX7-LABEL: test_call_v16bf16:
				; GFX7: ; %bb.0: ; %entry
				; GFX7-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX7-NEXT: s_or_saveexec_b64 s[4:5], -1
				; GFX7-NEXT: buffer_store_dword v17, off, s[0:3], s32 ; 4-byte Folded Spill
				; GFX7-NEXT: s_mov_b64 exec, s[4:5]
				; GFX7-NEXT: v_writelane_b32 v17, s33, 2
				; GFX7-NEXT: s_mov_b32 s33, s32
				; GFX7-NEXT: s_addk_i32 s32, 0x400
				; GFX7-NEXT: s_getpc_b64 s[4:5]
				; GFX7-NEXT: s_add_u32 s4, s4, test_arg_store_v2bf16@gotpcrel32@lo+4
				; GFX7-NEXT: s_addc_u32 s5, s5, test_arg_store_v2bf16@gotpcrel32@hi+12
				; GFX7-NEXT: s_load_dwordx2 s[4:5], s[4:5], 0x0
				; GFX7-NEXT: v_writelane_b32 v17, s30, 0
				; GFX7-NEXT: v_writelane_b32 v17, s31, 1
				; GFX7-NEXT: s_waitcnt lgkmcnt(0)
				; GFX7-NEXT: s_swappc_b64 s[30:31], s[4:5]
				; GFX7-NEXT: v_lshrrev_b32_e32 v15, 16, v15
				; GFX7-NEXT: v_add_i32_e32 v18, vcc, 30, v16
				; GFX7-NEXT: v_lshrrev_b32_e32 v14, 16, v14
				; GFX7-NEXT: buffer_store_short v15, v18, s[0:3], 0 offen
				; GFX7-NEXT: s_waitcnt vmcnt(0)
				; GFX7-NEXT: v_add_i32_e32 v15, vcc, 28, v16
				; GFX7-NEXT: v_lshrrev_b32_e32 v13, 16, v13
				; GFX7-NEXT: buffer_store_short v14, v15, s[0:3], 0 offen
				; GFX7-NEXT: s_waitcnt vmcnt(0)
				; GFX7-NEXT: v_add_i32_e32 v14, vcc, 26, v16
				; GFX7-NEXT: v_lshrrev_b32_e32 v12, 16, v12
				; GFX7-NEXT: buffer_store_short v13, v14, s[0:3], 0 offen
				; GFX7-NEXT: s_waitcnt vmcnt(0)
				; GFX7-NEXT: v_add_i32_e32 v13, vcc, 24, v16
				; GFX7-NEXT: v_lshrrev_b32_e32 v11, 16, v11
				; GFX7-NEXT: buffer_store_short v12, v13, s[0:3], 0 offen
				; GFX7-NEXT: s_waitcnt vmcnt(0)
				; GFX7-NEXT: v_add_i32_e32 v12, vcc, 22, v16
				; GFX7-NEXT: v_lshrrev_b32_e32 v10, 16, v10
				; GFX7-NEXT: buffer_store_short v11, v12, s[0:3], 0 offen
				; GFX7-NEXT: s_waitcnt vmcnt(0)
				; GFX7-NEXT: v_add_i32_e32 v11, vcc, 20, v16
				; GFX7-NEXT: v_lshrrev_b32_e32 v9, 16, v9
				; GFX7-NEXT: buffer_store_short v10, v11, s[0:3], 0 offen
				; GFX7-NEXT: s_waitcnt vmcnt(0)
				; GFX7-NEXT: v_add_i32_e32 v10, vcc, 18, v16
				; GFX7-NEXT: v_lshrrev_b32_e32 v8, 16, v8
				; GFX7-NEXT: buffer_store_short v9, v10, s[0:3], 0 offen
				; GFX7-NEXT: s_waitcnt vmcnt(0)
				; GFX7-NEXT: v_add_i32_e32 v9, vcc, 16, v16
				; GFX7-NEXT: v_lshrrev_b32_e32 v7, 16, v7
				; GFX7-NEXT: buffer_store_short v8, v9, s[0:3], 0 offen
				; GFX7-NEXT: s_waitcnt vmcnt(0)
				; GFX7-NEXT: v_add_i32_e32 v8, vcc, 14, v16
				; GFX7-NEXT: v_lshrrev_b32_e32 v6, 16, v6
				; GFX7-NEXT: buffer_store_short v7, v8, s[0:3], 0 offen
				; GFX7-NEXT: s_waitcnt vmcnt(0)
				; GFX7-NEXT: v_add_i32_e32 v7, vcc, 12, v16
				; GFX7-NEXT: v_lshrrev_b32_e32 v5, 16, v5
				; GFX7-NEXT: buffer_store_short v6, v7, s[0:3], 0 offen
				; GFX7-NEXT: s_waitcnt vmcnt(0)
				; GFX7-NEXT: v_add_i32_e32 v6, vcc, 10, v16
				; GFX7-NEXT: v_lshrrev_b32_e32 v4, 16, v4
				; GFX7-NEXT: buffer_store_short v5, v6, s[0:3], 0 offen
				; GFX7-NEXT: s_waitcnt vmcnt(0)
				; GFX7-NEXT: v_add_i32_e32 v5, vcc, 8, v16
				; GFX7-NEXT: v_lshrrev_b32_e32 v3, 16, v3
				; GFX7-NEXT: buffer_store_short v4, v5, s[0:3], 0 offen
				; GFX7-NEXT: s_waitcnt vmcnt(0)
				; GFX7-NEXT: v_add_i32_e32 v4, vcc, 6, v16
				; GFX7-NEXT: v_lshrrev_b32_e32 v2, 16, v2
				; GFX7-NEXT: buffer_store_short v3, v4, s[0:3], 0 offen
				; GFX7-NEXT: s_waitcnt vmcnt(0)
				; GFX7-NEXT: v_add_i32_e32 v3, vcc, 4, v16
				; GFX7-NEXT: v_lshrrev_b32_e32 v1, 16, v1
				; GFX7-NEXT: buffer_store_short v2, v3, s[0:3], 0 offen
				; GFX7-NEXT: s_waitcnt vmcnt(0)
				; GFX7-NEXT: v_add_i32_e32 v2, vcc, 2, v16
				; GFX7-NEXT: v_lshrrev_b32_e32 v0, 16, v0
				; GFX7-NEXT: buffer_store_short v1, v2, s[0:3], 0 offen
				; GFX7-NEXT: s_waitcnt vmcnt(0)
				; GFX7-NEXT: buffer_store_short v0, v16, s[0:3], 0 offen
				; GFX7-NEXT: s_waitcnt vmcnt(0)
				; GFX7-NEXT: v_readlane_b32 s31, v17, 1
				; GFX7-NEXT: v_readlane_b32 s30, v17, 0
				; GFX7-NEXT: s_addk_i32 s32, 0xfc00
				; GFX7-NEXT: v_readlane_b32 s33, v17, 2
				; GFX7-NEXT: s_or_saveexec_b64 s[4:5], -1
				; GFX7-NEXT: buffer_load_dword v17, off, s[0:3], s32 ; 4-byte Folded Reload
				; GFX7-NEXT: s_mov_b64 exec, s[4:5]
				; GFX7-NEXT: s_waitcnt vmcnt(0)
				; GFX7-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX8-LABEL: test_call_v16bf16:
				; GFX8: ; %bb.0: ; %entry
				; GFX8-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX8-NEXT: s_or_saveexec_b64 s[4:5], -1
				; GFX8-NEXT: buffer_store_dword v9, off, s[0:3], s32 ; 4-byte Folded Spill
				; GFX8-NEXT: s_mov_b64 exec, s[4:5]
				; GFX8-NEXT: v_writelane_b32 v9, s33, 2
				; GFX8-NEXT: s_mov_b32 s33, s32
				; GFX8-NEXT: s_addk_i32 s32, 0x400
				; GFX8-NEXT: s_getpc_b64 s[4:5]
				; GFX8-NEXT: s_add_u32 s4, s4, test_arg_store_v2bf16@gotpcrel32@lo+4
				; GFX8-NEXT: s_addc_u32 s5, s5, test_arg_store_v2bf16@gotpcrel32@hi+12
				; GFX8-NEXT: s_load_dwordx2 s[4:5], s[4:5], 0x0
				; GFX8-NEXT: v_writelane_b32 v9, s30, 0
				; GFX8-NEXT: v_writelane_b32 v9, s31, 1
				; GFX8-NEXT: s_waitcnt lgkmcnt(0)
				; GFX8-NEXT: s_swappc_b64 s[30:31], s[4:5]
				; GFX8-NEXT: v_add_u32_e32 v18, vcc, 28, v8
				; GFX8-NEXT: v_lshrrev_b32_e32 v17, 16, v7
				; GFX8-NEXT: buffer_store_short v7, v18, s[0:3], 0 offen
				; GFX8-NEXT: s_waitcnt vmcnt(0)
				; GFX8-NEXT: v_add_u32_e32 v7, vcc, 24, v8
				; GFX8-NEXT: v_lshrrev_b32_e32 v16, 16, v6
				; GFX8-NEXT: buffer_store_short v6, v7, s[0:3], 0 offen
				; GFX8-NEXT: s_waitcnt vmcnt(0)
				; GFX8-NEXT: v_add_u32_e32 v6, vcc, 20, v8
				; GFX8-NEXT: v_lshrrev_b32_e32 v15, 16, v5
				; GFX8-NEXT: buffer_store_short v5, v6, s[0:3], 0 offen
				; GFX8-NEXT: s_waitcnt vmcnt(0)
				; GFX8-NEXT: v_add_u32_e32 v5, vcc, 16, v8
				; GFX8-NEXT: v_lshrrev_b32_e32 v14, 16, v4
				; GFX8-NEXT: buffer_store_short v4, v5, s[0:3], 0 offen
				; GFX8-NEXT: s_waitcnt vmcnt(0)
				; GFX8-NEXT: v_add_u32_e32 v4, vcc, 12, v8
				; GFX8-NEXT: v_lshrrev_b32_e32 v13, 16, v3
				; GFX8-NEXT: buffer_store_short v3, v4, s[0:3], 0 offen
				; GFX8-NEXT: s_waitcnt vmcnt(0)
				; GFX8-NEXT: v_add_u32_e32 v3, vcc, 8, v8
				; GFX8-NEXT: v_lshrrev_b32_e32 v12, 16, v2
				; GFX8-NEXT: buffer_store_short v2, v3, s[0:3], 0 offen
				; GFX8-NEXT: s_waitcnt vmcnt(0)
				; GFX8-NEXT: v_add_u32_e32 v2, vcc, 4, v8
				; GFX8-NEXT: v_lshrrev_b32_e32 v10, 16, v0
				; GFX8-NEXT: buffer_store_short v1, v2, s[0:3], 0 offen
				; GFX8-NEXT: s_waitcnt vmcnt(0)
				; GFX8-NEXT: buffer_store_short v0, v8, s[0:3], 0 offen
				; GFX8-NEXT: s_waitcnt vmcnt(0)
				; GFX8-NEXT: v_add_u32_e32 v0, vcc, 30, v8
				; GFX8-NEXT: buffer_store_short v17, v0, s[0:3], 0 offen
				; GFX8-NEXT: s_waitcnt vmcnt(0)
				; GFX8-NEXT: v_add_u32_e32 v0, vcc, 26, v8
				; GFX8-NEXT: buffer_store_short v16, v0, s[0:3], 0 offen
				; GFX8-NEXT: s_waitcnt vmcnt(0)
				; GFX8-NEXT: v_add_u32_e32 v0, vcc, 22, v8
				; GFX8-NEXT: buffer_store_short v15, v0, s[0:3], 0 offen
				; GFX8-NEXT: s_waitcnt vmcnt(0)
				; GFX8-NEXT: v_add_u32_e32 v0, vcc, 18, v8
				; GFX8-NEXT: buffer_store_short v14, v0, s[0:3], 0 offen
				; GFX8-NEXT: s_waitcnt vmcnt(0)
				; GFX8-NEXT: v_add_u32_e32 v0, vcc, 14, v8
				; GFX8-NEXT: buffer_store_short v13, v0, s[0:3], 0 offen
				; GFX8-NEXT: s_waitcnt vmcnt(0)
				; GFX8-NEXT: v_add_u32_e32 v0, vcc, 10, v8
				; GFX8-NEXT: v_lshrrev_b32_e32 v11, 16, v1
				; GFX8-NEXT: buffer_store_short v12, v0, s[0:3], 0 offen
				; GFX8-NEXT: s_waitcnt vmcnt(0)
				; GFX8-NEXT: v_add_u32_e32 v0, vcc, 6, v8
				; GFX8-NEXT: buffer_store_short v11, v0, s[0:3], 0 offen
				; GFX8-NEXT: s_waitcnt vmcnt(0)
				; GFX8-NEXT: v_add_u32_e32 v0, vcc, 2, v8
				; GFX8-NEXT: buffer_store_short v10, v0, s[0:3], 0 offen
				; GFX8-NEXT: s_waitcnt vmcnt(0)
				; GFX8-NEXT: v_readlane_b32 s31, v9, 1
				; GFX8-NEXT: v_readlane_b32 s30, v9, 0
				; GFX8-NEXT: s_addk_i32 s32, 0xfc00
				; GFX8-NEXT: v_readlane_b32 s33, v9, 2
				; GFX8-NEXT: s_or_saveexec_b64 s[4:5], -1
				; GFX8-NEXT: buffer_load_dword v9, off, s[0:3], s32 ; 4-byte Folded Reload
				; GFX8-NEXT: s_mov_b64 exec, s[4:5]
				; GFX8-NEXT: s_waitcnt vmcnt(0)
				; GFX8-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX9-LABEL: test_call_v16bf16:
				; GFX9: ; %bb.0: ; %entry
				; GFX9-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX9-NEXT: s_or_saveexec_b64 s[4:5], -1
				; GFX9-NEXT: buffer_store_dword v9, off, s[0:3], s32 ; 4-byte Folded Spill
				; GFX9-NEXT: s_mov_b64 exec, s[4:5]
				; GFX9-NEXT: v_writelane_b32 v9, s33, 2
				; GFX9-NEXT: s_mov_b32 s33, s32
				; GFX9-NEXT: s_addk_i32 s32, 0x400
				; GFX9-NEXT: s_getpc_b64 s[4:5]
				; GFX9-NEXT: s_add_u32 s4, s4, test_arg_store_v2bf16@gotpcrel32@lo+4
				; GFX9-NEXT: s_addc_u32 s5, s5, test_arg_store_v2bf16@gotpcrel32@hi+12
				; GFX9-NEXT: s_load_dwordx2 s[4:5], s[4:5], 0x0
				; GFX9-NEXT: v_writelane_b32 v9, s30, 0
				; GFX9-NEXT: v_writelane_b32 v9, s31, 1
				; GFX9-NEXT: s_waitcnt lgkmcnt(0)
				; GFX9-NEXT: s_swappc_b64 s[30:31], s[4:5]
				; GFX9-NEXT: buffer_store_short_d16_hi v7, v8, s[0:3], 0 offen offset:30
				; GFX9-NEXT: s_waitcnt vmcnt(0)
				; GFX9-NEXT: buffer_store_short v7, v8, s[0:3], 0 offen offset:28
				; GFX9-NEXT: s_waitcnt vmcnt(0)
				; GFX9-NEXT: buffer_store_short_d16_hi v6, v8, s[0:3], 0 offen offset:26
				; GFX9-NEXT: s_waitcnt vmcnt(0)
				; GFX9-NEXT: buffer_store_short v6, v8, s[0:3], 0 offen offset:24
				; GFX9-NEXT: s_waitcnt vmcnt(0)
				; GFX9-NEXT: buffer_store_short_d16_hi v5, v8, s[0:3], 0 offen offset:22
				; GFX9-NEXT: s_waitcnt vmcnt(0)
				; GFX9-NEXT: buffer_store_short v5, v8, s[0:3], 0 offen offset:20
				; GFX9-NEXT: s_waitcnt vmcnt(0)
				; GFX9-NEXT: buffer_store_short_d16_hi v4, v8, s[0:3], 0 offen offset:18
				; GFX9-NEXT: s_waitcnt vmcnt(0)
				; GFX9-NEXT: buffer_store_short v4, v8, s[0:3], 0 offen offset:16
				; GFX9-NEXT: s_waitcnt vmcnt(0)
				; GFX9-NEXT: buffer_store_short_d16_hi v3, v8, s[0:3], 0 offen offset:14
				; GFX9-NEXT: s_waitcnt vmcnt(0)
				; GFX9-NEXT: buffer_store_short v3, v8, s[0:3], 0 offen offset:12
				; GFX9-NEXT: s_waitcnt vmcnt(0)
				; GFX9-NEXT: buffer_store_short_d16_hi v2, v8, s[0:3], 0 offen offset:10
				; GFX9-NEXT: s_waitcnt vmcnt(0)
				; GFX9-NEXT: buffer_store_short v2, v8, s[0:3], 0 offen offset:8
				; GFX9-NEXT: s_waitcnt vmcnt(0)
				; GFX9-NEXT: buffer_store_short_d16_hi v1, v8, s[0:3], 0 offen offset:6
				; GFX9-NEXT: s_waitcnt vmcnt(0)
				; GFX9-NEXT: buffer_store_short v1, v8, s[0:3], 0 offen offset:4
				; GFX9-NEXT: s_waitcnt vmcnt(0)
				; GFX9-NEXT: buffer_store_short_d16_hi v0, v8, s[0:3], 0 offen offset:2
				; GFX9-NEXT: s_waitcnt vmcnt(0)
				; GFX9-NEXT: buffer_store_short v0, v8, s[0:3], 0 offen
				; GFX9-NEXT: s_waitcnt vmcnt(0)
				; GFX9-NEXT: v_readlane_b32 s31, v9, 1
				; GFX9-NEXT: v_readlane_b32 s30, v9, 0
				; GFX9-NEXT: s_addk_i32 s32, 0xfc00
				; GFX9-NEXT: v_readlane_b32 s33, v9, 2
				; GFX9-NEXT: s_or_saveexec_b64 s[4:5], -1
				; GFX9-NEXT: buffer_load_dword v9, off, s[0:3], s32 ; 4-byte Folded Reload
				; GFX9-NEXT: s_mov_b64 exec, s[4:5]
				; GFX9-NEXT: s_waitcnt vmcnt(0)
				; GFX9-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX10-LABEL: test_call_v16bf16:
				; GFX10: ; %bb.0: ; %entry
				; GFX10-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX10-NEXT: s_or_saveexec_b32 s4, -1
				; GFX10-NEXT: buffer_store_dword v9, off, s[0:3], s32 ; 4-byte Folded Spill
				; GFX10-NEXT: s_waitcnt_depctr 0xffe3
				; GFX10-NEXT: s_mov_b32 exec_lo, s4
				; GFX10-NEXT: v_writelane_b32 v9, s33, 2
				; GFX10-NEXT: s_mov_b32 s33, s32
				; GFX10-NEXT: s_addk_i32 s32, 0x200
				; GFX10-NEXT: s_getpc_b64 s[4:5]
				; GFX10-NEXT: s_add_u32 s4, s4, test_arg_store_v2bf16@gotpcrel32@lo+4
				; GFX10-NEXT: s_addc_u32 s5, s5, test_arg_store_v2bf16@gotpcrel32@hi+12
				; GFX10-NEXT: s_load_dwordx2 s[4:5], s[4:5], 0x0
				; GFX10-NEXT: v_writelane_b32 v9, s30, 0
				; GFX10-NEXT: v_writelane_b32 v9, s31, 1
				; GFX10-NEXT: s_waitcnt lgkmcnt(0)
				; GFX10-NEXT: s_swappc_b64 s[30:31], s[4:5]
				; GFX10-NEXT: buffer_store_short_d16_hi v7, v8, s[0:3], 0 offen offset:30
				; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX10-NEXT: buffer_store_short v7, v8, s[0:3], 0 offen offset:28
				; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX10-NEXT: buffer_store_short_d16_hi v6, v8, s[0:3], 0 offen offset:26
				; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX10-NEXT: buffer_store_short v6, v8, s[0:3], 0 offen offset:24
				; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX10-NEXT: buffer_store_short_d16_hi v5, v8, s[0:3], 0 offen offset:22
				; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX10-NEXT: buffer_store_short v5, v8, s[0:3], 0 offen offset:20
				; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX10-NEXT: buffer_store_short_d16_hi v4, v8, s[0:3], 0 offen offset:18
				; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX10-NEXT: buffer_store_short v4, v8, s[0:3], 0 offen offset:16
				; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX10-NEXT: buffer_store_short_d16_hi v3, v8, s[0:3], 0 offen offset:14
				; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX10-NEXT: buffer_store_short v3, v8, s[0:3], 0 offen offset:12
				; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX10-NEXT: buffer_store_short_d16_hi v2, v8, s[0:3], 0 offen offset:10
				; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX10-NEXT: buffer_store_short v2, v8, s[0:3], 0 offen offset:8
				; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX10-NEXT: buffer_store_short_d16_hi v1, v8, s[0:3], 0 offen offset:6
				; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX10-NEXT: buffer_store_short v1, v8, s[0:3], 0 offen offset:4
				; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX10-NEXT: buffer_store_short_d16_hi v0, v8, s[0:3], 0 offen offset:2
				; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX10-NEXT: buffer_store_short v0, v8, s[0:3], 0 offen
				; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX10-NEXT: v_readlane_b32 s31, v9, 1
				; GFX10-NEXT: v_readlane_b32 s30, v9, 0
				; GFX10-NEXT: s_addk_i32 s32, 0xfe00
				; GFX10-NEXT: v_readlane_b32 s33, v9, 2
				; GFX10-NEXT: s_or_saveexec_b32 s4, -1
				; GFX10-NEXT: buffer_load_dword v9, off, s[0:3], s32 ; 4-byte Folded Reload
				; GFX10-NEXT: s_waitcnt_depctr 0xffe3
				; GFX10-NEXT: s_mov_b32 exec_lo, s4
				; GFX10-NEXT: s_waitcnt vmcnt(0)
				; GFX10-NEXT: s_setpc_b64 s[30:31]
				entry:
				%result = call <16 x bfloat> @test_arg_store_v2bf16(<16 x bfloat> %in)
				store volatile <16 x bfloat> %result, ptr addrspace(5) %out
				ret void
				}

				define bfloat @test_alloca_load_store_ret(bfloat %in) {
				; GCN-LABEL: test_alloca_load_store_ret:
				; GCN: ; %bb.0: ; %entry
				; GCN-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GCN-NEXT: v_lshrrev_b32_e32 v0, 16, v0
				; GCN-NEXT: buffer_store_short v0, off, s[0:3], s32
				; GCN-NEXT: s_waitcnt vmcnt(0) expcnt(0)
				; GCN-NEXT: buffer_load_ushort v0, off, s[0:3], s32 glc
				; GCN-NEXT: s_waitcnt vmcnt(0)
				; GCN-NEXT: v_lshlrev_b32_e32 v0, 16, v0
				; GCN-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX7-LABEL: test_alloca_load_store_ret:
				; GFX7: ; %bb.0: ; %entry
				; GFX7-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX7-NEXT: v_lshrrev_b32_e32 v0, 16, v0
				; GFX7-NEXT: buffer_store_short v0, off, s[0:3], s32
				; GFX7-NEXT: s_waitcnt vmcnt(0)
				; GFX7-NEXT: buffer_load_ushort v0, off, s[0:3], s32 glc
				; GFX7-NEXT: s_waitcnt vmcnt(0)
				; GFX7-NEXT: v_lshlrev_b32_e32 v0, 16, v0
				; GFX7-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX8-LABEL: test_alloca_load_store_ret:
				; GFX8: ; %bb.0: ; %entry
				; GFX8-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX8-NEXT: v_lshrrev_b32_e32 v0, 16, v0
				; GFX8-NEXT: buffer_store_short v0, off, s[0:3], s32
				; GFX8-NEXT: s_waitcnt vmcnt(0)
				; GFX8-NEXT: buffer_load_ushort v0, off, s[0:3], s32 glc
				; GFX8-NEXT: s_waitcnt vmcnt(0)
				; GFX8-NEXT: v_lshlrev_b32_e32 v0, 16, v0
				; GFX8-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX9-LABEL: test_alloca_load_store_ret:
				; GFX9: ; %bb.0: ; %entry
				; GFX9-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX9-NEXT: buffer_store_short_d16_hi v0, off, s[0:3], s32
				; GFX9-NEXT: s_waitcnt vmcnt(0)
				; GFX9-NEXT: v_mov_b32_e32 v0, 0
				; GFX9-NEXT: buffer_load_short_d16_hi v0, off, s[0:3], s32 glc
				; GFX9-NEXT: s_waitcnt vmcnt(0)
				; GFX9-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX10-LABEL: test_alloca_load_store_ret:
				; GFX10: ; %bb.0: ; %entry
				; GFX10-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX10-NEXT: v_mov_b32_e32 v1, 0
				; GFX10-NEXT: buffer_store_short_d16_hi v0, off, s[0:3], s32
				; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX10-NEXT: buffer_load_short_d16_hi v1, off, s[0:3], s32 glc dlc
				; GFX10-NEXT: s_waitcnt vmcnt(0)
				; GFX10-NEXT: v_mov_b32_e32 v0, v1
				; GFX10-NEXT: s_setpc_b64 s[30:31]
				entry:
				%in.addr = alloca bfloat, align 2, addrspace(5)
				store volatile bfloat %in, ptr addrspace(5) %in.addr, align 2
				%loaded = load volatile bfloat, ptr addrspace(5) %in.addr, align 2
				ret bfloat %loaded
				}

				define { <32 x i32>, bfloat } @test_overflow_stack(bfloat %a, <32 x i32> %b) {
				; GCN-LABEL: test_overflow_stack:
				; GCN: ; %bb.0:
				; GCN-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GCN-NEXT: buffer_store_dword v2, v0, s[0:3], 0 offen
				; GCN-NEXT: s_waitcnt expcnt(0)
				; GCN-NEXT: buffer_load_dword v2, off, s[0:3], s32 offset:8
				; GCN-NEXT: v_add_i32_e32 v31, vcc, 0x7c, v0
				; GCN-NEXT: buffer_load_dword v32, off, s[0:3], s32 offset:4
				; GCN-NEXT: buffer_load_dword v33, off, s[0:3], s32
				; GCN-NEXT: s_waitcnt vmcnt(2)
				; GCN-NEXT: buffer_store_dword v2, v31, s[0:3], 0 offen
				; GCN-NEXT: s_waitcnt expcnt(0)
				; GCN-NEXT: v_add_i32_e32 v2, vcc, 0x78, v0
				; GCN-NEXT: s_waitcnt vmcnt(2)
				; GCN-NEXT: buffer_store_dword v32, v2, s[0:3], 0 offen
				; GCN-NEXT: v_add_i32_e32 v2, vcc, 0x74, v0
				; GCN-NEXT: s_waitcnt vmcnt(2)
				; GCN-NEXT: buffer_store_dword v33, v2, s[0:3], 0 offen
				; GCN-NEXT: v_add_i32_e32 v2, vcc, 0x70, v0
				; GCN-NEXT: v_add_i32_e32 v31, vcc, 0x6c, v0
				; GCN-NEXT: buffer_store_dword v30, v2, s[0:3], 0 offen
				; GCN-NEXT: v_add_i32_e32 v2, vcc, 0x68, v0
				; GCN-NEXT: s_waitcnt expcnt(0)
				; GCN-NEXT: v_add_i32_e32 v30, vcc, 0x64, v0
				; GCN-NEXT: buffer_store_dword v29, v31, s[0:3], 0 offen
				; GCN-NEXT: s_waitcnt expcnt(0)
				; GCN-NEXT: v_add_i32_e32 v29, vcc, 0x60, v0
				; GCN-NEXT: v_add_i32_e32 v31, vcc, 0x5c, v0
				; GCN-NEXT: buffer_store_dword v28, v2, s[0:3], 0 offen
				; GCN-NEXT: v_add_i32_e32 v2, vcc, 0x58, v0
				; GCN-NEXT: s_waitcnt expcnt(0)
				; GCN-NEXT: v_add_i32_e32 v28, vcc, 0x54, v0
				; GCN-NEXT: buffer_store_dword v27, v30, s[0:3], 0 offen
				; GCN-NEXT: s_waitcnt expcnt(0)
				; GCN-NEXT: v_add_i32_e32 v27, vcc, 0x50, v0
				; GCN-NEXT: v_add_i32_e32 v30, vcc, 0x4c, v0
				; GCN-NEXT: buffer_store_dword v26, v29, s[0:3], 0 offen
				; GCN-NEXT: s_waitcnt expcnt(0)
				; GCN-NEXT: v_add_i32_e32 v26, vcc, 0x48, v0
				; GCN-NEXT: v_lshrrev_b32_e32 v1, 16, v1
				; GCN-NEXT: v_add_i32_e32 v29, vcc, 0x44, v0
				; GCN-NEXT: buffer_store_dword v25, v31, s[0:3], 0 offen
				; GCN-NEXT: s_waitcnt expcnt(0)
				; GCN-NEXT: v_add_i32_e32 v25, vcc, 64, v0
				; GCN-NEXT: v_add_i32_e32 v31, vcc, 60, v0
				; GCN-NEXT: buffer_store_dword v24, v2, s[0:3], 0 offen
				; GCN-NEXT: v_add_i32_e32 v2, vcc, 56, v0
				; GCN-NEXT: s_waitcnt expcnt(0)
				; GCN-NEXT: v_add_i32_e32 v24, vcc, 52, v0
				; GCN-NEXT: buffer_store_dword v23, v28, s[0:3], 0 offen
				; GCN-NEXT: s_waitcnt expcnt(0)
				; GCN-NEXT: v_add_i32_e32 v23, vcc, 48, v0
				; GCN-NEXT: v_add_i32_e32 v28, vcc, 44, v0
				; GCN-NEXT: buffer_store_dword v22, v27, s[0:3], 0 offen
				; GCN-NEXT: s_waitcnt expcnt(0)
				; GCN-NEXT: v_add_i32_e32 v22, vcc, 40, v0
				; GCN-NEXT: v_add_i32_e32 v27, vcc, 36, v0
				; GCN-NEXT: buffer_store_dword v21, v30, s[0:3], 0 offen
				; GCN-NEXT: s_waitcnt expcnt(0)
				; GCN-NEXT: v_add_i32_e32 v21, vcc, 32, v0
				; GCN-NEXT: v_add_i32_e32 v30, vcc, 28, v0
				; GCN-NEXT: buffer_store_dword v20, v26, s[0:3], 0 offen
				; GCN-NEXT: s_waitcnt expcnt(0)
				; GCN-NEXT: v_add_i32_e32 v20, vcc, 24, v0
				; GCN-NEXT: v_add_i32_e32 v26, vcc, 20, v0
				; GCN-NEXT: buffer_store_dword v19, v29, s[0:3], 0 offen
				; GCN-NEXT: s_waitcnt expcnt(0)
				; GCN-NEXT: v_add_i32_e32 v19, vcc, 16, v0
				; GCN-NEXT: v_add_i32_e32 v29, vcc, 12, v0
				; GCN-NEXT: buffer_store_dword v18, v25, s[0:3], 0 offen
				; GCN-NEXT: s_waitcnt expcnt(0)
				; GCN-NEXT: v_add_i32_e32 v18, vcc, 8, v0
				; GCN-NEXT: v_add_i32_e32 v25, vcc, 4, v0
				; GCN-NEXT: v_add_i32_e32 v0, vcc, 0x80, v0
				; GCN-NEXT: buffer_store_dword v17, v31, s[0:3], 0 offen
				; GCN-NEXT: buffer_store_dword v16, v2, s[0:3], 0 offen
				; GCN-NEXT: buffer_store_dword v15, v24, s[0:3], 0 offen
				; GCN-NEXT: buffer_store_dword v14, v23, s[0:3], 0 offen
				; GCN-NEXT: buffer_store_dword v13, v28, s[0:3], 0 offen
				; GCN-NEXT: buffer_store_dword v12, v22, s[0:3], 0 offen
				; GCN-NEXT: buffer_store_dword v11, v27, s[0:3], 0 offen
				; GCN-NEXT: buffer_store_dword v10, v21, s[0:3], 0 offen
				; GCN-NEXT: buffer_store_dword v9, v30, s[0:3], 0 offen
				; GCN-NEXT: buffer_store_dword v8, v20, s[0:3], 0 offen
				; GCN-NEXT: buffer_store_dword v7, v26, s[0:3], 0 offen
				; GCN-NEXT: buffer_store_dword v6, v19, s[0:3], 0 offen
				; GCN-NEXT: buffer_store_dword v5, v29, s[0:3], 0 offen
				; GCN-NEXT: buffer_store_dword v4, v18, s[0:3], 0 offen
				; GCN-NEXT: buffer_store_dword v3, v25, s[0:3], 0 offen
				; GCN-NEXT: buffer_store_short v1, v0, s[0:3], 0 offen
				; GCN-NEXT: s_waitcnt vmcnt(0) expcnt(0)
				; GCN-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX7-LABEL: test_overflow_stack:
				; GFX7: ; %bb.0:
				; GFX7-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX7-NEXT: buffer_store_dword v2, v0, s[0:3], 0 offen
				; GFX7-NEXT: buffer_load_dword v2, off, s[0:3], s32 offset:8
				; GFX7-NEXT: v_add_i32_e32 v31, vcc, 0x7c, v0
				; GFX7-NEXT: v_lshrrev_b32_e32 v1, 16, v1
				; GFX7-NEXT: s_waitcnt vmcnt(0)
				; GFX7-NEXT: buffer_store_dword v2, v31, s[0:3], 0 offen
				; GFX7-NEXT: buffer_load_dword v2, off, s[0:3], s32 offset:4
				; GFX7-NEXT: v_add_i32_e32 v31, vcc, 0x78, v0
				; GFX7-NEXT: s_waitcnt vmcnt(0)
				; GFX7-NEXT: buffer_store_dword v2, v31, s[0:3], 0 offen
				; GFX7-NEXT: buffer_load_dword v2, off, s[0:3], s32
				; GFX7-NEXT: v_add_i32_e32 v31, vcc, 0x74, v0
				; GFX7-NEXT: s_waitcnt vmcnt(0)
				; GFX7-NEXT: buffer_store_dword v2, v31, s[0:3], 0 offen
				; GFX7-NEXT: v_add_i32_e32 v2, vcc, 0x70, v0
				arsenmUnsubmitted Done Reply Inline Actions Use poison instead of undef in tests arsenm: Use poison instead of undef in tests
				; GFX7-NEXT: buffer_store_dword v30, v2, s[0:3], 0 offen
				; GFX7-NEXT: v_add_i32_e32 v2, vcc, 0x6c, v0
				; GFX7-NEXT: buffer_store_dword v29, v2, s[0:3], 0 offen
				; GFX7-NEXT: v_add_i32_e32 v2, vcc, 0x68, v0
				; GFX7-NEXT: buffer_store_dword v28, v2, s[0:3], 0 offen
				; GFX7-NEXT: v_add_i32_e32 v2, vcc, 0x64, v0
				; GFX7-NEXT: buffer_store_dword v27, v2, s[0:3], 0 offen
				; GFX7-NEXT: v_add_i32_e32 v2, vcc, 0x60, v0
				; GFX7-NEXT: buffer_store_dword v26, v2, s[0:3], 0 offen
				; GFX7-NEXT: v_add_i32_e32 v2, vcc, 0x5c, v0
				; GFX7-NEXT: buffer_store_dword v25, v2, s[0:3], 0 offen
				; GFX7-NEXT: v_add_i32_e32 v2, vcc, 0x58, v0
				; GFX7-NEXT: buffer_store_dword v24, v2, s[0:3], 0 offen
				; GFX7-NEXT: v_add_i32_e32 v2, vcc, 0x54, v0
				; GFX7-NEXT: buffer_store_dword v23, v2, s[0:3], 0 offen
				; GFX7-NEXT: v_add_i32_e32 v2, vcc, 0x50, v0
				; GFX7-NEXT: buffer_store_dword v22, v2, s[0:3], 0 offen
				; GFX7-NEXT: v_add_i32_e32 v2, vcc, 0x4c, v0
				; GFX7-NEXT: buffer_store_dword v21, v2, s[0:3], 0 offen
				; GFX7-NEXT: v_add_i32_e32 v2, vcc, 0x48, v0
				; GFX7-NEXT: buffer_store_dword v20, v2, s[0:3], 0 offen
				; GFX7-NEXT: v_add_i32_e32 v2, vcc, 0x44, v0
				; GFX7-NEXT: buffer_store_dword v19, v2, s[0:3], 0 offen
				; GFX7-NEXT: v_add_i32_e32 v2, vcc, 64, v0
				; GFX7-NEXT: buffer_store_dword v18, v2, s[0:3], 0 offen
				; GFX7-NEXT: v_add_i32_e32 v2, vcc, 60, v0
				; GFX7-NEXT: buffer_store_dword v17, v2, s[0:3], 0 offen
				; GFX7-NEXT: v_add_i32_e32 v2, vcc, 56, v0
				; GFX7-NEXT: buffer_store_dword v16, v2, s[0:3], 0 offen
				; GFX7-NEXT: v_add_i32_e32 v2, vcc, 52, v0
				; GFX7-NEXT: buffer_store_dword v15, v2, s[0:3], 0 offen
				; GFX7-NEXT: v_add_i32_e32 v2, vcc, 48, v0
				; GFX7-NEXT: buffer_store_dword v14, v2, s[0:3], 0 offen
				; GFX7-NEXT: v_add_i32_e32 v2, vcc, 44, v0
				; GFX7-NEXT: buffer_store_dword v13, v2, s[0:3], 0 offen
				; GFX7-NEXT: v_add_i32_e32 v2, vcc, 40, v0
				; GFX7-NEXT: buffer_store_dword v12, v2, s[0:3], 0 offen
				; GFX7-NEXT: v_add_i32_e32 v2, vcc, 36, v0
				; GFX7-NEXT: buffer_store_dword v11, v2, s[0:3], 0 offen
				; GFX7-NEXT: v_add_i32_e32 v2, vcc, 32, v0
				; GFX7-NEXT: buffer_store_dword v10, v2, s[0:3], 0 offen
				; GFX7-NEXT: v_add_i32_e32 v2, vcc, 28, v0
				; GFX7-NEXT: buffer_store_dword v9, v2, s[0:3], 0 offen
				; GFX7-NEXT: v_add_i32_e32 v2, vcc, 24, v0
				; GFX7-NEXT: buffer_store_dword v8, v2, s[0:3], 0 offen
				; GFX7-NEXT: v_add_i32_e32 v2, vcc, 20, v0
				; GFX7-NEXT: buffer_store_dword v7, v2, s[0:3], 0 offen
				; GFX7-NEXT: v_add_i32_e32 v2, vcc, 16, v0
				; GFX7-NEXT: buffer_store_dword v6, v2, s[0:3], 0 offen
				; GFX7-NEXT: v_add_i32_e32 v2, vcc, 12, v0
				; GFX7-NEXT: buffer_store_dword v5, v2, s[0:3], 0 offen
				; GFX7-NEXT: v_add_i32_e32 v2, vcc, 8, v0
				; GFX7-NEXT: buffer_store_dword v4, v2, s[0:3], 0 offen
				; GFX7-NEXT: v_add_i32_e32 v2, vcc, 4, v0
				; GFX7-NEXT: v_add_i32_e32 v0, vcc, 0x80, v0
				; GFX7-NEXT: buffer_store_dword v3, v2, s[0:3], 0 offen
				; GFX7-NEXT: buffer_store_short v1, v0, s[0:3], 0 offen
				; GFX7-NEXT: s_waitcnt vmcnt(0)
				; GFX7-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX8-LABEL: test_overflow_stack:
				; GFX8: ; %bb.0:
				; GFX8-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX8-NEXT: buffer_store_dword v2, v0, s[0:3], 0 offen
				; GFX8-NEXT: buffer_load_dword v2, off, s[0:3], s32 offset:8
				; GFX8-NEXT: v_add_u32_e32 v31, vcc, 0x7c, v0
				; GFX8-NEXT: v_lshrrev_b32_e32 v1, 16, v1
				; GFX8-NEXT: s_waitcnt vmcnt(0)
				; GFX8-NEXT: buffer_store_dword v2, v31, s[0:3], 0 offen
				; GFX8-NEXT: buffer_load_dword v2, off, s[0:3], s32 offset:4
				; GFX8-NEXT: v_add_u32_e32 v31, vcc, 0x78, v0
				; GFX8-NEXT: s_waitcnt vmcnt(0)
				; GFX8-NEXT: buffer_store_dword v2, v31, s[0:3], 0 offen
				; GFX8-NEXT: buffer_load_dword v2, off, s[0:3], s32
				; GFX8-NEXT: v_add_u32_e32 v31, vcc, 0x74, v0
				; GFX8-NEXT: s_waitcnt vmcnt(0)
				; GFX8-NEXT: buffer_store_dword v2, v31, s[0:3], 0 offen
				; GFX8-NEXT: v_add_u32_e32 v2, vcc, 0x70, v0
				; GFX8-NEXT: buffer_store_dword v30, v2, s[0:3], 0 offen
				; GFX8-NEXT: v_add_u32_e32 v2, vcc, 0x6c, v0
				; GFX8-NEXT: buffer_store_dword v29, v2, s[0:3], 0 offen
				; GFX8-NEXT: v_add_u32_e32 v2, vcc, 0x68, v0
				; GFX8-NEXT: buffer_store_dword v28, v2, s[0:3], 0 offen
				; GFX8-NEXT: v_add_u32_e32 v2, vcc, 0x64, v0
				; GFX8-NEXT: buffer_store_dword v27, v2, s[0:3], 0 offen
				; GFX8-NEXT: v_add_u32_e32 v2, vcc, 0x60, v0
				; GFX8-NEXT: buffer_store_dword v26, v2, s[0:3], 0 offen
				; GFX8-NEXT: v_add_u32_e32 v2, vcc, 0x5c, v0
				; GFX8-NEXT: buffer_store_dword v25, v2, s[0:3], 0 offen
				; GFX8-NEXT: v_add_u32_e32 v2, vcc, 0x58, v0
				; GFX8-NEXT: buffer_store_dword v24, v2, s[0:3], 0 offen
				; GFX8-NEXT: v_add_u32_e32 v2, vcc, 0x54, v0
				; GFX8-NEXT: buffer_store_dword v23, v2, s[0:3], 0 offen
				; GFX8-NEXT: v_add_u32_e32 v2, vcc, 0x50, v0
				; GFX8-NEXT: buffer_store_dword v22, v2, s[0:3], 0 offen
				; GFX8-NEXT: v_add_u32_e32 v2, vcc, 0x4c, v0
				; GFX8-NEXT: buffer_store_dword v21, v2, s[0:3], 0 offen
				; GFX8-NEXT: v_add_u32_e32 v2, vcc, 0x48, v0
				; GFX8-NEXT: buffer_store_dword v20, v2, s[0:3], 0 offen
				; GFX8-NEXT: v_add_u32_e32 v2, vcc, 0x44, v0
				; GFX8-NEXT: buffer_store_dword v19, v2, s[0:3], 0 offen
				; GFX8-NEXT: v_add_u32_e32 v2, vcc, 64, v0
				; GFX8-NEXT: buffer_store_dword v18, v2, s[0:3], 0 offen
				; GFX8-NEXT: v_add_u32_e32 v2, vcc, 60, v0
				; GFX8-NEXT: buffer_store_dword v17, v2, s[0:3], 0 offen
				; GFX8-NEXT: v_add_u32_e32 v2, vcc, 56, v0
				; GFX8-NEXT: buffer_store_dword v16, v2, s[0:3], 0 offen
				; GFX8-NEXT: v_add_u32_e32 v2, vcc, 52, v0
				; GFX8-NEXT: buffer_store_dword v15, v2, s[0:3], 0 offen
				; GFX8-NEXT: v_add_u32_e32 v2, vcc, 48, v0
				; GFX8-NEXT: buffer_store_dword v14, v2, s[0:3], 0 offen
				; GFX8-NEXT: v_add_u32_e32 v2, vcc, 44, v0
				; GFX8-NEXT: buffer_store_dword v13, v2, s[0:3], 0 offen
				; GFX8-NEXT: v_add_u32_e32 v2, vcc, 40, v0
				; GFX8-NEXT: buffer_store_dword v12, v2, s[0:3], 0 offen
				; GFX8-NEXT: v_add_u32_e32 v2, vcc, 36, v0
				; GFX8-NEXT: buffer_store_dword v11, v2, s[0:3], 0 offen
				; GFX8-NEXT: v_add_u32_e32 v2, vcc, 32, v0
				; GFX8-NEXT: buffer_store_dword v10, v2, s[0:3], 0 offen
				; GFX8-NEXT: v_add_u32_e32 v2, vcc, 28, v0
				; GFX8-NEXT: buffer_store_dword v9, v2, s[0:3], 0 offen
				; GFX8-NEXT: v_add_u32_e32 v2, vcc, 24, v0
				; GFX8-NEXT: buffer_store_dword v8, v2, s[0:3], 0 offen
				; GFX8-NEXT: v_add_u32_e32 v2, vcc, 20, v0
				; GFX8-NEXT: buffer_store_dword v7, v2, s[0:3], 0 offen
				; GFX8-NEXT: v_add_u32_e32 v2, vcc, 16, v0
				; GFX8-NEXT: buffer_store_dword v6, v2, s[0:3], 0 offen
				; GFX8-NEXT: v_add_u32_e32 v2, vcc, 12, v0
				; GFX8-NEXT: buffer_store_dword v5, v2, s[0:3], 0 offen
				; GFX8-NEXT: v_add_u32_e32 v2, vcc, 8, v0
				; GFX8-NEXT: buffer_store_dword v4, v2, s[0:3], 0 offen
				; GFX8-NEXT: v_add_u32_e32 v2, vcc, 4, v0
				; GFX8-NEXT: v_add_u32_e32 v0, vcc, 0x80, v0
				; GFX8-NEXT: buffer_store_dword v3, v2, s[0:3], 0 offen
				; GFX8-NEXT: buffer_store_short v1, v0, s[0:3], 0 offen
				; GFX8-NEXT: s_waitcnt vmcnt(0)
				; GFX8-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX9-LABEL: test_overflow_stack:
				; GFX9: ; %bb.0:
				; GFX9-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX9-NEXT: buffer_store_dword v30, v0, s[0:3], 0 offen offset:112
				; GFX9-NEXT: buffer_store_dword v29, v0, s[0:3], 0 offen offset:108
				; GFX9-NEXT: buffer_store_dword v28, v0, s[0:3], 0 offen offset:104
				; GFX9-NEXT: buffer_store_dword v27, v0, s[0:3], 0 offen offset:100
				; GFX9-NEXT: buffer_store_dword v26, v0, s[0:3], 0 offen offset:96
				; GFX9-NEXT: buffer_load_dword v26, off, s[0:3], s32 offset:4
				; GFX9-NEXT: s_nop 0
				; GFX9-NEXT: buffer_load_dword v27, off, s[0:3], s32 offset:8
				; GFX9-NEXT: s_nop 0
				; GFX9-NEXT: buffer_store_dword v25, v0, s[0:3], 0 offen offset:92
				; GFX9-NEXT: buffer_load_dword v25, off, s[0:3], s32
				; GFX9-NEXT: s_nop 0
				; GFX9-NEXT: buffer_store_dword v24, v0, s[0:3], 0 offen offset:88
				; GFX9-NEXT: buffer_store_dword v23, v0, s[0:3], 0 offen offset:84
				; GFX9-NEXT: buffer_store_dword v22, v0, s[0:3], 0 offen offset:80
				; GFX9-NEXT: buffer_store_dword v21, v0, s[0:3], 0 offen offset:76
				; GFX9-NEXT: buffer_store_dword v20, v0, s[0:3], 0 offen offset:72
				; GFX9-NEXT: buffer_store_dword v19, v0, s[0:3], 0 offen offset:68
				; GFX9-NEXT: buffer_store_dword v18, v0, s[0:3], 0 offen offset:64
				; GFX9-NEXT: buffer_store_dword v17, v0, s[0:3], 0 offen offset:60
				; GFX9-NEXT: buffer_store_dword v16, v0, s[0:3], 0 offen offset:56
				; GFX9-NEXT: buffer_store_dword v15, v0, s[0:3], 0 offen offset:52
				; GFX9-NEXT: buffer_store_dword v14, v0, s[0:3], 0 offen offset:48
				; GFX9-NEXT: buffer_store_dword v13, v0, s[0:3], 0 offen offset:44
				; GFX9-NEXT: buffer_store_dword v12, v0, s[0:3], 0 offen offset:40
				; GFX9-NEXT: buffer_store_dword v11, v0, s[0:3], 0 offen offset:36
				; GFX9-NEXT: buffer_store_dword v10, v0, s[0:3], 0 offen offset:32
				; GFX9-NEXT: buffer_store_dword v9, v0, s[0:3], 0 offen offset:28
				; GFX9-NEXT: buffer_store_dword v8, v0, s[0:3], 0 offen offset:24
				; GFX9-NEXT: buffer_store_dword v7, v0, s[0:3], 0 offen offset:20
				; GFX9-NEXT: buffer_store_dword v6, v0, s[0:3], 0 offen offset:16
				; GFX9-NEXT: buffer_store_dword v5, v0, s[0:3], 0 offen offset:12
				; GFX9-NEXT: buffer_store_dword v4, v0, s[0:3], 0 offen offset:8
				; GFX9-NEXT: buffer_store_dword v3, v0, s[0:3], 0 offen offset:4
				; GFX9-NEXT: buffer_store_dword v2, v0, s[0:3], 0 offen
				; GFX9-NEXT: s_waitcnt vmcnt(25)
				; GFX9-NEXT: buffer_store_dword v27, v0, s[0:3], 0 offen offset:124
				; GFX9-NEXT: buffer_store_dword v26, v0, s[0:3], 0 offen offset:120
				; GFX9-NEXT: s_waitcnt vmcnt(25)
				; GFX9-NEXT: buffer_store_dword v25, v0, s[0:3], 0 offen offset:116
				; GFX9-NEXT: buffer_store_short_d16_hi v1, v0, s[0:3], 0 offen offset:128
				; GFX9-NEXT: s_waitcnt vmcnt(0)
				; GFX9-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX10-LABEL: test_overflow_stack:
				; GFX10: ; %bb.0:
				; GFX10-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX10-NEXT: s_clause 0x2
				; GFX10-NEXT: buffer_load_dword v31, off, s[0:3], s32 offset:8
				; GFX10-NEXT: buffer_load_dword v32, off, s[0:3], s32 offset:4
				; GFX10-NEXT: buffer_load_dword v33, off, s[0:3], s32
				; GFX10-NEXT: buffer_store_dword v30, v0, s[0:3], 0 offen offset:112
				; GFX10-NEXT: buffer_store_dword v29, v0, s[0:3], 0 offen offset:108
				; GFX10-NEXT: buffer_store_dword v28, v0, s[0:3], 0 offen offset:104
				; GFX10-NEXT: buffer_store_dword v27, v0, s[0:3], 0 offen offset:100
				; GFX10-NEXT: buffer_store_dword v26, v0, s[0:3], 0 offen offset:96
				; GFX10-NEXT: buffer_store_dword v25, v0, s[0:3], 0 offen offset:92
				; GFX10-NEXT: buffer_store_dword v24, v0, s[0:3], 0 offen offset:88
				; GFX10-NEXT: buffer_store_dword v23, v0, s[0:3], 0 offen offset:84
				; GFX10-NEXT: buffer_store_dword v22, v0, s[0:3], 0 offen offset:80
				; GFX10-NEXT: buffer_store_dword v21, v0, s[0:3], 0 offen offset:76
				; GFX10-NEXT: buffer_store_dword v20, v0, s[0:3], 0 offen offset:72
				; GFX10-NEXT: buffer_store_dword v19, v0, s[0:3], 0 offen offset:68
				; GFX10-NEXT: buffer_store_dword v18, v0, s[0:3], 0 offen offset:64
				; GFX10-NEXT: buffer_store_dword v17, v0, s[0:3], 0 offen offset:60
				; GFX10-NEXT: buffer_store_dword v16, v0, s[0:3], 0 offen offset:56
				; GFX10-NEXT: buffer_store_dword v15, v0, s[0:3], 0 offen offset:52
				; GFX10-NEXT: buffer_store_dword v14, v0, s[0:3], 0 offen offset:48
				; GFX10-NEXT: buffer_store_dword v13, v0, s[0:3], 0 offen offset:44
				; GFX10-NEXT: buffer_store_dword v12, v0, s[0:3], 0 offen offset:40
				; GFX10-NEXT: buffer_store_dword v11, v0, s[0:3], 0 offen offset:36
				; GFX10-NEXT: buffer_store_dword v10, v0, s[0:3], 0 offen offset:32
				; GFX10-NEXT: buffer_store_dword v9, v0, s[0:3], 0 offen offset:28
				; GFX10-NEXT: buffer_store_dword v8, v0, s[0:3], 0 offen offset:24
				; GFX10-NEXT: buffer_store_dword v7, v0, s[0:3], 0 offen offset:20
				; GFX10-NEXT: buffer_store_dword v6, v0, s[0:3], 0 offen offset:16
				; GFX10-NEXT: buffer_store_dword v5, v0, s[0:3], 0 offen offset:12
				; GFX10-NEXT: buffer_store_dword v4, v0, s[0:3], 0 offen offset:8
				; GFX10-NEXT: buffer_store_dword v3, v0, s[0:3], 0 offen offset:4
				; GFX10-NEXT: buffer_store_dword v2, v0, s[0:3], 0 offen
				; GFX10-NEXT: s_waitcnt vmcnt(2)
				; GFX10-NEXT: buffer_store_dword v31, v0, s[0:3], 0 offen offset:124
				; GFX10-NEXT: s_waitcnt vmcnt(1)
				; GFX10-NEXT: buffer_store_dword v32, v0, s[0:3], 0 offen offset:120
				; GFX10-NEXT: s_waitcnt vmcnt(0)
				; GFX10-NEXT: buffer_store_dword v33, v0, s[0:3], 0 offen offset:116
				; GFX10-NEXT: buffer_store_short_d16_hi v1, v0, s[0:3], 0 offen offset:128
				; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX10-NEXT: s_setpc_b64 s[30:31]
				%ins.0 = insertvalue { <32 x i32>, bfloat } poison, <32 x i32> %b, 0
				%ins.1 = insertvalue { <32 x i32>, bfloat } %ins.0 ,bfloat %a, 1
				ret { <32 x i32>, bfloat } %ins.1
				}