This is an archive of the discontinued LLVM Phabricator instance.

AMDGPU/SI: Propose to redefine image load/store intrinsics
AbandonedPublic

Authored by cfang on Aug 8 2016, 4:56 PM.

Download Raw Diff

Details

Reviewers

• tstellarAMD
arsenm

Summary

This is a proposal to sync the definition of image load/store intrinsics with these pf samplers: https://reviews.llvm.org/D22838.

define vdata type to be llvm_anyfloat_ty, address type to be llvm_anyfloat_ty, and rsrc type to be llvm_anyint_ty as a result, we expect the intrinsics name to have three suffixes to overload each of these three types;

D128 as well as two other flags are implied in the three types, for example, if you use v8i32 as rsrc type, then r128 is true!

don't expose TFE flag and unorm flag (set to 1), and other flags are exposed in the instruction order: unrm, glc, slc, lwe and da.

LIT tests are not fully updated now!

Diff Detail

Event Timeline

cfang updated this revision to Diff 67249.Aug 8 2016, 4:56 PM

cfang retitled this revision from to AMDGPU/SI: Propose to redefine image load/store intrinsics.

cfang updated this object.

cfang added reviewers: arsenm, • tstellarAMD.

cfang added subscribers: arsenm, llvm-commits.

Herald added a subscriber: kzhuravl. · View Herald TranscriptAug 8 2016, 4:56 PM

Why wouldn't unorm be exposed?

include/llvm/IR/IntrinsicsAMDGPU.td
255	da should not be exposed since it is the mangling of the coordinate type

• tstellarAMD added inline comments.Aug 8 2016, 8:20 PM

include/llvm/IR/IntrinsicsAMDGPU.td
255	As far as I know da doesn't impact the coordinate type. Mesa doesn't seem to change anything about the coordinates when setting them.

Update LIT tests based on new intrinsics definition for image load and image:

image intrinsics name has three suffixes for types of vdata, address and resource, respectively. Something like: declare void @llvm.amdgcn.image.store.v4f32.v2i32.v8i32(<4 x float>, <2 x i32>, <8 x i32>, i32, i1, i1, i1, i1) #0 declare <4 x float> @llvm.amdgcn.image.load.v4f32.v4i32.v8i32(<4 x i32>, <8 x i32>, i32, i1, i1, i1, i1) #1

The four flags exposed are in the following order: glc, slc, lwe and da

unrm is not exposed and default to 1

I don't know about making vdata (input & output) overloaded. If you do, please at least add a machine instruction verifier that checks the type against dmask.

include/llvm/IR/IntrinsicsAMDGPU.td
255	da does affect how coordinates are interpreted by the TA block, but not in a way that LLVM could extract from the coordinate type, since LLVM cannot know everything it needs to know to do so (it doesn't know the image dimensions).

"unorm" and "da" must be exposed as parameters. They don't change the type, but they change the behavior of the TA hardware block. In all cases, the type is always floating-point.

"r128" doesn't have to be exposed and it's kinda useless. We don't have any use case for it and I think the next-gen hardware (after Polaris) doesn't have it either.

if you use v8i32 as rsrc type, then r128 is true!

Also, the quoted sentence is non-sense. In 100% of our cases, r128 must be 0.

In D23286#511221, @mareko wrote:

if you use v8i32 as rsrc type, then r128 is true!

Also, the quoted sentence is non-sense. In 100% of our cases, r128 must be 0.

Yes, the quoted comment is wrong. Do you think I can drop r128 bit support?
i.e. set resource type to be v8i32 and r128 bit to be 0 all the time?

I heard GL somehow uses R128 bit.

In D23286#511210, @mareko wrote:

"unorm" and "da" must be exposed as parameters. They don't change the type, but they change the behavior of the TA hardware block. In all cases, the type is always floating-point.

"r128" doesn't have to be exposed and it's kinda useless. We don't have any use case for it and I think the next-gen hardware (after Polaris) doesn't have it either.

This patch also consider image load and image store. For image store, unorm bit must be 1. I haven't seen any restriction regarding image load.
Are you sure the coordinate type is always float-point? I know for image_sample, it is the case, and not sure image load and image store.

In D23286#511418, @cfang wrote:

In D23286#511221, @mareko wrote:

if you use v8i32 as rsrc type, then r128 is true!

Also, the quoted sentence is non-sense. In 100% of our cases, r128 must be 0.

Yes, the quoted comment is wrong. Do you think I can drop r128 bit support?
i.e. set resource type to be v8i32 and r128 bit to be 0 all the time?

I heard GL somehow uses R128 bit.

Mesa GL doesn't use R128. If the closed GL does, I don't know about it.

In D23286#511442, @cfang wrote:

In D23286#511210, @mareko wrote:

"unorm" and "da" must be exposed as parameters. They don't change the type, but they change the behavior of the TA hardware block. In all cases, the type is always floating-point.

"r128" doesn't have to be exposed and it's kinda useless. We don't have any use case for it and I think the next-gen hardware (after Polaris) doesn't have it either.

This patch also consider image load and image store. For image store, unorm bit must be 1. I haven't seen any restriction regarding image load.
Are you sure the coordinate type is always float-point? I know for image_sample, it is the case, and not sure image load and image store.

Image loads/stores are a different category. There are several categories but you can split all image opcodes into two:

takes a floating-point address (all sample and gather opcodes) and also a sampler descriptor, the "unorm" and "da" bits apply here
takes an integer address (all load, store and atomic opcodes), the "unorm" bit is irrelevant (or must be 1 in some cases) and only the "da" bit applies

> Mesa GL doesn't use R128. If the closed GL does, I don't know about it.

Just discussed with other compiler engineers. Compiler should expose existing hardware feature no matter it is currently used or not.

Image loads/stores are a different category. There are several categories but you can split all image opcodes into two:

takes a floating-point address (all sample and gather opcodes) and also a sampler descriptor, the "unorm" and "da" bits apply here

takes an integer address (all load, store and atomic opcodes), the "unorm" bit is irrelevant (or must be 1 in some cases) and only the "da" bit applies

Thanks, This is what we are doing currently.
This patch handle image load/store; unorm ==1 and expose da

Another patch (https://reviews.llvm.org/D22838) handle sampler related image intrinsics which expose unorm and da

In D23286#511206, @nhaehnle wrote:

I don't know about making vdata (input & output) overloaded. If you do, please at least add a machine instruction verifier that checks the type against dmask.

The data could be types of float or half, so we need to overload the vdata type. Also we need to set D16 bit for vector of half type.

Could you outline what should I do in order to "add a machine instruction verifier that checks the type against dmask". Thanks.

Hello, All:

Do you agree with the proposed image load/store intrinsics definition? If yes, what should you suggest foe the existing applications that use "old" (existing) image load/store intrinsics? Thanks so much, and we need this change urgently!

The changes look good to me. Existing users (Mesa) will need to be fixed manually, but we can take care of that.

Was this already committed?

withdraw this revision because this same patch has been committed to trunk.

Revision Contents

Path

Size

include/

llvm/

IR/

IntrinsicsAMDGPU.td

21 lines

lib/

Target/

AMDGPU/

SIInstructions.td

41 lines

test/

CodeGen/

AMDGPU/

llvm.amdgcn.image.ll

58 lines

llvm.amdgcn.s.waitcnt.ll

12 lines

wqm.ll

10 lines

Diff 67415

include/llvm/IR/IntrinsicsAMDGPU.td

Context not available.
	>;	>;

	class AMDGPUImageLoad : Intrinsic <	class AMDGPUImageLoad : Intrinsic <
	[llvm_v4f32_ty], // vdata(VGPR)	[llvm_anyfloat_ty], // vdata(VGPR)
	[llvm_anyint_ty, // vaddr(VGPR)	[llvm_anyint_ty, // vaddr(VGPR)
	llvm_v8i32_ty, // rsrc(SGPR)	llvm_anyint_ty, // rsrc(SGPR)
	llvm_i32_ty, // dmask(imm)	llvm_i32_ty, // dmask(imm)
	llvm_i1_ty, // r128(imm)
	llvm_i1_ty, // da(imm)
	llvm_i1_ty, // glc(imm)	llvm_i1_ty, // glc(imm)
	llvm_i1_ty], // slc(imm)	llvm_i1_ty, // slc(imm)
		llvm_i1_ty, // lwe(imm)
		llvm_i1_ty], // da(imm)
	[IntrReadMem]>;	[IntrReadMem]>;

	def int_amdgcn_image_load : AMDGPUImageLoad;	def int_amdgcn_image_load : AMDGPUImageLoad;
	def int_amdgcn_image_load_mip : AMDGPUImageLoad;	def int_amdgcn_image_load_mip : AMDGPUImageLoad;
		def int_amdgcn_image_getresinfo : AMDGPUImageLoad;

	class AMDGPUImageStore : Intrinsic <	class AMDGPUImageStore : Intrinsic <
	[],	[],
	[llvm_v4f32_ty, // vdata(VGPR)	[llvm_anyfloat_ty, // vdata(VGPR)
	llvm_anyint_ty, // vaddr(VGPR)	llvm_anyint_ty, // vaddr(VGPR)
	llvm_v8i32_ty, // rsrc(SGPR)	llvm_anyint_ty, // rsrc(SGPR)
	llvm_i32_ty, // dmask(imm)	llvm_i32_ty, // dmask(imm)
	llvm_i1_ty, // r128(imm)
	llvm_i1_ty, // da(imm)
	llvm_i1_ty, // glc(imm)	llvm_i1_ty, // glc(imm)
	llvm_i1_ty], // slc(imm)	llvm_i1_ty, // slc(imm)
		llvm_i1_ty, // lwe(imm)
		llvm_i1_ty], // da(imm)
		arsenmUnsubmitted Not Done Reply Inline Actions da should not be exposed since it is the mangling of the coordinate type arsenm: da should not be exposed since it is the mangling of the coordinate type
		tstellarAMDUnsubmitted Not Done Reply Inline Actions As far as I know da doesn't impact the coordinate type. Mesa doesn't seem to change anything about the coordinates when setting them. tstellarAMD: As far as I know da doesn't impact the coordinate type. Mesa doesn't seem to change anything…
		nhaehnleUnsubmitted Not Done Reply Inline Actions da does affect how coordinates are interpreted by the TA block, but not in a way that LLVM could extract from the coordinate type, since LLVM cannot know everything it needs to know to do so (it doesn't know the image dimensions). nhaehnle: da does affect how coordinates are interpreted by the TA block, but not in a way that LLVM…
	[]>;	[]>;

	def int_amdgcn_image_store : AMDGPUImageStore;	def int_amdgcn_image_store : AMDGPUImageStore;
Context not available.

lib/Target/AMDGPU/SIInstructions.td

Context not available.
	def : ImagePattern<name, !cast<MIMG>(opcode # _V4_V4), v4i32>;	def : ImagePattern<name, !cast<MIMG>(opcode # _V4_V4), v4i32>;
	}	}

	class ImageLoadPattern<SDPatternOperator name, MIMG opcode, ValueType vt> : Pat <	multiclass ImageLoadPattern<SDPatternOperator name, MIMG opcode, ValueType vt> {
	(name vt:$addr, v8i32:$rsrc, imm:$dmask, imm:$r128, imm:$da, imm:$glc,	def : Pat <
	imm:$slc),	(v4f32 (name vt:$addr, v8i32:$rsrc, i32:$dmask, i1:$glc, i1:$slc, i1:$lwe,
	(opcode $addr, $rsrc,	i1:$da)),
		(opcode $addr, $rsrc,
	(as_i32imm $dmask), 1, (as_i1imm $glc), (as_i1imm $slc),	(as_i32imm $dmask), 1, (as_i1imm $glc), (as_i1imm $slc),
	(as_i1imm $r128), 0, 0, (as_i1imm $da))	0, 0, (as_i1imm $lwe), (as_i1imm $da))
	>;	>;
		}

	multiclass ImageLoadPatterns<SDPatternOperator name, string opcode> {	multiclass ImageLoadPatterns<SDPatternOperator name, string opcode> {
	def : ImageLoadPattern<name, !cast<MIMG>(opcode # _V4_V1), i32>;	defm : ImageLoadPattern<name, !cast<MIMG>(opcode # _V4_V1), i32>;
	def : ImageLoadPattern<name, !cast<MIMG>(opcode # _V4_V2), v2i32>;	defm : ImageLoadPattern<name, !cast<MIMG>(opcode # _V4_V2), v2i32>;
	def : ImageLoadPattern<name, !cast<MIMG>(opcode # _V4_V4), v4i32>;	defm : ImageLoadPattern<name, !cast<MIMG>(opcode # _V4_V4), v4i32>;
	}	}

	class ImageStorePattern<SDPatternOperator name, MIMG opcode, ValueType vt> : Pat <	multiclass ImageStorePattern<SDPatternOperator name, MIMG opcode, ValueType vt> {
	(name v4f32:$data, vt:$addr, v8i32:$rsrc, i32:$dmask, imm:$r128, imm:$da,	def : Pat <
	imm:$glc, imm:$slc),	(name v4f32:$data, vt:$addr, v8i32:$rsrc, i32:$dmask, i1:$glc, i1:$slc,
	(opcode $data, $addr, $rsrc,	i1:$lwe, i1:$da),
		(opcode $data, $addr, $rsrc,
	(as_i32imm $dmask), 1, (as_i1imm $glc), (as_i1imm $slc),	(as_i32imm $dmask), 1, (as_i1imm $glc), (as_i1imm $slc),
	(as_i1imm $r128), 0, 0, (as_i1imm $da))	0, 0, (as_i1imm $lwe), (as_i1imm $da))
	>;	>;
		}

	multiclass ImageStorePatterns<SDPatternOperator name, string opcode> {	multiclass ImageStorePatterns<SDPatternOperator name, string opcode> {
	def : ImageStorePattern<name, !cast<MIMG>(opcode # _V4_V1), i32>;	defm : ImageStorePattern<name, !cast<MIMG>(opcode # _V4_V1), i32>;
	def : ImageStorePattern<name, !cast<MIMG>(opcode # _V4_V2), v2i32>;	defm : ImageStorePattern<name, !cast<MIMG>(opcode # _V4_V2), v2i32>;
	def : ImageStorePattern<name, !cast<MIMG>(opcode # _V4_V4), v4i32>;	defm : ImageStorePattern<name, !cast<MIMG>(opcode # _V4_V4), v4i32>;
	}	}

	class ImageAtomicPattern<SDPatternOperator name, MIMG opcode, ValueType vt> : Pat <	class ImageAtomicPattern<SDPatternOperator name, MIMG opcode, ValueType vt> : Pat <
Context not available.
	defm : ImagePatterns<int_SI_image_load_mip, "IMAGE_LOAD_MIP">;	defm : ImagePatterns<int_SI_image_load_mip, "IMAGE_LOAD_MIP">;
	defm : ImageLoadPatterns<int_amdgcn_image_load, "IMAGE_LOAD">;	defm : ImageLoadPatterns<int_amdgcn_image_load, "IMAGE_LOAD">;
	defm : ImageLoadPatterns<int_amdgcn_image_load_mip, "IMAGE_LOAD_MIP">;	defm : ImageLoadPatterns<int_amdgcn_image_load_mip, "IMAGE_LOAD_MIP">;
		defm : ImageLoadPattern<int_amdgcn_image_getresinfo, IMAGE_GET_RESINFO_V4_V1, i32>;
	defm : ImageStorePatterns<int_amdgcn_image_store, "IMAGE_STORE">;	defm : ImageStorePatterns<int_amdgcn_image_store, "IMAGE_STORE">;
	defm : ImageStorePatterns<int_amdgcn_image_store_mip, "IMAGE_STORE_MIP">;	defm : ImageStorePatterns<int_amdgcn_image_store_mip, "IMAGE_STORE_MIP">;
	defm : ImageAtomicPatterns<int_amdgcn_image_atomic_swap, "IMAGE_ATOMIC_SWAP">;	defm : ImageAtomicPatterns<int_amdgcn_image_atomic_swap, "IMAGE_ATOMIC_SWAP">;
Context not available.

test/CodeGen/AMDGPU/llvm.amdgcn.image.ll

Context not available.
	;CHECK: s_waitcnt vmcnt(0)	;CHECK: s_waitcnt vmcnt(0)
	define amdgpu_ps <4 x float> @image_load_v4i32(<8 x i32> inreg %rsrc, <4 x i32> %c) {	define amdgpu_ps <4 x float> @image_load_v4i32(<8 x i32> inreg %rsrc, <4 x i32> %c) {
	main_body:	main_body:
	%tex = call <4 x float> @llvm.amdgcn.image.load.v4i32(<4 x i32> %c, <8 x i32> %rsrc, i32 15, i1 0, i1 0, i1 0, i1 0)	%tex = call <4 x float> @llvm.amdgcn.image.load.v4f32.v4i32.v8i32(<4 x i32> %c, <8 x i32> %rsrc, i32 15, i1 0, i1 0, i1 0, i1 0)
	ret <4 x float> %tex	ret <4 x float> %tex
	}	}

Context not available.
	;CHECK: s_waitcnt vmcnt(0)	;CHECK: s_waitcnt vmcnt(0)
	define amdgpu_ps <4 x float> @image_load_v2i32(<8 x i32> inreg %rsrc, <2 x i32> %c) {	define amdgpu_ps <4 x float> @image_load_v2i32(<8 x i32> inreg %rsrc, <2 x i32> %c) {
	main_body:	main_body:
	%tex = call <4 x float> @llvm.amdgcn.image.load.v2i32(<2 x i32> %c, <8 x i32> %rsrc, i32 15, i1 0, i1 0, i1 0, i1 0)	%tex = call <4 x float> @llvm.amdgcn.image.load.v4f32.v2i32.v8i32(<2 x i32> %c, <8 x i32> %rsrc, i32 15, i1 0, i1 0, i1 0, i1 0)
	ret <4 x float> %tex	ret <4 x float> %tex
	}	}

Context not available.
	;CHECK: s_waitcnt vmcnt(0)	;CHECK: s_waitcnt vmcnt(0)
	define amdgpu_ps <4 x float> @image_load_i32(<8 x i32> inreg %rsrc, i32 %c) {	define amdgpu_ps <4 x float> @image_load_i32(<8 x i32> inreg %rsrc, i32 %c) {
	main_body:	main_body:
	%tex = call <4 x float> @llvm.amdgcn.image.load.i32(i32 %c, <8 x i32> %rsrc, i32 15, i1 0, i1 0, i1 0, i1 0)	%tex = call <4 x float> @llvm.amdgcn.image.load.v4f32.i32.v8i32(i32 %c, <8 x i32> %rsrc, i32 15, i1 0, i1 0, i1 0, i1 0)
	ret <4 x float> %tex	ret <4 x float> %tex
	}	}

Context not available.
	;CHECK: s_waitcnt vmcnt(0)	;CHECK: s_waitcnt vmcnt(0)
	define amdgpu_ps <4 x float> @image_load_mip(<8 x i32> inreg %rsrc, <4 x i32> %c) {	define amdgpu_ps <4 x float> @image_load_mip(<8 x i32> inreg %rsrc, <4 x i32> %c) {
	main_body:	main_body:
	%tex = call <4 x float> @llvm.amdgcn.image.load.mip.v4i32(<4 x i32> %c, <8 x i32> %rsrc, i32 15, i1 0, i1 0, i1 0, i1 0)	%tex = call <4 x float> @llvm.amdgcn.image.load.mip.v4f32.v4i32.v8i32(<4 x i32> %c, <8 x i32> %rsrc, i32 15, i1 0, i1 0, i1 0, i1 0)
	ret <4 x float> %tex	ret <4 x float> %tex
	}	}

Context not available.
	;CHECK: s_waitcnt vmcnt(0)	;CHECK: s_waitcnt vmcnt(0)
	define amdgpu_ps float @image_load_1(<8 x i32> inreg %rsrc, <4 x i32> %c) {	define amdgpu_ps float @image_load_1(<8 x i32> inreg %rsrc, <4 x i32> %c) {
	main_body:	main_body:
	%tex = call <4 x float> @llvm.amdgcn.image.load.v4i32(<4 x i32> %c, <8 x i32> %rsrc, i32 15, i1 0, i1 0, i1 0, i1 0)	%tex = call <4 x float> @llvm.amdgcn.image.load.v4f32.v4i32.v8i32(<4 x i32> %c, <8 x i32> %rsrc, i32 15, i1 0, i1 0, i1 0, i1 0)
	%elt = extractelement <4 x float> %tex, i32 0	%elt = extractelement <4 x float> %tex, i32 0
	; Only first component used, test that dmask etc. is changed accordingly	; Only first component used, test that dmask etc. is changed accordingly
	ret float %elt	ret float %elt
Context not available.
	;CHECK: image_store v[0:3], v[4:7], s[0:7] dmask:0xf unorm	;CHECK: image_store v[0:3], v[4:7], s[0:7] dmask:0xf unorm
	define amdgpu_ps void @image_store_v4i32(<8 x i32> inreg %rsrc, <4 x float> %data, <4 x i32> %coords) {	define amdgpu_ps void @image_store_v4i32(<8 x i32> inreg %rsrc, <4 x float> %data, <4 x i32> %coords) {
	main_body:	main_body:
	call void @llvm.amdgcn.image.store.v4i32(<4 x float> %data, <4 x i32> %coords, <8 x i32> %rsrc, i32 15, i1 0, i1 0, i1 0, i1 0)	call void @llvm.amdgcn.image.store.v4f32.v4i32.v8i32(<4 x float> %data, <4 x i32> %coords, <8 x i32> %rsrc, i32 15, i1 0, i1 0, i1 0, i1 0)
	ret void	ret void
	}	}

Context not available.
	;CHECK: image_store v[0:3], v[4:5], s[0:7] dmask:0xf unorm	;CHECK: image_store v[0:3], v[4:5], s[0:7] dmask:0xf unorm
	define amdgpu_ps void @image_store_v2i32(<8 x i32> inreg %rsrc, <4 x float> %data, <2 x i32> %coords) {	define amdgpu_ps void @image_store_v2i32(<8 x i32> inreg %rsrc, <4 x float> %data, <2 x i32> %coords) {
	main_body:	main_body:
	call void @llvm.amdgcn.image.store.v2i32(<4 x float> %data, <2 x i32> %coords, <8 x i32> %rsrc, i32 15, i1 0, i1 0, i1 0, i1 0)	call void @llvm.amdgcn.image.store.v4f32.v2i32.v8i32(<4 x float> %data, <2 x i32> %coords, <8 x i32> %rsrc, i32 15, i1 0, i1 0, i1 0, i1 0)
	ret void	ret void
	}	}

Context not available.
	;CHECK: image_store v[0:3], v4, s[0:7] dmask:0xf unorm	;CHECK: image_store v[0:3], v4, s[0:7] dmask:0xf unorm
	define amdgpu_ps void @image_store_i32(<8 x i32> inreg %rsrc, <4 x float> %data, i32 %coords) {	define amdgpu_ps void @image_store_i32(<8 x i32> inreg %rsrc, <4 x float> %data, i32 %coords) {
	main_body:	main_body:
	call void @llvm.amdgcn.image.store.i32(<4 x float> %data, i32 %coords, <8 x i32> %rsrc, i32 15, i1 0, i1 0, i1 0, i1 0)	call void @llvm.amdgcn.image.store.v4f32.i32.v8i32(<4 x float> %data, i32 %coords, <8 x i32> %rsrc, i32 15, i1 0, i1 0, i1 0, i1 0)
	ret void	ret void
	}	}

Context not available.
	;CHECK: image_store_mip v[0:3], v[4:7], s[0:7] dmask:0xf unorm	;CHECK: image_store_mip v[0:3], v[4:7], s[0:7] dmask:0xf unorm
	define amdgpu_ps void @image_store_mip(<8 x i32> inreg %rsrc, <4 x float> %data, <4 x i32> %coords) {	define amdgpu_ps void @image_store_mip(<8 x i32> inreg %rsrc, <4 x float> %data, <4 x i32> %coords) {
	main_body:	main_body:
	call void @llvm.amdgcn.image.store.mip.v4i32(<4 x float> %data, <4 x i32> %coords, <8 x i32> %rsrc, i32 15, i1 0, i1 0, i1 0, i1 0)	call void @llvm.amdgcn.image.store.mip.v4f32.v4i32.v8i32(<4 x float> %data, <4 x i32> %coords, <8 x i32> %rsrc, i32 15, i1 0, i1 0, i1 0, i1 0)
	ret void	ret void
	}	}

		;CHECK-LABEL: {{^}}getresinfo:
		;CHECK: image_get_resinfo {{v\[[0-9]+:[0-9]+\]}}, {{v[0-9]+}}, {{s\[[0-9]+:[0-9]+\]}} dmask:0xf
		define amdgpu_ps void @getresinfo() {
		main_body:
		%r = call <4 x float> @llvm.amdgcn.image.getresinfo.v4f32.i32.v8i32(i32 undef, <8 x i32> undef, i32 15, i1 0, i1 0, i1 0, i1 0)
		%r0 = extractelement <4 x float> %r, i32 0
		%r1 = extractelement <4 x float> %r, i32 1
		%r2 = extractelement <4 x float> %r, i32 2
		%r3 = extractelement <4 x float> %r, i32 3
		call void @llvm.SI.export(i32 15, i32 1, i32 1, i32 0, i32 1, float %r0, float %r1, float %r2, float %r3)
		ret void
		}


	; Ideally, the register allocator would avoid the wait here	; Ideally, the register allocator would avoid the wait here
	;	;
	;CHECK-LABEL: {{^}}image_store_wait:	;CHECK-LABEL: {{^}}image_store_wait:
Context not available.
	;CHECK: image_store v[0:3], v4, s[16:23] dmask:0xf unorm	;CHECK: image_store v[0:3], v4, s[16:23] dmask:0xf unorm
	define amdgpu_ps void @image_store_wait(<8 x i32> inreg, <8 x i32> inreg, <8 x i32> inreg, <4 x float>, i32) {	define amdgpu_ps void @image_store_wait(<8 x i32> inreg, <8 x i32> inreg, <8 x i32> inreg, <4 x float>, i32) {
	main_body:	main_body:
	call void @llvm.amdgcn.image.store.i32(<4 x float> %3, i32 %4, <8 x i32> %0, i32 15, i1 0, i1 0, i1 0, i1 0)	call void @llvm.amdgcn.image.store.v4f32.i32.v8i32(<4 x float> %3, i32 %4, <8 x i32> %0, i32 15, i1 0, i1 0, i1 0, i1 0)
	%data = call <4 x float> @llvm.amdgcn.image.load.i32(i32 %4, <8 x i32> %1, i32 15, i1 0, i1 0, i1 0, i1 0)	%data = call <4 x float> @llvm.amdgcn.image.load.v4f32.i32.v8i32(i32 %4, <8 x i32> %1, i32 15, i1 0, i1 0, i1 0, i1 0)
	call void @llvm.amdgcn.image.store.i32(<4 x float> %data, i32 %4, <8 x i32> %2, i32 15, i1 0, i1 0, i1 0, i1 0)	call void @llvm.amdgcn.image.store.v4f32.i32.v8i32(<4 x float> %data, i32 %4, <8 x i32> %2, i32 15, i1 0, i1 0, i1 0, i1 0)
	ret void	ret void
	}	}

	declare void @llvm.amdgcn.image.store.i32(<4 x float>, i32, <8 x i32>, i32, i1, i1, i1, i1) #0	declare void @llvm.amdgcn.image.store.v4f32.i32.v8i32(<4 x float>, i32, <8 x i32>, i32, i1, i1, i1, i1) #0
	declare void @llvm.amdgcn.image.store.v2i32(<4 x float>, <2 x i32>, <8 x i32>, i32, i1, i1, i1, i1) #0	declare void @llvm.amdgcn.image.store.v4f32.v2i32.v8i32(<4 x float>, <2 x i32>, <8 x i32>, i32, i1, i1, i1, i1) #0
	declare void @llvm.amdgcn.image.store.v4i32(<4 x float>, <4 x i32>, <8 x i32>, i32, i1, i1, i1, i1) #0	declare void @llvm.amdgcn.image.store.v4f32.v4i32.v8i32(<4 x float>, <4 x i32>, <8 x i32>, i32, i1, i1, i1, i1) #0
	declare void @llvm.amdgcn.image.store.mip.v4i32(<4 x float>, <4 x i32>, <8 x i32>, i32, i1, i1, i1, i1) #0	declare void @llvm.amdgcn.image.store.mip.v4f32.v4i32.v8i32(<4 x float>, <4 x i32>, <8 x i32>, i32, i1, i1, i1, i1) #0

		declare <4 x float> @llvm.amdgcn.image.load.v4f32.i32.v8i32(i32, <8 x i32>, i32, i1, i1, i1, i1) #1
		declare <4 x float> @llvm.amdgcn.image.load.v4f32.v2i32.v8i32(<2 x i32>, <8 x i32>, i32, i1, i1, i1, i1) #1
		declare <4 x float> @llvm.amdgcn.image.load.v4f32.v4i32.v8i32(<4 x i32>, <8 x i32>, i32, i1, i1, i1, i1) #1
		declare <4 x float> @llvm.amdgcn.image.load.mip.v4f32.v4i32.v8i32(<4 x i32>, <8 x i32>, i32, i1, i1, i1, i1) #1

		declare <4 x float> @llvm.amdgcn.image.getresinfo.v4f32.i32.v8i32(i32, <8 x i32>, i32, i1, i1, i1, i1) #0

	declare <4 x float> @llvm.amdgcn.image.load.i32(i32, <8 x i32>, i32, i1, i1, i1, i1) #1	declare void @llvm.SI.export(i32, i32, i32, i32, i32, float, float, float, float)
	declare <4 x float> @llvm.amdgcn.image.load.v2i32(<2 x i32>, <8 x i32>, i32, i1, i1, i1, i1) #1
	declare <4 x float> @llvm.amdgcn.image.load.v4i32(<4 x i32>, <8 x i32>, i32, i1, i1, i1, i1) #1
	declare <4 x float> @llvm.amdgcn.image.load.mip.v4i32(<4 x i32>, <8 x i32>, i32, i1, i1, i1, i1) #1

	attributes #0 = { nounwind }	attributes #0 = { nounwind }
	attributes #1 = { nounwind readonly }	attributes #1 = { nounwind readonly }
Context not available.

test/CodeGen/AMDGPU/llvm.amdgcn.s.waitcnt.ll

Context not available.
	; CHECK-NEXT: image_store	; CHECK-NEXT: image_store
	; CHECK-NEXT: s_endpgm	; CHECK-NEXT: s_endpgm
	define amdgpu_ps void @test1(<8 x i32> inreg %rsrc, <4 x float> %d0, <4 x float> %d1, i32 %c0, i32 %c1) {	define amdgpu_ps void @test1(<8 x i32> inreg %rsrc, <4 x float> %d0, <4 x float> %d1, i32 %c0, i32 %c1) {
	call void @llvm.amdgcn.image.store.i32(<4 x float> %d0, i32 %c0, <8 x i32> %rsrc, i32 15, i1 0, i1 0, i1 1, i1 0)	call void @llvm.amdgcn.image.store.v4f32.i32.v8i32(<4 x float> %d0, i32 %c0, <8 x i32> %rsrc, i32 15, i1 0, i1 0, i1 1, i1 0)
	call void @llvm.amdgcn.s.waitcnt(i32 3840) ; 0xf00	call void @llvm.amdgcn.s.waitcnt(i32 3840) ; 0xf00
	call void @llvm.amdgcn.image.store.i32(<4 x float> %d1, i32 %c1, <8 x i32> %rsrc, i32 15, i1 0, i1 0, i1 1, i1 0)	call void @llvm.amdgcn.image.store.v4f32.i32.v8i32(<4 x float> %d1, i32 %c1, <8 x i32> %rsrc, i32 15, i1 0, i1 0, i1 1, i1 0)
	ret void	ret void
	}	}

Context not available.
	; CHECK: s_waitcnt	; CHECK: s_waitcnt
	; CHECK-NEXT: image_store	; CHECK-NEXT: image_store
	define amdgpu_ps void @test2(<8 x i32> inreg %rsrc, i32 %c) {	define amdgpu_ps void @test2(<8 x i32> inreg %rsrc, i32 %c) {
	%t = call <4 x float> @llvm.amdgcn.image.load.i32(i32 %c, <8 x i32> %rsrc, i32 15, i1 0, i1 0, i1 0, i1 0)	%t = call <4 x float> @llvm.amdgcn.image.load.v4f32.i32.v8i32(i32 %c, <8 x i32> %rsrc, i32 15, i1 0, i1 0, i1 0, i1 0)
	call void @llvm.amdgcn.s.waitcnt(i32 3840) ; 0xf00	call void @llvm.amdgcn.s.waitcnt(i32 3840) ; 0xf00
	%c.1 = mul i32 %c, 2	%c.1 = mul i32 %c, 2
	call void @llvm.amdgcn.image.store.i32(<4 x float> %t, i32 %c.1, <8 x i32> %rsrc, i32 15, i1 0, i1 0, i1 0, i1 0)	call void @llvm.amdgcn.image.store.v4f32.i32.v8i32(<4 x float> %t, i32 %c.1, <8 x i32> %rsrc, i32 15, i1 0, i1 0, i1 0, i1 0)
	ret void	ret void
	}	}

	declare void @llvm.amdgcn.s.waitcnt(i32) #0	declare void @llvm.amdgcn.s.waitcnt(i32) #0

	declare <4 x float> @llvm.amdgcn.image.load.i32(i32, <8 x i32>, i32, i1, i1, i1, i1) #1	declare <4 x float> @llvm.amdgcn.image.load.v4f32.i32.v8i32(i32, <8 x i32>, i32, i1, i1, i1, i1) #1
	declare void @llvm.amdgcn.image.store.i32(<4 x float>, i32, <8 x i32>, i32, i1, i1, i1, i1) #0	declare void @llvm.amdgcn.image.store.v4f32.i32.v8i32(<4 x float>, i32, <8 x i32>, i32, i1, i1, i1, i1) #0

	attributes #0 = { nounwind }	attributes #0 = { nounwind }
	attributes #1 = { nounwind readonly }	attributes #1 = { nounwind readonly }
Context not available.

test/CodeGen/AMDGPU/wqm.ll

Context not available.
	;CHECK-NOT: s_wqm	;CHECK-NOT: s_wqm
	define amdgpu_ps <4 x float> @test1(<8 x i32> inreg %rsrc, <4 x i32> %c) {	define amdgpu_ps <4 x float> @test1(<8 x i32> inreg %rsrc, <4 x i32> %c) {
	main_body:	main_body:
	%tex = call <4 x float> @llvm.amdgcn.image.load.v4i32(<4 x i32> %c, <8 x i32> %rsrc, i32 15, i1 0, i1 0, i1 0, i1 0)	%tex = call <4 x float> @llvm.amdgcn.image.load.v4f32.v4i32.v8i32(<4 x i32> %c, <8 x i32> %rsrc, i32 15, i1 0, i1 0, i1 0, i1 0)
	call void @llvm.amdgcn.image.store.v4i32(<4 x float> %tex, <4 x i32> %c, <8 x i32> %rsrc, i32 15, i1 0, i1 0, i1 0, i1 0)	call void @llvm.amdgcn.image.store.v4f32.v4i32.v8i32(<4 x float> %tex, <4 x i32> %c, <8 x i32> %rsrc, i32 15, i1 0, i1 0, i1 0, i1 0)
	ret <4 x float> %tex	ret <4 x float> %tex
	}	}

Context not available.
	; CHECK: ; return	; CHECK: ; return
	define amdgpu_ps <4 x float> @test_loop_vcc(<4 x float> %in) nounwind {	define amdgpu_ps <4 x float> @test_loop_vcc(<4 x float> %in) nounwind {
	entry:	entry:
	call void @llvm.amdgcn.image.store.v4i32(<4 x float> %in, <4 x i32> undef, <8 x i32> undef, i32 15, i1 0, i1 0, i1 0, i1 0)	call void @llvm.amdgcn.image.store.v4f32.v4i32.v8i32(<4 x float> %in, <4 x i32> undef, <8 x i32> undef, i32 15, i1 0, i1 0, i1 0, i1 0)
	br label %loop	br label %loop

	loop:	loop:
Context not available.
	}	}


	declare void @llvm.amdgcn.image.store.v4i32(<4 x float>, <4 x i32>, <8 x i32>, i32, i1, i1, i1, i1) #1	declare void @llvm.amdgcn.image.store.v4f32.v4i32.v8i32(<4 x float>, <4 x i32>, <8 x i32>, i32, i1, i1, i1, i1) #1
	declare void @llvm.amdgcn.buffer.store.f32(float, <4 x i32>, i32, i32, i1, i1) #1	declare void @llvm.amdgcn.buffer.store.f32(float, <4 x i32>, i32, i32, i1, i1) #1
	declare void @llvm.amdgcn.buffer.store.v4f32(<4 x float>, <4 x i32>, i32, i32, i1, i1) #1	declare void @llvm.amdgcn.buffer.store.v4f32(<4 x float>, <4 x i32>, i32, i32, i1, i1) #1

	declare <4 x float> @llvm.amdgcn.image.load.v4i32(<4 x i32>, <8 x i32>, i32, i1, i1, i1, i1) #2	declare <4 x float> @llvm.amdgcn.image.load.v4f32.v4i32.v8i32(<4 x i32>, <8 x i32>, i32, i1, i1, i1, i1) #2
	declare float @llvm.amdgcn.buffer.load.f32(<4 x i32>, i32, i32, i1, i1) #2	declare float @llvm.amdgcn.buffer.load.f32(<4 x i32>, i32, i32, i1, i1) #2

	declare <4 x float> @llvm.SI.image.sample.i32(i32, <8 x i32>, <4 x i32>, i32, i32, i32, i32, i32, i32, i32, i32) #3	declare <4 x float> @llvm.SI.image.sample.i32(i32, <8 x i32>, <4 x i32>, i32, i32, i32, i32, i32, i32, i32, i32) #3
Context not available.