This is an archive of the discontinued LLVM Phabricator instance.

Differential D49642

AMDGPU: Rework extract-lowbits test
AbandonedPublic

Authored by jvesely on Jul 21 2018, 9:35 PM.

Download Raw Diff

Details

Reviewers

arsenm

Summary

This will make it easier to add r600

Diff Detail

Repository: rL LLVM

Event Timeline

jvesely created this revision.Jul 21 2018, 9:35 PM

Herald added subscribers: llvm-commits, t-tye, tpr and 5 others. · View Herald TranscriptJul 21 2018, 9:35 PM

jvesely added a child revision: D49641: AMDGPU/R600: Add MOV instructions to BFE patterns.Jul 21 2018, 9:35 PM

I'd rather stop trying to share tests with r600 at all. I would like to split out most of the shared tests as-i

In D49642#1171132, @arsenm wrote:

I'd rather stop trying to share tests with r600 at all. I would like to split out most of the shared tests as-i

Any reason for that? Both bfe instructions use the same patterns so it'd be just a copy paste.

In D49642#1171179, @jvesely wrote:

In D49642#1171132, @arsenm wrote:

I'd rather stop trying to share tests with r600 at all. I would like to split out most of the shared tests as-i

Any reason for that? Both bfe instructions use the same patterns so it'd be just a copy paste.

A lot of tests have too many run lines as is, and adding more for r600 increases the mess. In this case you are actually changing the tested content. The original used VGPR inputs for everything, and this changes everything to be SGPR inputs. Both would be useful as separate tests, but we don't try particular hard to match scalar BFEs currently. Also, I want to stop artificially sharing some of the intrinsics.

In D49642#1172185, @arsenm wrote:

In D49642#1171179, @jvesely wrote:

In D49642#1171132, @arsenm wrote:

I'd rather stop trying to share tests with r600 at all. I would like to split out most of the shared tests as-i

Any reason for that? Both bfe instructions use the same patterns so it'd be just a copy paste.

A lot of tests have too many run lines as is, and adding more for r600 increases the mess. In this case you are actually changing the tested content. The original used VGPR inputs for everything, and this changes everything to be SGPR inputs. Both would be useful as separate tests, but we don't try particular hard to match scalar BFEs currently. Also, I want to stop artificially sharing some of the intrinsics.

Fair enough D49641 has been updated to include a copy of the file with EG/CM checks.
This sounds like chasing fools gold. Most RUN lines are added for new generations of gpu (gfx9, ...) and more for HSA, two lines for cayman and cypress barely make a difference.
The bit extract patterns should be recognized irrespective of whether it can be scalarized or not.

In D49642#1172989, @jvesely wrote:

In D49642#1172185, @arsenm wrote:

In D49642#1171179, @jvesely wrote:

In D49642#1171132, @arsenm wrote:

I'd rather stop trying to share tests with r600 at all. I would like to split out most of the shared tests as-i

Any reason for that? Both bfe instructions use the same patterns so it'd be just a copy paste.

A lot of tests have too many run lines as is, and adding more for r600 increases the mess. In this case you are actually changing the tested content. The original used VGPR inputs for everything, and this changes everything to be SGPR inputs. Both would be useful as separate tests, but we don't try particular hard to match scalar BFEs currently. Also, I want to stop artificially sharing some of the intrinsics.

Fair enough D49641 has been updated to include a copy of the file with EG/CM checks.
This sounds like chasing fools gold. Most RUN lines are added for new generations of gpu (gfx9, ...) and more for HSA, two lines for cayman and cypress barely make a difference.
The bit extract patterns should be recognized irrespective of whether it can be scalarized or not.

Part of the problem is I often try to add tests to the relevant files, and then I hit some crash or other issue in r

In D49642#1172989, @jvesely wrote:

In D49642#1172185, @arsenm wrote:

In D49642#1171179, @jvesely wrote:

In D49642#1171132, @arsenm wrote:

I'd rather stop trying to share tests with r600 at all. I would like to split out most of the shared tests as-i

Any reason for that? Both bfe instructions use the same patterns so it'd be just a copy paste.

A lot of tests have too many run lines as is, and adding more for r600 increases the mess. In this case you are actually changing the tested content. The original used VGPR inputs for everything, and this changes everything to be SGPR inputs. Both would be useful as separate tests, but we don't try particular hard to match scalar BFEs currently. Also, I want to stop artificially sharing some of the intrinsics.

Fair enough D49641 has been updated to include a copy of the file with EG/CM checks.
This sounds like chasing fools gold. Most RUN lines are added for new generations of gpu (gfx9, ...) and more for HSA, two lines for cayman and cypress barely make a difference.
The bit extract patterns should be recognized irrespective of whether it can be scalarized or not.

Another reason would be I sometimes try to add new tests to files where they logically go, and then hit some problem in r600. For example failing to handle function return values, so then I have to split the tests in more arbitrary ways

Revision Contents

Path

Size

test/

CodeGen/

AMDGPU/

extract-lowbits.ll

256 lines

Diff 156698

test/CodeGen/AMDGPU/extract-lowbits.ll

	; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py	; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
	; RUN: llc -march=amdgcn -mtriple=amdgcn-- -verify-machineinstrs < %s \| FileCheck -check-prefix=GCN -check-prefix=SI %s	; RUN: llc -march=amdgcn -mtriple=amdgcn-- -verify-machineinstrs < %s \| FileCheck -check-prefix=GCN -check-prefix=SI -check-prefix=AMDGPU %s
	; RUN: llc -march=amdgcn -mtriple=amdgcn-- -mcpu=tonga -verify-machineinstrs < %s \| FileCheck -check-prefix=GCN -check-prefix=VI %s	; RUN: llc -march=amdgcn -mtriple=amdgcn-- -mcpu=tonga -verify-machineinstrs < %s \| FileCheck -check-prefix=GCN -check-prefix=VI -check-prefix=AMDGPU %s

	; Loosely based on test/CodeGen/{X86,AArch64}/extract-lowbits.ll,	; Loosely based on test/CodeGen/{X86,AArch64}/extract-lowbits.ll,
	; but with all 64-bit tests, and tests with loads dropped.	; but with all 64-bit tests, and tests with loads dropped.
Context not available.
	; Pattern a. 32-bit	; Pattern a. 32-bit
	; ---------------------------------------------------------------------------- ;	; ---------------------------------------------------------------------------- ;

	define i32 @bzhi32_a0(i32 %val, i32 %numlowbits) nounwind {	; AMDGPU-LABEL: bzhi32_a0:
	; GCN-LABEL: bzhi32_a0:	; SI-DAG: s_load_dwordx2 s{{\[}}[[VAL:[0-9]+]]:[[NUM:[0-9]+]]{{\]}}, s[0:1], 0x9
	; GCN: ; %bb.0:	; SI-DAG: s_load_dwordx2 [[OUT:s\[[0-9]+:[0-9]+\]]], s[0:1], 0xb
	; GCN-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)	; VI-DAG: s_load_dwordx2 s{{\[}}[[VAL:[0-9]+]]:[[NUM:[0-9]+]]{{\]}}, s[0:1], 0x24
	; GCN-NEXT: v_bfe_u32 v0, v0, 0, v1	; VI-DAG: s_load_dwordx2 [[OUT:s\[[0-9]+:[0-9]+\]]], s[0:1], 0x2c
	; GCN-NEXT: s_setpc_b64 s[30:31]	; GCN: s_waitcnt
		; GCN-NEXT: v_mov_b32_e32 [[BITS:v[0-9]+]], s[[NUM]]
		; GCN-NEXT: v_bfe_u32 [[RES:v[0-9]*]], s[[VAL]], 0, [[BITS]]
		; SI-NEXT: buffer_store_dword [[RES]]
		; VI: flat_store_dword {{.*}}, [[RES]]
		define amdgpu_kernel void @bzhi32_a0(i32 %val, i32 %numlowbits, i32 addrspace(1)* %out) {
	%onebit = shl i32 1, %numlowbits	%onebit = shl i32 1, %numlowbits
	%mask = add nsw i32 %onebit, -1	%mask = add nsw i32 %onebit, -1
	%masked = and i32 %mask, %val	%masked = and i32 %mask, %val
	ret i32 %masked	store i32 %masked, i32 addrspace(1)* %out
		ret void
	}	}

	define i32 @bzhi32_a1_indexzext(i32 %val, i8 zeroext %numlowbits) nounwind {	; AMDGPU-LABEL: bzhi32_a1_indexzext:
	; GCN-LABEL: bzhi32_a1_indexzext:	; SI-DAG: s_load_dwordx2 s{{\[}}[[VAL:[0-9]+]]:[[NUM:[0-9]+]]{{\]}}, s[0:1], 0x9
	; GCN: ; %bb.0:	; SI-DAG: s_load_dwordx2 [[OUT:s\[[0-9]+:[0-9]+\]]], s[0:1], 0xb
	; GCN-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)	; VI-DAG: s_load_dwordx2 s{{\[}}[[VAL:[0-9]+]]:[[NUM:[0-9]+]]{{\]}}, s[0:1], 0x24
	; GCN-NEXT: v_bfe_u32 v0, v0, 0, v1	; VI-DAG: s_load_dwordx2 [[OUT:s\[[0-9]+:[0-9]+\]]], s[0:1], 0x2c
	; GCN-NEXT: s_setpc_b64 s[30:31]	; GCN: s_waitcnt
		; GCN-NEXT: s_and_b32 [[ZEXT:s[0-9]+]], s[[NUM]]
		; GCN: v_mov_b32_e32 [[BITS:v[0-9]+]], [[ZEXT]]
		; GCN-NEXT: v_bfe_u32 [[RES:v[0-9]*]], s[[VAL]], 0, [[BITS]]
		; SI-NEXT: buffer_store_dword [[RES]]
		; VI: flat_store_dword {{.*}}, [[RES]]
		define amdgpu_kernel void @bzhi32_a1_indexzext(i32 %val, i8 zeroext %numlowbits, i32 addrspace(1)* %out) {
	%conv = zext i8 %numlowbits to i32	%conv = zext i8 %numlowbits to i32
	%onebit = shl i32 1, %conv	%onebit = shl i32 1, %conv
	%mask = add nsw i32 %onebit, -1	%mask = add nsw i32 %onebit, -1
	%masked = and i32 %mask, %val	%masked = and i32 %mask, %val
	ret i32 %masked	store i32 %masked, i32 addrspace(1)* %out
		ret void
	}	}

	define i32 @bzhi32_a4_commutative(i32 %val, i32 %numlowbits) nounwind {	; AMDGPU-LABEL: bzhi32_a4_commutative:
	; GCN-LABEL: bzhi32_a4_commutative:	; SI-DAG: s_load_dwordx2 s{{\[}}[[VAL:[0-9]+]]:[[NUM:[0-9]+]]{{\]}}, s[0:1], 0x9
	; GCN: ; %bb.0:	; SI-DAG: s_load_dwordx2 [[OUT:s\[[0-9]+:[0-9]+\]]], s[0:1], 0xb
	; GCN-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)	; VI-DAG: s_load_dwordx2 s{{\[}}[[VAL:[0-9]+]]:[[NUM:[0-9]+]]{{\]}}, s[0:1], 0x24
	; GCN-NEXT: v_bfe_u32 v0, v0, 0, v1	; VI-DAG: s_load_dwordx2 [[OUT:s\[[0-9]+:[0-9]+\]]], s[0:1], 0x2c
	; GCN-NEXT: s_setpc_b64 s[30:31]	; GCN: s_waitcnt
		; GCN-NEXT: v_mov_b32_e32 [[BITS:v[0-9]+]], s[[NUM]]
		; GCN-NEXT: v_bfe_u32 [[RES:v[0-9]*]], s[[VAL]], 0, [[BITS]]
		; SI-NEXT: buffer_store_dword [[RES]]
		; VI: flat_store_dword {{.*}}, [[RES]]
		define amdgpu_kernel void @bzhi32_a4_commutative(i32 %val, i32 %numlowbits, i32 addrspace(1)* %out) {
	%onebit = shl i32 1, %numlowbits	%onebit = shl i32 1, %numlowbits
	%mask = add nsw i32 %onebit, -1	%mask = add nsw i32 %onebit, -1
	%masked = and i32 %val, %mask ; swapped order	%masked = and i32 %val, %mask ; swapped order
	ret i32 %masked	store i32 %masked, i32 addrspace(1)* %out
		ret void
	}	}

	; ---------------------------------------------------------------------------- ;	; ---------------------------------------------------------------------------- ;
	; Pattern b. 32-bit	; Pattern b. 32-bit
	; ---------------------------------------------------------------------------- ;	; ---------------------------------------------------------------------------- ;

	define i32 @bzhi32_b0(i32 %val, i32 %numlowbits) nounwind {	; AMDGPU-LABEL: bzhi32_b0:
	; GCN-LABEL: bzhi32_b0:	; SI-DAG: s_load_dwordx2 s{{\[}}[[VAL:[0-9]+]]:[[NUM:[0-9]+]]{{\]}}, s[0:1], 0x9
	; GCN: ; %bb.0:	; SI-DAG: s_load_dwordx2 [[OUT:s\[[0-9]+:[0-9]+\]]], s[0:1], 0xb
	; GCN-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)	; VI-DAG: s_load_dwordx2 s{{\[}}[[VAL:[0-9]+]]:[[NUM:[0-9]+]]{{\]}}, s[0:1], 0x24
	; GCN-NEXT: v_bfe_u32 v0, v0, 0, v1	; VI-DAG: s_load_dwordx2 [[OUT:s\[[0-9]+:[0-9]+\]]], s[0:1], 0x2c
	; GCN-NEXT: s_setpc_b64 s[30:31]	; GCN: s_waitcnt
		; GCN-NEXT: v_mov_b32_e32 [[BITS:v[0-9]+]], s[[NUM]]
		; GCN-NEXT: v_bfe_u32 [[RES:v[0-9]*]], s[[VAL]], 0, [[BITS]]
		; SI-NEXT: buffer_store_dword [[RES]]
		; VI: flat_store_dword {{.*}}, [[RES]]
		define amdgpu_kernel void @bzhi32_b0(i32 %val, i32 %numlowbits, i32 addrspace(1)* %out) {
	%notmask = shl i32 -1, %numlowbits	%notmask = shl i32 -1, %numlowbits
	%mask = xor i32 %notmask, -1	%mask = xor i32 %notmask, -1
	%masked = and i32 %mask, %val	%masked = and i32 %mask, %val
	ret i32 %masked	store i32 %masked, i32 addrspace(1)* %out
		ret void
	}	}

	define i32 @bzhi32_b1_indexzext(i32 %val, i8 zeroext %numlowbits) nounwind {	; AMDGPU-LABEL: bzhi32_b1_indexzext:
	; GCN-LABEL: bzhi32_b1_indexzext:	; SI-DAG: s_load_dwordx2 s{{\[}}[[VAL:[0-9]+]]:[[NUM:[0-9]+]]{{\]}}, s[0:1], 0x9
	; GCN: ; %bb.0:	; SI-DAG: s_load_dwordx2 [[OUT:s\[[0-9]+:[0-9]+\]]], s[0:1], 0xb
	; GCN-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)	; VI-DAG: s_load_dwordx2 s{{\[}}[[VAL:[0-9]+]]:[[NUM:[0-9]+]]{{\]}}, s[0:1], 0x24
	; GCN-NEXT: v_bfe_u32 v0, v0, 0, v1	; VI-DAG: s_load_dwordx2 [[OUT:s\[[0-9]+:[0-9]+\]]], s[0:1], 0x2c
	; GCN-NEXT: s_setpc_b64 s[30:31]	; GCN: s_waitcnt
		; GCN-NEXT: s_and_b32 [[ZEXT:s[0-9]+]], s[[NUM]]
		; GCN: v_mov_b32_e32 [[BITS:v[0-9]+]], [[ZEXT]]
		; GCN-NEXT: v_bfe_u32 [[RES:v[0-9]*]], s[[VAL]], 0, [[BITS]]
		; SI-NEXT: buffer_store_dword [[RES]]
		; VI: flat_store_dword {{.*}}, [[RES]]
		define amdgpu_kernel void @bzhi32_b1_indexzext(i32 %val, i8 zeroext %numlowbits, i32 addrspace(1)* %out) {
	%conv = zext i8 %numlowbits to i32	%conv = zext i8 %numlowbits to i32
	%notmask = shl i32 -1, %conv	%notmask = shl i32 -1, %conv
	%mask = xor i32 %notmask, -1	%mask = xor i32 %notmask, -1
	%masked = and i32 %mask, %val	%masked = and i32 %mask, %val
	ret i32 %masked	store i32 %masked, i32 addrspace(1)* %out
		ret void
	}	}

	define i32 @bzhi32_b4_commutative(i32 %val, i32 %numlowbits) nounwind {	; AMDGPU-LABEL: bzhi32_b4_commutative:
	; GCN-LABEL: bzhi32_b4_commutative:	; SI-DAG: s_load_dwordx2 s{{\[}}[[VAL:[0-9]+]]:[[NUM:[0-9]+]]{{\]}}, s[0:1], 0x9
	; GCN: ; %bb.0:	; SI-DAG: s_load_dwordx2 [[OUT:s\[[0-9]+:[0-9]+\]]], s[0:1], 0xb
	; GCN-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)	; VI-DAG: s_load_dwordx2 s{{\[}}[[VAL:[0-9]+]]:[[NUM:[0-9]+]]{{\]}}, s[0:1], 0x24
	; GCN-NEXT: v_bfe_u32 v0, v0, 0, v1	; VI-DAG: s_load_dwordx2 [[OUT:s\[[0-9]+:[0-9]+\]]], s[0:1], 0x2c
	; GCN-NEXT: s_setpc_b64 s[30:31]	; GCN: s_waitcnt
		; GCN-NEXT: v_mov_b32_e32 [[BITS:v[0-9]+]], s[[NUM]]
		; GCN-NEXT: v_bfe_u32 [[RES:v[0-9]*]], s[[VAL]], 0, [[BITS]]
		; SI-NEXT: buffer_store_dword [[RES]]
		; VI: flat_store_dword {{.*}}, [[RES]]
		define amdgpu_kernel void @bzhi32_b4_commutative(i32 %val, i32 %numlowbits, i32 addrspace(1)* %out) {
	%notmask = shl i32 -1, %numlowbits	%notmask = shl i32 -1, %numlowbits
	%mask = xor i32 %notmask, -1	%mask = xor i32 %notmask, -1
	%masked = and i32 %val, %mask ; swapped order	%masked = and i32 %val, %mask ; swapped order
	ret i32 %masked	store i32 %masked, i32 addrspace(1)* %out
		ret void
	}	}

	; ---------------------------------------------------------------------------- ;	; ---------------------------------------------------------------------------- ;
	; Pattern c. 32-bit	; Pattern c. 32-bit
	; ---------------------------------------------------------------------------- ;	; ---------------------------------------------------------------------------- ;

	define i32 @bzhi32_c0(i32 %val, i32 %numlowbits) nounwind {	; AMDGPU-LABEL: bzhi32_c0:
	; GCN-LABEL: bzhi32_c0:	; SI-DAG: s_load_dwordx2 s{{\[}}[[VAL:[0-9]+]]:[[NUM:[0-9]+]]{{\]}}, s[0:1], 0x9
	; GCN: ; %bb.0:	; SI-DAG: s_load_dwordx2 [[OUT:s\[[0-9]+:[0-9]+\]]], s[0:1], 0xb
	; GCN-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)	; VI-DAG: s_load_dwordx2 s{{\[}}[[VAL:[0-9]+]]:[[NUM:[0-9]+]]{{\]}}, s[0:1], 0x24
	; GCN-NEXT: v_bfe_u32 v0, v0, 0, v1	; VI-DAG: s_load_dwordx2 [[OUT:s\[[0-9]+:[0-9]+\]]], s[0:1], 0x2c
	; GCN-NEXT: s_setpc_b64 s[30:31]	; GCN: s_waitcnt
		; GCN-NEXT: v_mov_b32_e32 [[BITS:v[0-9]+]], s[[NUM]]
		; GCN-NEXT: v_bfe_u32 [[RES:v[0-9]*]], s[[VAL]], 0, [[BITS]]
		; SI-NEXT: buffer_store_dword [[RES]]
		; VI: flat_store_dword {{.*}}, [[RES]]
		define amdgpu_kernel void @bzhi32_c0(i32 %val, i32 %numlowbits, i32 addrspace(1)* %out) {
	%numhighbits = sub i32 32, %numlowbits	%numhighbits = sub i32 32, %numlowbits
	%mask = lshr i32 -1, %numhighbits	%mask = lshr i32 -1, %numhighbits
	%masked = and i32 %mask, %val	%masked = and i32 %mask, %val
	ret i32 %masked	store i32 %masked, i32 addrspace(1)* %out
		ret void
	}	}

	define i32 @bzhi32_c1_indexzext(i32 %val, i8 %numlowbits) nounwind {	; AMDGPU-LABEL: bzhi32_c1_indexzext:
	; SI-LABEL: bzhi32_c1_indexzext:	; SI-DAG: s_load_dwordx2 s{{\[}}[[VAL:[0-9]+]]:[[NUM:[0-9]+]]{{\]}}, s[0:1], 0x9
	; SI: ; %bb.0:	; SI-DAG: s_load_dwordx2 [[OUT:s\[[0-9]+:[0-9]+\]]], s[0:1], 0xb
	; SI-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)	; VI-DAG: s_load_dwordx2 s{{\[}}[[VAL:[0-9]+]]:[[NUM:[0-9]+]]{{\]}}, s[0:1], 0x24
	; SI-NEXT: v_sub_i32_e32 v1, vcc, 32, v1	; VI-DAG: s_load_dwordx2 [[OUT:s\[[0-9]+:[0-9]+\]]], s[0:1], 0x2c
	; SI-NEXT: v_and_b32_e32 v1, 0xff, v1	; GCN: s_waitcnt
	; SI-NEXT: v_lshr_b32_e32 v1, -1, v1	; GCN-NEXT: s_sub_i32 [[SUB:s[0-9]+]], 32, s[[NUM]]
	; SI-NEXT: v_and_b32_e32 v0, v1, v0	; GCN-NEXT: s_and_b32 [[ZEXT:s[0-9]+]], [[SUB]], 0xff
	; SI-NEXT: s_setpc_b64 s[30:31]	; GCN-NEXT: s_lshr_b32 [[MASK:s[0-9]+]], -1, [[ZEXT]]
	;	; GCN-NEXT: s_and_b32 [[SRES:s[0-9]+]], [[MASK]], s[[VAL]]
	; VI-LABEL: bzhi32_c1_indexzext:	; GCN: v_mov_b32_e32 [[RES:v[0-9]+]], [[SRES]]
	; VI: ; %bb.0:	; SI: buffer_store_dword [[RES]]
	; VI-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)	; VI: flat_store_dword {{.*}}, [[RES]]
	; VI-NEXT: v_sub_u16_e32 v1, 32, v1	define amdgpu_kernel void @bzhi32_c1_indexzext(i32 %val, i8 %numlowbits, i32 addrspace(1)* %out) {
	; VI-NEXT: v_mov_b32_e32 v2, -1
	; VI-NEXT: v_lshrrev_b32_sdwa v1, v1, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
	; VI-NEXT: v_and_b32_e32 v0, v1, v0
	; VI-NEXT: s_setpc_b64 s[30:31]
	%numhighbits = sub i8 32, %numlowbits	%numhighbits = sub i8 32, %numlowbits
	%sh_prom = zext i8 %numhighbits to i32	%sh_prom = zext i8 %numhighbits to i32
	%mask = lshr i32 -1, %sh_prom	%mask = lshr i32 -1, %sh_prom
	%masked = and i32 %mask, %val	%masked = and i32 %mask, %val
	ret i32 %masked	store i32 %masked, i32 addrspace(1)* %out
		ret void
	}	}

	define i32 @bzhi32_c4_commutative(i32 %val, i32 %numlowbits) nounwind {	; AMDGPU-LABEL: bzhi32_c4_commutative:
	; GCN-LABEL: bzhi32_c4_commutative:	; SI-DAG: s_load_dwordx2 s{{\[}}[[VAL:[0-9]+]]:[[NUM:[0-9]+]]{{\]}}, s[0:1], 0x9
	; GCN: ; %bb.0:	; SI-DAG: s_load_dwordx2 [[OUT:s\[[0-9]+:[0-9]+\]]], s[0:1], 0xb
	; GCN-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)	; VI-DAG: s_load_dwordx2 s{{\[}}[[VAL:[0-9]+]]:[[NUM:[0-9]+]]{{\]}}, s[0:1], 0x24
	; GCN-NEXT: v_bfe_u32 v0, v0, 0, v1	; VI-DAG: s_load_dwordx2 [[OUT:s\[[0-9]+:[0-9]+\]]], s[0:1], 0x2c
	; GCN-NEXT: s_setpc_b64 s[30:31]	; GCN: s_waitcnt
		; GCN-NEXT: v_mov_b32_e32 [[BITS:v[0-9]+]], s[[NUM]]
		; GCN-NEXT: v_bfe_u32 [[RES:v[0-9]*]], s[[VAL]], 0, [[BITS]]
		; SI-NEXT: buffer_store_dword [[RES]]
		; VI: flat_store_dword {{.*}}, [[RES]]
		define amdgpu_kernel void @bzhi32_c4_commutative(i32 %val, i32 %numlowbits, i32 addrspace(1)* %out) {
	%numhighbits = sub i32 32, %numlowbits	%numhighbits = sub i32 32, %numlowbits
	%mask = lshr i32 -1, %numhighbits	%mask = lshr i32 -1, %numhighbits
	%masked = and i32 %val, %mask ; swapped order	%masked = and i32 %val, %mask ; swapped order
	ret i32 %masked	store i32 %masked, i32 addrspace(1)* %out
		ret void
	}	}

	; ---------------------------------------------------------------------------- ;	; ---------------------------------------------------------------------------- ;
	; Pattern d. 32-bit.	; Pattern d. 32-bit.
	; ---------------------------------------------------------------------------- ;	; ---------------------------------------------------------------------------- ;

	define i32 @bzhi32_d0(i32 %val, i32 %numlowbits) nounwind {	; AMDGPU-LABEL: bzhi32_d0:
	; GCN-LABEL: bzhi32_d0:	; SI-DAG: s_load_dwordx2 s{{\[}}[[VAL:[0-9]+]]:[[NUM:[0-9]+]]{{\]}}, s[0:1], 0x9
	; GCN: ; %bb.0:	; SI-DAG: s_load_dwordx2 [[OUT:s\[[0-9]+:[0-9]+\]]], s[0:1], 0xb
	; GCN-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)	; VI-DAG: s_load_dwordx2 s{{\[}}[[VAL:[0-9]+]]:[[NUM:[0-9]+]]{{\]}}, s[0:1], 0x24
	; GCN-NEXT: v_bfe_u32 v0, v0, 0, v1	; VI-DAG: s_load_dwordx2 [[OUT:s\[[0-9]+:[0-9]+\]]], s[0:1], 0x2c
	; GCN-NEXT: s_setpc_b64 s[30:31]	; GCN: s_waitcnt
		; GCN-NEXT: v_mov_b32_e32 [[BITS:v[0-9]+]], s[[NUM]]
		; GCN-NEXT: v_bfe_u32 [[RES:v[0-9]*]], s[[VAL]], 0, [[BITS]]
		; SI-NEXT: buffer_store_dword [[RES]]
		; VI: flat_store_dword {{.*}}, [[RES]]
		define amdgpu_kernel void @bzhi32_d0(i32 %val, i32 %numlowbits, i32 addrspace(1)* %out) {
	%numhighbits = sub i32 32, %numlowbits	%numhighbits = sub i32 32, %numlowbits
	%highbitscleared = shl i32 %val, %numhighbits	%highbitscleared = shl i32 %val, %numhighbits
	%masked = lshr i32 %highbitscleared, %numhighbits	%masked = lshr i32 %highbitscleared, %numhighbits
	ret i32 %masked	store i32 %masked, i32 addrspace(1)* %out
		ret void
	}	}

	define i32 @bzhi32_d1_indexzext(i32 %val, i8 %numlowbits) nounwind {	; AMDGPU-LABEL: bzhi32_d1_indexzext:
	; SI-LABEL: bzhi32_d1_indexzext:	; SI-DAG: s_load_dwordx2 s{{\[}}[[VAL:[0-9]+]]:[[NUM:[0-9]+]]{{\]}}, s[0:1], 0x9
	; SI: ; %bb.0:	; SI-DAG: s_load_dwordx2 [[OUT:s\[[0-9]+:[0-9]+\]]], s[0:1], 0xb
	; SI-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)	; VI-DAG: s_load_dwordx2 s{{\[}}[[VAL:[0-9]+]]:[[NUM:[0-9]+]]{{\]}}, s[0:1], 0x24
	; SI-NEXT: v_sub_i32_e32 v1, vcc, 32, v1	; VI-DAG: s_load_dwordx2 [[OUT:s\[[0-9]+:[0-9]+\]]], s[0:1], 0x2c
	; SI-NEXT: v_and_b32_e32 v1, 0xff, v1	; GCN: s_waitcnt
	; SI-NEXT: v_lshl_b32_e32 v0, v0, v1	; GCN-NEXT: s_sub_i32 [[SUB:s[0-9]+]], 32, s[[NUM]]
	; SI-NEXT: v_lshr_b32_e32 v0, v0, v1	; GCN-NEXT: s_and_b32 [[ZEXT:s[0-9]+]], [[SUB]], 0xff
	; SI-NEXT: s_setpc_b64 s[30:31]	; GCN-NEXT: s_lshl_b32 [[SHL:s[0-9]+]], s[[VAL]], [[ZEXT]]
	;	; GCN-NEXT: s_lshr_b32 [[SHR:s[0-9]+]], [[SHL]], [[ZEXT]]
	; VI-LABEL: bzhi32_d1_indexzext:	; GCN: v_mov_b32_e32 [[RES:v[0-9]+]], [[SHR]]
	; VI: ; %bb.0:	; SI: buffer_store_dword [[RES]]
	; VI-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)	; VI: flat_store_dword {{.*}}, [[RES]]
	; VI-NEXT: v_sub_u16_e32 v1, 32, v1	define amdgpu_kernel void @bzhi32_d1_indexzext(i32 %val, i8 %numlowbits, i32 addrspace(1)* %out) {
	; VI-NEXT: v_and_b32_e32 v1, 0xff, v1
	; VI-NEXT: v_lshlrev_b32_e32 v0, v1, v0
	; VI-NEXT: v_lshrrev_b32_e32 v0, v1, v0
	; VI-NEXT: s_setpc_b64 s[30:31]
	%numhighbits = sub i8 32, %numlowbits	%numhighbits = sub i8 32, %numlowbits
	%sh_prom = zext i8 %numhighbits to i32	%sh_prom = zext i8 %numhighbits to i32
	%highbitscleared = shl i32 %val, %sh_prom	%highbitscleared = shl i32 %val, %sh_prom
	%masked = lshr i32 %highbitscleared, %sh_prom	%masked = lshr i32 %highbitscleared, %sh_prom
	ret i32 %masked	store i32 %masked, i32 addrspace(1)* %out
		ret void
	}	}
Context not available.