This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Make ds fp atomics overloadable
ClosedPublic

Authored by rampitec on Sep 18 2020, 2:43 PM.

Download Raw Diff

Details

Reviewers

arsenm
b-sumner
yaxunl

Commits

rG59691dc8740c: [AMDGPU] Make ds fp atomics overloadable

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

rampitec created this revision.Sep 18 2020, 2:43 PM

Herald added a project: Restricted Project. · View Herald TranscriptSep 18 2020, 2:43 PM

Herald added subscribers: kerbowa, t-tye, tpr and 5 others. · View Herald Transcript

rampitec requested review of this revision.Sep 18 2020, 2:43 PM

Herald added a subscriber: wdng. · View Herald TranscriptSep 18 2020, 2:43 PM

arsenm added inline comments.Sep 18 2020, 4:07 PM

clang/lib/CodeGen/CGBuiltin.cpp
14772	I don't think you need a cast here (at least an addrspacecast)

rampitec marked an inline comment as done.Sep 18 2020, 4:09 PM

rampitec added inline comments.

clang/lib/CodeGen/CGBuiltin.cpp
14772	If removed builtins-amdgcn.cu fails. It is CUDA with LDS pointer passed as flat. I.e. it comes as cast from addrspace(3) to flat. Generic builtin handling below in this file does the same.

arsenm added inline comments.Sep 18 2020, 4:11 PM

clang/lib/CodeGen/CGBuiltin.cpp
14772	I thought these casts would be present in the AST?

rampitec marked an inline comment as done.Sep 18 2020, 4:13 PM

rampitec added inline comments.

clang/lib/CodeGen/CGBuiltin.cpp
14772	It comes as a flat pointer. I am just replicating what generic code does.

rampitec added inline comments.Sep 18 2020, 4:20 PM

clang/lib/CodeGen/CGBuiltin.cpp
14772	Check the code around the line 4440 in the same file. It does even more than that.

rampitec added a reviewer: yaxunl.Sep 21 2020, 12:09 PM

yaxunl accepted this revision.Sep 23 2020, 11:37 AM

yaxunl added inline comments.

clang/lib/CodeGen/CGBuiltin.cpp
14772	There was a TargetInfo hook getCUDABuiltinAddressSpace which was introduced by Matt. The default implementation maps any address space to default address space 0. For some reason, it was not implemented as target specific to map the address space specified by builtin def to real ones. As a result, all builtin functions have generic pointer parameter for CUDA. Therefore the cast is needed here when calling the intrinsic. We could consider fix that. For this patch, I think we still need the cast.

This revision is now accepted and ready to land.Sep 23 2020, 11:37 AM

Closed by commit rG59691dc8740c: [AMDGPU] Make ds fp atomics overloadable (authored by rampitec). · Explain WhySep 23 2020, 11:40 AM

This revision was automatically updated to reflect the committed changes.

rampitec added a commit: rG59691dc8740c: [AMDGPU] Make ds fp atomics overloadable.

Herald added a project: Restricted Project. · View Herald TranscriptSep 23 2020, 11:40 AM

Herald added a subscriber: cfe-commits. · View Herald Transcript

Revision Contents

Path

Size

clang/

lib/

CodeGen/

CGBuiltin.cpp

26 lines

test/

CodeGenCUDA/

builtins-amdgcn.cu

2 lines

CodeGenOpenCL/

builtins-amdgcn-vi.cl

6 lines

llvm/

include/

llvm/

IR/

IntrinsicsAMDGPU.td

15 lines

test/

CodeGen/

AMDGPU/

lds_atomic_f32.ll

24 lines

Diff 293822

clang/lib/CodeGen/CGBuiltin.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 14,740 Lines • ▼ Show 20 Lines	Value *CodeGenFunction::EmitAMDGPUBuiltinExpr(unsigned BuiltinID,
case AMDGPU::BI__builtin_amdgcn_ds_append:		case AMDGPU::BI__builtin_amdgcn_ds_append:
case AMDGPU::BI__builtin_amdgcn_ds_consume: {		case AMDGPU::BI__builtin_amdgcn_ds_consume: {
Intrinsic::ID Intrin = BuiltinID == AMDGPU::BI__builtin_amdgcn_ds_append ?		Intrinsic::ID Intrin = BuiltinID == AMDGPU::BI__builtin_amdgcn_ds_append ?
Intrinsic::amdgcn_ds_append : Intrinsic::amdgcn_ds_consume;		Intrinsic::amdgcn_ds_append : Intrinsic::amdgcn_ds_consume;
Value *Src0 = EmitScalarExpr(E->getArg(0));		Value *Src0 = EmitScalarExpr(E->getArg(0));
Function *F = CGM.getIntrinsic(Intrin, { Src0->getType() });		Function *F = CGM.getIntrinsic(Intrin, { Src0->getType() });
return Builder.CreateCall(F, { Src0, Builder.getFalse() });		return Builder.CreateCall(F, { Src0, Builder.getFalse() });
}		}
		case AMDGPU::BI__builtin_amdgcn_ds_faddf:
		case AMDGPU::BI__builtin_amdgcn_ds_fminf:
		case AMDGPU::BI__builtin_amdgcn_ds_fmaxf: {
		Intrinsic::ID Intrin;
		switch (BuiltinID) {
		case AMDGPU::BI__builtin_amdgcn_ds_faddf:
		Intrin = Intrinsic::amdgcn_ds_fadd;
		break;
		case AMDGPU::BI__builtin_amdgcn_ds_fminf:
		Intrin = Intrinsic::amdgcn_ds_fmin;
		break;
		case AMDGPU::BI__builtin_amdgcn_ds_fmaxf:
		Intrin = Intrinsic::amdgcn_ds_fmax;
		break;
		}
		llvm::Value *Src0 = EmitScalarExpr(E->getArg(0));
		llvm::Value *Src1 = EmitScalarExpr(E->getArg(1));
		llvm::Value *Src2 = EmitScalarExpr(E->getArg(2));
		llvm::Value *Src3 = EmitScalarExpr(E->getArg(3));
		llvm::Value *Src4 = EmitScalarExpr(E->getArg(4));
		llvm::Function *F = CGM.getIntrinsic(Intrin, { Src1->getType() });
		llvm::FunctionType *FTy = F->getFunctionType();
		llvm::Type *PTy = FTy->getParamType(0);
		Src0 = Builder.CreatePointerBitCastOrAddrSpaceCast(Src0, PTy);
		arsenmUnsubmitted Done Reply Inline Actions I don't think you need a cast here (at least an addrspacecast) arsenm: I don't think you need a cast here (at least an addrspacecast)
		rampitecAuthorUnsubmitted Done Reply Inline Actions If removed builtins-amdgcn.cu fails. It is CUDA with LDS pointer passed as flat. I.e. it comes as cast from addrspace(3) to flat. Generic builtin handling below in this file does the same. rampitec: If removed builtins-amdgcn.cu fails. It is CUDA with LDS pointer passed as flat. I.e. it comes…
		arsenmUnsubmitted Not Done Reply Inline Actions I thought these casts would be present in the AST? arsenm: I thought these casts would be present in the AST?
		rampitecAuthorUnsubmitted Done Reply Inline Actions It comes as a flat pointer. I am just replicating what generic code does. rampitec: It comes as a flat pointer. I am just replicating what generic code does.
		rampitecAuthorUnsubmitted Done Reply Inline Actions Check the code around the line 4440 in the same file. It does even more than that. rampitec: Check the code around the line 4440 in the same file. It does even more than that.
		yaxunlUnsubmitted Not Done Reply Inline Actions There was a TargetInfo hook getCUDABuiltinAddressSpace which was introduced by Matt. The default implementation maps any address space to default address space 0. For some reason, it was not implemented as target specific to map the address space specified by builtin def to real ones. As a result, all builtin functions have generic pointer parameter for CUDA. Therefore the cast is needed here when calling the intrinsic. We could consider fix that. For this patch, I think we still need the cast. yaxunl: There was a TargetInfo hook getCUDABuiltinAddressSpace which was introduced by Matt. The…
		return Builder.CreateCall(F, { Src0, Src1, Src2, Src3, Src4 });
		}
case AMDGPU::BI__builtin_amdgcn_read_exec: {		case AMDGPU::BI__builtin_amdgcn_read_exec: {
CallInst *CI = cast<CallInst>(		CallInst *CI = cast<CallInst>(
EmitSpecialRegisterBuiltin(*this, E, Int64Ty, Int64Ty, NormalRead, "exec"));		EmitSpecialRegisterBuiltin(*this, E, Int64Ty, Int64Ty, NormalRead, "exec"));
CI->setConvergent();		CI->setConvergent();
return CI;		return CI;
}		}
case AMDGPU::BI__builtin_amdgcn_read_exec_lo:		case AMDGPU::BI__builtin_amdgcn_read_exec_lo:
case AMDGPU::BI__builtin_amdgcn_read_exec_hi: {		case AMDGPU::BI__builtin_amdgcn_read_exec_hi: {
▲ Show 20 Lines • Show All 1,987 Lines • Show Last 20 Lines

clang/test/CodeGenCUDA/builtins-amdgcn.cu

	// RUN: %clang_cc1 -triple amdgcn -fcuda-is-device -emit-llvm %s -o - \| FileCheck %s			// RUN: %clang_cc1 -triple amdgcn -fcuda-is-device -emit-llvm %s -o - \| FileCheck %s
	#include "Inputs/cuda.h"			#include "Inputs/cuda.h"

	// CHECK-LABEL: @_Z16use_dispatch_ptrPi(			// CHECK-LABEL: @_Z16use_dispatch_ptrPi(
	// CHECK: %[[PTR:.]] = call align 4 dereferenceable(64) i8 addrspace(4) @llvm.amdgcn.dispatch.ptr()			// CHECK: %[[PTR:.]] = call align 4 dereferenceable(64) i8 addrspace(4) @llvm.amdgcn.dispatch.ptr()
	// CHECK: %{{.}} = addrspacecast i8 addrspace(4) %[[PTR]] to i8*			// CHECK: %{{.}} = addrspacecast i8 addrspace(4) %[[PTR]] to i8*
	__global__ void use_dispatch_ptr(int* out) {			__global__ void use_dispatch_ptr(int* out) {
	const int* dispatch_ptr = (const int*)__builtin_amdgcn_dispatch_ptr();			const int* dispatch_ptr = (const int*)__builtin_amdgcn_dispatch_ptr();
	out = dispatch_ptr;			out = dispatch_ptr;
	}			}

	// CHECK-LABEL: @_Z12test_ds_fmaxf(			// CHECK-LABEL: @_Z12test_ds_fmaxf(
	// CHECK: call contract float @llvm.amdgcn.ds.fmax(float addrspace(3)* @_ZZ12test_ds_fmaxfE6shared, float %{{[^,]*}}, i32 0, i32 0, i1 false)			// CHECK: call contract float @llvm.amdgcn.ds.fmax.f32(float addrspace(3)* @_ZZ12test_ds_fmaxfE6shared, float %{{[^,]*}}, i32 0, i32 0, i1 false)
	__global__			__global__
	void test_ds_fmax(float src) {			void test_ds_fmax(float src) {
	__shared__ float shared;			__shared__ float shared;
	volatile float x = __builtin_amdgcn_ds_fmaxf(&shared, src, 0, 0, false);			volatile float x = __builtin_amdgcn_ds_fmaxf(&shared, src, 0, 0, false);
	}			}

clang/test/CodeGenOpenCL/builtins-amdgcn-vi.cl

	Show First 20 Lines • Show All 108 Lines • ▼ Show 20 Lines
	// CHECK-LABEL: @test_update_dpp			// CHECK-LABEL: @test_update_dpp
	// CHECK: call i32 @llvm.amdgcn.update.dpp.i32(i32 %arg1, i32 %arg2, i32 0, i32 0, i32 0, i1 false)			// CHECK: call i32 @llvm.amdgcn.update.dpp.i32(i32 %arg1, i32 %arg2, i32 0, i32 0, i32 0, i1 false)
	void test_update_dpp(global int* out, int arg1, int arg2)			void test_update_dpp(global int* out, int arg1, int arg2)
	{			{
	*out = __builtin_amdgcn_update_dpp(arg1, arg2, 0, 0, 0, false);			*out = __builtin_amdgcn_update_dpp(arg1, arg2, 0, 0, 0, false);
	}			}

	// CHECK-LABEL: @test_ds_fadd			// CHECK-LABEL: @test_ds_fadd
	// CHECK: call float @llvm.amdgcn.ds.fadd(float addrspace(3)* %out, float %src, i32 0, i32 0, i1 false)			// CHECK: call float @llvm.amdgcn.ds.fadd.f32(float addrspace(3)* %out, float %src, i32 0, i32 0, i1 false)
	void test_ds_faddf(local float *out, float src) {			void test_ds_faddf(local float *out, float src) {
	*out = __builtin_amdgcn_ds_faddf(out, src, 0, 0, false);			*out = __builtin_amdgcn_ds_faddf(out, src, 0, 0, false);
	}			}

	// CHECK-LABEL: @test_ds_fmin			// CHECK-LABEL: @test_ds_fmin
	// CHECK: call float @llvm.amdgcn.ds.fmin(float addrspace(3)* %out, float %src, i32 0, i32 0, i1 false)			// CHECK: call float @llvm.amdgcn.ds.fmin.f32(float addrspace(3)* %out, float %src, i32 0, i32 0, i1 false)
	void test_ds_fminf(local float *out, float src) {			void test_ds_fminf(local float *out, float src) {
	*out = __builtin_amdgcn_ds_fminf(out, src, 0, 0, false);			*out = __builtin_amdgcn_ds_fminf(out, src, 0, 0, false);
	}			}

	// CHECK-LABEL: @test_ds_fmax			// CHECK-LABEL: @test_ds_fmax
	// CHECK: call float @llvm.amdgcn.ds.fmax(float addrspace(3)* %out, float %src, i32 0, i32 0, i1 false)			// CHECK: call float @llvm.amdgcn.ds.fmax.f32(float addrspace(3)* %out, float %src, i32 0, i32 0, i1 false)
	void test_ds_fmaxf(local float *out, float src) {			void test_ds_fmaxf(local float *out, float src) {
	*out = __builtin_amdgcn_ds_fmaxf(out, src, 0, 0, false);			*out = __builtin_amdgcn_ds_fmaxf(out, src, 0, 0, false);
	}			}

llvm/include/llvm/IR/IntrinsicsAMDGPU.td

Show First 20 Lines • Show All 391 Lines • ▼ Show 20 Lines	class AMDGPUAtomicIncIntrin : Intrinsic<[llvm_anyint_ty],
[IntrArgMemOnly, IntrWillReturn, NoCapture<ArgIndex<0>>,		[IntrArgMemOnly, IntrWillReturn, NoCapture<ArgIndex<0>>,
ImmArg<ArgIndex<2>>, ImmArg<ArgIndex<3>>, ImmArg<ArgIndex<4>>], "",		ImmArg<ArgIndex<2>>, ImmArg<ArgIndex<3>>, ImmArg<ArgIndex<4>>], "",
[SDNPMemOperand]		[SDNPMemOperand]
>;		>;

def int_amdgcn_atomic_inc : AMDGPUAtomicIncIntrin;		def int_amdgcn_atomic_inc : AMDGPUAtomicIncIntrin;
def int_amdgcn_atomic_dec : AMDGPUAtomicIncIntrin;		def int_amdgcn_atomic_dec : AMDGPUAtomicIncIntrin;

class AMDGPULDSF32Intrin<string clang_builtin> :		class AMDGPULDSIntrin :
GCCBuiltin<clang_builtin>,		Intrinsic<[llvm_any_ty],
Intrinsic<[llvm_float_ty],		[LLVMQualPointerType<LLVMMatchType<0>, 3>,
[LLVMQualPointerType<llvm_float_ty, 3>,		LLVMMatchType<0>,
llvm_float_ty,
llvm_i32_ty, // ordering		llvm_i32_ty, // ordering
llvm_i32_ty, // scope		llvm_i32_ty, // scope
llvm_i1_ty], // isVolatile		llvm_i1_ty], // isVolatile
[IntrArgMemOnly, IntrWillReturn, NoCapture<ArgIndex<0>>,		[IntrArgMemOnly, IntrWillReturn, NoCapture<ArgIndex<0>>,
ImmArg<ArgIndex<2>>, ImmArg<ArgIndex<3>>, ImmArg<ArgIndex<4>>]		ImmArg<ArgIndex<2>>, ImmArg<ArgIndex<3>>, ImmArg<ArgIndex<4>>]
>;		>;

// FIXME: The m0 argument should be moved after the normal arguments		// FIXME: The m0 argument should be moved after the normal arguments
Show All 28 Lines

def int_amdgcn_ds_ordered_add : AMDGPUDSOrderedIntrinsic;		def int_amdgcn_ds_ordered_add : AMDGPUDSOrderedIntrinsic;
def int_amdgcn_ds_ordered_swap : AMDGPUDSOrderedIntrinsic;		def int_amdgcn_ds_ordered_swap : AMDGPUDSOrderedIntrinsic;

// The pointer argument is assumed to be dynamically uniform if a VGPR.		// The pointer argument is assumed to be dynamically uniform if a VGPR.
def int_amdgcn_ds_append : AMDGPUDSAppendConsumedIntrinsic;		def int_amdgcn_ds_append : AMDGPUDSAppendConsumedIntrinsic;
def int_amdgcn_ds_consume : AMDGPUDSAppendConsumedIntrinsic;		def int_amdgcn_ds_consume : AMDGPUDSAppendConsumedIntrinsic;

def int_amdgcn_ds_fadd : AMDGPULDSF32Intrin<"__builtin_amdgcn_ds_faddf">;		def int_amdgcn_ds_fadd : AMDGPULDSIntrin;
def int_amdgcn_ds_fmin : AMDGPULDSF32Intrin<"__builtin_amdgcn_ds_fminf">;		def int_amdgcn_ds_fmin : AMDGPULDSIntrin;
def int_amdgcn_ds_fmax : AMDGPULDSF32Intrin<"__builtin_amdgcn_ds_fmaxf">;		def int_amdgcn_ds_fmax : AMDGPULDSIntrin;

} // TargetPrefix = "amdgcn"		} // TargetPrefix = "amdgcn"

// New-style image intrinsics		// New-style image intrinsics

//////////////////////////////////////////////////////////////////////////		//////////////////////////////////////////////////////////////////////////
// Dimension-aware image intrinsics framework		// Dimension-aware image intrinsics framework
//////////////////////////////////////////////////////////////////////////		//////////////////////////////////////////////////////////////////////////
▲ Show 20 Lines • Show All 1,547 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/lds_atomic_f32.ll

	; RUN: llc -march=amdgcn -mcpu=tonga -verify-machineinstrs < %s \| FileCheck -enable-var-scope -check-prefixes=GCN,VI %s			; RUN: llc -march=amdgcn -mcpu=tonga -verify-machineinstrs < %s \| FileCheck -enable-var-scope -check-prefixes=GCN,VI %s
	; RUN: llc -march=amdgcn -mcpu=gfx900 -verify-machineinstrs < %s \| FileCheck -enable-var-scope -check-prefixes=GCN,GFX9 %s			; RUN: llc -march=amdgcn -mcpu=gfx900 -verify-machineinstrs < %s \| FileCheck -enable-var-scope -check-prefixes=GCN,GFX9 %s

	declare float @llvm.amdgcn.ds.fadd(float addrspace(3)* nocapture, float, i32, i32, i1)			declare float @llvm.amdgcn.ds.fadd.f32(float addrspace(3)* nocapture, float, i32, i32, i1)
	declare float @llvm.amdgcn.ds.fmin(float addrspace(3)* nocapture, float, i32, i32, i1)			declare float @llvm.amdgcn.ds.fmin.f32(float addrspace(3)* nocapture, float, i32, i32, i1)
	declare float @llvm.amdgcn.ds.fmax(float addrspace(3)* nocapture, float, i32, i32, i1)			declare float @llvm.amdgcn.ds.fmax.f32(float addrspace(3)* nocapture, float, i32, i32, i1)

	; GCN-LABEL: {{^}}lds_ds_fadd:			; GCN-LABEL: {{^}}lds_ds_fadd:
	; VI-DAG: s_mov_b32 m0			; VI-DAG: s_mov_b32 m0
	; GFX9-NOT: m0			; GFX9-NOT: m0
	; GCN-DAG: v_mov_b32_e32 [[V0:v[0-9]+]], 0x42280000			; GCN-DAG: v_mov_b32_e32 [[V0:v[0-9]+]], 0x42280000
	; GCN: ds_add_rtn_f32 [[V2:v[0-9]+]], [[V1:v[0-9]+]], [[V0]] offset:32			; GCN: ds_add_rtn_f32 [[V2:v[0-9]+]], [[V1:v[0-9]+]], [[V0]] offset:32
	; GCN: ds_add_f32 [[V3:v[0-9]+]], [[V0]] offset:64			; GCN: ds_add_f32 [[V3:v[0-9]+]], [[V0]] offset:64
	; GCN: s_waitcnt lgkmcnt(1)			; GCN: s_waitcnt lgkmcnt(1)
	; GCN: ds_add_rtn_f32 {{v[0-9]+}}, {{v[0-9]+}}, [[V2]]			; GCN: ds_add_rtn_f32 {{v[0-9]+}}, {{v[0-9]+}}, [[V2]]
	define amdgpu_kernel void @lds_ds_fadd(float addrspace(1)* %out, float addrspace(3)* %ptrf, i32 %idx) {			define amdgpu_kernel void @lds_ds_fadd(float addrspace(1)* %out, float addrspace(3)* %ptrf, i32 %idx) {
	%idx.add = add nuw i32 %idx, 4			%idx.add = add nuw i32 %idx, 4
	%shl0 = shl i32 %idx.add, 3			%shl0 = shl i32 %idx.add, 3
	%shl1 = shl i32 %idx.add, 4			%shl1 = shl i32 %idx.add, 4
	%ptr0 = inttoptr i32 %shl0 to float addrspace(3)*			%ptr0 = inttoptr i32 %shl0 to float addrspace(3)*
	%ptr1 = inttoptr i32 %shl1 to float addrspace(3)*			%ptr1 = inttoptr i32 %shl1 to float addrspace(3)*
	%a1 = call float @llvm.amdgcn.ds.fadd(float addrspace(3)* %ptr0, float 4.2e+1, i32 0, i32 0, i1 false)			%a1 = call float @llvm.amdgcn.ds.fadd.f32(float addrspace(3)* %ptr0, float 4.2e+1, i32 0, i32 0, i1 false)
	%a2 = call float @llvm.amdgcn.ds.fadd(float addrspace(3)* %ptr1, float 4.2e+1, i32 0, i32 0, i1 false)			%a2 = call float @llvm.amdgcn.ds.fadd.f32(float addrspace(3)* %ptr1, float 4.2e+1, i32 0, i32 0, i1 false)
	%a3 = call float @llvm.amdgcn.ds.fadd(float addrspace(3)* %ptrf, float %a1, i32 0, i32 0, i1 false)			%a3 = call float @llvm.amdgcn.ds.fadd.f32(float addrspace(3)* %ptrf, float %a1, i32 0, i32 0, i1 false)
	store float %a3, float addrspace(1)* %out			store float %a3, float addrspace(1)* %out
	ret void			ret void
	}			}

	; GCN-LABEL: {{^}}lds_ds_fmin:			; GCN-LABEL: {{^}}lds_ds_fmin:
	; VI-DAG: s_mov_b32 m0			; VI-DAG: s_mov_b32 m0
	; GFX9-NOT: m0			; GFX9-NOT: m0
	; GCN-DAG: v_mov_b32_e32 [[V0:v[0-9]+]], 0x42280000			; GCN-DAG: v_mov_b32_e32 [[V0:v[0-9]+]], 0x42280000
	; GCN: ds_min_rtn_f32 [[V2:v[0-9]+]], [[V1:v[0-9]+]], [[V0]] offset:32			; GCN: ds_min_rtn_f32 [[V2:v[0-9]+]], [[V1:v[0-9]+]], [[V0]] offset:32
	; GCN: ds_min_f32 [[V3:v[0-9]+]], [[V0]] offset:64			; GCN: ds_min_f32 [[V3:v[0-9]+]], [[V0]] offset:64
	; GCN: s_waitcnt lgkmcnt(1)			; GCN: s_waitcnt lgkmcnt(1)
	; GCN: ds_min_rtn_f32 {{v[0-9]+}}, {{v[0-9]+}}, [[V2]]			; GCN: ds_min_rtn_f32 {{v[0-9]+}}, {{v[0-9]+}}, [[V2]]
	define amdgpu_kernel void @lds_ds_fmin(float addrspace(1)* %out, float addrspace(3)* %ptrf, i32 %idx) {			define amdgpu_kernel void @lds_ds_fmin(float addrspace(1)* %out, float addrspace(3)* %ptrf, i32 %idx) {
	%idx.add = add nuw i32 %idx, 4			%idx.add = add nuw i32 %idx, 4
	%shl0 = shl i32 %idx.add, 3			%shl0 = shl i32 %idx.add, 3
	%shl1 = shl i32 %idx.add, 4			%shl1 = shl i32 %idx.add, 4
	%ptr0 = inttoptr i32 %shl0 to float addrspace(3)*			%ptr0 = inttoptr i32 %shl0 to float addrspace(3)*
	%ptr1 = inttoptr i32 %shl1 to float addrspace(3)*			%ptr1 = inttoptr i32 %shl1 to float addrspace(3)*
	%a1 = call float @llvm.amdgcn.ds.fmin(float addrspace(3)* %ptr0, float 4.2e+1, i32 0, i32 0, i1 false)			%a1 = call float @llvm.amdgcn.ds.fmin.f32(float addrspace(3)* %ptr0, float 4.2e+1, i32 0, i32 0, i1 false)
	%a2 = call float @llvm.amdgcn.ds.fmin(float addrspace(3)* %ptr1, float 4.2e+1, i32 0, i32 0, i1 false)			%a2 = call float @llvm.amdgcn.ds.fmin.f32(float addrspace(3)* %ptr1, float 4.2e+1, i32 0, i32 0, i1 false)
	%a3 = call float @llvm.amdgcn.ds.fmin(float addrspace(3)* %ptrf, float %a1, i32 0, i32 0, i1 false)			%a3 = call float @llvm.amdgcn.ds.fmin.f32(float addrspace(3)* %ptrf, float %a1, i32 0, i32 0, i1 false)
	store float %a3, float addrspace(1)* %out			store float %a3, float addrspace(1)* %out
	ret void			ret void
	}			}

	; GCN-LABEL: {{^}}lds_ds_fmax:			; GCN-LABEL: {{^}}lds_ds_fmax:
	; VI-DAG: s_mov_b32 m0			; VI-DAG: s_mov_b32 m0
	; GFX9-NOT: m0			; GFX9-NOT: m0
	; GCN-DAG: v_mov_b32_e32 [[V0:v[0-9]+]], 0x42280000			; GCN-DAG: v_mov_b32_e32 [[V0:v[0-9]+]], 0x42280000
	; GCN: ds_max_rtn_f32 [[V2:v[0-9]+]], [[V1:v[0-9]+]], [[V0]] offset:32			; GCN: ds_max_rtn_f32 [[V2:v[0-9]+]], [[V1:v[0-9]+]], [[V0]] offset:32
	; GCN: ds_max_f32 [[V3:v[0-9]+]], [[V0]] offset:64			; GCN: ds_max_f32 [[V3:v[0-9]+]], [[V0]] offset:64
	; GCN: s_waitcnt lgkmcnt(1)			; GCN: s_waitcnt lgkmcnt(1)
	; GCN: ds_max_rtn_f32 {{v[0-9]+}}, {{v[0-9]+}}, [[V2]]			; GCN: ds_max_rtn_f32 {{v[0-9]+}}, {{v[0-9]+}}, [[V2]]
	define amdgpu_kernel void @lds_ds_fmax(float addrspace(1)* %out, float addrspace(3)* %ptrf, i32 %idx) {			define amdgpu_kernel void @lds_ds_fmax(float addrspace(1)* %out, float addrspace(3)* %ptrf, i32 %idx) {
	%idx.add = add nuw i32 %idx, 4			%idx.add = add nuw i32 %idx, 4
	%shl0 = shl i32 %idx.add, 3			%shl0 = shl i32 %idx.add, 3
	%shl1 = shl i32 %idx.add, 4			%shl1 = shl i32 %idx.add, 4
	%ptr0 = inttoptr i32 %shl0 to float addrspace(3)*			%ptr0 = inttoptr i32 %shl0 to float addrspace(3)*
	%ptr1 = inttoptr i32 %shl1 to float addrspace(3)*			%ptr1 = inttoptr i32 %shl1 to float addrspace(3)*
	%a1 = call float @llvm.amdgcn.ds.fmax(float addrspace(3)* %ptr0, float 4.2e+1, i32 0, i32 0, i1 false)			%a1 = call float @llvm.amdgcn.ds.fmax.f32(float addrspace(3)* %ptr0, float 4.2e+1, i32 0, i32 0, i1 false)
	%a2 = call float @llvm.amdgcn.ds.fmax(float addrspace(3)* %ptr1, float 4.2e+1, i32 0, i32 0, i1 false)			%a2 = call float @llvm.amdgcn.ds.fmax.f32(float addrspace(3)* %ptr1, float 4.2e+1, i32 0, i32 0, i1 false)
	%a3 = call float @llvm.amdgcn.ds.fmax(float addrspace(3)* %ptrf, float %a1, i32 0, i32 0, i1 false)			%a3 = call float @llvm.amdgcn.ds.fmax.f32(float addrspace(3)* %ptrf, float %a1, i32 0, i32 0, i1 false)
	store float %a3, float addrspace(1)* %out			store float %a3, float addrspace(1)* %out
	ret void			ret void
	}			}