This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Generate range metadata for workitem id
ClosedPublic

Authored by rampitec on Apr 7 2017, 1:54 AM.

Download Raw Diff

Details

Reviewers

kzhuravl
arsenm

Commits

rGc90347d7604f: [AMDGPU] Generate range metadata for workitem id
rL300102: [AMDGPU] Generate range metadata for workitem id

Summary

If workgroup size is known inform llvm about range returned by local
id and local size queries.

Diff Detail

Repository: rL LLVM

Event Timeline

rampitec created this revision.Apr 7 2017, 1:54 AM

Herald added subscribers: t-tye, tpr, dstuttard and 4 others. · View Herald TranscriptApr 7 2017, 1:54 AM

rampitec added a reviewer: kzhuravl.Apr 7 2017, 3:12 PM

Doesn't the library already annotate these with the range metadata? We should probably tighten those bounds in a pass when the required workgroup size is known on the IR metadata

In D31804#722567, @arsenm wrote:

Doesn't the library already annotate these with the range metadata? We should probably tighten those bounds in a pass when the required workgroup size is known on the IR metadata

Generally library cannot know the workgroup size, it is the attribute on a kernel. Then clang produces amdgpu_flat_work_group_size, which is processed here. Too bad it is flat. There is also OpenCL specific reqd_work_group_size attribute which is now flattened and translated into amdgpu_flat_work_group_size by clang. Technically it shall be possible to get a more precise range with processing OpenCL specific reqd_work_group_size, but practically we do not support flat sizes more than 256, and AssertZExt cannot give a better range representation than 'extend from byte' anyway. A computeKnownBits could do it better, but it needs to process a target opcode, when after lowering it is just a load.

On a side note, there are other calls which can be simplified, like get_local_size(). I do not know how to do it though, because these are just loads yet in the library, they have neither intrinsics nor target opcodes.

In D31804#722593, @rampitec wrote:

In D31804#722567, @arsenm wrote:

Doesn't the library already annotate these with the range metadata? We should probably tighten those bounds in a pass when the required workgroup size is known on the IR metadata

Generally library cannot know the workgroup size, it is the attribute on a kernel. Then clang produces amdgpu_flat_work_group_size, which is processed here. Too bad it is flat. There is also OpenCL specific reqd_work_group_size attribute which is now flattened and translated into amdgpu_flat_work_group_size by clang. Technically it shall be possible to get a more precise range with processing OpenCL specific reqd_work_group_size, but practically we do not support flat sizes more than 256, and AssertZExt cannot give a better range representation than 'extend from byte' anyway. A computeKnownBits could do it better, but it needs to process a target opcode, when after lowering it is just a load.

On a side note, there are other calls which can be simplified, like get_local_size(). I do not know how to do it though, because these are just loads yet in the library, they have neither intrinsics nor target opcodes.

The library can use the hardware maximum (which I think it does already), and a pass that knows about the attribute can further reduce it. It can do better than extend from byte, it isn't limited to MVT types. Range metadata is already generically lowered to an arbitrary bitwidth to AssertZExt.

Doing it here doesn't really change anything fundamentally, but fixing the range metadata will allow the IR passes the same benefit and also wouldn't require reimplementing the logic to turn the range into AssertZExt.

In D31804#722616, @arsenm wrote:

In D31804#722593, @rampitec wrote:

In D31804#722567, @arsenm wrote:

Doesn't the library already annotate these with the range metadata? We should probably tighten those bounds in a pass when the required workgroup size is known on the IR metadata

Generally library cannot know the workgroup size, it is the attribute on a kernel. Then clang produces amdgpu_flat_work_group_size, which is processed here. Too bad it is flat. There is also OpenCL specific reqd_work_group_size attribute which is now flattened and translated into amdgpu_flat_work_group_size by clang. Technically it shall be possible to get a more precise range with processing OpenCL specific reqd_work_group_size, but practically we do not support flat sizes more than 256, and AssertZExt cannot give a better range representation than 'extend from byte' anyway. A computeKnownBits could do it better, but it needs to process a target opcode, when after lowering it is just a load.

On a side note, there are other calls which can be simplified, like get_local_size(). I do not know how to do it though, because these are just loads yet in the library, they have neither intrinsics nor target opcodes.

The library can use the hardware maximum (which I think it does already), and a pass that knows about the attribute can further reduce it. It can do better than extend from byte, it isn't limited to MVT types. Range metadata is already generically lowered to an arbitrary bitwidth to AssertZExt.

Doing it here doesn't really change anything fundamentally, but fixing the range metadata will allow the IR passes the same benefit and also wouldn't require reimplementing the logic to turn the range into AssertZExt.

I do not see any range metadata, and I also do not think this is a right way to go to use HW maximum. A kernel attribute generally capable to limit it more. For example:

__attribute__((reqd_work_group_size(128, 1, 1)))
kernel void zext_grp_size_256(global uint *a) {
  a[0] = get_local_id(0) & 0xff;
}

compiled to:

; Function Attrs: nounwind
define amdgpu_kernel void @zext_grp_size_256(i32 addrspace(1)* nocapture %a) local_unnamed_addr #0 !kernel_arg_addr_space !2 !kernel_arg_access_qual !3 !kernel_arg_type !4 !kernel_arg_base_type !4 !kernel_arg_type_qual !5 !kernel_arg_name !6 !reqd_work_group_size !7 {
entry:
  %call = tail call i64 @_Z12get_local_idj(i32 0) #2
  %0 = trunc i64 %call to i32
  %conv = and i32 %0, 255
  store i32 %conv, i32 addrspace(1)* %a, align 4, !tbaa !8
  ret void
}

; Function Attrs: alwaysinline nounwind readnone
define linkonce_odr protected i64 @_Z12get_local_idj(i32) local_unnamed_addr #1 {
  %2 = tail call i64 @__ockl_get_local_id(i32 %0) #2
  ret i64 %2
}

attributes #1 = { alwaysinline nounwind readnone "correctly-rounded-divide-sqrt-fp-math"="false" "disable-tail-calls"="false" "less-precise-fpmad"="false" "no-frame-pointer-elim"="true" "no-frame-pointer-elim-non-leaf" "no-infs-fp-math"="false" "no-jump-tables"="false" "no-nans-fp-math"="false" "no-signed-zeros-fp-math"="false" "no-trapping-math"="false" "stack-protector-buffer-size"="8" "target-features"="+fp64-fp16-denormals,-fp32-denormals" "unsafe-fp-math"="false" "use-soft-float"="false" }

attributes #2 = { nounwind readnone }

BTW, I do not see how to use AssertZExt with an arbitrary bitwidth...

That is probably possible to extend AMDGPULowerIntrinsics pass to generate range metadata. It works before inliner pass, but since it works after opt it shall be reasonable to assume these calls are already inlined.
What do you think, Matt?

In D31804#722643, @rampitec wrote:

In D31804#722616, @arsenm wrote:

In D31804#722593, @rampitec wrote:

In D31804#722567, @arsenm wrote:

Doesn't the library already annotate these with the range metadata? We should probably tighten those bounds in a pass when the required workgroup size is known on the IR metadata

Generally library cannot know the workgroup size, it is the attribute on a kernel. Then clang produces amdgpu_flat_work_group_size, which is processed here. Too bad it is flat. There is also OpenCL specific reqd_work_group_size attribute which is now flattened and translated into amdgpu_flat_work_group_size by clang. Technically it shall be possible to get a more precise range with processing OpenCL specific reqd_work_group_size, but practically we do not support flat sizes more than 256, and AssertZExt cannot give a better range representation than 'extend from byte' anyway. A computeKnownBits could do it better, but it needs to process a target opcode, when after lowering it is just a load.

On a side note, there are other calls which can be simplified, like get_local_size(). I do not know how to do it though, because these are just loads yet in the library, they have neither intrinsics nor target opcodes.

The library can use the hardware maximum (which I think it does already), and a pass that knows about the attribute can further reduce it. It can do better than extend from byte, it isn't limited to MVT types. Range metadata is already generically lowered to an arbitrary bitwidth to AssertZExt.

Doing it here doesn't really change anything fundamentally, but fixing the range metadata will allow the IR passes the same benefit and also wouldn't require reimplementing the logic to turn the range into AssertZExt.

I do not see any range metadata, and I also do not think this is a right way to go to use HW maximum. A kernel attribute generally capable to limit it more. For example:

I'm not saying the hardware maximum is the final answer, but it is a useful starting point when there is no fixed workgroup size.

__attribute__((reqd_work_group_size(128, 1, 1)))
kernel void zext_grp_size_256(global uint *a) {
  a[0] = get_local_id(0) & 0xff;
}

compiled to:

; Function Attrs: nounwind
define amdgpu_kernel void @zext_grp_size_256(i32 addrspace(1)* nocapture %a) local_unnamed_addr #0 !kernel_arg_addr_space !2 !kernel_arg_access_qual !3 !kernel_arg_type !4 !kernel_arg_base_type !4 !kernel_arg_type_qual !5 !kernel_arg_name !6 !reqd_work_group_size !7 {
entry:
  %call = tail call i64 @_Z12get_local_idj(i32 0) #2
  %0 = trunc i64 %call to i32
  %conv = and i32 %0, 255
  store i32 %conv, i32 addrspace(1)* %a, align 4, !tbaa !8
  ret void
}

; Function Attrs: alwaysinline nounwind readnone
define linkonce_odr protected i64 @_Z12get_local_idj(i32) local_unnamed_addr #1 {
  %2 = tail call i64 @__ockl_get_local_id(i32 %0) #2
  ret i64 %2
}

You need to look a level below this. Ideally these would be annotate as well, but I think just the final intrinsic call has it. Range metadata can also apply to loads, so it works in the library's use for the sizes read out of the dispatch packet

attributes #1 = { alwaysinline nounwind readnone "correctly-rounded-divide-sqrt-fp-math"="false" "disable-tail-calls"="false" "less-precise-fpmad"="false" "no-frame-pointer-elim"="true" "no-frame-pointer-elim-non-leaf" "no-infs-fp-math"="false" "no-jump-tables"="false" "no-nans-fp-math"="false" "no-signed-zeros-fp-math"="false" "no-trapping-math"="false" "stack-protector-buffer-size"="8" "target-features"="+fp64-fp16-denormals,-fp32-denormals" "unsafe-fp-math"="false" "use-soft-float"="false" }

attributes #2 = { nounwind readnone }
BTW, I do not see how to use AssertZExt with an arbitrary bitwidth...

You can get a value type with an EVT. SelectionDAGBuilder::lowerRangeToAssertZExt does this.

In D31804#722808, @rampitec wrote:

That is probably possible to extend AMDGPULowerIntrinsics pass to generate range metadata. It works before inliner pass, but since it works after opt it shall be reasonable to assume these calls are already inlined.
What do you think, Matt?

That might be a place to do it. I think we would want this done earlier, although then there are call graph problems with multiple kernels

I do not see range info anywhere... In fact it is only generated in AMDGPUPromoteAlloca for newly inserted calls in alloca handling.
Anyway, I will try to patch AMDGPULowerIntrinsics now.

Changed approach to generate range metadata.
Added processing of reqd_work_group_size to refine individual dimension results.
Created common method in subtarget to facilitate all places where we use it.
Promote alloca pass switched to the new method. This refines ranges produced over previous HW limit.
This also fixes bug in the range info produced by promote alloca pass: range metadata is [Lo, Hi), it was generated incorrectly as [0, 2048). Note, for ID query the range shall be one less than for size query, while it was produced the same. I.e. if size would really be 2048 local size range would be incorrectly assumed [0..2047].

Fixed bug in previous revision: promote alloca should not set range on XY component of local size, since it actually loads two lanes.

Also produce lower bound range info when possible.

arsenm added inline comments.Apr 11 2017, 1:54 PM

lib/Target/AMDGPU/AMDGPULowerIntrinsics.cpp
106 ↗	(On Diff #94874)	Lowercase first letter
lib/Target/AMDGPU/AMDGPUSubtarget.cpp
251–255 ↗	(On Diff #94874)	This seems to be unconditionally adding invariant load metadata. Why? This is broken and also unrelated to the range metadata
lib/Target/AMDGPU/AMDGPUSubtarget.h
517 ↗	(On Diff #94874)	Putting this in the subtarget is a weird place. Why not leave it in the pass?
test/CodeGen/AMDGPU/zext-lid.ll
1 ↗	(On Diff #94874)	This should be a test running the IR pass, with more checks for the specific ranges added

rampitec marked 2 inline comments as done.Apr 11 2017, 2:01 PM

rampitec added inline comments.

lib/Target/AMDGPU/AMDGPUSubtarget.h
517 ↗	(On Diff #94874)	I need to access it from the intrinsic lowering and from promote alloca, so I needed some kind of utility function.

Renamed function and moved invariant load meta back into promote alloca pass.

test/CodeGen/AMDGPU/zext-lid.ll
1 ↗	(On Diff #94874)	How do you propose to run it? With opt?

test/CodeGen/AMDGPU/zext-lid.ll
1 ↗	(On Diff #94874)	Actually I have a problem here: opt -S -mtriple=amdgcn-- -amdgpu-lower-intrinsics In this situation opt does not create TargetMachine, so the pass cannot do anything.

Added check to not crash if TM is not created.

Added IR pass run to the test and range checks.

LGTM

This revision is now accepted and ready to land.Apr 12 2017, 1:24 PM

Closed by commit rL300102: [AMDGPU] Generate range metadata for workitem id (authored by rampitec). · Explain WhyApr 12 2017, 2:01 PM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

llvm/

trunk/

lib/

Target/

AMDGPU/

AMDGPU.h

2 lines

AMDGPULowerIntrinsics.cpp

47 lines

AMDGPUPromoteAlloca.cpp

24 lines

AMDGPUSubtarget.h

3 lines

AMDGPUSubtarget.cpp

60 lines

AMDGPUTargetMachine.cpp

6 lines

test/

CodeGen/

AMDGPU/

add.i16.ll

2 lines

amdgpu.private-memory.ll

20 lines

bfe-patterns.ll

4 lines

ds_read2_superreg.ll

4 lines

llvm.amdgcn.atomic.dec.ll

16 lines

llvm.amdgcn.atomic.inc.ll

16 lines

local-memory.amdgcn.ll

6 lines

lower-range-metadata-intrinsic-call.ll

8 lines

private-memory-r600.ll

9 lines

shift-and-i128-ubfe.ll

6 lines

shift-and-i64-ubfe.ll

16 lines

shl.ll

4 lines

sub.i16.ll

2 lines

zext-lid.ll

83 lines

Diff 95028

llvm/trunk/lib/Target/AMDGPU/AMDGPU.h

	Show First 20 Lines • Show All 49 Lines • ▼ Show 20 Lines
	FunctionPass *createSIInsertWaitsPass();			FunctionPass *createSIInsertWaitsPass();
	FunctionPass *createSIInsertWaitcntsPass();			FunctionPass *createSIInsertWaitcntsPass();
	FunctionPass createAMDGPUCodeGenPreparePass(const GCNTargetMachine TM = nullptr);			FunctionPass createAMDGPUCodeGenPreparePass(const GCNTargetMachine TM = nullptr);

	ModulePass createAMDGPUAnnotateKernelFeaturesPass(const TargetMachine TM = nullptr);			ModulePass createAMDGPUAnnotateKernelFeaturesPass(const TargetMachine TM = nullptr);
	void initializeAMDGPUAnnotateKernelFeaturesPass(PassRegistry &);			void initializeAMDGPUAnnotateKernelFeaturesPass(PassRegistry &);
	extern char &AMDGPUAnnotateKernelFeaturesID;			extern char &AMDGPUAnnotateKernelFeaturesID;

	ModulePass *createAMDGPULowerIntrinsicsPass();			ModulePass createAMDGPULowerIntrinsicsPass(const TargetMachine TM = nullptr);
	void initializeAMDGPULowerIntrinsicsPass(PassRegistry &);			void initializeAMDGPULowerIntrinsicsPass(PassRegistry &);
	extern char &AMDGPULowerIntrinsicsID;			extern char &AMDGPULowerIntrinsicsID;

	void initializeSIFoldOperandsPass(PassRegistry &);			void initializeSIFoldOperandsPass(PassRegistry &);
	extern char &SIFoldOperandsID;			extern char &SIFoldOperandsID;

	void initializeSIPeepholeSDWAPass(PassRegistry &);			void initializeSIPeepholeSDWAPass(PassRegistry &);
	extern char &SIPeepholeSDWAID;			extern char &SIPeepholeSDWAID;
	▲ Show 20 Lines • Show All 142 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/AMDGPU/AMDGPULowerIntrinsics.cpp

//===-- AMDGPULowerIntrinsics.cpp -----------------------------------------===//		//===-- AMDGPULowerIntrinsics.cpp -----------------------------------------===//
//		//
// The LLVM Compiler Infrastructure		// The LLVM Compiler Infrastructure
//		//
// This file is distributed under the University of Illinois Open Source		// This file is distributed under the University of Illinois Open Source
// License. See LICENSE.TXT for details.		// License. See LICENSE.TXT for details.
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#include "AMDGPU.h"		#include "AMDGPU.h"
		#include "AMDGPUSubtarget.h"
#include "llvm/IR/Constants.h"		#include "llvm/IR/Constants.h"
#include "llvm/IR/Instructions.h"		#include "llvm/IR/Instructions.h"
#include "llvm/IR/IntrinsicInst.h"		#include "llvm/IR/IntrinsicInst.h"
#include "llvm/IR/Module.h"		#include "llvm/IR/Module.h"
#include "llvm/Transforms/Utils/LowerMemIntrinsics.h"		#include "llvm/Transforms/Utils/LowerMemIntrinsics.h"

#define DEBUG_TYPE "amdgpu-lower-intrinsics"		#define DEBUG_TYPE "amdgpu-lower-intrinsics"

using namespace llvm;		using namespace llvm;

namespace {		namespace {

const unsigned MaxStaticSize = 1024;		const unsigned MaxStaticSize = 1024;

class AMDGPULowerIntrinsics : public ModulePass {		class AMDGPULowerIntrinsics : public ModulePass {
		private:
		const TargetMachine *TM;

		bool makeLIDRangeMetadata(Function &F) const;

public:		public:
static char ID;		static char ID;

AMDGPULowerIntrinsics() : ModulePass(ID) { }		AMDGPULowerIntrinsics(const TargetMachine *TM = nullptr)
		: ModulePass(ID), TM(TM) { }
bool runOnModule(Module &M) override;		bool runOnModule(Module &M) override;
StringRef getPassName() const override {		StringRef getPassName() const override {
return "AMDGPU Lower Intrinsics";		return "AMDGPU Lower Intrinsics";
}		}
};		};

}		}

char AMDGPULowerIntrinsics::ID = 0;		char AMDGPULowerIntrinsics::ID = 0;

char &llvm::AMDGPULowerIntrinsicsID = AMDGPULowerIntrinsics::ID;		char &llvm::AMDGPULowerIntrinsicsID = AMDGPULowerIntrinsics::ID;

INITIALIZE_PASS(AMDGPULowerIntrinsics, DEBUG_TYPE,		INITIALIZE_TM_PASS(AMDGPULowerIntrinsics, DEBUG_TYPE,
"Lower intrinsics", false, false)		"Lower intrinsics", false, false)

// TODO: Should refine based on estimated number of accesses (e.g. does it		// TODO: Should refine based on estimated number of accesses (e.g. does it
// require splitting based on alignment)		// require splitting based on alignment)
static bool shouldExpandOperationWithSize(Value *Size) {		static bool shouldExpandOperationWithSize(Value *Size) {
ConstantInt *CI = dyn_cast<ConstantInt>(Size);		ConstantInt *CI = dyn_cast<ConstantInt>(Size);
return !CI \|\| (CI->getZExtValue() > MaxStaticSize);		return !CI \|\| (CI->getZExtValue() > MaxStaticSize);
}		}

Show All 39 Lines	for (auto I = F.user_begin(), E = F.user_end(); I != E;) {
default:		default:
break;		break;
}		}
}		}

return Changed;		return Changed;
}		}

		bool AMDGPULowerIntrinsics::makeLIDRangeMetadata(Function &F) const {
		if (!TM)
		return false;

		bool Changed = false;
		const AMDGPUSubtarget &ST = TM->getSubtarget<AMDGPUSubtarget>(F);

		for (auto *U : F.users()) {
		auto *CI = dyn_cast<CallInst>(U);
		if (!CI)
		continue;

		Changed \|= ST.makeLIDRangeMetadata(CI);
		}
		return Changed;
		}

bool AMDGPULowerIntrinsics::runOnModule(Module &M) {		bool AMDGPULowerIntrinsics::runOnModule(Module &M) {
bool Changed = false;		bool Changed = false;

for (Function &F : M) {		for (Function &F : M) {
if (!F.isDeclaration())		if (!F.isDeclaration())
continue;		continue;

switch (F.getIntrinsicID()) {		switch (F.getIntrinsicID()) {
case Intrinsic::memcpy:		case Intrinsic::memcpy:
case Intrinsic::memmove:		case Intrinsic::memmove:
case Intrinsic::memset:		case Intrinsic::memset:
if (expandMemIntrinsicUses(F))		if (expandMemIntrinsicUses(F))
Changed = true;		Changed = true;
break;		break;

		case Intrinsic::amdgcn_workitem_id_x:
		case Intrinsic::r600_read_tidig_x:
		case Intrinsic::amdgcn_workitem_id_y:
		case Intrinsic::r600_read_tidig_y:
		case Intrinsic::amdgcn_workitem_id_z:
		case Intrinsic::r600_read_tidig_z:
		case Intrinsic::r600_read_local_size_x:
		case Intrinsic::r600_read_local_size_y:
		case Intrinsic::r600_read_local_size_z:
		Changed \|= makeLIDRangeMetadata(F);
		break;

default:		default:
break;		break;
}		}
}		}

return Changed;		return Changed;
}		}

ModulePass *llvm::createAMDGPULowerIntrinsicsPass() {		ModulePass llvm::createAMDGPULowerIntrinsicsPass(const TargetMachine TM) {
return new AMDGPULowerIntrinsics();		return new AMDGPULowerIntrinsics(TM);
}		}

llvm/trunk/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp

Show All 32 Lines
#include "llvm/IR/GlobalValue.h"		#include "llvm/IR/GlobalValue.h"
#include "llvm/IR/GlobalVariable.h"		#include "llvm/IR/GlobalVariable.h"
#include "llvm/IR/Instruction.h"		#include "llvm/IR/Instruction.h"
#include "llvm/IR/Instructions.h"		#include "llvm/IR/Instructions.h"
#include "llvm/IR/IntrinsicInst.h"		#include "llvm/IR/IntrinsicInst.h"
#include "llvm/IR/Intrinsics.h"		#include "llvm/IR/Intrinsics.h"
#include "llvm/IR/IRBuilder.h"		#include "llvm/IR/IRBuilder.h"
#include "llvm/IR/LLVMContext.h"		#include "llvm/IR/LLVMContext.h"
#include "llvm/IR/MDBuilder.h"
#include "llvm/IR/Metadata.h"		#include "llvm/IR/Metadata.h"
#include "llvm/IR/Module.h"		#include "llvm/IR/Module.h"
#include "llvm/IR/Type.h"		#include "llvm/IR/Type.h"
#include "llvm/IR/User.h"		#include "llvm/IR/User.h"
#include "llvm/IR/Value.h"		#include "llvm/IR/Value.h"
#include "llvm/Pass.h"		#include "llvm/Pass.h"
#include "llvm/Support/Casting.h"		#include "llvm/Support/Casting.h"
#include "llvm/Support/Debug.h"		#include "llvm/Support/Debug.h"
Show All 16 Lines
namespace {		namespace {

// FIXME: This can create globals so should be a module pass.		// FIXME: This can create globals so should be a module pass.
class AMDGPUPromoteAlloca : public FunctionPass {		class AMDGPUPromoteAlloca : public FunctionPass {
private:		private:
const TargetMachine *TM;		const TargetMachine *TM;
Module *Mod = nullptr;		Module *Mod = nullptr;
const DataLayout *DL = nullptr;		const DataLayout *DL = nullptr;
MDNode *MaxWorkGroupSizeRange = nullptr;
AMDGPUAS AS;		AMDGPUAS AS;

// FIXME: This should be per-kernel.		// FIXME: This should be per-kernel.
uint32_t LocalMemLimit = 0;		uint32_t LocalMemLimit = 0;
uint32_t CurrentLocalMemUsage = 0;		uint32_t CurrentLocalMemUsage = 0;

bool IsAMDGCN = false;		bool IsAMDGCN = false;
bool IsAMDHSA = false;		bool IsAMDHSA = false;
▲ Show 20 Lines • Show All 45 Lines • ▼ Show 20 Lines

bool AMDGPUPromoteAlloca::doInitialization(Module &M) {		bool AMDGPUPromoteAlloca::doInitialization(Module &M) {
if (!TM)		if (!TM)
return false;		return false;

Mod = &M;		Mod = &M;
DL = &Mod->getDataLayout();		DL = &Mod->getDataLayout();

// The maximum workitem id.
//
// FIXME: Should get as subtarget property. Usually runtime enforced max is
// 256.
MDBuilder MDB(Mod->getContext());
MaxWorkGroupSizeRange = MDB.createRange(APInt(32, 0), APInt(32, 2048));

const Triple &TT = TM->getTargetTriple();		const Triple &TT = TM->getTargetTriple();

IsAMDGCN = TT.getArch() == Triple::amdgcn;		IsAMDGCN = TT.getArch() == Triple::amdgcn;
IsAMDHSA = TT.getOS() == Triple::AMDHSA;		IsAMDHSA = TT.getOS() == Triple::AMDHSA;

return false;		return false;
}		}

▲ Show 20 Lines • Show All 102 Lines • ▼ Show 20 Lines	if (AI)
handleAlloca(*AI);		handleAlloca(*AI);
}		}

return true;		return true;
}		}

std::pair<Value , Value >		std::pair<Value , Value >
AMDGPUPromoteAlloca::getLocalSizeYZ(IRBuilder<> &Builder) {		AMDGPUPromoteAlloca::getLocalSizeYZ(IRBuilder<> &Builder) {
		const AMDGPUSubtarget &ST = TM->getSubtarget<AMDGPUSubtarget>(
		*Builder.GetInsertBlock()->getParent());

if (!IsAMDHSA) {		if (!IsAMDHSA) {
Function *LocalSizeYFn		Function *LocalSizeYFn
= Intrinsic::getDeclaration(Mod, Intrinsic::r600_read_local_size_y);		= Intrinsic::getDeclaration(Mod, Intrinsic::r600_read_local_size_y);
Function *LocalSizeZFn		Function *LocalSizeZFn
= Intrinsic::getDeclaration(Mod, Intrinsic::r600_read_local_size_z);		= Intrinsic::getDeclaration(Mod, Intrinsic::r600_read_local_size_z);

CallInst *LocalSizeY = Builder.CreateCall(LocalSizeYFn, {});		CallInst *LocalSizeY = Builder.CreateCall(LocalSizeYFn, {});
CallInst *LocalSizeZ = Builder.CreateCall(LocalSizeZFn, {});		CallInst *LocalSizeZ = Builder.CreateCall(LocalSizeZFn, {});

LocalSizeY->setMetadata(LLVMContext::MD_range, MaxWorkGroupSizeRange);		ST.makeLIDRangeMetadata(LocalSizeY);
LocalSizeZ->setMetadata(LLVMContext::MD_range, MaxWorkGroupSizeRange);		ST.makeLIDRangeMetadata(LocalSizeZ);

return std::make_pair(LocalSizeY, LocalSizeZ);		return std::make_pair(LocalSizeY, LocalSizeZ);
}		}

// We must read the size out of the dispatch pointer.		// We must read the size out of the dispatch pointer.
assert(IsAMDGCN);		assert(IsAMDGCN);

// We are indexing into this struct, and want to extract the workgroup_size_*		// We are indexing into this struct, and want to extract the workgroup_size_*
▲ Show 20 Lines • Show All 48 Lines • ▼ Show 20 Lines	AMDGPUPromoteAlloca::getLocalSizeYZ(IRBuilder<> &Builder) {
LoadInst *LoadXY = Builder.CreateAlignedLoad(GEPXY, 4);		LoadInst *LoadXY = Builder.CreateAlignedLoad(GEPXY, 4);

Value *GEPZU = Builder.CreateConstInBoundsGEP1_64(CastDispatchPtr, 2);		Value *GEPZU = Builder.CreateConstInBoundsGEP1_64(CastDispatchPtr, 2);
LoadInst *LoadZU = Builder.CreateAlignedLoad(GEPZU, 4);		LoadInst *LoadZU = Builder.CreateAlignedLoad(GEPZU, 4);

MDNode *MD = MDNode::get(Mod->getContext(), None);		MDNode *MD = MDNode::get(Mod->getContext(), None);
LoadXY->setMetadata(LLVMContext::MD_invariant_load, MD);		LoadXY->setMetadata(LLVMContext::MD_invariant_load, MD);
LoadZU->setMetadata(LLVMContext::MD_invariant_load, MD);		LoadZU->setMetadata(LLVMContext::MD_invariant_load, MD);
LoadZU->setMetadata(LLVMContext::MD_range, MaxWorkGroupSizeRange);		ST.makeLIDRangeMetadata(LoadZU);

// Extract y component. Upper half of LoadZU should be zero already.		// Extract y component. Upper half of LoadZU should be zero already.
Value *Y = Builder.CreateLShr(LoadXY, 16);		Value *Y = Builder.CreateLShr(LoadXY, 16);

return std::make_pair(Y, LoadZU);		return std::make_pair(Y, LoadZU);
}		}

Value *AMDGPUPromoteAlloca::getWorkitemID(IRBuilder<> &Builder, unsigned N) {		Value *AMDGPUPromoteAlloca::getWorkitemID(IRBuilder<> &Builder, unsigned N) {
		const AMDGPUSubtarget &ST = TM->getSubtarget<AMDGPUSubtarget>(
		*Builder.GetInsertBlock()->getParent());
Intrinsic::ID IntrID = Intrinsic::ID::not_intrinsic;		Intrinsic::ID IntrID = Intrinsic::ID::not_intrinsic;

switch (N) {		switch (N) {
case 0:		case 0:
IntrID = IsAMDGCN ? Intrinsic::amdgcn_workitem_id_x		IntrID = IsAMDGCN ? Intrinsic::amdgcn_workitem_id_x
: Intrinsic::r600_read_tidig_x;		: Intrinsic::r600_read_tidig_x;
break;		break;
case 1:		case 1:
IntrID = IsAMDGCN ? Intrinsic::amdgcn_workitem_id_y		IntrID = IsAMDGCN ? Intrinsic::amdgcn_workitem_id_y
: Intrinsic::r600_read_tidig_y;		: Intrinsic::r600_read_tidig_y;
break;		break;

case 2:		case 2:
IntrID = IsAMDGCN ? Intrinsic::amdgcn_workitem_id_z		IntrID = IsAMDGCN ? Intrinsic::amdgcn_workitem_id_z
: Intrinsic::r600_read_tidig_z;		: Intrinsic::r600_read_tidig_z;
break;		break;
default:		default:
llvm_unreachable("invalid dimension");		llvm_unreachable("invalid dimension");
}		}

Function *WorkitemIdFn = Intrinsic::getDeclaration(Mod, IntrID);		Function *WorkitemIdFn = Intrinsic::getDeclaration(Mod, IntrID);
CallInst *CI = Builder.CreateCall(WorkitemIdFn);		CallInst *CI = Builder.CreateCall(WorkitemIdFn);
CI->setMetadata(LLVMContext::MD_range, MaxWorkGroupSizeRange);		ST.makeLIDRangeMetadata(CI);

return CI;		return CI;
}		}

static VectorType arrayTypeToVecType(Type ArrayTy) {		static VectorType arrayTypeToVecType(Type ArrayTy) {
return VectorType::get(ArrayTy->getArrayElementType(),		return VectorType::get(ArrayTy->getArrayElementType(),
ArrayTy->getArrayNumElements());		ArrayTy->getArrayNumElements());
}		}
▲ Show 20 Lines • Show All 309 Lines • ▼ Show 20 Lines	void AMDGPUPromoteAlloca::handleAlloca(AllocaInst &I) {
// Don't promote the alloca to LDS for shader calling conventions as the work		// Don't promote the alloca to LDS for shader calling conventions as the work
// item ID intrinsics are not supported for these calling conventions.		// item ID intrinsics are not supported for these calling conventions.
// Furthermore not all LDS is available for some of the stages.		// Furthermore not all LDS is available for some of the stages.
if (AMDGPU::isShader(ContainingFunction.getCallingConv()))		if (AMDGPU::isShader(ContainingFunction.getCallingConv()))
return;		return;

const AMDGPUSubtarget &ST =		const AMDGPUSubtarget &ST =
TM->getSubtarget<AMDGPUSubtarget>(ContainingFunction);		TM->getSubtarget<AMDGPUSubtarget>(ContainingFunction);
// FIXME: We should also try to get this value from the reqd_work_group_size
// function attribute if it is available.
unsigned WorkGroupSize = ST.getFlatWorkGroupSizes(ContainingFunction).second;		unsigned WorkGroupSize = ST.getFlatWorkGroupSizes(ContainingFunction).second;

const DataLayout &DL = Mod->getDataLayout();		const DataLayout &DL = Mod->getDataLayout();

unsigned Align = I.getAlignment();		unsigned Align = I.getAlignment();
if (Align == 0)		if (Align == 0)
Align = DL.getABITypeAlignment(I.getAllocatedType());		Align = DL.getABITypeAlignment(I.getAllocatedType());

▲ Show 20 Lines • Show All 173 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/AMDGPU/AMDGPUSubtarget.h

Show First 20 Lines • Show All 506 Lines • ▼ Show 20 Lines	public:
/// execution unit explicitly requested using "amdgpu-waves-per-eu" attribute		/// execution unit explicitly requested using "amdgpu-waves-per-eu" attribute
/// attached to function \p F.		/// attached to function \p F.
///		///
/// \returns Subtarget's default values if explicitly requested values cannot		/// \returns Subtarget's default values if explicitly requested values cannot
/// be converted to integer, violate subtarget's specifications, or are not		/// be converted to integer, violate subtarget's specifications, or are not
/// compatible with minimum/maximum number of waves limited by flat work group		/// compatible with minimum/maximum number of waves limited by flat work group
/// size, register usage, and/or lds usage.		/// size, register usage, and/or lds usage.
std::pair<unsigned, unsigned> getWavesPerEU(const Function &F) const;		std::pair<unsigned, unsigned> getWavesPerEU(const Function &F) const;

		/// Creates value range metadata on an workitemid.* inrinsic call or load.
		bool makeLIDRangeMetadata(Instruction *I) const;
};		};

class R600Subtarget final : public AMDGPUSubtarget {		class R600Subtarget final : public AMDGPUSubtarget {
private:		private:
R600InstrInfo InstrInfo;		R600InstrInfo InstrInfo;
R600FrameLowering FrameLowering;		R600FrameLowering FrameLowering;
R600TargetLowering TLInfo;		R600TargetLowering TLInfo;

▲ Show 20 Lines • Show All 292 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/AMDGPU/AMDGPUSubtarget.cpp

Show All 10 Lines
/// \brief Implements the AMDGPU specific subclass of TargetSubtarget.		/// \brief Implements the AMDGPU specific subclass of TargetSubtarget.
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#include "AMDGPUSubtarget.h"		#include "AMDGPUSubtarget.h"
#include "SIMachineFunctionInfo.h"		#include "SIMachineFunctionInfo.h"
#include "llvm/ADT/SmallString.h"		#include "llvm/ADT/SmallString.h"
#include "llvm/CodeGen/MachineScheduler.h"		#include "llvm/CodeGen/MachineScheduler.h"
		#include "llvm/IR/MDBuilder.h"
#include "llvm/Target/TargetFrameLowering.h"		#include "llvm/Target/TargetFrameLowering.h"
#include <algorithm>		#include <algorithm>

using namespace llvm;		using namespace llvm;

#define DEBUG_TYPE "amdgpu-subtarget"		#define DEBUG_TYPE "amdgpu-subtarget"

#define GET_SUBTARGETINFO_TARGET_DESC		#define GET_SUBTARGETINFO_TARGET_DESC
▲ Show 20 Lines • Show All 208 Lines • ▼ Show 20 Lines	std::pair<unsigned, unsigned> AMDGPUSubtarget::getWavesPerEU(
// minimum/maximum flat work group sizes.		// minimum/maximum flat work group sizes.
if (RequestedFlatWorkGroupSize &&		if (RequestedFlatWorkGroupSize &&
Requested.first > MinImpliedByFlatWorkGroupSize)		Requested.first > MinImpliedByFlatWorkGroupSize)
return Default;		return Default;

return Requested;		return Requested;
}		}

		bool AMDGPUSubtarget::makeLIDRangeMetadata(Instruction *I) const {
		Function *Kernel = I->getParent()->getParent();
		unsigned MinSize = 0;
		unsigned MaxSize = getFlatWorkGroupSizes(*Kernel).second;
		bool IdQuery = false;

		// If reqd_work_group_size is present it narrows value down.
		if (auto *CI = dyn_cast<CallInst>(I)) {
		const Function *F = CI->getCalledFunction();
		if (F) {
		unsigned Dim = UINT_MAX;
		switch (F->getIntrinsicID()) {
		case Intrinsic::amdgcn_workitem_id_x:
		case Intrinsic::r600_read_tidig_x:
		IdQuery = true;
		case Intrinsic::r600_read_local_size_x:
		Dim = 0;
		break;
		case Intrinsic::amdgcn_workitem_id_y:
		case Intrinsic::r600_read_tidig_y:
		IdQuery = true;
		case Intrinsic::r600_read_local_size_y:
		Dim = 1;
		break;
		case Intrinsic::amdgcn_workitem_id_z:
		case Intrinsic::r600_read_tidig_z:
		IdQuery = true;
		case Intrinsic::r600_read_local_size_z:
		Dim = 2;
		break;
		default:
		break;
		}
		if (Dim <= 3) {
		if (auto Node = Kernel->getMetadata("reqd_work_group_size"))
		if (Node->getNumOperands() == 3)
		MinSize = MaxSize = mdconst::extract<ConstantInt>(
		Node->getOperand(Dim))->getZExtValue();
		}
		}
		}

		if (!MaxSize)
		return false;

		// Range metadata is [Lo, Hi). For ID query we need to pass max size
		// as Hi. For size query we need to pass Hi + 1.
		if (IdQuery)
		MinSize = 0;
		else
		++MaxSize;

		MDBuilder MDB(I->getContext());
		MDNode *MaxWorkGroupSizeRange = MDB.createRange(APInt(32, MinSize),
		APInt(32, MaxSize));
		I->setMetadata(LLVMContext::MD_range, MaxWorkGroupSizeRange);
		return true;
		}

R600Subtarget::R600Subtarget(const Triple &TT, StringRef GPU, StringRef FS,		R600Subtarget::R600Subtarget(const Triple &TT, StringRef GPU, StringRef FS,
const TargetMachine &TM) :		const TargetMachine &TM) :
AMDGPUSubtarget(TT, GPU, FS, TM),		AMDGPUSubtarget(TT, GPU, FS, TM),
InstrInfo(*this),		InstrInfo(*this),
FrameLowering(TargetFrameLowering::StackGrowsUp, getStackAlignment(), 0),		FrameLowering(TargetFrameLowering::StackGrowsUp, getStackAlignment(), 0),
TLInfo(TM, *this) {}		TLInfo(TM, *this) {}

SISubtarget::SISubtarget(const Triple &TT, StringRef GPU, StringRef FS,		SISubtarget::SISubtarget(const Triple &TT, StringRef GPU, StringRef FS,
▲ Show 20 Lines • Show All 179 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp

Show First 20 Lines • Show All 549 Lines • ▼ Show 20 Lines	void AMDGPUPassConfig::addStraightLineScalarOptimizationPasses() {
// Run NaryReassociate after EarlyCSE/GVN to be more effective.		// Run NaryReassociate after EarlyCSE/GVN to be more effective.
addPass(createNaryReassociatePass());		addPass(createNaryReassociatePass());
// NaryReassociate on GEPs creates redundant common expressions, so run		// NaryReassociate on GEPs creates redundant common expressions, so run
// EarlyCSE after it.		// EarlyCSE after it.
addPass(createEarlyCSEPass());		addPass(createEarlyCSEPass());
}		}

void AMDGPUPassConfig::addIRPasses() {		void AMDGPUPassConfig::addIRPasses() {
		const AMDGPUTargetMachine &TM = getAMDGPUTargetMachine();

// There is no reason to run these.		// There is no reason to run these.
disablePass(&StackMapLivenessID);		disablePass(&StackMapLivenessID);
disablePass(&FuncletLayoutID);		disablePass(&FuncletLayoutID);
disablePass(&PatchableFunctionID);		disablePass(&PatchableFunctionID);

addPass(createAMDGPULowerIntrinsicsPass());		addPass(createAMDGPULowerIntrinsicsPass(&TM));

// Function calls are not supported, so make sure we inline everything.		// Function calls are not supported, so make sure we inline everything.
addPass(createAMDGPUAlwaysInlinePass());		addPass(createAMDGPUAlwaysInlinePass());
addPass(createAlwaysInlinerLegacyPass());		addPass(createAlwaysInlinerLegacyPass());
// We need to add the barrier noop pass, otherwise adding the function		// We need to add the barrier noop pass, otherwise adding the function
// inlining pass will cause all of the PassConfigs passes to be run		// inlining pass will cause all of the PassConfigs passes to be run
// one function at a time, which means if we have a nodule with two		// one function at a time, which means if we have a nodule with two
// functions, then we will generate code for the first function		// functions, then we will generate code for the first function
// without ever running any passes on the second.		// without ever running any passes on the second.
addPass(createBarrierNoopPass());		addPass(createBarrierNoopPass());

const AMDGPUTargetMachine &TM = getAMDGPUTargetMachine();

if (TM.getTargetTriple().getArch() == Triple::amdgcn) {		if (TM.getTargetTriple().getArch() == Triple::amdgcn) {
// TODO: May want to move later or split into an early and late one.		// TODO: May want to move later or split into an early and late one.

addPass(createAMDGPUCodeGenPreparePass(		addPass(createAMDGPUCodeGenPreparePass(
static_cast<const GCNTargetMachine *>(&TM)));		static_cast<const GCNTargetMachine *>(&TM)));
}		}

// Handle uses of OpenCL image2d_t, image3d_t and sampler_t arguments.		// Handle uses of OpenCL image2d_t, image3d_t and sampler_t arguments.
▲ Show 20 Lines • Show All 249 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/AMDGPU/add.i16.ll

Show First 20 Lines • Show All 78 Lines • ▼ Show 20 Lines	define amdgpu_kernel void @v_test_add_i16_zext_to_i32(i32 addrspace(1)* %out, i16 addrspace(1)* %in0, i16 addrspace(1)* %in1) #1 {
%add = add i16 %a, %b		%add = add i16 %a, %b
%ext = zext i16 %add to i32		%ext = zext i16 %add to i32
store i32 %ext, i32 addrspace(1)* %out		store i32 %ext, i32 addrspace(1)* %out
ret void		ret void
}		}

; FIXME: Need to handle non-uniform case for function below (load without gep).		; FIXME: Need to handle non-uniform case for function below (load without gep).
; GCN-LABEL: {{^}}v_test_add_i16_zext_to_i64:		; GCN-LABEL: {{^}}v_test_add_i16_zext_to_i64:
		; VI-DAG: v_mov_b32_e32 v[[VZERO:[0-9]+]], 0
; VI: flat_load_ushort [[A:v[0-9]+]]		; VI: flat_load_ushort [[A:v[0-9]+]]
; VI: flat_load_ushort [[B:v[0-9]+]]		; VI: flat_load_ushort [[B:v[0-9]+]]
; VI-DAG: v_add_u16_e32 v[[ADD:[0-9]+]], [[B]], [[A]]		; VI-DAG: v_add_u16_e32 v[[ADD:[0-9]+]], [[B]], [[A]]
; VI-DAG: v_mov_b32_e32 v[[VZERO:[0-9]+]], 0
; VI: buffer_store_dwordx2 v{{\[}}[[ADD]]:[[VZERO]]{{\]}}, off, {{s\[[0-9]+:[0-9]+\]}}, 0{{$}}		; VI: buffer_store_dwordx2 v{{\[}}[[ADD]]:[[VZERO]]{{\]}}, off, {{s\[[0-9]+:[0-9]+\]}}, 0{{$}}
define amdgpu_kernel void @v_test_add_i16_zext_to_i64(i64 addrspace(1)* %out, i16 addrspace(1)* %in0, i16 addrspace(1)* %in1) #1 {		define amdgpu_kernel void @v_test_add_i16_zext_to_i64(i64 addrspace(1)* %out, i16 addrspace(1)* %in0, i16 addrspace(1)* %in1) #1 {
%tid = call i32 @llvm.amdgcn.workitem.id.x()		%tid = call i32 @llvm.amdgcn.workitem.id.x()
%gep.out = getelementptr inbounds i64, i64 addrspace(1)* %out, i32 %tid		%gep.out = getelementptr inbounds i64, i64 addrspace(1)* %out, i32 %tid
%gep.in0 = getelementptr inbounds i16, i16 addrspace(1)* %in0, i32 %tid		%gep.in0 = getelementptr inbounds i16, i16 addrspace(1)* %in0, i32 %tid
%gep.in1 = getelementptr inbounds i16, i16 addrspace(1)* %in1, i32 %tid		%gep.in1 = getelementptr inbounds i16, i16 addrspace(1)* %in1, i32 %tid
%a = load volatile i16, i16 addrspace(1)* %gep.in0		%a = load volatile i16, i16 addrspace(1)* %gep.in0
%b = load volatile i16, i16 addrspace(1)* %gep.in1		%b = load volatile i16, i16 addrspace(1)* %gep.in1
▲ Show 20 Lines • Show All 51 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/AMDGPU/amdgpu.private-memory.ll

Show All 21 Lines
; R600: LDS_WRITE		; R600: LDS_WRITE
; R600: LDS_READ		; R600: LDS_READ
; R600: LDS_READ		; R600: LDS_READ

; HSA-PROMOTE: .amd_kernel_code_t		; HSA-PROMOTE: .amd_kernel_code_t
; HSA-PROMOTE: workgroup_group_segment_byte_size = 5120		; HSA-PROMOTE: workgroup_group_segment_byte_size = 5120
; HSA-PROMOTE: .end_amd_kernel_code_t		; HSA-PROMOTE: .end_amd_kernel_code_t

; FIXME: These should be merged
; HSA-PROMOTE: s_load_dword s{{[0-9]+}}, s[4:5], 0x1
; HSA-PROMOTE: s_load_dword s{{[0-9]+}}, s[4:5], 0x2		; HSA-PROMOTE: s_load_dword s{{[0-9]+}}, s[4:5], 0x2

; SI-PROMOTE: ds_write_b32		; SI-PROMOTE: ds_write_b32
; SI-PROMOTE: ds_write_b32		; SI-PROMOTE: ds_write_b32
; SI-PROMOTE: ds_read_b32		; SI-PROMOTE: ds_read_b32
; SI-PROMOTE: ds_read_b32		; SI-PROMOTE: ds_read_b32

; HSA-ALLOCA: .amd_kernel_code_t		; HSA-ALLOCA: .amd_kernel_code_t
Show All 13 Lines
; HSAOPT: [[DISPATCH_PTR:%[0-9]+]] = call noalias nonnull dereferenceable(64) i8 addrspace(2)* @llvm.amdgcn.dispatch.ptr()		; HSAOPT: [[DISPATCH_PTR:%[0-9]+]] = call noalias nonnull dereferenceable(64) i8 addrspace(2)* @llvm.amdgcn.dispatch.ptr()
; HSAOPT: [[CAST_DISPATCH_PTR:%[0-9]+]] = bitcast i8 addrspace(2)* [[DISPATCH_PTR]] to i32 addrspace(2)*		; HSAOPT: [[CAST_DISPATCH_PTR:%[0-9]+]] = bitcast i8 addrspace(2)* [[DISPATCH_PTR]] to i32 addrspace(2)*
; HSAOPT: [[GEP0:%[0-9]+]] = getelementptr inbounds i32, i32 addrspace(2)* [[CAST_DISPATCH_PTR]], i64 1		; HSAOPT: [[GEP0:%[0-9]+]] = getelementptr inbounds i32, i32 addrspace(2)* [[CAST_DISPATCH_PTR]], i64 1
; HSAOPT: [[LDXY:%[0-9]+]] = load i32, i32 addrspace(2)* [[GEP0]], align 4, !invariant.load !0		; HSAOPT: [[LDXY:%[0-9]+]] = load i32, i32 addrspace(2)* [[GEP0]], align 4, !invariant.load !0
; HSAOPT: [[GEP1:%[0-9]+]] = getelementptr inbounds i32, i32 addrspace(2)* [[CAST_DISPATCH_PTR]], i64 2		; HSAOPT: [[GEP1:%[0-9]+]] = getelementptr inbounds i32, i32 addrspace(2)* [[CAST_DISPATCH_PTR]], i64 2
; HSAOPT: [[LDZU:%[0-9]+]] = load i32, i32 addrspace(2)* [[GEP1]], align 4, !range !1, !invariant.load !0		; HSAOPT: [[LDZU:%[0-9]+]] = load i32, i32 addrspace(2)* [[GEP1]], align 4, !range !1, !invariant.load !0
; HSAOPT: [[EXTRACTY:%[0-9]+]] = lshr i32 [[LDXY]], 16		; HSAOPT: [[EXTRACTY:%[0-9]+]] = lshr i32 [[LDXY]], 16

; HSAOPT: [[WORKITEM_ID_X:%[0-9]+]] = call i32 @llvm.amdgcn.workitem.id.x(), !range !1		; HSAOPT: [[WORKITEM_ID_X:%[0-9]+]] = call i32 @llvm.amdgcn.workitem.id.x(), !range !2
; HSAOPT: [[WORKITEM_ID_Y:%[0-9]+]] = call i32 @llvm.amdgcn.workitem.id.y(), !range !1		; HSAOPT: [[WORKITEM_ID_Y:%[0-9]+]] = call i32 @llvm.amdgcn.workitem.id.y(), !range !2
; HSAOPT: [[WORKITEM_ID_Z:%[0-9]+]] = call i32 @llvm.amdgcn.workitem.id.z(), !range !1		; HSAOPT: [[WORKITEM_ID_Z:%[0-9]+]] = call i32 @llvm.amdgcn.workitem.id.z(), !range !2

; HSAOPT: [[Y_SIZE_X_Z_SIZE:%[0-9]+]] = mul nuw nsw i32 [[EXTRACTY]], [[LDZU]]		; HSAOPT: [[Y_SIZE_X_Z_SIZE:%[0-9]+]] = mul nuw nsw i32 [[EXTRACTY]], [[LDZU]]
; HSAOPT: [[YZ_X_XID:%[0-9]+]] = mul i32 [[Y_SIZE_X_Z_SIZE]], [[WORKITEM_ID_X]]		; HSAOPT: [[YZ_X_XID:%[0-9]+]] = mul i32 [[Y_SIZE_X_Z_SIZE]], [[WORKITEM_ID_X]]
; HSAOPT: [[Y_X_Z_SIZE:%[0-9]+]] = mul nuw nsw i32 [[WORKITEM_ID_Y]], [[LDZU]]		; HSAOPT: [[Y_X_Z_SIZE:%[0-9]+]] = mul nuw nsw i32 [[WORKITEM_ID_Y]], [[LDZU]]
; HSAOPT: [[ADD_YZ_X_X_YZ_SIZE:%[0-9]+]] = add i32 [[YZ_X_XID]], [[Y_X_Z_SIZE]]		; HSAOPT: [[ADD_YZ_X_X_YZ_SIZE:%[0-9]+]] = add i32 [[YZ_X_XID]], [[Y_X_Z_SIZE]]
; HSAOPT: [[ADD_ZID:%[0-9]+]] = add i32 [[ADD_YZ_X_X_YZ_SIZE]], [[WORKITEM_ID_Z]]		; HSAOPT: [[ADD_ZID:%[0-9]+]] = add i32 [[ADD_YZ_X_X_YZ_SIZE]], [[WORKITEM_ID_Z]]

; HSAOPT: [[LOCAL_GEP:%[0-9]+]] = getelementptr inbounds [256 x [5 x i32]], [256 x [5 x i32]] addrspace(3)* @mova_same_clause.stack, i32 0, i32 [[ADD_ZID]]		; HSAOPT: [[LOCAL_GEP:%[0-9]+]] = getelementptr inbounds [256 x [5 x i32]], [256 x [5 x i32]] addrspace(3)* @mova_same_clause.stack, i32 0, i32 [[ADD_ZID]]
; HSAOPT: %arrayidx1 = getelementptr inbounds [5 x i32], [5 x i32] addrspace(3)* [[LOCAL_GEP]], i32 0, i32 {{%[0-9]+}}		; HSAOPT: %arrayidx1 = getelementptr inbounds [5 x i32], [5 x i32] addrspace(3)* [[LOCAL_GEP]], i32 0, i32 {{%[0-9]+}}
; HSAOPT: %arrayidx3 = getelementptr inbounds [5 x i32], [5 x i32] addrspace(3)* [[LOCAL_GEP]], i32 0, i32 {{%[0-9]+}}		; HSAOPT: %arrayidx3 = getelementptr inbounds [5 x i32], [5 x i32] addrspace(3)* [[LOCAL_GEP]], i32 0, i32 {{%[0-9]+}}
; HSAOPT: %arrayidx10 = getelementptr inbounds [5 x i32], [5 x i32] addrspace(3)* [[LOCAL_GEP]], i32 0, i32 0		; HSAOPT: %arrayidx10 = getelementptr inbounds [5 x i32], [5 x i32] addrspace(3)* [[LOCAL_GEP]], i32 0, i32 0
; HSAOPT: %arrayidx12 = getelementptr inbounds [5 x i32], [5 x i32] addrspace(3)* [[LOCAL_GEP]], i32 0, i32 1		; HSAOPT: %arrayidx12 = getelementptr inbounds [5 x i32], [5 x i32] addrspace(3)* [[LOCAL_GEP]], i32 0, i32 1


; NOHSAOPT: call i32 @llvm.r600.read.local.size.y(), !range !0		; NOHSAOPT: call i32 @llvm.r600.read.local.size.y(), !range !0
; NOHSAOPT: call i32 @llvm.r600.read.local.size.z(), !range !0		; NOHSAOPT: call i32 @llvm.r600.read.local.size.z(), !range !0
; NOHSAOPT: call i32 @llvm.amdgcn.workitem.id.x(), !range !0		; NOHSAOPT: call i32 @llvm.amdgcn.workitem.id.x(), !range !1
; NOHSAOPT: call i32 @llvm.amdgcn.workitem.id.y(), !range !0		; NOHSAOPT: call i32 @llvm.amdgcn.workitem.id.y(), !range !1
; NOHSAOPT: call i32 @llvm.amdgcn.workitem.id.z(), !range !0		; NOHSAOPT: call i32 @llvm.amdgcn.workitem.id.z(), !range !1
define amdgpu_kernel void @mova_same_clause(i32 addrspace(1)* nocapture %out, i32 addrspace(1)* nocapture %in) #0 {		define amdgpu_kernel void @mova_same_clause(i32 addrspace(1)* nocapture %out, i32 addrspace(1)* nocapture %in) #0 {
entry:		entry:
%stack = alloca [5 x i32], align 4		%stack = alloca [5 x i32], align 4
%0 = load i32, i32 addrspace(1)* %in, align 4		%0 = load i32, i32 addrspace(1)* %in, align 4
%arrayidx1 = getelementptr inbounds [5 x i32], [5 x i32]* %stack, i32 0, i32 %0		%arrayidx1 = getelementptr inbounds [5 x i32], [5 x i32]* %stack, i32 0, i32 %0
store i32 4, i32* %arrayidx1, align 4		store i32 4, i32* %arrayidx1, align 4
%arrayidx2 = getelementptr inbounds i32, i32 addrspace(1)* %in, i32 1		%arrayidx2 = getelementptr inbounds i32, i32 addrspace(1)* %in, i32 1
%1 = load i32, i32 addrspace(1)* %arrayidx2, align 4		%1 = load i32, i32 addrspace(1)* %arrayidx2, align 4
▲ Show 20 Lines • Show All 461 Lines • ▼ Show 20 Lines	entry:
%load = load [1 x i32], [1 x i32]* %tmp		%load = load [1 x i32], [1 x i32]* %tmp
store [1 x i32] %load, [1 x i32] addrspace(1)* %out		store [1 x i32] %load, [1 x i32] addrspace(1)* %out
ret void		ret void
}		}

attributes #0 = { nounwind "amdgpu-waves-per-eu"="1,2" }		attributes #0 = { nounwind "amdgpu-waves-per-eu"="1,2" }

; HSAOPT: !0 = !{}		; HSAOPT: !0 = !{}
; HSAOPT: !1 = !{i32 0, i32 2048}		; HSAOPT: !1 = !{i32 0, i32 257}
		; HSAOPT: !2 = !{i32 0, i32 256}

; NOHSAOPT: !0 = !{i32 0, i32 2048}		; NOHSAOPT: !0 = !{i32 0, i32 257}
		; NOHSAOPT: !1 = !{i32 0, i32 256}

llvm/trunk/test/CodeGen/AMDGPU/bfe-patterns.ll

Show First 20 Lines • Show All 44 Lines • ▼ Show 20 Lines	define amdgpu_kernel void @v_ubfe_sub_multi_use_shl_i32(i32 addrspace(1)* %out, i32 addrspace(1)* %in0, i32 addrspace(1)* %in1) #1 {
store i32 %bfe, i32 addrspace(1)* %out.gep		store i32 %bfe, i32 addrspace(1)* %out.gep
store volatile i32 %shl, i32 addrspace(1)* undef		store volatile i32 %shl, i32 addrspace(1)* undef
ret void		ret void
}		}

; GCN-LABEL: {{^}}s_ubfe_sub_i32:		; GCN-LABEL: {{^}}s_ubfe_sub_i32:
; GCN: s_load_dword [[SRC:s[0-9]+]]		; GCN: s_load_dword [[SRC:s[0-9]+]]
; GCN: s_load_dword [[WIDTH:s[0-9]+]]		; GCN: s_load_dword [[WIDTH:s[0-9]+]]
; GCN: v_mov_b32_e32 [[VWIDTH:v[0-9]+]]		; GCN: v_mov_b32_e32 [[VWIDTH:v[0-9]+]], {{s[0-9]+}}
; GCN: v_bfe_u32 v{{[0-9]+}}, [[SRC]], 0, [[VWIDTH]]		; GCN: v_bfe_u32 v{{[0-9]+}}, [[SRC]], 0, [[VWIDTH]]
define amdgpu_kernel void @s_ubfe_sub_i32(i32 addrspace(1)* %out, i32 %src, i32 %width) #1 {		define amdgpu_kernel void @s_ubfe_sub_i32(i32 addrspace(1)* %out, i32 %src, i32 %width) #1 {
%id.x = tail call i32 @llvm.amdgcn.workitem.id.x()		%id.x = tail call i32 @llvm.amdgcn.workitem.id.x()
%out.gep = getelementptr i32, i32 addrspace(1)* %out, i32 %id.x		%out.gep = getelementptr i32, i32 addrspace(1)* %out, i32 %id.x
%sub = sub i32 32, %width		%sub = sub i32 32, %width
%shl = shl i32 %src, %sub		%shl = shl i32 %src, %sub
%bfe = lshr i32 %shl, %sub		%bfe = lshr i32 %shl, %sub
store i32 %bfe, i32 addrspace(1)* %out.gep		store i32 %bfe, i32 addrspace(1)* %out.gep
▲ Show 20 Lines • Show All 61 Lines • ▼ Show 20 Lines	define amdgpu_kernel void @v_sbfe_sub_multi_use_shl_i32(i32 addrspace(1)* %out, i32 addrspace(1)* %in0, i32 addrspace(1)* %in1) #1 {
store i32 %bfe, i32 addrspace(1)* %out.gep		store i32 %bfe, i32 addrspace(1)* %out.gep
store volatile i32 %shl, i32 addrspace(1)* undef		store volatile i32 %shl, i32 addrspace(1)* undef
ret void		ret void
}		}

; GCN-LABEL: {{^}}s_sbfe_sub_i32:		; GCN-LABEL: {{^}}s_sbfe_sub_i32:
; GCN: s_load_dword [[SRC:s[0-9]+]]		; GCN: s_load_dword [[SRC:s[0-9]+]]
; GCN: s_load_dword [[WIDTH:s[0-9]+]]		; GCN: s_load_dword [[WIDTH:s[0-9]+]]
; GCN: v_mov_b32_e32 [[VWIDTH:v[0-9]+]]		; GCN: v_mov_b32_e32 [[VWIDTH:v[0-9]+]], {{s[0-9]+}}
; GCN: v_bfe_i32 v{{[0-9]+}}, [[SRC]], 0, [[VWIDTH]]		; GCN: v_bfe_i32 v{{[0-9]+}}, [[SRC]], 0, [[VWIDTH]]
define amdgpu_kernel void @s_sbfe_sub_i32(i32 addrspace(1)* %out, i32 %src, i32 %width) #1 {		define amdgpu_kernel void @s_sbfe_sub_i32(i32 addrspace(1)* %out, i32 %src, i32 %width) #1 {
%id.x = tail call i32 @llvm.amdgcn.workitem.id.x()		%id.x = tail call i32 @llvm.amdgcn.workitem.id.x()
%out.gep = getelementptr i32, i32 addrspace(1)* %out, i32 %id.x		%out.gep = getelementptr i32, i32 addrspace(1)* %out, i32 %id.x
%sub = sub i32 32, %width		%sub = sub i32 32, %width
%shl = shl i32 %src, %sub		%shl = shl i32 %src, %sub
%bfe = ashr i32 %shl, %sub		%bfe = ashr i32 %shl, %sub
store i32 %bfe, i32 addrspace(1)* %out.gep		store i32 %bfe, i32 addrspace(1)* %out.gep
Show All 24 Lines

llvm/trunk/test/CodeGen/AMDGPU/ds_read2_superreg.ll

Show First 20 Lines • Show All 144 Lines • ▼ Show 20 Lines	define amdgpu_kernel void @simple_read2_v16f32_superreg(<16 x float> addrspace(1)* %out) #0 {
%out.gep = getelementptr inbounds <16 x float>, <16 x float> addrspace(1)* %out, i32 %x.i		%out.gep = getelementptr inbounds <16 x float>, <16 x float> addrspace(1)* %out, i32 %x.i
store <16 x float> %val0, <16 x float> addrspace(1)* %out.gep		store <16 x float> %val0, <16 x float> addrspace(1)* %out.gep
ret void		ret void
}		}

; Do scalar loads into the super register we need.		; Do scalar loads into the super register we need.
; CI-LABEL: {{^}}simple_read2_v2f32_superreg_scalar_loads_align4:		; CI-LABEL: {{^}}simple_read2_v2f32_superreg_scalar_loads_align4:
; CI-DAG: ds_read2_b32 v{{\[}}[[REG_ELT0:[0-9]+]]:[[REG_ELT1:[0-9]+]]{{\]}}, v{{[0-9]+}} offset1:1{{$}}		; CI-DAG: ds_read2_b32 v{{\[}}[[REG_ELT0:[0-9]+]]:[[REG_ELT1:[0-9]+]]{{\]}}, v{{[0-9]+}} offset1:1{{$}}
; CI-NOT: v_mov		; CI-NOT: v_mov {{v[0-9]+}}, {{[sv][0-9]+}}
; CI: buffer_store_dwordx2 v{{\[}}[[REG_ELT0]]:[[REG_ELT1]]{{\]}}		; CI: buffer_store_dwordx2 v{{\[}}[[REG_ELT0]]:[[REG_ELT1]]{{\]}}
; CI: s_endpgm		; CI: s_endpgm
define amdgpu_kernel void @simple_read2_v2f32_superreg_scalar_loads_align4(<2 x float> addrspace(1)* %out) #0 {		define amdgpu_kernel void @simple_read2_v2f32_superreg_scalar_loads_align4(<2 x float> addrspace(1)* %out) #0 {
%x.i = tail call i32 @llvm.amdgcn.workitem.id.x() #1		%x.i = tail call i32 @llvm.amdgcn.workitem.id.x() #1
%arrayidx0 = getelementptr inbounds [512 x float], [512 x float] addrspace(3)* @lds, i32 0, i32 %x.i		%arrayidx0 = getelementptr inbounds [512 x float], [512 x float] addrspace(3)* @lds, i32 0, i32 %x.i
%arrayidx1 = getelementptr inbounds float, float addrspace(3)* %arrayidx0, i32 1		%arrayidx1 = getelementptr inbounds float, float addrspace(3)* %arrayidx0, i32 1

%val0 = load float, float addrspace(3)* %arrayidx0		%val0 = load float, float addrspace(3)* %arrayidx0
%val1 = load float, float addrspace(3)* %arrayidx1		%val1 = load float, float addrspace(3)* %arrayidx1

%vec.0 = insertelement <2 x float> undef, float %val0, i32 0		%vec.0 = insertelement <2 x float> undef, float %val0, i32 0
%vec.1 = insertelement <2 x float> %vec.0, float %val1, i32 1		%vec.1 = insertelement <2 x float> %vec.0, float %val1, i32 1

%out.gep = getelementptr inbounds <2 x float>, <2 x float> addrspace(1)* %out, i32 %x.i		%out.gep = getelementptr inbounds <2 x float>, <2 x float> addrspace(1)* %out, i32 %x.i
store <2 x float> %vec.1, <2 x float> addrspace(1)* %out.gep		store <2 x float> %vec.1, <2 x float> addrspace(1)* %out.gep
ret void		ret void
}		}

; Do scalar loads into the super register we need.		; Do scalar loads into the super register we need.
; CI-LABEL: {{^}}simple_read2_v4f32_superreg_scalar_loads_align4:		; CI-LABEL: {{^}}simple_read2_v4f32_superreg_scalar_loads_align4:
; CI-DAG: ds_read2_b32 v{{\[}}[[REG_ELT0:[0-9]+]]:[[REG_ELT1:[0-9]+]]{{\]}}, v{{[0-9]+}} offset1:1{{$}}		; CI-DAG: ds_read2_b32 v{{\[}}[[REG_ELT0:[0-9]+]]:[[REG_ELT1:[0-9]+]]{{\]}}, v{{[0-9]+}} offset1:1{{$}}
; CI-DAG: ds_read2_b32 v{{\[}}[[REG_ELT2:[0-9]+]]:[[REG_ELT3:[0-9]+]]{{\]}}, v{{[0-9]+}} offset0:2 offset1:3{{$}}		; CI-DAG: ds_read2_b32 v{{\[}}[[REG_ELT2:[0-9]+]]:[[REG_ELT3:[0-9]+]]{{\]}}, v{{[0-9]+}} offset0:2 offset1:3{{$}}
; CI-NOT: v_mov		; CI-NOT: v_mov {{v[0-9]+}}, {{[sv][0-9]+}}
; CI: buffer_store_dwordx4 v{{\[}}[[REG_ELT0]]:[[REG_ELT3]]{{\]}}		; CI: buffer_store_dwordx4 v{{\[}}[[REG_ELT0]]:[[REG_ELT3]]{{\]}}
; CI: s_endpgm		; CI: s_endpgm
define amdgpu_kernel void @simple_read2_v4f32_superreg_scalar_loads_align4(<4 x float> addrspace(1)* %out) #0 {		define amdgpu_kernel void @simple_read2_v4f32_superreg_scalar_loads_align4(<4 x float> addrspace(1)* %out) #0 {
%x.i = tail call i32 @llvm.amdgcn.workitem.id.x() #1		%x.i = tail call i32 @llvm.amdgcn.workitem.id.x() #1
%arrayidx0 = getelementptr inbounds [512 x float], [512 x float] addrspace(3)* @lds, i32 0, i32 %x.i		%arrayidx0 = getelementptr inbounds [512 x float], [512 x float] addrspace(3)* @lds, i32 0, i32 %x.i
%arrayidx1 = getelementptr inbounds float, float addrspace(3)* %arrayidx0, i32 1		%arrayidx1 = getelementptr inbounds float, float addrspace(3)* %arrayidx0, i32 1
%arrayidx2 = getelementptr inbounds float, float addrspace(3)* %arrayidx0, i32 2		%arrayidx2 = getelementptr inbounds float, float addrspace(3)* %arrayidx0, i32 2
%arrayidx3 = getelementptr inbounds float, float addrspace(3)* %arrayidx0, i32 3		%arrayidx3 = getelementptr inbounds float, float addrspace(3)* %arrayidx0, i32 3
Show All 25 Lines

llvm/trunk/test/CodeGen/AMDGPU/llvm.amdgcn.atomic.dec.ll

	Show First 20 Lines • Show All 228 Lines • ▼ Show 20 Lines
	; GCN: flat_atomic_dec_x2 v{{\[[0-9]+:[0-9]+\]}}, v{{\[}}[[KLO]]:[[KHI]]{{\]$}}			; GCN: flat_atomic_dec_x2 v{{\[[0-9]+:[0-9]+\]}}, v{{\[}}[[KLO]]:[[KHI]]{{\]$}}
	define amdgpu_kernel void @flat_atomic_dec_noret_i64_offset(i64 addrspace(4)* %ptr) nounwind {			define amdgpu_kernel void @flat_atomic_dec_noret_i64_offset(i64 addrspace(4)* %ptr) nounwind {
	%gep = getelementptr i64, i64 addrspace(4)* %ptr, i32 4			%gep = getelementptr i64, i64 addrspace(4)* %ptr, i32 4
	%result = call i64 @llvm.amdgcn.atomic.dec.i64.p4i64(i64 addrspace(4)* %gep, i64 42, i32 0, i32 0, i1 false)			%result = call i64 @llvm.amdgcn.atomic.dec.i64.p4i64(i64 addrspace(4)* %gep, i64 42, i32 0, i32 0, i1 false)
	ret void			ret void
	}			}

	; GCN-LABEL: {{^}}flat_atomic_dec_ret_i64_offset_addr64:			; GCN-LABEL: {{^}}flat_atomic_dec_ret_i64_offset_addr64:
	; GCN-DAG: v_mov_b32_e32 v[[KLO:[0-9]+]], 42			; GCN: v_mov_b32_e32 v[[KLO:[0-9]+]], 42
	; GCN-DAG: v_mov_b32_e32 v[[KHI:[0-9]+]], 0{{$}}			; GCN: v_mov_b32_e32 v[[KHI:[0-9]+]], 0{{$}}
	; GCN: flat_atomic_dec_x2 v{{\[[0-9]+:[0-9]+\]}}, v{{\[[0-9]+:[0-9]+\]}}, v{{\[}}[[KLO]]:[[KHI]]{{\]}} glc{{$}}			; GCN: flat_atomic_dec_x2 v{{\[[0-9]+:[0-9]+\]}}, v{{\[[0-9]+:[0-9]+\]}}, v{{\[}}[[KLO]]:[[KHI]]{{\]}} glc{{$}}
	define amdgpu_kernel void @flat_atomic_dec_ret_i64_offset_addr64(i64 addrspace(4)* %out, i64 addrspace(4)* %ptr) #0 {			define amdgpu_kernel void @flat_atomic_dec_ret_i64_offset_addr64(i64 addrspace(4)* %out, i64 addrspace(4)* %ptr) #0 {
	%id = call i32 @llvm.amdgcn.workitem.id.x()			%id = call i32 @llvm.amdgcn.workitem.id.x()
	%gep.tid = getelementptr i64, i64 addrspace(4)* %ptr, i32 %id			%gep.tid = getelementptr i64, i64 addrspace(4)* %ptr, i32 %id
	%out.gep = getelementptr i64, i64 addrspace(4)* %out, i32 %id			%out.gep = getelementptr i64, i64 addrspace(4)* %out, i32 %id
	%gep = getelementptr i64, i64 addrspace(4)* %gep.tid, i32 5			%gep = getelementptr i64, i64 addrspace(4)* %gep.tid, i32 5
	%result = call i64 @llvm.amdgcn.atomic.dec.i64.p4i64(i64 addrspace(4)* %gep, i64 42, i32 0, i32 0, i1 false)			%result = call i64 @llvm.amdgcn.atomic.dec.i64.p4i64(i64 addrspace(4)* %gep, i64 42, i32 0, i32 0, i1 false)
	store i64 %result, i64 addrspace(4)* %out.gep			store i64 %result, i64 addrspace(4)* %out.gep
	ret void			ret void
	}			}

	; GCN-LABEL: {{^}}flat_atomic_dec_noret_i64_offset_addr64:			; GCN-LABEL: {{^}}flat_atomic_dec_noret_i64_offset_addr64:
	; GCN-DAG: v_mov_b32_e32 v[[KLO:[0-9]+]], 42			; GCN: v_mov_b32_e32 v[[KLO:[0-9]+]], 42
	; GCN-DAG: v_mov_b32_e32 v[[KHI:[0-9]+]], 0{{$}}			; GCN: v_mov_b32_e32 v[[KHI:[0-9]+]], 0{{$}}
	; GCN: flat_atomic_dec_x2 v{{\[[0-9]+:[0-9]+\]}}, v{{\[}}[[KLO]]:[[KHI]]{{\]$}}			; GCN: flat_atomic_dec_x2 v{{\[[0-9]+:[0-9]+\]}}, v{{\[}}[[KLO]]:[[KHI]]{{\]$}}
	define amdgpu_kernel void @flat_atomic_dec_noret_i64_offset_addr64(i64 addrspace(4)* %ptr) #0 {			define amdgpu_kernel void @flat_atomic_dec_noret_i64_offset_addr64(i64 addrspace(4)* %ptr) #0 {
	%id = call i32 @llvm.amdgcn.workitem.id.x()			%id = call i32 @llvm.amdgcn.workitem.id.x()
	%gep.tid = getelementptr i64, i64 addrspace(4)* %ptr, i32 %id			%gep.tid = getelementptr i64, i64 addrspace(4)* %ptr, i32 %id
	%gep = getelementptr i64, i64 addrspace(4)* %gep.tid, i32 5			%gep = getelementptr i64, i64 addrspace(4)* %gep.tid, i32 5
	%result = call i64 @llvm.amdgcn.atomic.dec.i64.p4i64(i64 addrspace(4)* %gep, i64 42, i32 0, i32 0, i1 false)			%result = call i64 @llvm.amdgcn.atomic.dec.i64.p4i64(i64 addrspace(4)* %gep, i64 42, i32 0, i32 0, i1 false)
	ret void			ret void
	}			}
	▲ Show 20 Lines • Show All 89 Lines • ▼ Show 20 Lines
	; GCN: buffer_atomic_dec_x2 v{{\[}}[[KLO]]:[[KHI]]{{\]}}, off, s{{\[[0-9]+:[0-9]+\]}}, 0 offset:32{{$}}			; GCN: buffer_atomic_dec_x2 v{{\[}}[[KLO]]:[[KHI]]{{\]}}, off, s{{\[[0-9]+:[0-9]+\]}}, 0 offset:32{{$}}
	define amdgpu_kernel void @global_atomic_dec_noret_i64_offset(i64 addrspace(1)* %ptr) nounwind {			define amdgpu_kernel void @global_atomic_dec_noret_i64_offset(i64 addrspace(1)* %ptr) nounwind {
	%gep = getelementptr i64, i64 addrspace(1)* %ptr, i32 4			%gep = getelementptr i64, i64 addrspace(1)* %ptr, i32 4
	%result = call i64 @llvm.amdgcn.atomic.dec.i64.p1i64(i64 addrspace(1)* %gep, i64 42, i32 0, i32 0, i1 false)			%result = call i64 @llvm.amdgcn.atomic.dec.i64.p1i64(i64 addrspace(1)* %gep, i64 42, i32 0, i32 0, i1 false)
	ret void			ret void
	}			}

	; GCN-LABEL: {{^}}global_atomic_dec_ret_i64_offset_addr64:			; GCN-LABEL: {{^}}global_atomic_dec_ret_i64_offset_addr64:
	; GCN-DAG: v_mov_b32_e32 v[[KLO:[0-9]+]], 42			; GCN: v_mov_b32_e32 v[[KLO:[0-9]+]], 42
	; GCN-DAG: v_mov_b32_e32 v[[KHI:[0-9]+]], 0{{$}}			; GCN: v_mov_b32_e32 v[[KHI:[0-9]+]], 0{{$}}
	; CI: buffer_atomic_dec_x2 v{{\[}}[[KLO]]:[[KHI]]{{\]}}, v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:40 glc{{$}}			; CI: buffer_atomic_dec_x2 v{{\[}}[[KLO]]:[[KHI]]{{\]}}, v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:40 glc{{$}}
	; VI: flat_atomic_dec_x2 v{{\[[0-9]+:[0-9]+\]}}, v{{\[[0-9]+:[0-9]+\]}}, v{{\[}}[[KLO]]:[[KHI]]{{\]}} glc{{$}}			; VI: flat_atomic_dec_x2 v{{\[[0-9]+:[0-9]+\]}}, v{{\[[0-9]+:[0-9]+\]}}, v{{\[}}[[KLO]]:[[KHI]]{{\]}} glc{{$}}
	define amdgpu_kernel void @global_atomic_dec_ret_i64_offset_addr64(i64 addrspace(1)* %out, i64 addrspace(1)* %ptr) #0 {			define amdgpu_kernel void @global_atomic_dec_ret_i64_offset_addr64(i64 addrspace(1)* %out, i64 addrspace(1)* %ptr) #0 {
	%id = call i32 @llvm.amdgcn.workitem.id.x()			%id = call i32 @llvm.amdgcn.workitem.id.x()
	%gep.tid = getelementptr i64, i64 addrspace(1)* %ptr, i32 %id			%gep.tid = getelementptr i64, i64 addrspace(1)* %ptr, i32 %id
	%out.gep = getelementptr i64, i64 addrspace(1)* %out, i32 %id			%out.gep = getelementptr i64, i64 addrspace(1)* %out, i32 %id
	%gep = getelementptr i64, i64 addrspace(1)* %gep.tid, i32 5			%gep = getelementptr i64, i64 addrspace(1)* %gep.tid, i32 5
	%result = call i64 @llvm.amdgcn.atomic.dec.i64.p1i64(i64 addrspace(1)* %gep, i64 42, i32 0, i32 0, i1 false)			%result = call i64 @llvm.amdgcn.atomic.dec.i64.p1i64(i64 addrspace(1)* %gep, i64 42, i32 0, i32 0, i1 false)
	store i64 %result, i64 addrspace(1)* %out.gep			store i64 %result, i64 addrspace(1)* %out.gep
	ret void			ret void
	}			}

	; GCN-LABEL: {{^}}global_atomic_dec_noret_i64_offset_addr64:			; GCN-LABEL: {{^}}global_atomic_dec_noret_i64_offset_addr64:
	; GCN-DAG: v_mov_b32_e32 v[[KLO:[0-9]+]], 42			; GCN: v_mov_b32_e32 v[[KLO:[0-9]+]], 42
	; GCN-DAG: v_mov_b32_e32 v[[KHI:[0-9]+]], 0{{$}}			; GCN: v_mov_b32_e32 v[[KHI:[0-9]+]], 0{{$}}
	; CI: buffer_atomic_dec_x2 v{{\[}}[[KLO]]:[[KHI]]{{\]}}, v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:40{{$}}			; CI: buffer_atomic_dec_x2 v{{\[}}[[KLO]]:[[KHI]]{{\]}}, v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:40{{$}}
	; VI: flat_atomic_dec_x2 v{{\[[0-9]+:[0-9]+\]}}, v{{\[}}[[KLO]]:[[KHI]]{{\]}}{{$}}			; VI: flat_atomic_dec_x2 v{{\[[0-9]+:[0-9]+\]}}, v{{\[}}[[KLO]]:[[KHI]]{{\]}}{{$}}
	define amdgpu_kernel void @global_atomic_dec_noret_i64_offset_addr64(i64 addrspace(1)* %ptr) #0 {			define amdgpu_kernel void @global_atomic_dec_noret_i64_offset_addr64(i64 addrspace(1)* %ptr) #0 {
	%id = call i32 @llvm.amdgcn.workitem.id.x()			%id = call i32 @llvm.amdgcn.workitem.id.x()
	%gep.tid = getelementptr i64, i64 addrspace(1)* %ptr, i32 %id			%gep.tid = getelementptr i64, i64 addrspace(1)* %ptr, i32 %id
	%gep = getelementptr i64, i64 addrspace(1)* %gep.tid, i32 5			%gep = getelementptr i64, i64 addrspace(1)* %gep.tid, i32 5
	%result = call i64 @llvm.amdgcn.atomic.dec.i64.p1i64(i64 addrspace(1)* %gep, i64 42, i32 0, i32 0, i1 false)			%result = call i64 @llvm.amdgcn.atomic.dec.i64.p1i64(i64 addrspace(1)* %gep, i64 42, i32 0, i32 0, i1 false)
	ret void			ret void
	Show All 29 Lines

llvm/trunk/test/CodeGen/AMDGPU/llvm.amdgcn.atomic.inc.ll

	Show First 20 Lines • Show All 200 Lines • ▼ Show 20 Lines
	; GCN: buffer_atomic_inc_x2 v{{\[}}[[KLO]]:[[KHI]]{{\]}}, off, s{{\[[0-9]+:[0-9]+\]}}, 0 offset:32{{$}}			; GCN: buffer_atomic_inc_x2 v{{\[}}[[KLO]]:[[KHI]]{{\]}}, off, s{{\[[0-9]+:[0-9]+\]}}, 0 offset:32{{$}}
	define amdgpu_kernel void @global_atomic_inc_noret_i64_offset(i64 addrspace(1)* %ptr) nounwind {			define amdgpu_kernel void @global_atomic_inc_noret_i64_offset(i64 addrspace(1)* %ptr) nounwind {
	%gep = getelementptr i64, i64 addrspace(1)* %ptr, i32 4			%gep = getelementptr i64, i64 addrspace(1)* %ptr, i32 4
	%result = call i64 @llvm.amdgcn.atomic.inc.i64.p1i64(i64 addrspace(1)* %gep, i64 42, i32 0, i32 0, i1 false)			%result = call i64 @llvm.amdgcn.atomic.inc.i64.p1i64(i64 addrspace(1)* %gep, i64 42, i32 0, i32 0, i1 false)
	ret void			ret void
	}			}

	; GCN-LABEL: {{^}}global_atomic_inc_ret_i64_offset_addr64:			; GCN-LABEL: {{^}}global_atomic_inc_ret_i64_offset_addr64:
	; GCN-DAG: v_mov_b32_e32 v[[KLO:[0-9]+]], 42			; GCN: v_mov_b32_e32 v[[KLO:[0-9]+]], 42
	; GCN-DAG: v_mov_b32_e32 v[[KHI:[0-9]+]], 0{{$}}			; GCN: v_mov_b32_e32 v[[KHI:[0-9]+]], 0{{$}}
	; CI: buffer_atomic_inc_x2 v{{\[}}[[KLO]]:[[KHI]]{{\]}}, v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:40 glc{{$}}			; CI: buffer_atomic_inc_x2 v{{\[}}[[KLO]]:[[KHI]]{{\]}}, v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:40 glc{{$}}
	; VI: flat_atomic_inc_x2 v{{\[[0-9]+:[0-9]+\]}}, v{{\[[0-9]+:[0-9]+\]}}, v{{\[}}[[KLO]]:[[KHI]]{{\]}} glc{{$}}			; VI: flat_atomic_inc_x2 v{{\[[0-9]+:[0-9]+\]}}, v{{\[[0-9]+:[0-9]+\]}}, v{{\[}}[[KLO]]:[[KHI]]{{\]}} glc{{$}}
	define amdgpu_kernel void @global_atomic_inc_ret_i64_offset_addr64(i64 addrspace(1)* %out, i64 addrspace(1)* %ptr) #0 {			define amdgpu_kernel void @global_atomic_inc_ret_i64_offset_addr64(i64 addrspace(1)* %out, i64 addrspace(1)* %ptr) #0 {
	%id = call i32 @llvm.amdgcn.workitem.id.x()			%id = call i32 @llvm.amdgcn.workitem.id.x()
	%gep.tid = getelementptr i64, i64 addrspace(1)* %ptr, i32 %id			%gep.tid = getelementptr i64, i64 addrspace(1)* %ptr, i32 %id
	%out.gep = getelementptr i64, i64 addrspace(1)* %out, i32 %id			%out.gep = getelementptr i64, i64 addrspace(1)* %out, i32 %id
	%gep = getelementptr i64, i64 addrspace(1)* %gep.tid, i32 5			%gep = getelementptr i64, i64 addrspace(1)* %gep.tid, i32 5
	%result = call i64 @llvm.amdgcn.atomic.inc.i64.p1i64(i64 addrspace(1)* %gep, i64 42, i32 0, i32 0, i1 false)			%result = call i64 @llvm.amdgcn.atomic.inc.i64.p1i64(i64 addrspace(1)* %gep, i64 42, i32 0, i32 0, i1 false)
	store i64 %result, i64 addrspace(1)* %out.gep			store i64 %result, i64 addrspace(1)* %out.gep
	ret void			ret void
	}			}

	; GCN-LABEL: {{^}}global_atomic_inc_noret_i64_offset_addr64:			; GCN-LABEL: {{^}}global_atomic_inc_noret_i64_offset_addr64:
	; GCN-DAG: v_mov_b32_e32 v[[KLO:[0-9]+]], 42			; GCN: v_mov_b32_e32 v[[KLO:[0-9]+]], 42
	; GCN-DAG: v_mov_b32_e32 v[[KHI:[0-9]+]], 0{{$}}			; GCN: v_mov_b32_e32 v[[KHI:[0-9]+]], 0{{$}}
	; CI: buffer_atomic_inc_x2 v{{\[}}[[KLO]]:[[KHI]]{{\]}}, v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:40{{$}}			; CI: buffer_atomic_inc_x2 v{{\[}}[[KLO]]:[[KHI]]{{\]}}, v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:40{{$}}
	; VI: flat_atomic_inc_x2 v{{\[[0-9]+:[0-9]+\]}}, v{{\[}}[[KLO]]:[[KHI]]{{\]}}{{$}}			; VI: flat_atomic_inc_x2 v{{\[[0-9]+:[0-9]+\]}}, v{{\[}}[[KLO]]:[[KHI]]{{\]}}{{$}}
	define amdgpu_kernel void @global_atomic_inc_noret_i64_offset_addr64(i64 addrspace(1)* %ptr) #0 {			define amdgpu_kernel void @global_atomic_inc_noret_i64_offset_addr64(i64 addrspace(1)* %ptr) #0 {
	%id = call i32 @llvm.amdgcn.workitem.id.x()			%id = call i32 @llvm.amdgcn.workitem.id.x()
	%gep.tid = getelementptr i64, i64 addrspace(1)* %ptr, i32 %id			%gep.tid = getelementptr i64, i64 addrspace(1)* %ptr, i32 %id
	%gep = getelementptr i64, i64 addrspace(1)* %gep.tid, i32 5			%gep = getelementptr i64, i64 addrspace(1)* %gep.tid, i32 5
	%result = call i64 @llvm.amdgcn.atomic.inc.i64.p1i64(i64 addrspace(1)* %gep, i64 42, i32 0, i32 0, i1 false)			%result = call i64 @llvm.amdgcn.atomic.inc.i64.p1i64(i64 addrspace(1)* %gep, i64 42, i32 0, i32 0, i1 false)
	ret void			ret void
	▲ Show 20 Lines • Show All 109 Lines • ▼ Show 20 Lines
	; GCN: flat_atomic_inc_x2 v{{\[[0-9]+:[0-9]+\]}}, v{{\[}}[[KLO]]:[[KHI]]{{\]$}}			; GCN: flat_atomic_inc_x2 v{{\[[0-9]+:[0-9]+\]}}, v{{\[}}[[KLO]]:[[KHI]]{{\]$}}
	define amdgpu_kernel void @flat_atomic_inc_noret_i64_offset(i64 addrspace(4)* %ptr) nounwind {			define amdgpu_kernel void @flat_atomic_inc_noret_i64_offset(i64 addrspace(4)* %ptr) nounwind {
	%gep = getelementptr i64, i64 addrspace(4)* %ptr, i32 4			%gep = getelementptr i64, i64 addrspace(4)* %ptr, i32 4
	%result = call i64 @llvm.amdgcn.atomic.inc.i64.p4i64(i64 addrspace(4)* %gep, i64 42, i32 0, i32 0, i1 false)			%result = call i64 @llvm.amdgcn.atomic.inc.i64.p4i64(i64 addrspace(4)* %gep, i64 42, i32 0, i32 0, i1 false)
	ret void			ret void
	}			}

	; GCN-LABEL: {{^}}flat_atomic_inc_ret_i64_offset_addr64:			; GCN-LABEL: {{^}}flat_atomic_inc_ret_i64_offset_addr64:
	; GCN-DAG: v_mov_b32_e32 v[[KLO:[0-9]+]], 42			; GCN: v_mov_b32_e32 v[[KLO:[0-9]+]], 42
	; GCN-DAG: v_mov_b32_e32 v[[KHI:[0-9]+]], 0{{$}}			; GCN: v_mov_b32_e32 v[[KHI:[0-9]+]], 0{{$}}
	; GCN: flat_atomic_inc_x2 v{{\[[0-9]+:[0-9]+\]}}, v{{\[[0-9]+:[0-9]+\]}}, v{{\[}}[[KLO]]:[[KHI]]{{\]}} glc{{$}}			; GCN: flat_atomic_inc_x2 v{{\[[0-9]+:[0-9]+\]}}, v{{\[[0-9]+:[0-9]+\]}}, v{{\[}}[[KLO]]:[[KHI]]{{\]}} glc{{$}}
	define amdgpu_kernel void @flat_atomic_inc_ret_i64_offset_addr64(i64 addrspace(4)* %out, i64 addrspace(4)* %ptr) #0 {			define amdgpu_kernel void @flat_atomic_inc_ret_i64_offset_addr64(i64 addrspace(4)* %out, i64 addrspace(4)* %ptr) #0 {
	%id = call i32 @llvm.amdgcn.workitem.id.x()			%id = call i32 @llvm.amdgcn.workitem.id.x()
	%gep.tid = getelementptr i64, i64 addrspace(4)* %ptr, i32 %id			%gep.tid = getelementptr i64, i64 addrspace(4)* %ptr, i32 %id
	%out.gep = getelementptr i64, i64 addrspace(4)* %out, i32 %id			%out.gep = getelementptr i64, i64 addrspace(4)* %out, i32 %id
	%gep = getelementptr i64, i64 addrspace(4)* %gep.tid, i32 5			%gep = getelementptr i64, i64 addrspace(4)* %gep.tid, i32 5
	%result = call i64 @llvm.amdgcn.atomic.inc.i64.p4i64(i64 addrspace(4)* %gep, i64 42, i32 0, i32 0, i1 false)			%result = call i64 @llvm.amdgcn.atomic.inc.i64.p4i64(i64 addrspace(4)* %gep, i64 42, i32 0, i32 0, i1 false)
	store i64 %result, i64 addrspace(4)* %out.gep			store i64 %result, i64 addrspace(4)* %out.gep
	ret void			ret void
	}			}

	; GCN-LABEL: {{^}}flat_atomic_inc_noret_i64_offset_addr64:			; GCN-LABEL: {{^}}flat_atomic_inc_noret_i64_offset_addr64:
	; GCN-DAG: v_mov_b32_e32 v[[KLO:[0-9]+]], 42			; GCN: v_mov_b32_e32 v[[KLO:[0-9]+]], 42
	; GCN-DAG: v_mov_b32_e32 v[[KHI:[0-9]+]], 0{{$}}			; GCN: v_mov_b32_e32 v[[KHI:[0-9]+]], 0{{$}}
	; GCN: flat_atomic_inc_x2 v{{\[[0-9]+:[0-9]+\]}}, v{{\[}}[[KLO]]:[[KHI]]{{\]$}}			; GCN: flat_atomic_inc_x2 v{{\[[0-9]+:[0-9]+\]}}, v{{\[}}[[KLO]]:[[KHI]]{{\]$}}
	define amdgpu_kernel void @flat_atomic_inc_noret_i64_offset_addr64(i64 addrspace(4)* %ptr) #0 {			define amdgpu_kernel void @flat_atomic_inc_noret_i64_offset_addr64(i64 addrspace(4)* %ptr) #0 {
	%id = call i32 @llvm.amdgcn.workitem.id.x()			%id = call i32 @llvm.amdgcn.workitem.id.x()
	%gep.tid = getelementptr i64, i64 addrspace(4)* %ptr, i32 %id			%gep.tid = getelementptr i64, i64 addrspace(4)* %ptr, i32 %id
	%gep = getelementptr i64, i64 addrspace(4)* %gep.tid, i32 5			%gep = getelementptr i64, i64 addrspace(4)* %gep.tid, i32 5
	%result = call i64 @llvm.amdgcn.atomic.inc.i64.p4i64(i64 addrspace(4)* %gep, i64 42, i32 0, i32 0, i1 false)			%result = call i64 @llvm.amdgcn.atomic.inc.i64.p4i64(i64 addrspace(4)* %gep, i64 42, i32 0, i32 0, i1 false)
	ret void			ret void
	}			}

	attributes #0 = { nounwind }			attributes #0 = { nounwind }
	attributes #1 = { nounwind readnone }			attributes #1 = { nounwind readnone }
	attributes #2 = { nounwind argmemonly }			attributes #2 = { nounwind argmemonly }

llvm/trunk/test/CodeGen/AMDGPU/local-memory.amdgcn.ll

	Show All 39 Lines
	; EG: .long 166120			; EG: .long 166120
	; EG-NEXT: .long 8			; EG-NEXT: .long 8
	; GCN: .long 47180			; GCN: .long 47180
	; GCN-NEXT: .long 32900			; GCN-NEXT: .long 32900

	; GCN-LABEL: {{^}}local_memory_two_objects:			; GCN-LABEL: {{^}}local_memory_two_objects:
	; GCN: v_lshlrev_b32_e32 [[ADDRW:v[0-9]+]], 2, v0			; GCN: v_lshlrev_b32_e32 [[ADDRW:v[0-9]+]], 2, v0
	; CI-DAG: ds_write2_b32 [[ADDRW]], {{v[0-9]+}}, {{v[0-9]+}} offset1:4			; CI-DAG: ds_write2_b32 [[ADDRW]], {{v[0-9]+}}, {{v[0-9]+}} offset1:4
				; SI-DAG: ds_write2_b32 [[ADDRW]], {{v[0-9]+}}, {{v[0-9]+}} offset1:4
	; SI: v_add_i32_e32 [[ADDRW_OFF:v[0-9]+]], vcc, 16, [[ADDRW]]

	; SI-DAG: ds_write_b32 [[ADDRW]],
	; SI-DAG: ds_write_b32 [[ADDRW_OFF]],

	; GCN: s_barrier			; GCN: s_barrier

	; SI-DAG: v_sub_i32_e32 [[SUB0:v[0-9]+]], vcc, 28, [[ADDRW]]			; SI-DAG: v_sub_i32_e32 [[SUB0:v[0-9]+]], vcc, 28, [[ADDRW]]
	; SI-DAG: v_sub_i32_e32 [[SUB1:v[0-9]+]], vcc, 12, [[ADDRW]]			; SI-DAG: v_sub_i32_e32 [[SUB1:v[0-9]+]], vcc, 12, [[ADDRW]]

	; SI-DAG: ds_read_b32 v{{[0-9]+}}, [[SUB0]]			; SI-DAG: ds_read_b32 v{{[0-9]+}}, [[SUB0]]
	; SI-DAG: ds_read_b32 v{{[0-9]+}}, [[SUB1]]			; SI-DAG: ds_read_b32 v{{[0-9]+}}, [[SUB1]]
	Show All 31 Lines

llvm/trunk/test/CodeGen/AMDGPU/lower-range-metadata-intrinsic-call.ll

	; RUN: llc -march=amdgcn -mtriple=amdgcn-unknown-amdhsa < %s \| FileCheck %s			; RUN: llc -march=amdgcn -mtriple=amdgcn-unknown-amdhsa < %s \| FileCheck %s
	; RUN: llc -march=amdgcn -mtriple=amdgcn-unknown-unknown < %s \| FileCheck %s			; RUN: llc -march=amdgcn -mtriple=amdgcn-unknown-unknown < %s \| FileCheck %s

	; and can be eliminated			; and can be eliminated
	; CHECK-LABEL: {{^}}test_workitem_id_x_known_max_range:			; CHECK-LABEL: {{^}}test_workitem_id_x_known_max_range:
	; CHECK-NOT: v0			; CHECK-NOT: v0
	; CHECK: {{flat\|buffer}}_store_dword {{.*}}v0			; CHECK: {{flat\|buffer}}_store_dword {{.*}}v0
	define amdgpu_kernel void @test_workitem_id_x_known_max_range(i32 addrspace(1)* nocapture %out) #0 {			define amdgpu_kernel void @test_workitem_id_x_known_max_range(i32 addrspace(1)* nocapture %out) #0 {
	entry:			entry:
	%id = tail call i32 @llvm.amdgcn.workitem.id.x(), !range !0			%id = tail call i32 @llvm.amdgcn.workitem.id.x(), !range !0
	%and = and i32 %id, 1023			%and = and i32 %id, 1023
	store i32 %and, i32 addrspace(1)* %out, align 4			store i32 %and, i32 addrspace(1)* %out, align 4
	ret void			ret void
	}			}

	; CHECK-LABEL: {{^}}test_workitem_id_x_known_trunc_1_bit_range:			; CHECK-LABEL: {{^}}test_workitem_id_x_known_trunc_1_bit_range:
	; CHECK: v_and_b32_e32 [[MASKED:v[0-9]+]], 0x1ff, v0			; CHECK-NOT: v_and_b32
	; CHECK: {{flat\|buffer}}_store_dword {{.*}}[[MASKED]]			; CHECK: {{flat\|buffer}}_store_dword {{.*}}v0
	define amdgpu_kernel void @test_workitem_id_x_known_trunc_1_bit_range(i32 addrspace(1)* nocapture %out) #0 {			define amdgpu_kernel void @test_workitem_id_x_known_trunc_1_bit_range(i32 addrspace(1)* nocapture %out) #0 {
	entry:			entry:
	%id = tail call i32 @llvm.amdgcn.workitem.id.x(), !range !0			%id = tail call i32 @llvm.amdgcn.workitem.id.x(), !range !0
	%and = and i32 %id, 511			%and = and i32 %id, 511
	store i32 %and, i32 addrspace(1)* %out, align 4			store i32 %and, i32 addrspace(1)* %out, align 4
	ret void			ret void
	}			}

	; CHECK-LABEL: {{^}}test_workitem_id_x_known_max_range_m1:			; CHECK-LABEL: {{^}}test_workitem_id_x_known_max_range_m1:
	; CHECK-NOT: v0			; CHECK-NOT: v0
	; CHECK: v_and_b32_e32 [[MASKED:v[0-9]+]], 0xff, v0			; CHECK-NOT: v_and_b32
	; CHECK: {{flat\|buffer}}_store_dword {{.*}}[[MASKED]]			; CHECK: {{flat\|buffer}}_store_dword {{.*}}v0
	define amdgpu_kernel void @test_workitem_id_x_known_max_range_m1(i32 addrspace(1)* nocapture %out) #0 {			define amdgpu_kernel void @test_workitem_id_x_known_max_range_m1(i32 addrspace(1)* nocapture %out) #0 {
	entry:			entry:
	%id = tail call i32 @llvm.amdgcn.workitem.id.x(), !range !1			%id = tail call i32 @llvm.amdgcn.workitem.id.x(), !range !1
	%and = and i32 %id, 255			%and = and i32 %id, 255
	store i32 %and, i32 addrspace(1)* %out, align 4			store i32 %and, i32 addrspace(1)* %out, align 4
	ret void			ret void
	}			}


	declare i32 @llvm.amdgcn.workitem.id.x() #1			declare i32 @llvm.amdgcn.workitem.id.x() #1

	attributes #0 = { norecurse nounwind }			attributes #0 = { norecurse nounwind }
	attributes #1 = { nounwind readnone }			attributes #1 = { nounwind readnone }

	!0 = !{i32 0, i32 1024}			!0 = !{i32 0, i32 1024}
	!1 = !{i32 0, i32 1023}			!1 = !{i32 0, i32 1023}

llvm/trunk/test/CodeGen/AMDGPU/private-memory-r600.ll

; RUN: llc -march=r600 -mcpu=redwood < %s \| FileCheck %s -check-prefix=R600 -check-prefix=FUNC		; RUN: llc -march=r600 -mcpu=redwood < %s \| FileCheck %s -check-prefix=R600 -check-prefix=FUNC
; RUN: opt -S -mtriple=r600-unknown-unknown -mcpu=redwood -amdgpu-promote-alloca < %s \| FileCheck -check-prefix=OPT %s		; RUN: opt -S -mtriple=r600-unknown-unknown -mcpu=redwood -amdgpu-promote-alloca < %s \| FileCheck -check-prefix=OPT %s

declare i32 @llvm.r600.read.tidig.x() nounwind readnone		declare i32 @llvm.r600.read.tidig.x() nounwind readnone

; FUNC-LABEL: {{^}}mova_same_clause:		; FUNC-LABEL: {{^}}mova_same_clause:

; R600: LDS_WRITE		; R600: LDS_WRITE
; R600: LDS_WRITE		; R600: LDS_WRITE
; R600: LDS_READ		; R600: LDS_READ
; R600: LDS_READ		; R600: LDS_READ

; OPT: call i32 @llvm.r600.read.local.size.y(), !range !0		; OPT: call i32 @llvm.r600.read.local.size.y(), !range !0
; OPT: call i32 @llvm.r600.read.local.size.z(), !range !0		; OPT: call i32 @llvm.r600.read.local.size.z(), !range !0
; OPT: call i32 @llvm.r600.read.tidig.x(), !range !0		; OPT: call i32 @llvm.r600.read.tidig.x(), !range !1
; OPT: call i32 @llvm.r600.read.tidig.y(), !range !0		; OPT: call i32 @llvm.r600.read.tidig.y(), !range !1
; OPT: call i32 @llvm.r600.read.tidig.z(), !range !0		; OPT: call i32 @llvm.r600.read.tidig.z(), !range !1

define amdgpu_kernel void @mova_same_clause(i32 addrspace(1)* nocapture %out, i32 addrspace(1)* nocapture %in) #0 {		define amdgpu_kernel void @mova_same_clause(i32 addrspace(1)* nocapture %out, i32 addrspace(1)* nocapture %in) #0 {
entry:		entry:
%stack = alloca [5 x i32], align 4		%stack = alloca [5 x i32], align 4
%0 = load i32, i32 addrspace(1)* %in, align 4		%0 = load i32, i32 addrspace(1)* %in, align 4
%arrayidx1 = getelementptr inbounds [5 x i32], [5 x i32]* %stack, i32 0, i32 %0		%arrayidx1 = getelementptr inbounds [5 x i32], [5 x i32]* %stack, i32 0, i32 %0
store i32 4, i32* %arrayidx1, align 4		store i32 4, i32* %arrayidx1, align 4
%arrayidx2 = getelementptr inbounds i32, i32 addrspace(1)* %in, i32 1		%arrayidx2 = getelementptr inbounds i32, i32 addrspace(1)* %in, i32 1
▲ Show 20 Lines • Show All 264 Lines • ▼ Show 20 Lines	define amdgpu_kernel void @ptrtoint(i32 addrspace(1)* %out, i32 %a, i32 %b) #0 {
%tmp2 = add i32 %tmp1, 5		%tmp2 = add i32 %tmp1, 5
%tmp3 = inttoptr i32 %tmp2 to i32*		%tmp3 = inttoptr i32 %tmp2 to i32*
%tmp4 = getelementptr inbounds i32, i32* %tmp3, i32 %b		%tmp4 = getelementptr inbounds i32, i32* %tmp3, i32 %b
%tmp5 = load i32, i32* %tmp4		%tmp5 = load i32, i32* %tmp4
store i32 %tmp5, i32 addrspace(1)* %out		store i32 %tmp5, i32 addrspace(1)* %out
ret void		ret void
}		}

; OPT: !0 = !{i32 0, i32 2048}		; OPT: !0 = !{i32 0, i32 257}
		; OPT: !1 = !{i32 0, i32 256}

attributes #0 = { nounwind "amdgpu-waves-per-eu"="1,2" }		attributes #0 = { nounwind "amdgpu-waves-per-eu"="1,2" }

llvm/trunk/test/CodeGen/AMDGPU/shift-and-i128-ubfe.ll

; RUN: llc -march=amdgcn -verify-machineinstrs < %s \| FileCheck -check-prefix=GCN %s		; RUN: llc -march=amdgcn -verify-machineinstrs < %s \| FileCheck -check-prefix=GCN %s

; Extract the high bit of the 1st quarter		; Extract the high bit of the 1st quarter
; GCN-LABEL: {{^}}v_uextract_bit_31_i128:		; GCN-LABEL: {{^}}v_uextract_bit_31_i128:
; GCN: buffer_load_dword [[VAL:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64{{$}}		; GCN: buffer_load_dword [[VAL:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64{{$}}

; GCN: v_mov_b32_e32 v[[ZERO0:[0-9]+]], 0{{$}}		; GCN: v_mov_b32_e32 v[[ZERO0:[0-9]+]], 0{{$}}
; GCN: v_mov_b32_e32 v[[ZERO1:[0-9]+]], 0{{$}}		; GCN: v_mov_b32_e32 v[[ZERO1:[0-9]+]], 0{{$}}
; GCN: v_mov_b32_e32 v[[ZERO2:[0-9]+]], v[[ZERO0]]{{$}}		; GCN: v_mov_b32_e32 v[[ZERO2:[0-9]+]], v[[ZERO0]]{{$}}
; GCN: v_lshrrev_b32_e32 v[[SHIFT:[0-9]+]], 31, [[VAL]]		; GCN: v_lshrrev_b32_e32 v[[SHIFT:[0-9]+]], 31, [[VAL]]

; GCN: buffer_store_dwordx4 v{{\[}}[[SHIFT]]:[[ZERO2]]{{\]}}, v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64{{$}}		; GCN: buffer_store_dwordx4 v{{\[}}[[SHIFT]]:[[ZERO2]]{{\]}}, v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64{{$}}
; GCN: s_endpgm		; GCN: s_endpgm
define amdgpu_kernel void @v_uextract_bit_31_i128(i128 addrspace(1)* %out, i128 addrspace(1)* %in) #1 {		define amdgpu_kernel void @v_uextract_bit_31_i128(i128 addrspace(1)* %out, i128 addrspace(1)* %in) #1 {
%id.x = tail call i32 @llvm.amdgcn.workitem.id.x()		%id.x = tail call i32 @llvm.amdgcn.workgroup.id.x()
%in.gep = getelementptr i128, i128 addrspace(1)* %in, i32 %id.x		%in.gep = getelementptr i128, i128 addrspace(1)* %in, i32 %id.x
%out.gep = getelementptr i128, i128 addrspace(1)* %out, i32 %id.x		%out.gep = getelementptr i128, i128 addrspace(1)* %out, i32 %id.x
%ld.64 = load i128, i128 addrspace(1)* %in.gep		%ld.64 = load i128, i128 addrspace(1)* %in.gep
%srl = lshr i128 %ld.64, 31		%srl = lshr i128 %ld.64, 31
%bit = and i128 %srl, 1		%bit = and i128 %srl, 1
store i128 %bit, i128 addrspace(1)* %out.gep		store i128 %bit, i128 addrspace(1)* %out.gep
ret void		ret void
}		}
Show All 27 Lines
; GCN-DAG: v_mov_b32_e32 v[[ZERO0:[0-9]+]], 0{{$}}		; GCN-DAG: v_mov_b32_e32 v[[ZERO0:[0-9]+]], 0{{$}}
; GCN: v_mov_b32_e32 v[[ZERO1:[0-9]+]], 0{{$}}		; GCN: v_mov_b32_e32 v[[ZERO1:[0-9]+]], 0{{$}}
; GCN: v_mov_b32_e32 v[[ZERO2:[0-9]+]], v[[ZERO0]]{{$}}		; GCN: v_mov_b32_e32 v[[ZERO2:[0-9]+]], v[[ZERO0]]{{$}}
; GCN-DAG: v_lshrrev_b32_e32 v[[SHIFT:[0-9]+]], 31, [[VAL]]		; GCN-DAG: v_lshrrev_b32_e32 v[[SHIFT:[0-9]+]], 31, [[VAL]]

; GCN-DAG: buffer_store_dwordx4 v{{\[}}[[SHIFT]]:[[ZERO2]]{{\]}}, v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64{{$}}		; GCN-DAG: buffer_store_dwordx4 v{{\[}}[[SHIFT]]:[[ZERO2]]{{\]}}, v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64{{$}}
; GCN: s_endpgm		; GCN: s_endpgm
define amdgpu_kernel void @v_uextract_bit_95_i128(i128 addrspace(1)* %out, i128 addrspace(1)* %in) #1 {		define amdgpu_kernel void @v_uextract_bit_95_i128(i128 addrspace(1)* %out, i128 addrspace(1)* %in) #1 {
%id.x = tail call i32 @llvm.amdgcn.workitem.id.x()		%id.x = tail call i32 @llvm.amdgcn.workgroup.id.x()
%in.gep = getelementptr i128, i128 addrspace(1)* %in, i32 %id.x		%in.gep = getelementptr i128, i128 addrspace(1)* %in, i32 %id.x
%out.gep = getelementptr i128, i128 addrspace(1)* %out, i32 %id.x		%out.gep = getelementptr i128, i128 addrspace(1)* %out, i32 %id.x
%ld.64 = load i128, i128 addrspace(1)* %in.gep		%ld.64 = load i128, i128 addrspace(1)* %in.gep
%srl = lshr i128 %ld.64, 95		%srl = lshr i128 %ld.64, 95
%bit = and i128 %srl, 1		%bit = and i128 %srl, 1
store i128 %bit, i128 addrspace(1)* %out.gep		store i128 %bit, i128 addrspace(1)* %out.gep
ret void		ret void
}		}
Show All 40 Lines	define amdgpu_kernel void @v_uextract_bit_34_100_i128(i128 addrspace(1)* %out, i128 addrspace(1)* %in) #1 {
%srl = lshr i128 %ld.64, 34		%srl = lshr i128 %ld.64, 34
%bit = and i128 %srl, 73786976294838206463		%bit = and i128 %srl, 73786976294838206463
store i128 %bit, i128 addrspace(1)* %out.gep		store i128 %bit, i128 addrspace(1)* %out.gep
ret void		ret void
}		}

declare i32 @llvm.amdgcn.workitem.id.x() #0		declare i32 @llvm.amdgcn.workitem.id.x() #0

		declare i32 @llvm.amdgcn.workgroup.id.x() #0

attributes #0 = { nounwind readnone }		attributes #0 = { nounwind readnone }
attributes #1 = { nounwind }		attributes #1 = { nounwind }

llvm/trunk/test/CodeGen/AMDGPU/shift-and-i64-ubfe.ll

; RUN: llc -march=amdgcn -verify-machineinstrs < %s \| FileCheck -check-prefix=GCN %s		; RUN: llc -march=amdgcn -verify-machineinstrs < %s \| FileCheck -check-prefix=GCN %s

; Make sure 64-bit BFE pattern does a 32-bit BFE on the relevant half.		; Make sure 64-bit BFE pattern does a 32-bit BFE on the relevant half.

; Extract the high bit of the low half		; Extract the high bit of the low half
; GCN-LABEL: {{^}}v_uextract_bit_31_i64:		; GCN-LABEL: {{^}}v_uextract_bit_31_i64:
; GCN: buffer_load_dword [[VAL:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64{{$}}		; GCN: buffer_load_dword [[VAL:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64{{$}}
; GCN-DAG: v_lshrrev_b32_e32 v[[SHIFT:[0-9]+]], 31, [[VAL]]		; GCN-DAG: v_lshrrev_b32_e32 v[[SHIFT:[0-9]+]], 31, [[VAL]]
; GCN-DAG: v_mov_b32_e32 v[[ZERO:[0-9]+]], 0{{$}}		; GCN-DAG: v_mov_b32_e32 v[[ZERO:[0-9]+]], 0{{$}}
; GCN: buffer_store_dwordx2 v{{\[}}[[SHIFT]]:[[ZERO]]{{\]}}		; GCN: buffer_store_dwordx2 v{{\[}}[[SHIFT]]:[[ZERO]]{{\]}}
define amdgpu_kernel void @v_uextract_bit_31_i64(i64 addrspace(1)* %out, i64 addrspace(1)* %in) #1 {		define amdgpu_kernel void @v_uextract_bit_31_i64(i64 addrspace(1)* %out, i64 addrspace(1)* %in) #1 {
%id.x = tail call i32 @llvm.amdgcn.workitem.id.x()		%id.x = tail call i32 @llvm.amdgcn.workgroup.id.x()
%in.gep = getelementptr i64, i64 addrspace(1)* %in, i32 %id.x		%in.gep = getelementptr i64, i64 addrspace(1)* %in, i32 %id.x
%out.gep = getelementptr i64, i64 addrspace(1)* %out, i32 %id.x		%out.gep = getelementptr i64, i64 addrspace(1)* %out, i32 %id.x
%ld.64 = load i64, i64 addrspace(1)* %in.gep		%ld.64 = load i64, i64 addrspace(1)* %in.gep
%srl = lshr i64 %ld.64, 31		%srl = lshr i64 %ld.64, 31
%bit = and i64 %srl, 1		%bit = and i64 %srl, 1
store i64 %bit, i64 addrspace(1)* %out.gep		store i64 %bit, i64 addrspace(1)* %out.gep
ret void		ret void
}		}
Show All 16 Lines
}		}

; GCN-LABEL: {{^}}v_uextract_bit_1_i64:		; GCN-LABEL: {{^}}v_uextract_bit_1_i64:
; GCN: buffer_load_dword [[VAL:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64{{$}}		; GCN: buffer_load_dword [[VAL:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64{{$}}
; GCN-DAG: v_bfe_u32 v[[BFE:[0-9]+]], [[VAL]], 1, 1		; GCN-DAG: v_bfe_u32 v[[BFE:[0-9]+]], [[VAL]], 1, 1
; GCN-DAG: v_mov_b32_e32 v[[ZERO:[0-9]+]], 0{{$}}		; GCN-DAG: v_mov_b32_e32 v[[ZERO:[0-9]+]], 0{{$}}
; GCN: buffer_store_dwordx2 v{{\[}}[[BFE]]:[[ZERO]]{{\]}}		; GCN: buffer_store_dwordx2 v{{\[}}[[BFE]]:[[ZERO]]{{\]}}
define amdgpu_kernel void @v_uextract_bit_1_i64(i64 addrspace(1)* %out, i64 addrspace(1)* %in) #1 {		define amdgpu_kernel void @v_uextract_bit_1_i64(i64 addrspace(1)* %out, i64 addrspace(1)* %in) #1 {
%id.x = tail call i32 @llvm.amdgcn.workitem.id.x()		%id.x = tail call i32 @llvm.amdgcn.workgroup.id.x()
%in.gep = getelementptr i64, i64 addrspace(1)* %in, i32 %id.x		%in.gep = getelementptr i64, i64 addrspace(1)* %in, i32 %id.x
%out.gep = getelementptr i64, i64 addrspace(1)* %out, i32 %id.x		%out.gep = getelementptr i64, i64 addrspace(1)* %out, i32 %id.x
%ld.64 = load i64, i64 addrspace(1)* %in.gep		%ld.64 = load i64, i64 addrspace(1)* %in.gep
%srl = lshr i64 %ld.64, 1		%srl = lshr i64 %ld.64, 1
%bit = and i64 %srl, 1		%bit = and i64 %srl, 1
store i64 %bit, i64 addrspace(1)* %out.gep		store i64 %bit, i64 addrspace(1)* %out.gep
ret void		ret void
}		}

; GCN-LABEL: {{^}}v_uextract_bit_20_i64:		; GCN-LABEL: {{^}}v_uextract_bit_20_i64:
; GCN: buffer_load_dword [[VAL:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64{{$}}		; GCN: buffer_load_dword [[VAL:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64{{$}}
; GCN-DAG: v_bfe_u32 v[[BFE:[0-9]+]], [[VAL]], 20, 1		; GCN-DAG: v_bfe_u32 v[[BFE:[0-9]+]], [[VAL]], 20, 1
; GCN-DAG: v_mov_b32_e32 v[[ZERO:[0-9]+]], 0{{$}}		; GCN-DAG: v_mov_b32_e32 v[[ZERO:[0-9]+]], 0{{$}}
; GCN: buffer_store_dwordx2 v{{\[}}[[BFE]]:[[ZERO]]{{\]}}		; GCN: buffer_store_dwordx2 v{{\[}}[[BFE]]:[[ZERO]]{{\]}}
define amdgpu_kernel void @v_uextract_bit_20_i64(i64 addrspace(1)* %out, i64 addrspace(1)* %in) #1 {		define amdgpu_kernel void @v_uextract_bit_20_i64(i64 addrspace(1)* %out, i64 addrspace(1)* %in) #1 {
%id.x = tail call i32 @llvm.amdgcn.workitem.id.x()		%id.x = tail call i32 @llvm.amdgcn.workgroup.id.x()
%in.gep = getelementptr i64, i64 addrspace(1)* %in, i32 %id.x		%in.gep = getelementptr i64, i64 addrspace(1)* %in, i32 %id.x
%out.gep = getelementptr i64, i64 addrspace(1)* %out, i32 %id.x		%out.gep = getelementptr i64, i64 addrspace(1)* %out, i32 %id.x
%ld.64 = load i64, i64 addrspace(1)* %in.gep		%ld.64 = load i64, i64 addrspace(1)* %in.gep
%srl = lshr i64 %ld.64, 20		%srl = lshr i64 %ld.64, 20
%bit = and i64 %srl, 1		%bit = and i64 %srl, 1
store i64 %bit, i64 addrspace(1)* %out.gep		store i64 %bit, i64 addrspace(1)* %out.gep
ret void		ret void
}		}
Show All 31 Lines
}		}

; GCN-LABEL: {{^}}v_uextract_bit_20_21_i64:		; GCN-LABEL: {{^}}v_uextract_bit_20_21_i64:
; GCN: buffer_load_dword [[VAL:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64{{$}}		; GCN: buffer_load_dword [[VAL:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64{{$}}
; GCN-DAG: v_bfe_u32 v[[BFE:[0-9]+]], [[VAL]], 20, 2		; GCN-DAG: v_bfe_u32 v[[BFE:[0-9]+]], [[VAL]], 20, 2
; GCN-DAG: v_mov_b32_e32 v[[ZERO:[0-9]+]], 0{{$}}		; GCN-DAG: v_mov_b32_e32 v[[ZERO:[0-9]+]], 0{{$}}
; GCN: buffer_store_dwordx2 v{{\[}}[[BFE]]:[[ZERO]]{{\]}}		; GCN: buffer_store_dwordx2 v{{\[}}[[BFE]]:[[ZERO]]{{\]}}
define amdgpu_kernel void @v_uextract_bit_20_21_i64(i64 addrspace(1)* %out, i64 addrspace(1)* %in) #1 {		define amdgpu_kernel void @v_uextract_bit_20_21_i64(i64 addrspace(1)* %out, i64 addrspace(1)* %in) #1 {
%id.x = tail call i32 @llvm.amdgcn.workitem.id.x()		%id.x = tail call i32 @llvm.amdgcn.workgroup.id.x()
%in.gep = getelementptr i64, i64 addrspace(1)* %in, i32 %id.x		%in.gep = getelementptr i64, i64 addrspace(1)* %in, i32 %id.x
%out.gep = getelementptr i64, i64 addrspace(1)* %out, i32 %id.x		%out.gep = getelementptr i64, i64 addrspace(1)* %out, i32 %id.x
%ld.64 = load i64, i64 addrspace(1)* %in.gep		%ld.64 = load i64, i64 addrspace(1)* %in.gep
%srl = lshr i64 %ld.64, 20		%srl = lshr i64 %ld.64, 20
%bit = and i64 %srl, 3		%bit = and i64 %srl, 3
store i64 %bit, i64 addrspace(1)* %out.gep		store i64 %bit, i64 addrspace(1)* %out.gep
ret void		ret void
}		}

; GCN-LABEL: {{^}}v_uextract_bit_1_30_i64:		; GCN-LABEL: {{^}}v_uextract_bit_1_30_i64:
; GCN: buffer_load_dword [[VAL:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64{{$}}		; GCN: buffer_load_dword [[VAL:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64{{$}}
; GCN-DAG: v_bfe_u32 v[[BFE:[0-9]+]], [[VAL]], 1, 30		; GCN-DAG: v_bfe_u32 v[[BFE:[0-9]+]], [[VAL]], 1, 30
; GCN-DAG: v_mov_b32_e32 v[[ZERO:[0-9]+]], 0{{$}}		; GCN-DAG: v_mov_b32_e32 v[[ZERO:[0-9]+]], 0{{$}}
; GCN: buffer_store_dwordx2 v{{\[}}[[BFE]]:[[ZERO]]{{\]}}		; GCN: buffer_store_dwordx2 v{{\[}}[[BFE]]:[[ZERO]]{{\]}}
define amdgpu_kernel void @v_uextract_bit_1_30_i64(i64 addrspace(1)* %out, i64 addrspace(1)* %in) #1 {		define amdgpu_kernel void @v_uextract_bit_1_30_i64(i64 addrspace(1)* %out, i64 addrspace(1)* %in) #1 {
%id.x = tail call i32 @llvm.amdgcn.workitem.id.x()		%id.x = tail call i32 @llvm.amdgcn.workgroup.id.x()
%in.gep = getelementptr i64, i64 addrspace(1)* %in, i32 %id.x		%in.gep = getelementptr i64, i64 addrspace(1)* %in, i32 %id.x
%out.gep = getelementptr i64, i64 addrspace(1)* %out, i32 %id.x		%out.gep = getelementptr i64, i64 addrspace(1)* %out, i32 %id.x
%ld.64 = load i64, i64 addrspace(1)* %in.gep		%ld.64 = load i64, i64 addrspace(1)* %in.gep
%srl = lshr i64 %ld.64, 1		%srl = lshr i64 %ld.64, 1
%bit = and i64 %srl, 1073741823		%bit = and i64 %srl, 1073741823
store i64 %bit, i64 addrspace(1)* %out.gep		store i64 %bit, i64 addrspace(1)* %out.gep
ret void		ret void
}		}

; GCN-LABEL: {{^}}v_uextract_bit_1_31_i64:		; GCN-LABEL: {{^}}v_uextract_bit_1_31_i64:
; GCN: buffer_load_dword [[VAL:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64{{$}}		; GCN: buffer_load_dword [[VAL:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64{{$}}
; GCN-DAG: v_lshrrev_b32_e32 v[[SHIFT:[0-9]+]], 1, [[VAL]]		; GCN-DAG: v_lshrrev_b32_e32 v[[SHIFT:[0-9]+]], 1, [[VAL]]
; GCN-DAG: v_mov_b32_e32 v[[ZERO:[0-9]+]], 0{{$}}		; GCN-DAG: v_mov_b32_e32 v[[ZERO:[0-9]+]], 0{{$}}
; GCN: buffer_store_dwordx2 v{{\[}}[[SHIFT]]:[[ZERO]]{{\]}}		; GCN: buffer_store_dwordx2 v{{\[}}[[SHIFT]]:[[ZERO]]{{\]}}
define amdgpu_kernel void @v_uextract_bit_1_31_i64(i64 addrspace(1)* %out, i64 addrspace(1)* %in) #1 {		define amdgpu_kernel void @v_uextract_bit_1_31_i64(i64 addrspace(1)* %out, i64 addrspace(1)* %in) #1 {
%id.x = tail call i32 @llvm.amdgcn.workitem.id.x()		%id.x = tail call i32 @llvm.amdgcn.workgroup.id.x()
%in.gep = getelementptr i64, i64 addrspace(1)* %in, i32 %id.x		%in.gep = getelementptr i64, i64 addrspace(1)* %in, i32 %id.x
%out.gep = getelementptr i64, i64 addrspace(1)* %out, i32 %id.x		%out.gep = getelementptr i64, i64 addrspace(1)* %out, i32 %id.x
%ld.64 = load i64, i64 addrspace(1)* %in.gep		%ld.64 = load i64, i64 addrspace(1)* %in.gep
%srl = lshr i64 %ld.64, 1		%srl = lshr i64 %ld.64, 1
%bit = and i64 %srl, 2147483647		%bit = and i64 %srl, 2147483647
store i64 %bit, i64 addrspace(1)* %out.gep		store i64 %bit, i64 addrspace(1)* %out.gep
ret void		ret void
}		}

; Spans the dword boundary, so requires full shift		; Spans the dword boundary, so requires full shift
; GCN-LABEL: {{^}}v_uextract_bit_31_32_i64:		; GCN-LABEL: {{^}}v_uextract_bit_31_32_i64:
; GCN: buffer_load_dwordx2 [[VAL:v\[[0-9]+:[0-9]+\]]]		; GCN: buffer_load_dwordx2 [[VAL:v\[[0-9]+:[0-9]+\]]]
; GCN: v_lshr_b64 v{{\[}}[[SHRLO:[0-9]+]]:[[SHRHI:[0-9]+]]{{\]}}, [[VAL]], 31		; GCN: v_lshr_b64 v{{\[}}[[SHRLO:[0-9]+]]:[[SHRHI:[0-9]+]]{{\]}}, [[VAL]], 31
; GCN-DAG: v_and_b32_e32 v[[AND:[0-9]+]], 3, v[[SHRLO]]{{$}}		; GCN-DAG: v_and_b32_e32 v[[AND:[0-9]+]], 3, v[[SHRLO]]{{$}}
; GCN-DAG: v_mov_b32_e32 v[[ZERO:[0-9]+]], 0{{$}}		; GCN-DAG: v_mov_b32_e32 v[[ZERO:[0-9]+]], 0{{$}}
; GCN: buffer_store_dwordx2 v{{\[}}[[AND]]:[[ZERO]]{{\]}}		; GCN: buffer_store_dwordx2 v{{\[}}[[AND]]:[[ZERO]]{{\]}}
define amdgpu_kernel void @v_uextract_bit_31_32_i64(i64 addrspace(1)* %out, i64 addrspace(1)* %in) #1 {		define amdgpu_kernel void @v_uextract_bit_31_32_i64(i64 addrspace(1)* %out, i64 addrspace(1)* %in) #1 {
%id.x = tail call i32 @llvm.amdgcn.workitem.id.x()		%id.x = tail call i32 @llvm.amdgcn.workgroup.id.x()
%in.gep = getelementptr i64, i64 addrspace(1)* %in, i32 %id.x		%in.gep = getelementptr i64, i64 addrspace(1)* %in, i32 %id.x
%out.gep = getelementptr i64, i64 addrspace(1)* %out, i32 %id.x		%out.gep = getelementptr i64, i64 addrspace(1)* %out, i32 %id.x
%ld.64 = load i64, i64 addrspace(1)* %in.gep		%ld.64 = load i64, i64 addrspace(1)* %in.gep
%srl = lshr i64 %ld.64, 31		%srl = lshr i64 %ld.64, 31
%bit = and i64 %srl, 3		%bit = and i64 %srl, 3
store i64 %bit, i64 addrspace(1)* %out.gep		store i64 %bit, i64 addrspace(1)* %out.gep
ret void		ret void
}		}
▲ Show 20 Lines • Show All 210 Lines • ▼ Show 20 Lines	define amdgpu_kernel void @v_uextract_bit_33_36_use_upper_half_shift_i64(i64 addrspace(1)* %out0, i32 addrspace(1)* %out1, i64 addrspace(1)* %in) #1 {
%srl.srl32 = lshr i64 %srl, 32		%srl.srl32 = lshr i64 %srl, 32
%srl.hi = trunc i64 %srl.srl32 to i32		%srl.hi = trunc i64 %srl.srl32 to i32
store volatile i32 %srl.hi, i32 addrspace(1)* %out1.gep		store volatile i32 %srl.hi, i32 addrspace(1)* %out1.gep
ret void		ret void
}		}

declare i32 @llvm.amdgcn.workitem.id.x() #0		declare i32 @llvm.amdgcn.workitem.id.x() #0

		declare i32 @llvm.amdgcn.workgroup.id.x() #0

attributes #0 = { nounwind readnone }		attributes #0 = { nounwind readnone }
attributes #1 = { nounwind }		attributes #1 = { nounwind }

llvm/trunk/test/CodeGen/AMDGPU/shl.ll

; RUN: llc -march=amdgcn -mcpu=verde -verify-machineinstrs < %s \| FileCheck -check-prefix=GCN -check-prefix=SI -check-prefix=FUNC %s		; RUN: llc -march=amdgcn -mcpu=verde -verify-machineinstrs < %s \| FileCheck -check-prefix=GCN -check-prefix=SI -check-prefix=FUNC %s
; XUN: llc -march=amdgcn -mcpu=tonga -mattr=-flat-for-global -verify-machineinstrs < %s \| FileCheck -check-prefix=GCN -check-prefix=VI -check-prefix=FUNC %s		; XUN: llc -march=amdgcn -mcpu=tonga -mattr=-flat-for-global -verify-machineinstrs < %s \| FileCheck -check-prefix=GCN -check-prefix=VI -check-prefix=FUNC %s
; RUN: llc -march=r600 -mcpu=redwood < %s \| FileCheck -check-prefix=EG -check-prefix=FUNC %s		; RUN: llc -march=r600 -mcpu=redwood < %s \| FileCheck -check-prefix=EG -check-prefix=FUNC %s

declare i32 @llvm.r600.read.tidig.x() #0		declare i32 @llvm.r600.read.tidig.x() #0

		declare i32 @llvm.r600.read.tgid.x() #0


;EG: {{^}}shl_v2i32:		;EG: {{^}}shl_v2i32:
;EG: LSHL {{\? }}T{{[0-9]+\.[XYZW], T[0-9]+\.[XYZW], T[0-9]+\.[XYZW]}}		;EG: LSHL {{\? }}T{{[0-9]+\.[XYZW], T[0-9]+\.[XYZW], T[0-9]+\.[XYZW]}}
;EG: LSHL {{\? }}T{{[0-9]+\.[XYZW], T[0-9]+\.[XYZW], T[0-9]+\.[XYZW]}}		;EG: LSHL {{\? }}T{{[0-9]+\.[XYZW], T[0-9]+\.[XYZW], T[0-9]+\.[XYZW]}}

;SI: {{^}}shl_v2i32:		;SI: {{^}}shl_v2i32:
;SI: v_lshl_b32_e32 v{{[0-9]+, v[0-9]+, v[0-9]+}}		;SI: v_lshl_b32_e32 v{{[0-9]+, v[0-9]+, v[0-9]+}}
;SI: v_lshl_b32_e32 v{{[0-9]+, v[0-9]+, v[0-9]+}}		;SI: v_lshl_b32_e32 v{{[0-9]+, v[0-9]+, v[0-9]+}}
▲ Show 20 Lines • Show All 268 Lines • ▼ Show 20 Lines	define amdgpu_kernel void @s_shl_32_i64(i64 addrspace(1)* %out, i64 %a) {
ret void		ret void
}		}

; GCN-LABEL: {{^}}v_shl_32_i64:		; GCN-LABEL: {{^}}v_shl_32_i64:
; GCN-DAG: buffer_load_dword v[[LO_A:[0-9]+]],		; GCN-DAG: buffer_load_dword v[[LO_A:[0-9]+]],
; GCN-DAG: v_mov_b32_e32 v[[VLO:[0-9]+]], 0{{$}}		; GCN-DAG: v_mov_b32_e32 v[[VLO:[0-9]+]], 0{{$}}
; GCN: buffer_store_dwordx2 v{{\[}}[[VLO]]:[[LO_A]]{{\]}}		; GCN: buffer_store_dwordx2 v{{\[}}[[VLO]]:[[LO_A]]{{\]}}
define amdgpu_kernel void @v_shl_32_i64(i64 addrspace(1)* %out, i64 addrspace(1)* %in) {		define amdgpu_kernel void @v_shl_32_i64(i64 addrspace(1)* %out, i64 addrspace(1)* %in) {
%tid = call i32 @llvm.r600.read.tidig.x() #0		%tid = call i32 @llvm.r600.read.tgid.x() #0
%gep.in = getelementptr i64, i64 addrspace(1)* %in, i32 %tid		%gep.in = getelementptr i64, i64 addrspace(1)* %in, i32 %tid
%gep.out = getelementptr i64, i64 addrspace(1)* %out, i32 %tid		%gep.out = getelementptr i64, i64 addrspace(1)* %out, i32 %tid
%a = load i64, i64 addrspace(1)* %gep.in		%a = load i64, i64 addrspace(1)* %gep.in
%result = shl i64 %a, 32		%result = shl i64 %a, 32
store i64 %result, i64 addrspace(1)* %gep.out		store i64 %result, i64 addrspace(1)* %gep.out
ret void		ret void
}		}

▲ Show 20 Lines • Show All 178 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/AMDGPU/sub.i16.ll

Show First 20 Lines • Show All 79 Lines • ▼ Show 20 Lines	define amdgpu_kernel void @v_test_sub_i16_zext_to_i32(i32 addrspace(1)* %out, i16 addrspace(1)* %in0, i16 addrspace(1)* %in1) #1 {
%add = sub i16 %a, %b		%add = sub i16 %a, %b
%ext = zext i16 %add to i32		%ext = zext i16 %add to i32
store i32 %ext, i32 addrspace(1)* %out		store i32 %ext, i32 addrspace(1)* %out
ret void		ret void
}		}

; FIXME: Need to handle non-uniform case for function below (load without gep).		; FIXME: Need to handle non-uniform case for function below (load without gep).
; GCN-LABEL: {{^}}v_test_sub_i16_zext_to_i64:		; GCN-LABEL: {{^}}v_test_sub_i16_zext_to_i64:
		; VI-DAG: v_mov_b32_e32 v[[VZERO:[0-9]+]], 0
; VI: flat_load_ushort [[A:v[0-9]+]]		; VI: flat_load_ushort [[A:v[0-9]+]]
; VI: flat_load_ushort [[B:v[0-9]+]]		; VI: flat_load_ushort [[B:v[0-9]+]]
; VI-DAG: v_subrev_u16_e32 v[[ADD:[0-9]+]], [[B]], [[A]]		; VI-DAG: v_subrev_u16_e32 v[[ADD:[0-9]+]], [[B]], [[A]]
; VI-DAG: v_mov_b32_e32 v[[VZERO:[0-9]+]], 0
; VI: buffer_store_dwordx2 v{{\[}}[[ADD]]:[[VZERO]]{{\]}}, off, {{s\[[0-9]+:[0-9]+\]}}, 0{{$}}		; VI: buffer_store_dwordx2 v{{\[}}[[ADD]]:[[VZERO]]{{\]}}, off, {{s\[[0-9]+:[0-9]+\]}}, 0{{$}}
define amdgpu_kernel void @v_test_sub_i16_zext_to_i64(i64 addrspace(1)* %out, i16 addrspace(1)* %in0, i16 addrspace(1)* %in1) #1 {		define amdgpu_kernel void @v_test_sub_i16_zext_to_i64(i64 addrspace(1)* %out, i16 addrspace(1)* %in0, i16 addrspace(1)* %in1) #1 {
%tid = call i32 @llvm.amdgcn.workitem.id.x()		%tid = call i32 @llvm.amdgcn.workitem.id.x()
%gep.out = getelementptr inbounds i64, i64 addrspace(1)* %out, i32 %tid		%gep.out = getelementptr inbounds i64, i64 addrspace(1)* %out, i32 %tid
%gep.in0 = getelementptr inbounds i16, i16 addrspace(1)* %in0, i32 %tid		%gep.in0 = getelementptr inbounds i16, i16 addrspace(1)* %in0, i32 %tid
%gep.in1 = getelementptr inbounds i16, i16 addrspace(1)* %in1, i32 %tid		%gep.in1 = getelementptr inbounds i16, i16 addrspace(1)* %in1, i32 %tid
%a = load volatile i16, i16 addrspace(1)* %gep.in0		%a = load volatile i16, i16 addrspace(1)* %gep.in0
%b = load volatile i16, i16 addrspace(1)* %gep.in1		%b = load volatile i16, i16 addrspace(1)* %gep.in1
▲ Show 20 Lines • Show All 70 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/AMDGPU/zext-lid.ll

				; RUN: llc -march=amdgcn < %s \| FileCheck %s
				; RUN: opt -S -mtriple=amdgcn-- -amdgpu-lower-intrinsics < %s \| FileCheck -check-prefix=OPT %s

				; CHECK-NOT: and_b32

				; OPT-LABEL: @zext_grp_size_128
				; OPT: tail call i32 @llvm.amdgcn.workitem.id.x() #2, !range !0
				; OPT: tail call i32 @llvm.amdgcn.workitem.id.y() #2, !range !0
				; OPT: tail call i32 @llvm.amdgcn.workitem.id.z() #2, !range !0
				define amdgpu_kernel void @zext_grp_size_128(i32 addrspace(1)* nocapture %arg) #0 {
				bb:
				%tmp = tail call i32 @llvm.amdgcn.workitem.id.x() #2
				%tmp1 = and i32 %tmp, 127
				store i32 %tmp1, i32 addrspace(1)* %arg, align 4
				%tmp2 = tail call i32 @llvm.amdgcn.workitem.id.y() #2
				%tmp3 = and i32 %tmp2, 127
				%tmp4 = getelementptr inbounds i32, i32 addrspace(1)* %arg, i64 1
				store i32 %tmp3, i32 addrspace(1)* %tmp4, align 4
				%tmp5 = tail call i32 @llvm.amdgcn.workitem.id.z() #2
				%tmp6 = and i32 %tmp5, 127
				%tmp7 = getelementptr inbounds i32, i32 addrspace(1)* %arg, i64 2
				store i32 %tmp6, i32 addrspace(1)* %tmp7, align 4
				ret void
				}

				; OPT-LABEL: @zext_grp_size_32x4x1
				; OPT: tail call i32 @llvm.amdgcn.workitem.id.x() #2, !range !2
				; OPT: tail call i32 @llvm.amdgcn.workitem.id.y() #2, !range !3
				; OPT: tail call i32 @llvm.amdgcn.workitem.id.z() #2, !range !4
				define amdgpu_kernel void @zext_grp_size_32x4x1(i32 addrspace(1)* nocapture %arg) #0 !reqd_work_group_size !0 {
				bb:
				%tmp = tail call i32 @llvm.amdgcn.workitem.id.x() #2
				%tmp1 = and i32 %tmp, 31
				store i32 %tmp1, i32 addrspace(1)* %arg, align 4
				%tmp2 = tail call i32 @llvm.amdgcn.workitem.id.y() #2
				%tmp3 = and i32 %tmp2, 3
				%tmp4 = getelementptr inbounds i32, i32 addrspace(1)* %arg, i64 1
				store i32 %tmp3, i32 addrspace(1)* %tmp4, align 4
				%tmp5 = tail call i32 @llvm.amdgcn.workitem.id.z() #2
				%tmp6 = and i32 %tmp5, 1
				%tmp7 = getelementptr inbounds i32, i32 addrspace(1)* %arg, i64 2
				store i32 %tmp6, i32 addrspace(1)* %tmp7, align 4
				ret void
				}

				; OPT-LABEL: @zext_grp_size_512
				; OPT: tail call i32 @llvm.amdgcn.workitem.id.x() #2, !range !5
				; OPT: tail call i32 @llvm.amdgcn.workitem.id.y() #2, !range !5
				; OPT: tail call i32 @llvm.amdgcn.workitem.id.z() #2, !range !5
				define amdgpu_kernel void @zext_grp_size_512(i32 addrspace(1)* nocapture %arg) #1 {
				bb:
				%tmp = tail call i32 @llvm.amdgcn.workitem.id.x() #2
				%tmp1 = and i32 %tmp, 65535
				store i32 %tmp1, i32 addrspace(1)* %arg, align 4
				%tmp2 = tail call i32 @llvm.amdgcn.workitem.id.y() #2
				%tmp3 = and i32 %tmp2, 65535
				%tmp4 = getelementptr inbounds i32, i32 addrspace(1)* %arg, i64 1
				store i32 %tmp3, i32 addrspace(1)* %tmp4, align 4
				%tmp5 = tail call i32 @llvm.amdgcn.workitem.id.z() #2
				%tmp6 = and i32 %tmp5, 65535
				%tmp7 = getelementptr inbounds i32, i32 addrspace(1)* %arg, i64 2
				store i32 %tmp6, i32 addrspace(1)* %tmp7, align 4
				ret void
				}

				declare i32 @llvm.amdgcn.workitem.id.x() #2

				declare i32 @llvm.amdgcn.workitem.id.y() #2

				declare i32 @llvm.amdgcn.workitem.id.z() #2

				attributes #0 = { nounwind "amdgpu-flat-work-group-size"="64,128" }
				attributes #1 = { nounwind "amdgpu-flat-work-group-size"="512,512" }
				attributes #2 = { nounwind readnone }

				!0 = !{i32 32, i32 4, i32 1}

				; OPT: !0 = !{i32 0, i32 128}
				; OPT: !1 = !{i32 32, i32 4, i32 1}
				; OPT: !2 = !{i32 0, i32 32}
				; OPT: !3 = !{i32 0, i32 4}
				; OPT: !4 = !{i32 0, i32 1}
				; OPT: !5 = !{i32 0, i32 512}

This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Generate range metadata for workitem idClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 95028

llvm/trunk/lib/Target/AMDGPU/AMDGPU.h

llvm/trunk/lib/Target/AMDGPU/AMDGPULowerIntrinsics.cpp

llvm/trunk/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp

llvm/trunk/lib/Target/AMDGPU/AMDGPUSubtarget.h

llvm/trunk/lib/Target/AMDGPU/AMDGPUSubtarget.cpp

llvm/trunk/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp

llvm/trunk/test/CodeGen/AMDGPU/add.i16.ll

llvm/trunk/test/CodeGen/AMDGPU/amdgpu.private-memory.ll

llvm/trunk/test/CodeGen/AMDGPU/bfe-patterns.ll

llvm/trunk/test/CodeGen/AMDGPU/ds_read2_superreg.ll

llvm/trunk/test/CodeGen/AMDGPU/llvm.amdgcn.atomic.dec.ll

llvm/trunk/test/CodeGen/AMDGPU/llvm.amdgcn.atomic.inc.ll

llvm/trunk/test/CodeGen/AMDGPU/local-memory.amdgcn.ll

llvm/trunk/test/CodeGen/AMDGPU/lower-range-metadata-intrinsic-call.ll

llvm/trunk/test/CodeGen/AMDGPU/private-memory-r600.ll

llvm/trunk/test/CodeGen/AMDGPU/shift-and-i128-ubfe.ll

llvm/trunk/test/CodeGen/AMDGPU/shift-and-i64-ubfe.ll

llvm/trunk/test/CodeGen/AMDGPU/shl.ll

llvm/trunk/test/CodeGen/AMDGPU/sub.i16.ll

llvm/trunk/test/CodeGen/AMDGPU/zext-lid.ll

[AMDGPU] Generate range metadata for workitem id
ClosedPublic