This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
include/llvm/IR/
-
llvm/
-
IR/
1/7
IntrinsicsAMDGPU.td
-
lib/Target/AMDGPU/
-
Target/
-
AMDGPU/
21
SIISelLowering.cpp
1
SIInstructions.td
-
test/CodeGen/AMDGPU/
-
CodeGen/
-
AMDGPU/
2
llvm.amdgcn.reduce.umin.ll

Differential D154858

[AMDGPU] Add llvm.amdgcn.wave.reduce.umin/umax Intrinsic.
ClosedPublic

Authored by pravinjagtap on Jul 10 2023, 9:05 AM.

Download Raw Diff

Details

Reviewers

arsenm
yassingh
b-sumner
foad
cdevadas

Group Reviewers

Restricted Project

Commits

rGc48ed93cf8c9: [AMDGPU] Add llvm.amdgcn.wave.reduce.umin/umax Intrinsic.

Summary

When input to intrinsic is uniform value, reduced value is
same as input whereas if input value is divergent we need
to iterate over all the active lane to perform the reduction.

The control flow for a loop has been set up, which
iterates over only active lanes to perform reduction.

Introduced WAVE_REDUCE_UMIN_PSEUDO_U32 and
WAVE_REDUCE_UMAX_PSEUDO_U32 Pseudos which
are lowered Post-ISel (in EmitInstrWithCustomInserter ).

Diff Detail

Repository: rG LLVM Github Monorepo

Unit TestsFailed

	Time	Test
	60,060 ms	x64 debian > ThreadSanitizer-x86_64.ThreadSanitizer-x86_64::restore_stack.cpp

Event Timeline

pravinjagtap created this revision.Jul 10 2023, 9:05 AM

Herald added a project: Restricted Project. · View Herald TranscriptJul 10 2023, 9:05 AM

Herald added subscribers: foad, kerbowa, hiraditya and 6 others. · View Herald Transcript

pravinjagtap requested review of this revision.Jul 10 2023, 9:05 AM

Herald added a project: Restricted Project. · View Herald TranscriptJul 10 2023, 9:05 AM

Herald added subscribers: llvm-commits, wdng. · View Herald Transcript

pravinjagtap added reviewers: arsenm, yassingh, b-sumner, foad.Jul 10 2023, 9:06 AM

Herald added a subscriber: StephenFan. · View Herald TranscriptJul 10 2023, 9:06 AM

I was thinking an IR expansion would be easier, but it's good to have a machine one (at least for umin)

llvm/include/llvm/IR/IntrinsicsAMDGPU.td
1929	Should have a mangled type. I also think it should have an immarg operand for the preferred lowering strategy to use. Also, wave_reduce? Also umin would be a better choice for a first one, given that we want it for dynamic alloca handling
llvm/lib/Target/AMDGPU/SILowerReduceAndScanPseudo.cpp
1 ↗	(On Diff #538679)	Missing header
87 ↗	(On Diff #538679)	There's supposed to be a getWaveRegClass to go through
172 ↗	(On Diff #538679)	This doesn't need to be a separate pass, can be a post isel hook
llvm/test/CodeGen/AMDGPU/llvm.amdgpu.reduce.ll
2 ↗	(On Diff #538679)	test with -global-isel=1/0
19 ↗	(On Diff #538679)	Also add poison and constant tests
25 ↗	(On Diff #538679)	Also add a test where this is under divergent control flow
52 ↗	(On Diff #538679)	Should strip out most of this test
73 ↗	(On Diff #538679)	Drop this, it's redundant with the run line target and breaks adding multiple run targets

Can add MIR tests.

llvm/lib/Target/AMDGPU/SILowerReduceAndScanPseudo.cpp
8 ↗	(On Diff #538679)	Some description about what the pass will do? Or function comment if this is not implemented as a pass.
45–48 ↗	(On Diff #538679)	INITIALIZE_PASS(SIExpandReduceAndScanPseudo, DEBUG_TYPE, "Expand Reduction and Scan Pseudos", false, false)

Harbormaster completed remote builds in B244169: Diff 538679.Jul 10 2023, 10:34 AM

pravinjagtap added inline comments.Jul 10 2023, 11:44 PM

llvm/lib/Target/AMDGPU/SILowerReduceAndScanPseudo.cpp
172 ↗	(On Diff #538679)	Are you referring to `EmitInstrWithCustomInserter` API where other PSEUDOs are expanded ?

arsenm added inline comments.Jul 11 2023, 5:58 PM

llvm/lib/Target/AMDGPU/SILowerReduceAndScanPseudo.cpp
172 ↗	(On Diff #538679)	Yes, that's generally where the pseudos to hack around the DAG not handling control flow go

Addressed review comments @arsenm.

Implemented umin using post isel hook

pravinjagtap added inline comments.Jul 12 2023, 4:07 AM

llvm/include/llvm/IR/IntrinsicsAMDGPU.td
1929	I also think it should have an immarg operand for the preferred lowering strategy to use In that case, we need to create two different intrinsics and two pesudo operations, one for immediate operand and other for non-immediate operand. Also, reduction of scalar value of immediate value is that value itself, so do we really need lowering for this ?

foad added inline comments.Jul 12 2023, 4:45 AM

llvm/include/llvm/IR/IntrinsicsAMDGPU.td
1929	I think @arsenm meant that the intrinsic should take an extra `immarg i32 %strategy` argument.

In D154858#4492965, @pravinjagtap wrote:

Addressed review comments @arsenm.

Implemented umin using post isel hook

Sorry, I meant umax. We need umax for alloca, not umin

arsenm added inline comments.Jul 12 2023, 4:47 AM

llvm/include/llvm/IR/IntrinsicsAMDGPU.td
1929	Yes, so you have a way of requesting the DPP or WWM lowering etc. It doesn't change the main operand

Harbormaster completed remote builds in B244739: Diff 539486.Jul 12 2023, 7:03 AM

Added support for umax

arsenm added inline comments.Jul 12 2023, 7:51 AM

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
4089	I was envisioning this as just a hint, and if unimplemented (or the target doesn't support the version), it would just fallback to one that works. Should also add some intrinsic documentation to AMDGPUUsage with the values for this
llvm/lib/Target/AMDGPU/SIInstructions.td
267	These need _U32/_B32 suffixes
llvm/test/CodeGen/AMDGPU/llvm.amdgcn.reduce.umax.ll
2 ↗	(On Diff #539555)	Should test with both wave sizes, and test for every generation, with global-isel=0 and 1
llvm/test/CodeGen/AMDGPU/llvm.amdgcn.reduce.umin.ll
5	Put the immarg on the declarations
127	Use named values

arsenm added inline comments.Jul 12 2023, 7:58 AM

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
4089	Also split the argument to strategy decision to a separate function

Harbormaster completed remote builds in B244789: Diff 539555.Jul 12 2023, 11:37 AM

Addressed review comments of @arsenm

Extended support global isel
Updated test

arsenm added inline comments.Jul 13 2023, 6:19 AM

llvm/docs/AMDGPUUsage.rst
984 ↗	(On Diff #539969)	Elaborate that it should work if the target doesn't support the mode (e.g. gfx6/7 have no DPP)
llvm/include/llvm/IR/IntrinsicsAMDGPU.td
1933–1937	Define an intrinsic class for these to avoid repeating the signautre each time. Also you still should use a type mangled argument instead of hardcoded i32.
llvm/lib/Target/AMDGPU/SIISelLowering.cpp
4068	static, start with lowercase
4079	static, start with lowercase

arsenm requested changes to this revision.Jul 13 2023, 6:38 AM

This revision now requires changes to proceed.Jul 13 2023, 6:38 AM

pravinjagtap retitled this revision from [WIP] [AMDGPU] Add llvm.amdgcn.wave.reduce.umin/umax Intrinsic. to [AMDGPU] Add llvm.amdgcn.wave.reduce.umin/umax Intrinsic..Jul 13 2023, 6:38 AM

pravinjagtap edited the summary of this revision. (Show Details)

pravinjagtap added a reviewer: Restricted Project.

Harbormaster completed remote builds in B245074: Diff 539969.Jul 13 2023, 7:43 AM

arsenm added inline comments.Jul 13 2023, 12:20 PM

llvm/docs/AMDGPUUsage.rst
984 ↗	(On Diff #539969)	The default 0 should mean target default preference. The higher values should request a specific strategy

Need MIR tests for pseudo expansion

llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
4521–4528 ↗	(On Diff #539969)
llvm/lib/Target/AMDGPU/SIISelLowering.cpp
4113	typo `iterative`
4115	same
4150	ExecReg

Addressed review commnets.

Harbormaster completed remote builds in B245311: Diff 540307.Jul 14 2023, 2:04 AM

Added MIR tests

Harbormaster completed remote builds in B245340: Diff 540341.Jul 14 2023, 4:05 AM

arsenm added inline comments.Jul 17 2023, 4:39 PM

llvm/docs/AMDGPUUsage.rst
982 ↗	(On Diff #540341)	Missing wave from the name. Also, probably should spell out each one individually rather than putting a / in the names
llvm/lib/Target/AMDGPU/SIISelLowering.cpp
4081	See my above comment, 0 should be auto

Addressed reveiw comments

Harbormaster completed remote builds in B246076: Diff 541326.Jul 18 2023, 2:33 AM

pravinjagtap added a reviewer: cdevadas.Jul 19 2023, 9:25 PM

arsenm added inline comments.Jul 20 2023, 4:25 PM

llvm/docs/AMDGPUUsage.rst
996 ↗	(On Diff #541326)	unsigned minimum
1005 ↗	(On Diff #541326)	unsigned maximum
llvm/include/llvm/IR/IntrinsicsAMDGPU.td
1936	Comment doesn't match the description now
llvm/lib/Target/AMDGPU/SIISelLowering.cpp
4069	No point in this wrapper, whenever the new implementation arrives it will add the check
4087	Just return true/false?
4092	Just use the default 0

Addressed review commnets.

For now, for all the cases (default, Iterative and DPP) we use
iterative approach by default. When DPP arrives, strategy
switch needs to be added to decide which implemenation to use.

Harbormaster completed remote builds in B247109: Diff 542773.Jul 21 2023, 1:08 AM

Mostly lgtm with a few more cleanups

llvm/docs/AMDGPUUsage.rst
1014 ↗	(On Diff #542773)	Probably should mention it's currently only implemented for i32
llvm/include/llvm/IR/IntrinsicsAMDGPU.td
1932	llvm_anyint_ty
llvm/lib/Target/AMDGPU/SIISelLowering.cpp
4106	ST.getWaveMaskRegClass
4129	uint32_t? std::numeric_limits<uint32_t>::max()?
4130	No & on the result of any BuildMI
4149	No &
4151	No &
4163	No &
4182	Could have just use the original register to begin with?
llvm/test/CodeGen/AMDGPU/llvm.amdgcn.reduce.umax.ll
7 ↗	(On Diff #542773)	don't specify the wavefrontsize features twice, just use the wave64 override and assume wave32 by default
313 ↗	(On Diff #542773)	In a follow up commit, AMDGPUInstCombineIntrinsic should also fold these constant cases out

Addressed review comments. Mostly, Code cleanup.

arsenm accepted this revision.Jul 21 2023, 11:54 AM

arsenm added inline comments.

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
4068	lowerWaveReduce?
4098–4103	Can use C++17 binding
4119	Reuse the same getRegClass call

This revision is now accepted and ready to land.Jul 21 2023, 11:54 AM

As a follow up please do the constant folds in AMDGPUInstCombine. Also can you prepare another to introduce this in the lowering of divergent dynamic alloca?

Harbormaster completed remote builds in B247296: Diff 543018.Jul 21 2023, 6:55 PM

Addressed comments

Harbormaster completed remote builds in B247375: Diff 543145.Jul 22 2023, 5:12 AM

This revision was landed with ongoing or failed builds.Jul 23 2023, 9:11 PM

Closed by commit rGc48ed93cf8c9: [AMDGPU] Add llvm.amdgcn.wave.reduce.umin/umax Intrinsic. (authored by pravinjagtap). · Explain Why

This revision was automatically updated to reflect the committed changes.

pravinjagtap added a commit: rGc48ed93cf8c9: [AMDGPU] Add llvm.amdgcn.wave.reduce.umin/umax Intrinsic..

In D154858#4523488, @arsenm wrote:

As a follow up please do the constant folds in AMDGPUInstCombine. Also can you prepare another to introduce this in the lowering of divergent dynamic alloca?

Constant folds: D156077. Will start looking into the lowering of divergent dynamic alloca.

I think this breaks the expensive-checks CI:
https://lab.llvm.org/buildbot/#/builders/16/builds/51955

In D154858#4527008, @steakhal wrote:

I think this breaks the expensive-checks CI:
https://lab.llvm.org/buildbot/#/builders/16/builds/51955

Hello @steakhal, I am looking into it.

In D154858#4527185, @pravinjagtap wrote:

In D154858#4527008, @steakhal wrote:

I think this breaks the expensive-checks CI:
https://lab.llvm.org/buildbot/#/builders/16/builds/51955

Hello @steakhal, I am looking into it.

Unless you think you've almost got it solved, can you revert the changes so the bots go back to green?

In D154858#4527699, @aaron.ballman wrote:

In D154858#4527185, @pravinjagtap wrote:

In D154858#4527008, @steakhal wrote:

I think this breaks the expensive-checks CI:
https://lab.llvm.org/buildbot/#/builders/16/builds/51955

Hello @steakhal, I am looking into it.

Unless you think you've almost got it solved, can you revert the changes so the bots go back to green?

Fix : https://reviews.llvm.org/rGd163b76ce348516db7abe3a462ae4cb78f922c75

CC: @steakhal, @aaron.ballman

In D154858#4528245, @pravinjagtap wrote:

In D154858#4527699, @aaron.ballman wrote:

In D154858#4527185, @pravinjagtap wrote:

In D154858#4527008, @steakhal wrote:

I think this breaks the expensive-checks CI:
https://lab.llvm.org/buildbot/#/builders/16/builds/51955

Hello @steakhal, I am looking into it.

Unless you think you've almost got it solved, can you revert the changes so the bots go back to green?

Fix : https://reviews.llvm.org/rGd163b76ce348516db7abe3a462ae4cb78f922c75

CC: @steakhal, @aaron.ballman

Thank you! I can confirm this resolved the issues I was seeing.

Revision Contents

Path

Size

llvm/

include/

llvm/

IR/

IntrinsicsAMDGPU.td

4 lines

lib/

Target/

AMDGPU/

SIISelLowering.cpp

121 lines

SIInstructions.td

7 lines

test/

CodeGen/

AMDGPU/

llvm.amdgcn.reduce.umin.ll

138 lines

Diff 539486

llvm/include/llvm/IR/IntrinsicsAMDGPU.td

Show First 20 Lines • Show All 1,920 Lines • ▼ Show 20 Lines	Intrinsic<[llvm_anyint_ty], [llvm_anyfloat_ty, LLVMMatchType<1>, llvm_i32_ty],
[IntrNoMem, IntrConvergent,		[IntrNoMem, IntrConvergent,
ImmArg<ArgIndex<2>>, IntrWillReturn, IntrNoCallback, IntrNoFree]>;		ImmArg<ArgIndex<2>>, IntrWillReturn, IntrNoCallback, IntrNoFree]>;

def int_amdgcn_ballot :		def int_amdgcn_ballot :
Intrinsic<[llvm_anyint_ty], [llvm_i1_ty],		Intrinsic<[llvm_anyint_ty], [llvm_i1_ty],
[IntrNoMem, IntrConvergent, IntrWillReturn, IntrNoCallback, IntrNoFree]>;		[IntrNoMem, IntrConvergent, IntrWillReturn, IntrNoCallback, IntrNoFree]>;

def int_amdgcn_inverse_ballot :		def int_amdgcn_inverse_ballot :
Intrinsic<[llvm_i1_ty], [llvm_anyint_ty],		Intrinsic<[llvm_i1_ty], [llvm_anyint_ty],
		arsenmUnsubmitted Not Done Reply Inline Actions Should have a mangled type. I also think it should have an immarg operand for the preferred lowering strategy to use. Also, wave_reduce? Also umin would be a better choice for a first one, given that we want it for dynamic alloca handling arsenm: Should have a mangled type. I also think it should have an immarg operand for the preferred…
		pravinjagtapAuthorUnsubmitted Done Reply Inline Actions I also think it should have an immarg operand for the preferred lowering strategy to use In that case, we need to create two different intrinsics and two pesudo operations, one for immediate operand and other for non-immediate operand. Also, reduction of scalar value of immediate value is that value itself, so do we really need lowering for this ? pravinjagtap: > I also think it should have an immarg operand for the preferred lowering strategy to use In…
		foadUnsubmitted Not Done Reply Inline Actions I think @arsenm meant that the intrinsic should take an extra `immarg i32 %strategy` argument. foad: I think @arsenm meant that the intrinsic should take an //extra// `immarg i32 %strategy`…
		arsenmUnsubmitted Not Done Reply Inline Actions Yes, so you have a way of requesting the DPP or WWM lowering etc. It doesn't change the main operand arsenm: Yes, so you have a way of requesting the DPP or WWM lowering etc. It doesn't change the main…
[IntrNoMem, IntrConvergent, IntrWillReturn, IntrNoCallback, IntrNoFree]>;		[IntrNoMem, IntrConvergent, IntrWillReturn, IntrNoCallback, IntrNoFree]>;

		def int_amdgcn_wave_reduce_umin :
		arsenmUnsubmitted Not Done Reply Inline Actions llvm_anyint_ty arsenm: llvm_anyint_ty
		Intrinsic<[llvm_i32_ty], [llvm_i32_ty],
		[IntrNoMem, IntrConvergent, IntrWillReturn, IntrNoCallback, IntrNoFree]>;

def int_amdgcn_readfirstlane :		def int_amdgcn_readfirstlane :
		arsenmUnsubmitted Not Done Reply Inline Actions Comment doesn't match the description now arsenm: Comment doesn't match the description now
ClangBuiltin<"__builtin_amdgcn_readfirstlane">,		ClangBuiltin<"__builtin_amdgcn_readfirstlane">,
		arsenmUnsubmitted Not Done Reply Inline Actions Define an intrinsic class for these to avoid repeating the signautre each time. Also you still should use a type mangled argument instead of hardcoded i32. arsenm: Define an intrinsic class for these to avoid repeating the signautre each time. Also you still…
Intrinsic<[llvm_i32_ty], [llvm_i32_ty],		Intrinsic<[llvm_i32_ty], [llvm_i32_ty],
[IntrNoMem, IntrConvergent, IntrWillReturn, IntrNoCallback, IntrNoFree]>;		[IntrNoMem, IntrConvergent, IntrWillReturn, IntrNoCallback, IntrNoFree]>;

// The lane argument must be uniform across the currently active threads of the		// The lane argument must be uniform across the currently active threads of the
// current wave. Otherwise, the result is undefined.		// current wave. Otherwise, the result is undefined.
def int_amdgcn_readlane :		def int_amdgcn_readlane :
ClangBuiltin<"__builtin_amdgcn_readlane">,		ClangBuiltin<"__builtin_amdgcn_readlane">,
Intrinsic<[llvm_i32_ty], [llvm_i32_ty, llvm_i32_ty],		Intrinsic<[llvm_i32_ty], [llvm_i32_ty, llvm_i32_ty],
▲ Show 20 Lines • Show All 824 Lines • Show Last 20 Lines

llvm/lib/Target/AMDGPU/SIISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 4,059 Lines • ▼ Show 20 Lines BuildMI(*LoopBB, InsPt, DL, MovRelDesc, Dst)

.add(*Val) .add(*Val)

.addImm(AMDGPU::sub0); .addImm(AMDGPU::sub0);

} }

MI.eraseFromParent(); MI.eraseFromParent();

return LoopBB; return LoopBB;

} }

MachineBasicBlock *SITargetLowering::EmitInstrWithCustomInserter( MachineBasicBlock *SITargetLowering::EmitInstrWithCustomInserter(

arsenmUnsubmitted

Not Done

static, start with lowercase

arsenm: static, start with lowercase

arsenmUnsubmitted

Not Done

lowerWaveReduce?

arsenm: lowerWaveReduce?

MachineInstr &MI, MachineBasicBlock *BB) const { MachineInstr &MI, MachineBasicBlock *BB) const {

arsenmUnsubmitted

Not Done

No point in this wrapper, whenever the new implementation arrives it will add the check

arsenm: No point in this wrapper, whenever the new implementation arrives it will add the check

const SIInstrInfo *TII = getSubtarget()->getInstrInfo(); const SIInstrInfo *TII = getSubtarget()->getInstrInfo();

MachineFunction *MF = BB->getParent(); MachineFunction *MF = BB->getParent();

SIMachineFunctionInfo *MFI = MF->getInfo<SIMachineFunctionInfo>(); SIMachineFunctionInfo *MFI = MF->getInfo<SIMachineFunctionInfo>();

switch (MI.getOpcode()) { switch (MI.getOpcode()) {

case AMDGPU::WAVE_REDUCE_UMIN_PSEUDO: {

MachineRegisterInfo &MRI = BB->getParent()->getRegInfo();

const GCNSubtarget &ST = MF->getSubtarget<GCNSubtarget>();

arsenmUnsubmitted

Not Done

static, start with lowercase

arsenm: static, start with lowercase

const SIRegisterInfo *TRI = ST.getRegisterInfo();

const DebugLoc &DL = MI.getDebugLoc();

arsenmUnsubmitted

Not Done

See my above comment, 0 should be auto

arsenm: See my above comment, 0 should be auto

// Reduction operations depend on whether the input operand is SGPR or VGPR.

bool isSGPR = TRI->isSGPRClass(MRI.getRegClass(SrcReg));

MachineBasicBlock *RetBB = nullptr;

arsenmUnsubmitted

Not Done

Just return true/false?

arsenm: Just return true/false?

if (isSGPR) {

// These operations with a uniform value i.e. SGPR are idempotent.

arsenmUnsubmitted

Not Done

I was envisioning this as just a hint, and if unimplemented (or the target doesn't support the version), it would just fallback to one that works.

Should also add some intrinsic documentation to AMDGPUUsage with the values for this

arsenm: I was envisioning this as just a hint, and if unimplemented (or the target doesn't support the…

arsenmUnsubmitted

Not Done

Also split the argument to strategy decision to a separate function

arsenm: Also split the argument to strategy decision to a separate function

// Reduced value will be same as given sgpr.

BuildMI(*BB, MI, DL, TII->get(AMDGPU::S_MOV_B32), DstReg).addReg(SrcReg);

RetBB = BB;

arsenmUnsubmitted

Not Done

Just use the default 0

arsenm: Just use the default 0

} else {

// To reduce the VGPR, we need to iterative over all the active lanes.

// Lowering consists of ComputeLoop, which iterative over only active

// lanes. We use copy of EXEC register as induction variable and

// every active lane modifies it using bitset0 so that

// we will get the next active lane for next iteration.

MachineBasicBlock::iterator I = BB->end();

// Create Control flow for loop

MachineBasicBlock *ComputeLoop;

arsenmUnsubmitted

Not Done

- // Create Control flow for loop

- MachineBasicBlock *ComputeLoop;

- MachineBasicBlock *ComputeEnd;

// Split MI's Machine Basic block into For loop

- std::tie(ComputeLoop, ComputeEnd) = splitBlockForLoop(MI, BB, true);

+ auto [ComputeLoop, ComputeEnd] = splitBlockForLoop(MI, BB, true);

bool IsWave32 = ST.isWave32();

Can use C++17 binding

arsenm: Can use C++17 binding

MachineBasicBlock *ComputeEnd;

// Split MI's Machine Basic block into For loop

arsenmUnsubmitted

Not Done

ST.getWaveMaskRegClass

arsenm: ST.getWaveMaskRegClass

std::tie(ComputeLoop, ComputeEnd) = splitBlockForLoop(MI, *BB, true);

bool IsWave32 = ST.isWave32();

const TargetRegisterClass *RegClass =

IsWave32 ? &AMDGPU::SReg_32RegClass : &AMDGPU::SReg_64RegClass;

// Create Registers required for lowering.

yassinghUnsubmitted

Not Done

typo iterative

yassingh: typo `iterative`

yassinghUnsubmitted

Not Done

same

yassingh: same

MRI.createVirtualRegister(MRI.getRegClass(DstReg));

arsenmUnsubmitted

Not Done

Reuse the same getRegClass call

arsenm: Reuse the same getRegClass call

MRI.createVirtualRegister(MRI.getRegClass(DstReg));

arsenmUnsubmitted

Not Done

uint32_t? std::numeric_limits<uint32_t>::max()?

arsenm: uint32_t? std::numeric_limits<uint32_t>::max()?

unsigned MovOpc = IsWave32 ? AMDGPU::S_MOV_B32 : AMDGPU::S_MOV_B64;

arsenmUnsubmitted

Not Done

No & on the result of any BuildMI

arsenm: No & on the result of any BuildMI

unsigned ExecOpc = IsWave32 ? AMDGPU::EXEC_LO : AMDGPU::EXEC;

// Create initail values of induction variable from Exec, Accumulator and

// Branch to ComputeBlock

auto &TmpSReg =

BuildMI(*BB, I, DL, TII->get(MovOpc), LoopIterator).addReg(ExecOpc);

BuildMI(*BB, I, DL, TII->get(AMDGPU::S_MOV_B32), InitalValReg)

.addImm(UINT_MAX);

BuildMI(*BB, I, DL, TII->get(AMDGPU::S_BRANCH)).addMBB(ComputeLoop);

// Start constructing ComputeLoop

I = ComputeLoop->end();

auto Accumulator =

BuildMI(*ComputeLoop, I, DL, TII->get(AMDGPU::PHI), AccumulatorReg)

.addReg(InitalValReg)

.addMBB(BB);

auto ActiveBits =

BuildMI(*ComputeLoop, I, DL, TII->get(AMDGPU::PHI), ActiveBitsReg)

.addReg(TmpSReg->getOperand(0).getReg())

arsenmUnsubmitted

Not Done

No &

arsenm: No &

.addMBB(BB);

yassinghUnsubmitted

Not Done

ExecReg

yassingh: ExecReg

arsenmUnsubmitted

Not Done

No &

arsenm: No &

// Perform the computations

unsigned SFFOpc =

IsWave32 ? AMDGPU::S_FF1_I32_B32 : AMDGPU::S_FF1_I32_B64;

auto &FF1 = BuildMI(*ComputeLoop, I, DL, TII->get(SFFOpc), FF1Reg)

.addReg(ActiveBits->getOperand(0).getReg());

auto &LaneValue = BuildMI(*ComputeLoop, I, DL,

TII->get(AMDGPU::V_READLANE_B32), LaneValueReg)

.addReg(SrcReg)

.addReg(FF1->getOperand(0).getReg());

auto &NewAccumulator =

BuildMI(*ComputeLoop, I, DL, TII->get(AMDGPU::S_MIN_U32),

NewAccumulatorReg)

arsenmUnsubmitted

Not Done

No &

arsenm: No &

.addReg(Accumulator->getOperand(0).getReg())

.addReg(LaneValue->getOperand(0).getReg());

// Manipulate the iterator to get the next active lane

unsigned BITSETOpc =

IsWave32 ? AMDGPU::S_BITSET0_B32 : AMDGPU::S_BITSET0_B64;

auto &NewActiveBits =

BuildMI(*ComputeLoop, I, DL, TII->get(BITSETOpc), NewActiveBitsReg)

.addReg(FF1->getOperand(0).getReg())

.addReg(ActiveBits->getOperand(0).getReg());

// Add phi nodes

Accumulator.addReg(NewAccumulator->getOperand(0).getReg())

.addMBB(ComputeLoop);

ActiveBits.addReg(NewActiveBits->getOperand(0).getReg())

.addMBB(ComputeLoop);

// Creating branching

unsigned CMPOpc = IsWave32 ? AMDGPU::S_CMP_LG_U32 : AMDGPU::S_CMP_LG_U64;

arsenmUnsubmitted

Not Done

Could have just use the original register to begin with?

arsenm: Could have just use the original register to begin with?

BuildMI(*ComputeLoop, I, DL, TII->get(CMPOpc))

.addReg(NewActiveBits->getOperand(0).getReg())

.addImm(0);

BuildMI(*ComputeLoop, I, DL, TII->get(AMDGPU::S_CBRANCH_SCC1))

.addMBB(ComputeLoop);

MRI.replaceRegWith(DstReg, NewAccumulator->getOperand(0).getReg());

RetBB = ComputeEnd;

}

MI.eraseFromParent();

return RetBB;

}

case AMDGPU::S_UADDO_PSEUDO: case AMDGPU::S_UADDO_PSEUDO:

case AMDGPU::S_USUBO_PSEUDO: { case AMDGPU::S_USUBO_PSEUDO: {

const DebugLoc &DL = MI.getDebugLoc(); const DebugLoc &DL = MI.getDebugLoc();

MachineOperand &Dest0 = MI.getOperand(0); MachineOperand &Dest0 = MI.getOperand(0);

MachineOperand &Dest1 = MI.getOperand(1); MachineOperand &Dest1 = MI.getOperand(1);

MachineOperand &Src0 = MI.getOperand(2); MachineOperand &Src0 = MI.getOperand(2);

MachineOperand &Src1 = MI.getOperand(3); MachineOperand &Src1 = MI.getOperand(3);

▲ Show 20 Lines • Show All 10,194 Lines • Show Last 20 Lines

llvm/lib/Target/AMDGPU/SIInstructions.td

	Show First 20 Lines • Show All 252 Lines • ▼ Show 20 Lines
	}			}

	def V_SET_INACTIVE_B64 : VPseudoInstSI <(outs VReg_64:$vdst),			def V_SET_INACTIVE_B64 : VPseudoInstSI <(outs VReg_64:$vdst),
	(ins VSrc_b64: $src, VSrc_b64:$inactive),			(ins VSrc_b64: $src, VSrc_b64:$inactive),
	[(set i64:$vdst, (int_amdgcn_set_inactive i64:$src, i64:$inactive))]> {			[(set i64:$vdst, (int_amdgcn_set_inactive i64:$src, i64:$inactive))]> {
	}			}
	} // End Defs = [SCC]			} // End Defs = [SCC]

				let usesCustomInserter = 1, hasSideEffects = 0, mayLoad = 0, mayStore = 0, Uses = [EXEC] in {
				def WAVE_REDUCE_UMIN_PSEUDO : VPseudoInstSI <(outs SGPR_32:$sdst),
				(ins VSrc_b32: $src),
				[(set i32:$sdst, (int_amdgcn_wave_reduce_umin i32:$src))]> {
				}
				}

				arsenmUnsubmitted Not Done Reply Inline Actions These need _U32/_B32 suffixes arsenm: These need _U32/_B32 suffixes
	let usesCustomInserter = 1, Defs = [VCC, EXEC] in {			let usesCustomInserter = 1, Defs = [VCC, EXEC] in {
	def V_ADD_U64_PSEUDO : VPseudoInstSI <			def V_ADD_U64_PSEUDO : VPseudoInstSI <
	(outs VReg_64:$vdst), (ins VSrc_b64:$src0, VSrc_b64:$src1),			(outs VReg_64:$vdst), (ins VSrc_b64:$src0, VSrc_b64:$src1),
	[(set VReg_64:$vdst, (DivergentBinFrag<add> i64:$src0, i64:$src1))]			[(set VReg_64:$vdst, (DivergentBinFrag<add> i64:$src0, i64:$src1))]
	>;			>;

	def V_SUB_U64_PSEUDO : VPseudoInstSI <			def V_SUB_U64_PSEUDO : VPseudoInstSI <
	(outs VReg_64:$vdst), (ins VSrc_b64:$src0, VSrc_b64:$src1),			(outs VReg_64:$vdst), (ins VSrc_b64:$src0, VSrc_b64:$src1),
	▲ Show 20 Lines • Show All 3,336 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/llvm.amdgcn.reduce.umin.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 3
				; RUN: llc -march=amdgcn -mcpu=gfx1100 -global-isel=0 -mattr=+wavefrontsize32,-wavefrontsize64 < %s \| FileCheck %s


				declare i32 @llvm.amdgcn.wave.reduce.umin(i32)
				arsenmUnsubmitted Not Done Reply Inline Actions Put the immarg on the declarations arsenm: Put the immarg on the declarations
				declare i32 @llvm.amdgcn.workitem.id.x()

				define amdgpu_kernel void @uniform_value(ptr addrspace(1) %out, i32 %in) {
				; CHECK-LABEL: uniform_value:
				; CHECK: ; %bb.0: ; %entry
				; CHECK-NEXT: s_clause 0x1
				; CHECK-NEXT: s_load_b32 s2, s[0:1], 0x2c
				; CHECK-NEXT: s_load_b64 s[0:1], s[0:1], 0x24
				; CHECK-NEXT: s_waitcnt lgkmcnt(0)
				; CHECK-NEXT: v_dual_mov_b32 v0, 0 :: v_dual_mov_b32 v1, s2
				; CHECK-NEXT: global_store_b32 v0, v1, s[0:1]
				; CHECK-NEXT: s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)
				; CHECK-NEXT: s_endpgm
				entry:
				%result = call i32 @llvm.amdgcn.wave.reduce.umin(i32 %in)
				store i32 %result, ptr addrspace(1) %out
				ret void
				}

				define amdgpu_kernel void @const_value(ptr addrspace(1) %out) {
				; CHECK-LABEL: const_value:
				; CHECK: ; %bb.0: ; %entry
				; CHECK-NEXT: s_load_b64 s[0:1], s[0:1], 0x24
				; CHECK-NEXT: v_dual_mov_b32 v0, 0 :: v_dual_mov_b32 v1, 0x7b
				; CHECK-NEXT: s_waitcnt lgkmcnt(0)
				; CHECK-NEXT: global_store_b32 v0, v1, s[0:1]
				; CHECK-NEXT: s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)
				; CHECK-NEXT: s_endpgm
				entry:
				%result = call i32 @llvm.amdgcn.wave.reduce.umin(i32 123)
				store i32 %result, ptr addrspace(1) %out
				ret void
				}

				define amdgpu_kernel void @poison_value(ptr addrspace(1) %out, i32 %in) {
				; CHECK-LABEL: poison_value:
				; CHECK: ; %bb.0: ; %entry
				; CHECK-NEXT: s_load_b64 s[0:1], s[0:1], 0x24
				; CHECK-NEXT: v_mov_b32_e32 v0, 0
				; CHECK-NEXT: s_waitcnt lgkmcnt(0)
				; CHECK-NEXT: global_store_b32 v0, v0, s[0:1]
				; CHECK-NEXT: s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)
				; CHECK-NEXT: s_endpgm
				entry:
				%result = call i32 @llvm.amdgcn.wave.reduce.umin(i32 poison)
				store i32 %result, ptr addrspace(1) %out
				ret void
				}

				define amdgpu_kernel void @divergent_value(ptr addrspace(1) %out, i32 %in) {
				; CHECK-LABEL: divergent_value:
				; CHECK: ; %bb.0: ; %entry
				; CHECK-NEXT: s_load_b64 s[0:1], s[0:1], 0x24
				; CHECK-NEXT: v_mov_b32_e32 v1, 0
				; CHECK-NEXT: s_mov_b32 s3, exec_lo
				; CHECK-NEXT: s_mov_b32 s2, -1
				; CHECK-NEXT: .LBB3_1: ; =>This Inner Loop Header: Depth=1
				; CHECK-NEXT: s_ctz_i32_b32 s4, s3
				; CHECK-NEXT: s_delay_alu instid0(SALU_CYCLE_1) \| instskip(SKIP_1) \| instid1(VALU_DEP_1)
				; CHECK-NEXT: v_readlane_b32 s5, v0, s4
				; CHECK-NEXT: s_bitset0_b32 s3, s4
				; CHECK-NEXT: s_min_u32 s2, s2, s5
				; CHECK-NEXT: s_cmp_lg_u32 s3, 0
				; CHECK-NEXT: s_cbranch_scc1 .LBB3_1
				; CHECK-NEXT: ; %bb.2:
				; CHECK-NEXT: v_mov_b32_e32 v0, s2
				; CHECK-NEXT: s_waitcnt lgkmcnt(0)
				; CHECK-NEXT: global_store_b32 v1, v0, s[0:1]
				; CHECK-NEXT: s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)
				; CHECK-NEXT: s_endpgm
				entry:
				%id.x = call i32 @llvm.amdgcn.workitem.id.x()
				%result = call i32 @llvm.amdgcn.wave.reduce.umin(i32 %id.x)
				store i32 %result, ptr addrspace(1) %out
				ret void
				}

				define amdgpu_kernel void @divergent_cfg(ptr addrspace(1) %out, i32 %in) {
				; CHECK-LABEL: divergent_cfg:
				; CHECK: ; %bb.0: ; %entry
				; CHECK-NEXT: s_mov_b32 s2, exec_lo
				; CHECK-NEXT: ; implicit-def: $sgpr3
				; CHECK-NEXT: v_cmpx_lt_u32_e32 15, v0
				; CHECK-NEXT: s_xor_b32 s2, exec_lo, s2
				; CHECK-NEXT: s_cbranch_execz .LBB4_2
				; CHECK-NEXT: ; %bb.1: ; %else
				; CHECK-NEXT: s_load_b32 s3, s[0:1], 0x2c
				; CHECK-NEXT: ; implicit-def: $vgpr0
				; CHECK-NEXT: .LBB4_2: ; %Flow
				; CHECK-NEXT: s_or_saveexec_b32 s2, s2
				; CHECK-NEXT: s_waitcnt lgkmcnt(0)
				; CHECK-NEXT: v_mov_b32_e32 v1, s3
				; CHECK-NEXT: s_xor_b32 exec_lo, exec_lo, s2
				; CHECK-NEXT: s_cbranch_execz .LBB4_6
				; CHECK-NEXT: ; %bb.3: ; %if
				; CHECK-NEXT: s_mov_b32 s4, exec_lo
				; CHECK-NEXT: s_mov_b32 s3, -1
				; CHECK-NEXT: .LBB4_4: ; =>This Inner Loop Header: Depth=1
				; CHECK-NEXT: s_ctz_i32_b32 s5, s4
				; CHECK-NEXT: s_delay_alu instid0(SALU_CYCLE_1) \| instskip(SKIP_1) \| instid1(VALU_DEP_1)
				; CHECK-NEXT: v_readlane_b32 s6, v0, s5
				; CHECK-NEXT: s_bitset0_b32 s4, s5
				; CHECK-NEXT: s_min_u32 s3, s3, s6
				; CHECK-NEXT: s_cmp_lg_u32 s4, 0
				; CHECK-NEXT: s_cbranch_scc1 .LBB4_4
				; CHECK-NEXT: ; %bb.5:
				; CHECK-NEXT: v_mov_b32_e32 v1, s3
				; CHECK-NEXT: .LBB4_6: ; %endif
				; CHECK-NEXT: s_or_b32 exec_lo, exec_lo, s2
				; CHECK-NEXT: s_load_b64 s[0:1], s[0:1], 0x24
				; CHECK-NEXT: v_mov_b32_e32 v0, 0
				; CHECK-NEXT: s_waitcnt lgkmcnt(0)
				; CHECK-NEXT: global_store_b32 v0, v1, s[0:1]
				; CHECK-NEXT: s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)
				; CHECK-NEXT: s_endpgm
				entry:
				%tid = call i32 @llvm.amdgcn.workitem.id.x()
				%d_cmp = icmp ult i32 %tid, 16
				br i1 %d_cmp, label %if, label %else

				if:
				%0 = call i32 @llvm.amdgcn.wave.reduce.umin(i32 %tid)
				arsenmUnsubmitted Not Done Reply Inline Actions Use named values arsenm: Use named values
				br label %endif

				else:
				%1 = call i32 @llvm.amdgcn.wave.reduce.umin(i32 %in)
				br label %endif

				endif:
				%2 = phi i32 [%0, %if], [%1, %else]
				store i32 %2, ptr addrspace(1) %out
				ret void
				}

This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Add llvm.amdgcn.wave.reduce.umin/umax Intrinsic.ClosedPublic

Details

Diff Detail

Unit TestsFailed

Event Timeline

Revision Contents

Diff 539486

llvm/include/llvm/IR/IntrinsicsAMDGPU.td

llvm/lib/Target/AMDGPU/SIISelLowering.cpp

llvm/lib/Target/AMDGPU/SIInstructions.td

llvm/test/CodeGen/AMDGPU/llvm.amdgcn.reduce.umin.ll

[AMDGPU] Add llvm.amdgcn.wave.reduce.umin/umax Intrinsic.
ClosedPublic