This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/AMDGPU/
-
Target/
-
AMDGPU/
-
AMDGPUISelDAGToDAG.cpp
-
AMDGPUTargetTransformInfo.cpp
-
test/CodeGen/AMDGPU/
-
CodeGen/
-
AMDGPU/
-
uniform-load-from-tid.ll

Differential D132511

[AMDGPU] Detect uniformness of TID / wavefrontsize
ClosedPublic

Authored by rampitec on Aug 23 2022, 3:40 PM.

Download Raw Diff

Details

Reviewers

bcahoon
alex-t
foad

Commits

rG813ae2871d71: [AMDGPU] Detect uniformness of TID / wavefrontsize

Summary

A value of 'workitemid / wavefrontize' or 'workitemid & (wavefrontize - 1)'
is wave uniform.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

rampitec created this revision.Aug 23 2022, 3:40 PM

Herald added a project: Restricted Project. · View Herald TranscriptAug 23 2022, 3:40 PM

Herald added subscribers: kosarev, kerbowa, hiraditya and 8 others. · View Herald Transcript

rampitec requested review of this revision.Aug 23 2022, 3:40 PM

Herald added a project: Restricted Project. · View Herald TranscriptAug 23 2022, 3:40 PM

Herald added a subscriber: wdng. · View Herald Transcript

A bit of explanation why change to the isUniformLoad is needed. The DA is stateless analysis and we can attach amdgpu.uniform metadata to the GEP, but when it comes to SDag DA cannot help and SDag sees the node as divergent. Even if divergent bit is unset on the node during dag creation it hardly helps after dag combines losing it when killing the original shift node. Moreover, dag combines create a pattern which is much more difficult to recognize. So checking the MMO metadata (much like gisel does, it relies on the MMO metadata completely for that).

Harbormaster completed remote builds in B182955: Diff 454989.Aug 23 2022, 7:01 PM

rampitec updated this revision to Diff 455308.Aug 24 2022, 11:41 AM

Harbormaster completed remote builds in B183177: Diff 455308.Aug 24 2022, 1:58 PM

This looks good to me. The patch enables the compiler to generate s_load when the user writes code that divides threadIdx.x by the wavefront size. The only suggestion I have is to add some test cases showing explicitly that the amdgpu.uniform metadata is added via the divergence analysis. (by the AnnotateUniformValues pass). The test cases provided rely upon that working correctly, though they show the end result rather than the steps needed to get the result.

In D132511#3748054, @bcahoon wrote:

This looks good to me. The patch enables the compiler to generate s_load when the user writes code that divides threadIdx.x by the wavefront size. The only suggestion I have is to add some test cases showing explicitly that the amdgpu.uniform metadata is added via the divergence analysis. (by the AnnotateUniformValues pass). The test cases provided rely upon that working correctly, though they show the end result rather than the steps needed to get the result.

I will update the same testcase to run only annotate uniform values pass and check the metadata.

what if the operation happens in divergent control flow, e.g

int x = 0;
x = threadIdx.x > 32 : threadIdx/64 : 0;

will this patch still work?

In D132511#3749004, @yaxunl wrote:
what if the operation happens in divergent control flow, e.g
int x = 0;
x = threadIdx.x > 32 : threadIdx/64 : 0;
will this patch still work?

Regardless of the CFG the value of TID / 64 is always uniform. The value of 'x' here is another value, derived from that uniform value. It is no different from x = cc ? sgpr0 : sgpr1; LHS and RHS are uniform, but 'x' is not.

Added test run lines to check amdgpu-annotate-uniform metadata directly.

Harbormaster completed remote builds in B183651: Diff 455974.Aug 26 2022, 1:13 PM

LGTM. Thanks for adding this patch!

This revision is now accepted and ready to land.Aug 26 2022, 2:42 PM

Closed by commit rG813ae2871d71: [AMDGPU] Detect uniformness of TID / wavefrontsize (authored by rampitec). · Explain WhyAug 26 2022, 11:43 PM

This revision was automatically updated to reflect the committed changes.

rampitec added a commit: rG813ae2871d71: [AMDGPU] Detect uniformness of TID / wavefrontsize.

In D132511#3752207, @rampitec wrote:
In D132511#3749004, @yaxunl wrote:
what if the operation happens in divergent control flow, e.g
int x = 0;
x = threadIdx.x > 32 : threadIdx/64 : 0;
will this patch still work?
Regardless of the CFG the value of TID / 64 is always uniform. The value of 'x' here is another value, derived from that uniform value. It is no different from x = cc ? sgpr0 : sgpr1; LHS and RHS are uniform, but 'x' is not.

What if the blockDim.x is not 64, like 65, blockDim.y is not 1
for example:
the workgroup shape is <65, 2, 1>
warp 0 should be [0,0] to [63, 0]
warp 1 should be [64,0] to [62, 1]
warp 2 should be [63, 1] to [64, 1]
warp1&2's threadIdx.x / 64 should be still divergent

In D132511#3753491, @bcl5980 wrote:

What if the blockDim.x is not 64, like 65, blockDim.y is not 1
for example:
the workgroup shape is <65, 2, 1>
warp 0 should be [0,0] to [63, 0]
warp 1 should be [64,0] to [62, 1]
warp 2 should be [63, 1] to [64, 1]
warp1&2's threadIdx.x / 64 should be still divergent

Yes, you are right, thanks! Looks like I need to limit it to the case when there is amdgpu-no-workitem-id-y attribute on the function.

In D132511#3756081, @rampitec wrote:

In D132511#3753491, @bcl5980 wrote:

What if the blockDim.x is not 64, like 65, blockDim.y is not 1
for example:
the workgroup shape is <65, 2, 1>
warp 0 should be [0,0] to [63, 0]
warp 1 should be [64,0] to [62, 1]
warp 2 should be [63, 1] to [64, 1]
warp1&2's threadIdx.x / 64 should be still divergent

Yes, you are right, thanks! Looks like I need to limit it to the case when there is amdgpu-no-workitem-id-y attribute on the function.

D132879 limits it.

Does this address the same issue as D124385?

In D132511#3795179, @arsenm wrote:

Does this address the same issue as D124385?

Yes, although practically divisor is a power of 2 and in fact a wavefront size (that is how people use it), so there is no SDiv or UDiv, there is a shift like in this patch.

Revision Contents

Path

Size

llvm/

lib/

Target/

AMDGPU/

AMDGPUISelDAGToDAG.cpp

11 lines

AMDGPUTargetTransformInfo.cpp

16 lines

test/

CodeGen/

AMDGPU/

uniform-load-from-tid.ll

66 lines

Diff 456086

llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp

//===-- AMDGPUISelDAGToDAG.cpp - A dag to dag inst selector for AMDGPU ----===//		//===-- AMDGPUISelDAGToDAG.cpp - A dag to dag inst selector for AMDGPU ----===//
//		//
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.		// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
// See https://llvm.org/LICENSE.txt for license information.		// See https://llvm.org/LICENSE.txt for license information.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception		// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
//		//
//==-----------------------------------------------------------------------===//		//==-----------------------------------------------------------------------===//
//		//
/// \file		/// \file
/// Defines an instruction selector for the AMDGPU target.		/// Defines an instruction selector for the AMDGPU target.
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#include "AMDGPUISelDAGToDAG.h"		#include "AMDGPUISelDAGToDAG.h"
#include "AMDGPU.h"		#include "AMDGPU.h"
		#include "AMDGPUInstrInfo.h"
#include "AMDGPUSubtarget.h"		#include "AMDGPUSubtarget.h"
#include "AMDGPUTargetMachine.h"		#include "AMDGPUTargetMachine.h"
#include "MCTargetDesc/AMDGPUMCTargetDesc.h"		#include "MCTargetDesc/AMDGPUMCTargetDesc.h"
#include "MCTargetDesc/R600MCTargetDesc.h"		#include "MCTargetDesc/R600MCTargetDesc.h"
#include "R600RegisterInfo.h"		#include "R600RegisterInfo.h"
#include "SIMachineFunctionInfo.h"		#include "SIMachineFunctionInfo.h"
#include "llvm/Analysis/LegacyDivergenceAnalysis.h"		#include "llvm/Analysis/LegacyDivergenceAnalysis.h"
#include "llvm/Analysis/ValueTracking.h"		#include "llvm/Analysis/ValueTracking.h"
▲ Show 20 Lines • Show All 2,936 Lines • ▼ Show 20 Lines	for (SDNode::use_iterator U = N->use_begin(), E = SDNode::use_end();
}		}
}		}
return !AllUsesAcceptSReg && (Limit < 10);		return !AllUsesAcceptSReg && (Limit < 10);
}		}

bool AMDGPUDAGToDAGISel::isUniformLoad(const SDNode * N) const {		bool AMDGPUDAGToDAGISel::isUniformLoad(const SDNode * N) const {
auto Ld = cast<LoadSDNode>(N);		auto Ld = cast<LoadSDNode>(N);

		if (N->isDivergent() && !AMDGPUInstrInfo::isUniformMMO(Ld->getMemOperand()))
		return false;

return Ld->getAlign() >= Align(4) &&		return Ld->getAlign() >= Align(4) &&
(((Ld->getAddressSpace() == AMDGPUAS::CONSTANT_ADDRESS \|\|		((Ld->getAddressSpace() == AMDGPUAS::CONSTANT_ADDRESS \|\|
Ld->getAddressSpace() == AMDGPUAS::CONSTANT_ADDRESS_32BIT) &&		Ld->getAddressSpace() == AMDGPUAS::CONSTANT_ADDRESS_32BIT) \|\|
!N->isDivergent()) \|\|
(Subtarget->getScalarizeGlobalBehavior() &&		(Subtarget->getScalarizeGlobalBehavior() &&
Ld->getAddressSpace() == AMDGPUAS::GLOBAL_ADDRESS &&		Ld->getAddressSpace() == AMDGPUAS::GLOBAL_ADDRESS &&
Ld->isSimple() && !N->isDivergent() &&		Ld->isSimple() &&
static_cast<const SITargetLowering *>(getTargetLowering())		static_cast<const SITargetLowering *>(getTargetLowering())
->isMemOpHasNoClobberedMemOperand(N)));		->isMemOpHasNoClobberedMemOperand(N)));
}		}

void AMDGPUDAGToDAGISel::PostprocessISelDAG() {		void AMDGPUDAGToDAGISel::PostprocessISelDAG() {
const AMDGPUTargetLowering& Lowering =		const AMDGPUTargetLowering& Lowering =
static_cast<const AMDGPUTargetLowering>(getTargetLowering());		static_cast<const AMDGPUTargetLowering>(getTargetLowering());
bool IsModified = false;		bool IsModified = false;
Show All 21 Lines

llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp

Show First 20 Lines • Show All 912 Lines • ▼ Show 20 Lines	bool GCNTTIImpl::isAlwaysUniform(const Value *V) const {
}		}

if (const CallInst *CI = dyn_cast<CallInst>(V)) {		if (const CallInst *CI = dyn_cast<CallInst>(V)) {
if (CI->isInlineAsm())		if (CI->isInlineAsm())
return !isInlineAsmSourceOfDivergence(CI);		return !isInlineAsmSourceOfDivergence(CI);
return false;		return false;
}		}

		using namespace llvm::PatternMatch;
		uint64_t C;
		if (match(V, m_LShr(m_Intrinsic<Intrinsic::amdgcn_workitem_id_x>(),
		m_ConstantInt(C))) \|\|
		match(V, m_AShr(m_Intrinsic<Intrinsic::amdgcn_workitem_id_x>(),
		m_ConstantInt(C))))
		return C >= ST->getWavefrontSizeLog2();

		Value *Mask;
		if (match(V, m_c_And(m_Intrinsic<Intrinsic::amdgcn_workitem_id_x>(),
		m_Value(Mask)))) {
		const DataLayout &DL = cast<Instruction>(V)->getModule()->getDataLayout();
		return computeKnownBits(Mask, DL).countMinTrailingZeros() >=
		ST->getWavefrontSizeLog2();
		}

const ExtractValueInst *ExtValue = dyn_cast<ExtractValueInst>(V);		const ExtractValueInst *ExtValue = dyn_cast<ExtractValueInst>(V);
if (!ExtValue)		if (!ExtValue)
return false;		return false;

const CallInst *CI = dyn_cast<CallInst>(ExtValue->getOperand(0));		const CallInst *CI = dyn_cast<CallInst>(ExtValue->getOperand(0));
if (!CI)		if (!CI)
return false;		return false;

▲ Show 20 Lines • Show All 251 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/uniform-load-from-tid.ll

This file was added.

				; RUN: llc -march=amdgcn -mcpu=gfx1010 -verify-machineinstrs < %s \| FileCheck --check-prefixes=GCN,W32 --enable-var-scope %s
				; RUN: llc -march=amdgcn -mcpu=gfx1010 -mattr=+wavefrontsize64 -verify-machineinstrs < %s \| FileCheck --check-prefixes=GCN,W64 --enable-var-scope %s
				; RUN: opt -mtriple=amdgcn-amd-amdhsa -mcpu=gfx1010 -S -amdgpu-annotate-uniform < %s \| FileCheck --check-prefixes=OPT,OPT-W32 --enable-var-scope %s
				; RUN: opt -mtriple=amdgcn-amd-amdhsa -mcpu=gfx1010 -mattr=+wavefrontsize64 -S -amdgpu-annotate-uniform < %s \| FileCheck --check-prefixes=OPT,OPT-W64 --enable-var-scope %s

				; GCN-LABEL: {{^}}lshr_threadid:
				; W64: global_load_dword
				; W32: v_readfirstlane_b32 [[OFFSET:s[0-9]+]], v0
				; W32: s_load_dword s{{[0-9]+}}, s[{{[0-9:]+}}], [[OFFSET]]

				; OPT-LABEL: @lshr_threadid
				; OPT-W64: %arrayidx = getelementptr inbounds i32, ptr addrspace(1) %in, i64 %div4{{$}}
				; OPT-W32: %arrayidx = getelementptr inbounds i32, ptr addrspace(1) %in, i64 %div4, !amdgpu.uniform !0
				define amdgpu_kernel void @lshr_threadid(ptr addrspace(1) align 4 %in, ptr addrspace(1) align 4 %out) {
				entry:
				%lid = tail call i32 @llvm.amdgcn.workitem.id.x()
				%div = lshr i32 %lid, 5
				%div4 = zext i32 %div to i64
				%arrayidx = getelementptr inbounds i32, ptr addrspace(1) %in, i64 %div4
				%load = load i32, ptr addrspace(1) %arrayidx, align 4
				%arrayidx2 = getelementptr inbounds i32, ptr addrspace(1) %out, i64 %div4
				store i32 %load, ptr addrspace(1) %arrayidx2, align 4
				ret void
				}

				; GCN-LABEL: {{^}}ashr_threadid:
				; W64: global_load_dword
				; W32: v_readfirstlane_b32 [[OFFSET:s[0-9]+]], v0
				; W32: s_load_dword s{{[0-9]+}}, s[{{[0-9:]+}}], [[OFFSET]]

				; OPT-LABEL: @ashr_threadid
				; OPT-W64: %arrayidx = getelementptr inbounds i32, ptr addrspace(1) %in, i64 %div4{{$}}
				; OPT-W32: %arrayidx = getelementptr inbounds i32, ptr addrspace(1) %in, i64 %div4, !amdgpu.uniform !0
				define amdgpu_kernel void @ashr_threadid(ptr addrspace(1) align 4 %in, ptr addrspace(1) align 4 %out) {
				entry:
				%lid = tail call i32 @llvm.amdgcn.workitem.id.x()
				%div = ashr i32 %lid, 5
				%div4 = zext i32 %div to i64
				%arrayidx = getelementptr inbounds i32, ptr addrspace(1) %in, i64 %div4
				%load = load i32, ptr addrspace(1) %arrayidx, align 4
				%arrayidx2 = getelementptr inbounds i32, ptr addrspace(1) %out, i64 %div4
				store i32 %load, ptr addrspace(1) %arrayidx2, align 4
				ret void
				}

				; GCN-LABEL: {{^}}and_threadid:
				; W64: global_load_dword
				; W32: v_readfirstlane_b32 [[OFFSET:s[0-9]+]], v0
				; W32: s_load_dword s{{[0-9]+}}, s[{{[0-9:]+}}], [[OFFSET]]

				; OPT-LABEL: @and_threadid
				; OPT-W64: %arrayidx = getelementptr inbounds i32, ptr addrspace(1) %in, i64 %div4{{$}}
				; OPT-W32: %arrayidx = getelementptr inbounds i32, ptr addrspace(1) %in, i64 %div4, !amdgpu.uniform !0
				define amdgpu_kernel void @and_threadid(ptr addrspace(1) align 4 %in, ptr addrspace(1) align 4 %out) {
				entry:
				%lid = tail call i32 @llvm.amdgcn.workitem.id.x()
				%and = and i32 %lid, -32
				%div4 = zext i32 %and to i64
				%arrayidx = getelementptr inbounds i32, ptr addrspace(1) %in, i64 %div4
				%load = load i32, ptr addrspace(1) %arrayidx, align 4
				%arrayidx2 = getelementptr inbounds i32, ptr addrspace(1) %out, i64 %div4
				store i32 %load, ptr addrspace(1) %arrayidx2, align 4
				ret void
				}

				declare i32 @llvm.amdgcn.workitem.id.x()