This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/trunk/
-
trunk/
-
lib/Target/AMDGPU/
-
Target/
-
AMDGPU/
-
GCNHazardRecognizer.cpp
-
test/CodeGen/AMDGPU/
-
CodeGen/
-
AMDGPU/
-
hazard.mir

Differential D37205

AMDGPU: Make worst-case assumption about the wait states in inline assembly
ClosedPublic

Authored by nhaehnle on Aug 28 2017, 2:56 AM.

Download Raw Diff

Details

Reviewers

arsenm

Commits

rG523827145b55: AMDGPU: Make worst-case assumption about the wait states in inline assembly
rL312635: AMDGPU: Make worst-case assumption about the wait states in inline assembly

Summary

Mesa still uses a hack where empty inline assembly is used as a kind of
optimization barrier. This exposed a problem where not enough wait states
were inserted, because the hazard recognizer implicitly assumed that each
inline assembly "instruction" has at least one wait state.

Diff Detail

Repository: rL LLVM

Event Timeline

nhaehnle created this revision.Aug 28 2017, 2:56 AM

Herald added subscribers: t-tye, tpr, dstuttard and 3 others. · View Herald TranscriptAug 28 2017, 2:56 AM

LGTM. What is the barrier for? That sounds disturbing

lib/Target/AMDGPU/GCNHazardRecognizer.cpp
229 ↗	(On Diff #112865)	If you want to be fancy you could check getInlineAsmLength to approximate the number of instructions, and assume they each have 1 wait state.

This revision is now accepted and ready to land.Aug 29 2017, 10:34 AM

In D37205#855583, @arsenm wrote:

LGTM.

Thanks.

What is the barrier for? That sounds disturbing

I agree. It's a hack for the "ballot" instruction, which takes a per-thread boolean value and returns a bit-mask of that value for each live thread. The Mesa frontend currently translates this to a llvm.amdgcn.icmp.i32 (though it'd be nice to have a dedicated intrinsic, as that would allow better codegen; hasn't been high priority). The issue is code like:

if (if_cond) {
   threadmask_true = ballot(ballot_cond)
  do something...
} else {
   threadmask_false = ballot(ballot_cond)
   do something...
}

... which LLVM will happily hoist to

threadmask_all = ballot(ballot_cond)
if (if_cond) {
   do something
} else {
   do something
}

which is incorrect, because the following hold:

threadmask_true & threadmask_false == 0
threadmask_all = threadmask_true | threadmask_false

In order to convince LLVM not to do that, we're passing the input argument of llvm.amdgcn.icmp.i32 through a no-op inline assembly statement which is said to have side-effects and which pretends to be different between the two branches.

It's ugly all around and we should really fix it, but adding the corresponding semantics to LLVM IR runs against the wall of people who work on compilers for normal machines...

Closed by commit rL312635: AMDGPU: Make worst-case assumption about the wait states in inline assembly (authored by nha). · Explain WhySep 6 2017, 6:51 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

llvm/

trunk/

lib/

Target/

AMDGPU/

GCNHazardRecognizer.cpp

3 lines

test/

CodeGen/

AMDGPU/

hazard.mir

29 lines

Diff 114003

llvm/trunk/lib/Target/AMDGPU/GCNHazardRecognizer.cpp

Show First 20 Lines • Show All 219 Lines • ▼ Show 20 Lines	int GCNHazardRecognizer::getWaitStatesSince(
function_ref<bool(MachineInstr *)> IsHazard) {		function_ref<bool(MachineInstr *)> IsHazard) {
int WaitStates = 0;		int WaitStates = 0;
for (MachineInstr *MI : EmittedInstrs) {		for (MachineInstr *MI : EmittedInstrs) {
if (MI) {		if (MI) {
if (IsHazard(MI))		if (IsHazard(MI))
return WaitStates;		return WaitStates;

unsigned Opcode = MI->getOpcode();		unsigned Opcode = MI->getOpcode();
if (Opcode == AMDGPU::DBG_VALUE \|\| Opcode == AMDGPU::IMPLICIT_DEF)		if (Opcode == AMDGPU::DBG_VALUE \|\| Opcode == AMDGPU::IMPLICIT_DEF \|\|
		Opcode == AMDGPU::INLINEASM)
continue;		continue;
}		}
++WaitStates;		++WaitStates;
}		}
return std::numeric_limits<int>::max();		return std::numeric_limits<int>::max();
}		}

int GCNHazardRecognizer::getWaitStatesSinceDef(		int GCNHazardRecognizer::getWaitStatesSinceDef(
▲ Show 20 Lines • Show All 351 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/AMDGPU/hazard.mir

# RUN: llc -march=amdgcn -mcpu=tonga -verify-machineinstrs -run-pass post-RA-hazard-rec %s -o - \| FileCheck -check-prefix=GCN -check-prefix=VI %s		# RUN: llc -march=amdgcn -mcpu=tonga -verify-machineinstrs -run-pass post-RA-hazard-rec %s -o - \| FileCheck -check-prefix=GCN -check-prefix=VI %s
# RUN: llc -march=amdgcn -mcpu=gfx900 -verify-machineinstrs -run-pass post-RA-hazard-rec %s -o - \| FileCheck -check-prefix=GCN -check-prefix=GFX9 %s		# RUN: llc -march=amdgcn -mcpu=gfx900 -verify-machineinstrs -run-pass post-RA-hazard-rec %s -o - \| FileCheck -check-prefix=GCN -check-prefix=GFX9 %s

		# GCN-LABEL: name: hazard_implicit_def
# GCN: bb.0.entry:		# GCN: bb.0.entry:
# GCN: %m0 = S_MOV_B32		# GCN: %m0 = S_MOV_B32
# GFX9: S_NOP 0		# GFX9: S_NOP 0
# VI-NOT: S_NOP_0		# VI-NOT: S_NOP_0
# GCN: V_INTERP_P1_F32		# GCN: V_INTERP_P1_F32

---		---
name: hazard_implicit_def		name: hazard_implicit_def
Show All 12 Lines	bb.0.entry:
liveins: %sgpr7, %vgpr4		liveins: %sgpr7, %vgpr4

%m0 = S_MOV_B32 killed %sgpr7		%m0 = S_MOV_B32 killed %sgpr7
%vgpr5 = IMPLICIT_DEF		%vgpr5 = IMPLICIT_DEF
%vgpr0 = V_INTERP_P1_F32 killed %vgpr4, 0, 0, implicit %m0, implicit %exec		%vgpr0 = V_INTERP_P1_F32 killed %vgpr4, 0, 0, implicit %m0, implicit %exec
SI_RETURN_TO_EPILOG killed %vgpr5, killed %vgpr0		SI_RETURN_TO_EPILOG killed %vgpr5, killed %vgpr0

...		...

		# GCN-LABEL: name: hazard_inlineasm
		# GCN: bb.0.entry:
		# GCN: %m0 = S_MOV_B32
		# GFX9: S_NOP 0
		# VI-NOT: S_NOP_0
		# GCN: V_INTERP_P1_F32
		---
		name: hazard_inlineasm
		alignment: 0
		exposesReturnsTwice: false
		legalized: false
		regBankSelected: false
		selected: false
		tracksRegLiveness: true
		registers:
		liveins:
		- { reg: '%sgpr7', virtual-reg: '' }
		- { reg: '%vgpr4', virtual-reg: '' }
		body: \|
		bb.0.entry:
		liveins: %sgpr7, %vgpr4

		%m0 = S_MOV_B32 killed %sgpr7
		INLINEASM $"; no-op", 1, 327690, def %vgpr5
		%vgpr0 = V_INTERP_P1_F32 killed %vgpr4, 0, 0, implicit %m0, implicit %exec
		SI_RETURN_TO_EPILOG killed %vgpr5, killed %vgpr0
		...