This is an archive of the discontinued LLVM Phabricator instance.

AMDGPU: Make worst-case assumption about the wait states in inline assembly
ClosedPublic

Authored by nhaehnle on Aug 28 2017, 2:56 AM.

Details

Summary

Mesa still uses a hack where empty inline assembly is used as a kind of
optimization barrier. This exposed a problem where not enough wait states
were inserted, because the hazard recognizer implicitly assumed that each
inline assembly "instruction" has at least one wait state.

Diff Detail

Repository
rL LLVM

Event Timeline

nhaehnle created this revision.Aug 28 2017, 2:56 AM
arsenm accepted this revision.Aug 29 2017, 10:34 AM

LGTM. What is the barrier for? That sounds disturbing

lib/Target/AMDGPU/GCNHazardRecognizer.cpp
229 ↗(On Diff #112865)

If you want to be fancy you could check getInlineAsmLength to approximate the number of instructions, and assume they each have 1 wait state.

This revision is now accepted and ready to land.Aug 29 2017, 10:34 AM

LGTM.

Thanks.

What is the barrier for? That sounds disturbing

I agree. It's a hack for the "ballot" instruction, which takes a per-thread boolean value and returns a bit-mask of that value for each live thread. The Mesa frontend currently translates this to a llvm.amdgcn.icmp.i32 (though it'd be nice to have a dedicated intrinsic, as that would allow better codegen; hasn't been high priority). The issue is code like:

if (if_cond) {
   threadmask_true = ballot(ballot_cond)
  do something...
} else {
   threadmask_false = ballot(ballot_cond)
   do something...
}

... which LLVM will happily hoist to

threadmask_all = ballot(ballot_cond)
if (if_cond) {
   do something
} else {
   do something
}

which is incorrect, because the following hold:

  1. threadmask_true & threadmask_false == 0
  2. threadmask_all = threadmask_true | threadmask_false

In order to convince LLVM not to do that, we're passing the input argument of llvm.amdgcn.icmp.i32 through a no-op inline assembly statement which is said to have side-effects and which pretends to be different between the two branches.

It's ugly all around and we should really fix it, but adding the corresponding semantics to LLVM IR runs against the wall of people who work on compilers for normal machines...

This revision was automatically updated to reflect the committed changes.