Mesa still uses a hack where empty inline assembly is used as a kind of
optimization barrier. This exposed a problem where not enough wait states
were inserted, because the hazard recognizer implicitly assumed that each
inline assembly "instruction" has at least one wait state.
Details
Diff Detail
- Build Status
Buildable 9685 Build 9685: arc lint + arc unit
Event Timeline
LGTM. What is the barrier for? That sounds disturbing
lib/Target/AMDGPU/GCNHazardRecognizer.cpp | ||
---|---|---|
229 | If you want to be fancy you could check getInlineAsmLength to approximate the number of instructions, and assume they each have 1 wait state. |
Thanks.
What is the barrier for? That sounds disturbing
I agree. It's a hack for the "ballot" instruction, which takes a per-thread boolean value and returns a bit-mask of that value for each live thread. The Mesa frontend currently translates this to a llvm.amdgcn.icmp.i32 (though it'd be nice to have a dedicated intrinsic, as that would allow better codegen; hasn't been high priority). The issue is code like:
if (if_cond) { threadmask_true = ballot(ballot_cond) do something... } else { threadmask_false = ballot(ballot_cond) do something... }
... which LLVM will happily hoist to
threadmask_all = ballot(ballot_cond) if (if_cond) { do something } else { do something }
which is incorrect, because the following hold:
- threadmask_true & threadmask_false == 0
- threadmask_all = threadmask_true | threadmask_false
In order to convince LLVM not to do that, we're passing the input argument of llvm.amdgcn.icmp.i32 through a no-op inline assembly statement which is said to have side-effects and which pretends to be different between the two branches.
It's ugly all around and we should really fix it, but adding the corresponding semantics to LLVM IR runs against the wall of people who work on compilers for normal machines...
If you want to be fancy you could check getInlineAsmLength to approximate the number of instructions, and assume they each have 1 wait state.