Page MenuHomePhabricator

AMDGPU/InsertWaitcnts: Untangle some semi-global state
ClosedPublic

Authored by nhaehnle on Nov 7 2018, 2:15 PM.

Details

Summary

Reduce the statefulness of the algorithm in two ways:

  1. More clearly split generateWaitcntInstBefore into two phases: the first one which determines the required wait, if any, without changing the ScoreBrackets, and the second one which actually inserts the wait and updates the brackets.
  1. Communicate pre-existing s_waitcnt instructions using an argument to generateWaitcntInstBefore instead of through the ScoreBrackets.

To simplify these changes, a Waitcnt structure is introduced which carries
the counts of an s_waitcnt instruction in decoded form.

There are some functional changes:

  1. The FIXME for the VCCZ bug workaround was implemented: we only wait for SMEM instructions as required instead of waiting on all counters.
  1. There are some cases where we previously merged some waitcnt instructions together non-locally due to the somewhat odd OldWaitcnt tracking, e.g. we would produce code like this:

    ds_read_b32 v0, ... ds_read_b32 v1, ... s_waitcnt lgkmcnt(0) <-- this is a merged wait for both uses use(v0) more code use(v1)

    In these cases we will now always first emit a wait for lgkmcnt(1), and then later for lgkmcnt(0). This should basically always be a win, although theoretically there could be cases where it's very slightly worse due to the increased code size. The worst code size regressions in my shader-db are:

    WORST REGRESSIONS - Code Size Before After Delta Percentage 1724 1736 12 0.70 % shaders/private/f1-2015/1334.shader_test [0] 2276 2284 8 0.35 % shaders/private/f1-2015/1306.shader_test [0] 4632 4640 8 0.17 % shaders/private/ue4_elemental/62.shader_test [0] 2376 2384 8 0.34 % shaders/private/f1-2015/1308.shader_test [0] 3284 3292 8 0.24 % shaders/private/talos_principle/1955.shader_test [0]

    ... so I'm not particularly worried about the rather theoretical downside.

Diff Detail

Repository
rL LLVM

Event Timeline

nhaehnle created this revision.Nov 7 2018, 2:15 PM
arsenm added a comment.Nov 7 2018, 2:27 PM

Testcase for the 2nd change?

nhaehnle updated this revision to Diff 173327.Nov 9 2018, 7:14 AM

Turns out I was a bit too quick in my analysis of the second point.
I thought the overly conservative waitcnt was due to the control flow
in the shader I was looking at, but it was actually due to a pre-existing
waitcnt.

Note, the new waitcnt-preexisting.mir test shows this change.

nhaehnle updated this revision to Diff 175040.Nov 22 2018, 5:16 AM
  • Early-out in generateWaitcntInstBefore when no wait is needed. This helps keep the nesting complexity in check for later changes.
This revision is now accepted and ready to land.Nov 22 2018, 9:38 AM
This revision was automatically updated to reflect the committed changes.