This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Fix missing waitcnt issue
ClosedPublic

Authored by piotr on Jan 18 2022, 12:19 AM.

Details

Summary

Ignore out of order counters when merging brackets. The fact that
there was a pending event in the old state does not guarantee that
the waitcnt was generated, so we still need to conservatively re-process
the block.

The patch fixes a correctness issue where the block was not re-processed
and the waitcnt not inserted in consequence.

Diff Detail

Event Timeline

piotr created this revision.Jan 18 2022, 12:19 AM
piotr requested review of this revision.Jan 18 2022, 12:19 AM
Herald added a project: Restricted Project. · View Herald TranscriptJan 18 2022, 12:19 AM
foad added inline comments.Jan 18 2022, 12:27 AM
llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
1428

As an experiment, would changing the condition to RegStrictDom && !OldOutOfOrder && MyPending != 0 also fix the bug?

foad added inline comments.Jan 18 2022, 12:29 AM
llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
1428

As an experiment, would changing the condition to RegStrictDom && !OldOutOfOrder && MyPending != 0 also fix the bug?

Sorry I meant RegStrictDom && !(OldOutOfOrder && MyPending != 0).

piotr added inline comments.Jan 18 2022, 1:10 AM
llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
1428

As an experiment, would changing the condition to RegStrictDom && !OldOutOfOrder && MyPending != 0 also fix the bug?

Sorry I meant RegStrictDom && !(OldOutOfOrder && MyPending != 0).

Do you mean this:

if (RegStrictDom && !(OldOutOfOrder && MyPending == 0))

I think that would also fix the bug in the test, but I wasn't convinced that was a complete fix, so I went for a safer and simpler approach.

foad added a comment.Jan 18 2022, 1:15 AM

As a bit of background, the OldOutOfOrder test was introduced by @nhaehnle in a large refactoring in D54231. I think it's just a performance optimisation: if the old state had events that completed out of order, then any use of the corresponding registers would have to be preceded by a "waitcnt 0", so there's no need to reprocess that block because the waitcnts are already as strict as they can be. But this goes wrong in the case where the merge introduces a wait on a particular register that had no wait in the old state (so no waitcnt would have been generated for it the last time the block was processed).

foad added a comment.Jan 18 2022, 1:16 AM

Did you notice any compile-time degradation from your fix?

piotr added a comment.Jan 18 2022, 1:27 AM

In our usual compile-time tests it shows 0.056% degradation on average, worst case 0.9%.

foad accepted this revision.Jan 18 2022, 1:44 AM

In our usual compile-time tests it shows 0.056% degradation on average, worst case 0.9%.

In that case I agree we should go with your fix as it is the simplest.

llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
1428

Do you mean this:

I'm not sure what I meant; sorry for the noise. Now I think it *might* be possible to salvage the original optimisation by remembering whether any individual register has a wait in the merged state, when it had no wait in the original state. But I don't think it's worth the complexity.

This revision is now accepted and ready to land.Jan 18 2022, 1:44 AM
piotr updated this revision to Diff 400815.Jan 18 2022, 4:53 AM

Remove references to "waw" as the issue can also trigger in other scenarios, as Jay pointed out to me (thanks).

foad accepted this revision.Jan 18 2022, 5:23 AM
This revision was landed with ongoing or failed builds.Jan 19 2022, 1:55 AM
This revision was automatically updated to reflect the committed changes.