This is an archive of the discontinued LLVM Phabricator instance.

AMDGPU: Fix handling of infinite loops in fragment shaders
ClosedPublic

Authored by cwabbott on Nov 27 2019, 6:21 AM.

Details

Summary

Due to the fact that kill is just a normal intrinsic, even though it's
supposed to terminate the thread, we can end up with provably infinite
loops that are actually supposed to end successfully. The
AMDGPUUnifyDivergentExitNodes pass breaks up these loops, but because
there's no obvious place to make the loop branch to, it just makes it
return immediately, which skips the exports that are supposed to happen
at the end and hangs the GPU if all the threads end up being killed.

While it would be nice if the fact that kill terminates the thread were
modeled in the IR, I think that the structurizer as-is would make a mess if we
did that when the kill is inside control flow. For now, we just add a null
export at the end to make sure that it always exports something, which fixes
the immediate problem without penalizing the more common case. This means that
we sometimes do two "done" exports when only some of the threads enter the
discard loop, but from tests the hardware seems ok with that.

This fixes dEQP-VK.graphicsfuzz.while-inside-switch with radv.

Event Timeline

cwabbott created this revision.Nov 27 2019, 6:21 AM
Herald added a project: Restricted Project. · View Herald TranscriptNov 27 2019, 6:21 AM
cwabbott updated this revision to Diff 231241.Nov 27 2019, 6:38 AM

Fix leftover extraneous change from an earlier version.

I think there are two potential problems with this change (with the caveat that I have a heavy cold so I might not be thinking clearly):

  1. The extra export is for not great for performance as it introduces an unnecessary stall at the end of the shader.
  2. The extra export overwrites the set of active lanes set by a correct export done earlier in the shader.
arsenm added a comment.Dec 2 2019, 7:37 AM

This seems fine to me, although I know nothing about exports, and I would prefer if we could eventually fix the kill intrinsic design

The extra export is for not great for performance as it introduces an unnecessary stall at the end of the shader.

I don't expect this specific case (an otherwise-infinite loop with a discard) to happen often enough with "real" shaders for performance to matter. After all, this went completely unnoticed until it showed up in a CTS test that was created by a fuzzer.

The extra export overwrites the set of active lanes set by a correct export done earlier in the shader.

I think this isn't an issue because the exec mask at the end of the program is going to be the same as the mask when the last "real" export happens. The way kill is implemented means that control flow never reconverges for a killed thread, so it stays dead until the very end. I've done some manual tests with the aforementioned CTS test, and it does seem to be properly discarding the right pixels too. But that's a good question :)

This seems fine to me, although I know nothing about exports, and I would prefer if we could eventually fix the kill intrinsic design

That would be nice. I've thought about it a little, and it's tricky. Your first instinct might be to try to model the actual thread-level semantics, and have this new kill intrinsic do something like "jump here (which is an empty block with just a return) if this thread is killed but some threads are still live, otherwise jump to a block that just does a null export and then returns." So you'd have three possible exits, the normal one, the "I'm killed but some threads are still live" one, and "all threads are killed." However that's problematic for a number of reasons:

  1. You'd have to either worry about handling cases where it doesn't jump to a block containing just a return or add some validation that this is always the case. The former case adds unnecessary complexity, the latter case makes the intrinsic still quite "magic," just in a different way.
  2. The exec mask has to be zero when executing the null export in the all-threads-are-killed case. This sort of breaks the idea that this is "just a normal jump".
  3. radeonsi actually relies on the exec mask not containing killed threads when returning, because the exports actually happen in an epilog which is a separate function whose code is pasted after the main shader. In other words, for threads that are killed, reconvergance actually doesn't conceptually happen until *after* the epilog. So the semantics here are still misleading, and you have to rely on the "optimization" of making it just turn off the bits in the exec mask for correctness.

I think that more doable would be an intrinsic that would do something like "jump to this block that does a null export if all threads are killed, otherwise jump to this block with the killed threads turned off in exec," and then in the frontend you'd make it jump to the next "normal" block in the case where not everything is killed. So now there are only two branch destinations. This is a little less honest, since if a thread is killed but some remain, then the thread that's killed doesn't actually continue on to the next normal block. However it has less problems.

One remaining problem, which is also a problem with the other idea, is that now you somehow have to get a mask of threads that haven't been killed or returned yet, and computing it is going to involve inserting instructions that are largely redundant with the control-flow instructions inserted by SILowerControlFlow, but there's currently no good way to get what we want directly from SILowerControlFlow. With some work you'll probably be able to write optimizations to remove these instructions in the simplest of cases, but that still leaves more complex cases with redundant instructions and no easy way to fix it. And of course, because the final null-export block has to be done with an exec mask of 0, you still can't fit it directly into the control-flow machinery without some special cases.

The extra export is for not great for performance as it introduces an unnecessary stall at the end of the shader.

I don't expect this specific case (an otherwise-infinite loop with a discard) to happen often enough with "real" shaders for performance to matter. After all, this went completely unnoticed until it showed up in a CTS test that was created by a fuzzer.

Thanks for debugging my thoughts here. I missed that this is only triggered in the dummy return block case. I agree this is totally reasonable.

The extra export overwrites the set of active lanes set by a correct export done earlier in the shader.

I think this isn't an issue because the exec mask at the end of the program is going to be the same as the mask when the last "real" export happens. The way kill is implemented means that control flow never reconverges for a killed thread, so it stays dead until the very end. I've done some manual tests with the aforementioned CTS test, and it does seem to be properly discarding the right pixels too. But that's a good question :)

I agree in the general case that should hold; however, if there was a discard after the true export done then it wouldn't. I don't have a good use case for why such a thing would happen -- exports should in the main be the last thing a (PS) shader does. So if we consider kill after export done semantically invalid then this is fine.

I think this isn't an issue because the exec mask at the end of the program is going to be the same as the mask when the last "real" export happens. The way kill is implemented means that control flow never reconverges for a killed thread, so it stays dead until the very end. I've done some manual tests with the aforementioned CTS test, and it does seem to be properly discarding the right pixels too. But that's a good question :)

I agree in the general case that should hold; however, if there was a discard after the true export done then it wouldn't. I don't have a good use case for why such a thing would happen -- exports should in the main be the last thing a (PS) shader does. So if we consider kill after export done semantically invalid then this is fine.

Yeah, we definitely don't ever do such a thing in Mesa, presumably not in AMDVLK/PAL either, which covers all the frontends, and I can't think of a reason for doing that either. So I don't think it's a concern. Should I add a comment to explain this?

I think this isn't an issue because the exec mask at the end of the program is going to be the same as the mask when the last "real" export happens. The way kill is implemented means that control flow never reconverges for a killed thread, so it stays dead until the very end. I've done some manual tests with the aforementioned CTS test, and it does seem to be properly discarding the right pixels too. But that's a good question :)

I agree in the general case that should hold; however, if there was a discard after the true export done then it wouldn't. I don't have a good use case for why such a thing would happen -- exports should in the main be the last thing a (PS) shader does. So if we consider kill after export done semantically invalid then this is fine.

Yeah, we definitely don't ever do such a thing in Mesa, presumably not in AMDVLK/PAL either, which covers all the frontends, and I can't think of a reason for doing that either. So I don't think it's a concern. Should I add a comment to explain this?

A brief comment seems reasonable in case someone runs into it in future.

cwabbott updated this revision to Diff 232593.Dec 6 2019, 9:41 AM

Update comment to explain why this works even when only some threads are killed.

This mostly looks good, except I strongly suspect that all other export intrinsics should have their done bit set to 0 in this case.

If two exports with done bit are executed, I suspect we could enter race conditions where the export allocation is freed after the first export with done, and then another wave gets the same export memory spot and its data could be overwritten by the second, newly introduced, export with done bit. Or enter some hang condition in the hardware. Who knows.

cwabbott updated this revision to Diff 232786.Dec 9 2019, 2:37 AM
  • Fix wrong indentation and missing "immarg" on intrinsic declaration
  • Make sure that we remove the done bit from existing exports, and test it

If two exports with done bit are executed, I suspect we could enter race conditions where the export allocation is freed after the first export with done, and then another wave gets the same export memory spot and its data could be overwritten by the second, newly introduced, export with done bit. Or enter some hang condition in the hardware. Who knows.

Thanks for mentioning this. I wasn't entirely sure if what I was doing is kosher, since the public documentation on exports is a little sparse, even though it worked for the test. I've implemented clearing the done bit like you've suggested... it shouldn't hurt at least.

nhaehnle accepted this revision.Dec 9 2019, 11:33 PM

Thanks, LGTM

This revision is now accepted and ready to land.Dec 9 2019, 11:33 PM

Can someone else please commit this? @nhaehnle? It's been almost a month, excluding Christmas, and my commit access situation still hasn't been resolved. This fix is necessary for radv to pass VK 1.2 conformance, and we'd like it to cherry-pick it to LLVM 10 before it's released.

This revision was automatically updated to reflect the committed changes.
cwabbott reopened this revision.Jan 29 2020, 9:03 AM
This revision is now accepted and ready to land.Jan 29 2020, 9:03 AM
cwabbott closed this revision.Jan 29 2020, 9:08 AM