This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Insert waitcnt after returning from call
ClosedPublic

Authored by Flakebi on Sep 15 2020, 1:15 AM.

Details

Summary

When memory operations are outstanding on function calls, either the
caller or the callee can insert a waitcnt to ensure that all reads are
finished.
Calls need some time to be executed, so if the callee inserts the
waitcnt, filling the instruction buffer and waiting for memory will be
interleaved, hiding some latency. This comes at the cost of having a
waitcnt inside functions that may not be needed as no memory operations
are outstanding.

For function calls, this is already implemented. The same principal
applies to returns: If the caller inserts a waitcnt after the call, the
callee does not have to wait and the return and memory operation can be
run in parallel.

This commit implements waiting in the caller after returning from a
function call.

Diff Detail

Event Timeline

Flakebi created this revision.Sep 15 2020, 1:15 AM
Herald added a project: Restricted Project. · View Herald TranscriptSep 15 2020, 1:15 AM
Flakebi requested review of this revision.Sep 15 2020, 1:15 AM
arsenm added inline comments.Sep 15 2020, 7:21 AM
llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
858

In principle this depend on the calling convention, so should retain the argument

1216

Return early and reduce indentation

1226

Why drop the debug loc?

Flakebi updated this revision to Diff 291945.Sep 15 2020, 8:29 AM

Use DebugLoc from call for waitcnt and return early.

llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
858

I guess callWaitsOnFunctionEntry should be similar but we do not have an instruction when it is called in runOnMachineFunction.
In the case callWaitsOnFunctionEntry or callWaitsOnFunctionReturn is called for a callee, we can pass the calling convention directly.
Can we get the calling convention somehow from a call instruction?

madhur13490 added inline comments.Sep 15 2020, 8:44 AM
llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
893

please fuse these two "if" conditions. They are already unreadable.

1220

nit: "Don't insert waitcnt if this function returns *immediately*"

arsenm added inline comments.Sep 15 2020, 8:48 AM
llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
858

I think you have to get it from the call target global (which may be null). We would probably need to track this in the call somehow

Flakebi updated this revision to Diff 291952.Sep 15 2020, 9:03 AM

Improve comment as suggested

llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
893

Fusing these conditions would change the behavior as the else is executed (and insert waitcnts), so I can’t :)

llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
893

Ah! I didn't notice the else part. I think the conditions in first "if" (and also of "else if") can be wrapped inside a function for better readability but I'd leave it to you if you want to clean this up :)

Flakebi updated this revision to Diff 292132.Sep 16 2020, 1:12 AM

Move is gs-done check to own function.

Flakebi added inline comments.Sep 16 2020, 1:14 AM
llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
858

Should I do this as part of this patch or leave it for later, when a calling convention wants other behavior?

arsenm added inline comments.Sep 16 2020, 7:28 AM
llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
858

It's a separate change. Should add a fixme that this is a function of the calling convention

madhur13490 accepted this revision.Sep 16 2020, 7:29 AM
This revision is now accepted and ready to land.Sep 16 2020, 7:29 AM
Flakebi updated this revision to Diff 292216.Sep 16 2020, 7:39 AM

Add fixme that wait on function entry/return should depend on calling convention.

This revision was automatically updated to reflect the committed changes.

This completely broke the Mesa radeonsi driver on Navi 14. Xorg + xterm come up with major corruption & psychedelic colours.

Thanks for the heads-up, I reverted it for now.

mareko added a subscriber: mareko.Sep 23 2020, 9:40 AM

Yes, this commit is incorrect. It completely breaks code linking in Mesa OpenGL. s_waitcnt is required at the end of all global functions that return values.

Please revert. @nhaehnle

Yes, this commit is incorrect. It completely breaks code linking in Mesa OpenGL. s_waitcnt is required at the end of all global functions that return values.

Please revert. @nhaehnle

I don't understand why would it fail. This patch just moves s_waitcnt to the caller so they would be executed anyway. I think I am missing something. It would be helpful to root cause if we can isolate to a small test case.

Yes, this commit is incorrect. It completely breaks code linking in Mesa OpenGL. s_waitcnt is required at the end of all global functions that return values.

Please revert. @nhaehnle

I don't understand why would it fail. This patch just moves s_waitcnt to the caller so they would be executed anyway. I think I am missing something. It would be helpful to root cause if we can isolate to a small test case.

Shader returns aren't real returns and the "caller" doesn't wait

Yes, this commit is incorrect. It completely breaks code linking in Mesa OpenGL. s_waitcnt is required at the end of all global functions that return values.

Please revert. @nhaehnle

I don't understand why would it fail. This patch just moves s_waitcnt to the caller so they would be executed anyway. I think I am missing something. It would be helpful to root cause if we can isolate to a small test case.

Shader returns aren't real returns and the "caller" doesn't wait

I see. So how should this be implemented? May be we conditionalize this patch just for compute?

mareko added a comment.EditedSep 23 2020, 9:57 AM

Yes, this commit is incorrect. It completely breaks code linking in Mesa OpenGL. s_waitcnt is required at the end of all global functions that return values.

Please revert. @nhaehnle

I don't understand why would it fail. This patch just moves s_waitcnt to the caller so they would be executed anyway. I think I am missing something. It would be helpful to root cause if we can isolate to a small test case.

Shader returns aren't real returns and the "caller" doesn't wait

That's right. There are no callers. Functions that return SGPRs and VGPRs don't jump anywhere. Instead, return means that another function will be appended at the shader binary level after the last instruction, consuming those SGPRs and VGPRs. The other function doesn't know whether it was appended at the end of another shader binary, or whether it's the real beginning of the shader and the SGPRs and VGPRs were initialized by the register init machines.

Yes, this commit is incorrect. It completely breaks code linking in Mesa OpenGL. s_waitcnt is required at the end of all global functions that return values.

Please revert. @nhaehnle

I don't understand why would it fail. This patch just moves s_waitcnt to the caller so they would be executed anyway. I think I am missing something. It would be helpful to root cause if we can isolate to a small test case.

Shader returns aren't real returns and the "caller" doesn't wait

I see. So how should this be implemented? May be we conditionalize this patch just for compute?

How waitcnt are handled is a function of the calling convention, so that's what should be tested for here. Calls to and returns from functions with a default calling convention can be handled in the new way, while other cases need to be left unchanged for now. The mistake was to believe that this could be delayed for later as a "fixme".

This could also just check the specific return opcodes (we could maybe even remove the isReturn from the shader pseudo-return)

The revert also highlights the problem that we don't have representative test in llvm/test :(

The revert also highlights the problem that we don't have representative test in llvm/test :(

Don't we? Isn't the problem rather that a lit test alone isn't really authoritative on whether the generated code is correct? There are cases where it's easy to convince yourself that a change in codegen is correct even though it isn't, if you're only staring at the assembly.

The revert also highlights the problem that we don't have representative test in llvm/test :(

To me, it rather highlights that there still doesn't seem to be even smoke-testing with Mesa before merging LLVM AMDGPU backend changes, let alone automated regression testing.

Using the new return behavior only for the calling conventions that Mesa doesn't use sounds like the right solution.