This is an archive of the discontinued LLVM Phabricator instance.

AMDGPU/SI: Better handle s_wait insertion
ClosedPublic

Authored by axeldavy on Aug 9 2015, 5:54 AM.

Details

Summary

We can wait on either VM, EXP or LGKM.
The waits are independent.

Without this patch, a wait inserted because of one of them
would also wait for all the previous others.
This patch makes s_wait only wait for the ones we need for the next instruction.

Here's an example of subtle perf reduction this patch solves:

This is without the patch:

buffer_load_format_xyzw v[8:11], v0, s[44:47], 0 idxen
buffer_load_format_xyzw v[12:15], v0, s[48:51], 0 idxen
s_load_dwordx4 s[44:47], s[8:9], 0xc
s_waitcnt lgkmcnt(0)
buffer_load_format_xyzw v[16:19], v0, s[52:55], 0 idxen
s_load_dwordx4 s[48:51], s[8:9], 0x10
s_waitcnt vmcnt(1)
buffer_load_format_xyzw v[20:23], v0, s[44:47], 0 idxen

The s_waitcnt vmcnt(1) is useless.
The reason it is added is because the last
buffer_load_format_xyzw needs s[44:47], which was issued
by the first s_load_dwordx4. It waits for all VM
before that call to have finished.

Internally after every instruction, 3 counters (for VM, EXP and LGTM)
are updated after every instruction. For example buffer_load_format_xyzw will
increase the VM counter, and s_load_dwordx4 the LGKM one.

Without the patch, for every defined register,
the current 3 counters are stored, and are used to know
how long to wait when an instruction needs the register.

Because of that, the s[44:47] counter includes that to use the register
you need to wait for the previous buffer_load_format_xyzw.

Instead this patch stores only the counters that matter for the register,
and puts zero for the other ones, since we don't need any wait for them.

Diff Detail

Event Timeline

axeldavy updated this revision to Diff 31615.Aug 9 2015, 5:54 AM
axeldavy retitled this revision from to AMDGPU/SI: Better handle s_wait insertion.
axeldavy updated this object.
axeldavy added a subscriber: llvm-commits.
test/CodeGen/AMDGPU/wait.ll
7–8

I would add a check for the instruction between these two s_waitcnt instructions. We want to make sure some future change doesn't regress us, and cause us to emit two s_waitcnt instructions in a row.

test/CodeGen/AMDGPU/wait2.ll
1 ↗(On Diff #31615)

Is there are reason why this test was added to a new file and not to wait.ll ?

axeldavy added inline comments.Aug 13 2015, 10:15 PM
test/CodeGen/AMDGPU/wait.ll
7–8

OK sure

test/CodeGen/AMDGPU/wait2.ll
1 ↗(On Diff #31615)

yes, the run command is different (and uses the ilpmax scheduler)

test/CodeGen/AMDGPU/wait2.ll
1 ↗(On Diff #31615)

You can have multiple run lines in the same test file, you just need to use a different --check-prefix if the output is different. It is preferred to put similar test cases in the same file if possible.

axeldavy updated this revision to Diff 32245.Aug 16 2015, 11:53 AM

Take comments into account

This revision was automatically updated to reflect the committed changes.