This is an archive of the discontinued LLVM Phabricator instance.

AMDGPU/GlobalISel: Fix move s.buffer.load to VALU
ClosedPublic

Authored by arsenm on Feb 3 2020, 5:10 AM.

Details

Reviewers
nhaehnle
kerbowa
Summary

We were executing this in a waterfall loop as a placeholder, but this
should really be converted to a MUBUF load. Also execute in a
waterfall loop if the resource isn't an SGPR. This is a case where the
DAG handling was wrong because doing the right thing was too hard.

Currently, this will mishandle 96-bit loads. There's currently no way
to track the original memory size with an MMO, so these loads will be
widened andd the resulting memory size will be 128-bits.

Diff Detail

Event Timeline

arsenm created this revision.Feb 3 2020, 5:10 AM
Herald added a project: Restricted Project. · View Herald TranscriptFeb 3 2020, 5:10 AM

Thank you for doing this. A couple of concerns...

llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
1375–1377

What's the justification for the alignment in the NumLoads > 1 case?

1437–1448

This case should waterfall the original load if only the srsrc is a VGPR....

llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-amdgcn-s-buffer-load.mir
79

This should still be an s_buffer_load...

arsenm marked 2 inline comments as done.Feb 5 2020, 8:44 AM
arsenm added inline comments.
llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
1375–1377

This ensures the last load's offset will fit in the immediate field

llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-amdgcn-s-buffer-load.mir
79

How can this be a scalar result with a vector resource?

arsenm marked an inline comment as done.Feb 5 2020, 12:50 PM
arsenm added inline comments.
llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
1375–1377

Really the align 4 in the 1 case isn't necessary

nhaehnle accepted this revision.Feb 7 2020, 1:42 AM

One more comment, but apart from that LGTM.

llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
1375–1377

Okay, that makes sense. Thanks.

llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-amdgcn-s-buffer-load.mir
79

It's not a vector resource anymore after waterfalling...

Okay, so admittedly it's not entirely clear which way ultimately ends up with better code in practice. One case is a loop with

buffer_load_dwordx4 v[0:4], ...

in it. The other case is a loop with:

s_buffer_load_dwordx4 s[0:4], ...
s_waitcnt lgkmcnt(0)
v_mov_b32 v0, s0
v_mov_b32 v1, s1
v_mov_b32 v2, s2
v_mov_b32 v3, s3

Okay, so the second one would likely be better if it has better K$ behavior, e.g. if the resource is often dynamically uniform in practice. This is hard to predict... but it should be a conscious decision. It's not a high priority for graphics at the moment, but at the very least, please take a note of it in a comment the relevant part of the code.

This revision is now accepted and ready to land.Feb 7 2020, 1:42 AM