This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Combine DS operations with offsets bigger than byte
ClosedPublic

Authored by rampitec on Apr 12 2017, 3:34 PM.

Details

Summary

In many cases ds operations can be combined even if offsets do not
fit into 8 bit encoding. What it takes is to adjust base address.

Diff Detail

Repository
rL LLVM

Event Timeline

rampitec created this revision.Apr 12 2017, 3:34 PM
vpykhtin edited edge metadata.Apr 13 2017, 10:25 AM

Looks good.

lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
78 ↗(On Diff #95040)

Why 3 offsets?

rampitec marked an inline comment as done.Apr 13 2017, 10:42 AM
rampitec added inline comments.
lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
78 ↗(On Diff #95040)

Two offsets as it will be encoded into an LDS instruction, and then base offset which needs to be added with v_add_i32 to the pointer if non zero.

vpykhtin added inline comments.Apr 13 2017, 10:44 AM
lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
78 ↗(On Diff #95040)

But one of it would become zero in the instruction?

rampitec marked an inline comment as done.Apr 13 2017, 10:47 AM
rampitec added inline comments.
lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
78 ↗(On Diff #95040)

For now yes, either Offset0 or BaseOff are zero.

In a longer term that is not necessarily so. Imagine you could add less to the base pointer, use both offsets in the encoding, and then reuse new base register for another couple of registers, where this code will generate a separate v_add_i32 otherwise.

vpykhtin accepted this revision.Apr 13 2017, 10:48 AM
vpykhtin added inline comments.
lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
78 ↗(On Diff #95040)

Right.

This revision is now accepted and ready to land.Apr 13 2017, 10:48 AM
This revision was automatically updated to reflect the committed changes.
arsenm added inline comments.Apr 13 2017, 11:11 AM
llvm/trunk/lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
387–389

This should use the e64 version with an unused carry. We should add a helper to TII to emit this since it will change with GFX9

392

BuildMI on next line and shift the .adds to the left

397–398

.setMemRefs

459–460

Same as with the other add

llvm/trunk/test/CodeGen/AMDGPU/ds-combine-large-stride.ll
1

-march is redundant with the triple. Should also use GCN check prefix since the check lines will need to change with GFX9 (should also add a run line for it so it breaks when the add change is made)

370

These all need to be marked amdgpu_kernel

rampitec marked 5 inline comments as done.Apr 13 2017, 1:52 PM

Sure, will fix shortly.

Sure, will fix shortly.

Oh, it is already fixed. Thanks!

rampitec marked an inline comment as done.Apr 13 2017, 2:41 PM
rampitec added inline comments.
llvm/trunk/lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
387–389

Matt, I doubt we should use e64 version here. It does not accept immediate, which effectively would require one more SGPR and one more mov. A vcc thrashing seems to be less issue.

rampitec added inline comments.Apr 13 2017, 2:52 PM
llvm/trunk/lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
387–389

Even new no-carry variant is VOP3, so same issue.

arsenm added inline comments.Apr 13 2017, 3:37 PM
llvm/trunk/lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
387–389

You can materialize the constant in a register. It will be folded and shrunk later

rampitec added inline comments.Apr 13 2017, 3:54 PM
llvm/trunk/lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
387–389

It is not folded though:

s_movk_i32 s0, 0x960
v_add_i32_e32 v0, vcc, s0, v4
ds_read2_b32 v[0:1], v0 offset1:100

Also note, that values in this case will be mostly unique, because they are added to the base pointer, not to a previously incremented pointer. In latter case the would be high probability that many adds would have the same literal.

I would lean towards using e64 version and an SGPR when the pass is redesigned to first collect all pairs, build a chain of adds, and only then combine. In this case we will use single SGPR.