In many cases ds operations can be combined even if offsets do not
fit into 8 bit encoding. What it takes is to adjust base address.
Details
Diff Detail
- Repository
- rL LLVM
Event Timeline
lib/Target/AMDGPU/SILoadStoreOptimizer.cpp | ||
---|---|---|
78 | Two offsets as it will be encoded into an LDS instruction, and then base offset which needs to be added with v_add_i32 to the pointer if non zero. |
lib/Target/AMDGPU/SILoadStoreOptimizer.cpp | ||
---|---|---|
78 | But one of it would become zero in the instruction? |
lib/Target/AMDGPU/SILoadStoreOptimizer.cpp | ||
---|---|---|
78 | For now yes, either Offset0 or BaseOff are zero. In a longer term that is not necessarily so. Imagine you could add less to the base pointer, use both offsets in the encoding, and then reuse new base register for another couple of registers, where this code will generate a separate v_add_i32 otherwise. |
lib/Target/AMDGPU/SILoadStoreOptimizer.cpp | ||
---|---|---|
78 | Right. |
llvm/trunk/lib/Target/AMDGPU/SILoadStoreOptimizer.cpp | ||
---|---|---|
387–389 ↗ | (On Diff #95170) | This should use the e64 version with an unused carry. We should add a helper to TII to emit this since it will change with GFX9 |
392 ↗ | (On Diff #95170) | BuildMI on next line and shift the .adds to the left |
397–398 ↗ | (On Diff #95170) | .setMemRefs |
459–460 ↗ | (On Diff #95170) | Same as with the other add |
llvm/trunk/test/CodeGen/AMDGPU/ds-combine-large-stride.ll | ||
1 ↗ | (On Diff #95170) | -march is redundant with the triple. Should also use GCN check prefix since the check lines will need to change with GFX9 (should also add a run line for it so it breaks when the add change is made) |
370 ↗ | (On Diff #95170) | These all need to be marked amdgpu_kernel |
http://lab.llvm.org:8011/builders/sanitizer-ppc64be-linux/builds/2283 is broken by this change.
llvm/trunk/lib/Target/AMDGPU/SILoadStoreOptimizer.cpp | ||
---|---|---|
387–389 ↗ | (On Diff #95170) | Matt, I doubt we should use e64 version here. It does not accept immediate, which effectively would require one more SGPR and one more mov. A vcc thrashing seems to be less issue. |
llvm/trunk/lib/Target/AMDGPU/SILoadStoreOptimizer.cpp | ||
---|---|---|
387–389 ↗ | (On Diff #95170) | Even new no-carry variant is VOP3, so same issue. |
llvm/trunk/lib/Target/AMDGPU/SILoadStoreOptimizer.cpp | ||
---|---|---|
387–389 ↗ | (On Diff #95170) | You can materialize the constant in a register. It will be folded and shrunk later |
llvm/trunk/lib/Target/AMDGPU/SILoadStoreOptimizer.cpp | ||
---|---|---|
387–389 ↗ | (On Diff #95170) | It is not folded though: s_movk_i32 s0, 0x960 v_add_i32_e32 v0, vcc, s0, v4 ds_read2_b32 v[0:1], v0 offset1:100 Also note, that values in this case will be mostly unique, because they are added to the base pointer, not to a previously incremented pointer. In latter case the would be high probability that many adds would have the same literal. I would lean towards using e64 version and an SGPR when the pass is redesigned to first collect all pairs, build a chain of adds, and only then combine. In this case we will use single SGPR. |
Why 3 offsets?