This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Split unaligned 3 DWORD DS operations
ClosedPublic

Authored by rampitec on Apr 11 2022, 10:43 AM.

Details

Summary

I have written a minitest to check the performance. Overall
the benefit of aligned b96 operations on data which is not
known but happens to be aligned is small, while performance
hit of using b96 operations on a really unaligned memory is
high.

The only exception is when data is not aligned even by 4, it
is better to use b96 in this case.

Here is the test output on Vega and Navi:

Using platform: AMD Accelerated Parallel Processing
Using device: gfx900:xnack-

ds_write_b96                                  aligned: 3.4 sec
ds_write_b32 + ds_write_b64                   aligned: 4.5 sec
ds_write_b32 * 3                              aligned: 4.8 sec
ds_write_b96                          misaligned by 1: 4.8 sec
ds_write_b32 + ds_write_b64           misaligned by 1: 7.2 sec
ds_write_b32 * 3                      misaligned by 1: 10.0 sec
ds_write_b96                          misaligned by 2: 4.8 sec
ds_write_b32 + ds_write_b64           misaligned by 2: 7.2 sec
ds_write_b32 * 3                      misaligned by 2: 10.1 sec
ds_write_b96                          misaligned by 4: 4.8 sec
ds_write_b32 + ds_write_b64           misaligned by 4: 4.2 sec
ds_write_b32 * 3                      misaligned by 4: 4.9 sec
ds_write_b96                          misaligned by 8: 4.8 sec
ds_write_b32 + ds_write_b64           misaligned by 8: 4.6 sec
ds_write_b32 * 3                      misaligned by 8: 4.9 sec
ds_read_b96                                   aligned: 3.3 sec
ds_read_b32 + ds_read_b64                     aligned: 4.9 sec
ds_read_b32 * 3                               aligned: 2.6 sec
ds_read_b96                           misaligned by 1: 4.1 sec
ds_read_b32 + ds_read_b64             misaligned by 1: 7.2 sec
ds_read_b32 * 3                       misaligned by 1: 10.1 sec
ds_read_b96                           misaligned by 2: 4.1 sec
ds_read_b32 + ds_read_b64             misaligned by 2: 7.2 sec
ds_read_b32 * 3                       misaligned by 2: 10.1 sec
ds_read_b96                           misaligned by 4: 4.1 sec
ds_read_b32 + ds_read_b64             misaligned by 4: 2.6 sec
ds_read_b32 * 3                       misaligned by 4: 2.6 sec
ds_read_b96                           misaligned by 8: 4.1 sec
ds_read_b32 + ds_read_b64             misaligned by 8: 4.9 sec
ds_read_b32 * 3                       misaligned by 8: 2.6 sec

Using platform: AMD Accelerated Parallel Processing
Using device: gfx1030

ds_write_b96                                  aligned: 4.1 sec
ds_write_b32 + ds_write_b64                   aligned: 13.0 sec
ds_write_b32 * 3                              aligned: 4.5 sec
ds_write_b96                          misaligned by 1: 12.5 sec
ds_write_b32 + ds_write_b64           misaligned by 1: 22.0 sec
ds_write_b32 * 3                      misaligned by 1: 31.5 sec
ds_write_b96                          misaligned by 2: 12.4 sec
ds_write_b32 + ds_write_b64           misaligned by 2: 22.0 sec
ds_write_b32 * 3                      misaligned by 2: 31.5 sec
ds_write_b96                          misaligned by 4: 12.4 sec
ds_write_b32 + ds_write_b64           misaligned by 4: 4.0 sec
ds_write_b32 * 3                      misaligned by 4: 4.5 sec
ds_write_b96                          misaligned by 8: 12.4 sec
ds_write_b32 + ds_write_b64           misaligned by 8: 13.0 sec
ds_write_b32 * 3                      misaligned by 8: 4.5 sec
ds_read_b96                                   aligned: 3.8 sec
ds_read_b32 + ds_read_b64                     aligned: 12.8 sec
ds_read_b32 * 3                               aligned: 4.4 sec
ds_read_b96                           misaligned by 1: 10.9 sec
ds_read_b32 + ds_read_b64             misaligned by 1: 21.8 sec
ds_read_b32 * 3                       misaligned by 1: 31.5 sec
ds_read_b96                           misaligned by 2: 10.9 sec
ds_read_b32 + ds_read_b64             misaligned by 2: 21.9 sec
ds_read_b32 * 3                       misaligned by 2: 31.5 sec
ds_read_b96                           misaligned by 4: 10.9 sec
ds_read_b32 + ds_read_b64             misaligned by 4: 3.8 sec
ds_read_b32 * 3                       misaligned by 4: 4.5 sec
ds_read_b96                           misaligned by 8: 10.9 sec
ds_read_b32 + ds_read_b64             misaligned by 8: 12.8 sec
ds_read_b32 * 3                       misaligned by 8: 4.5 sec

Fixes: SWDEV-330802

Diff Detail

Event Timeline

rampitec created this revision.Apr 11 2022, 10:43 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 11 2022, 10:43 AM
rampitec requested review of this revision.Apr 11 2022, 10:43 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 11 2022, 10:43 AM
Herald added a subscriber: wdng. · View Herald Transcript
rampitec updated this revision to Diff 421978.Apr 11 2022, 11:15 AM
rampitec updated this revision to Diff 422054.Apr 11 2022, 3:31 PM

Fixed comment.

foad accepted this revision.Apr 12 2022, 1:25 AM

Looks OK to me. But there will always be benchmarks that go faster and slower with any change like this, because the compiler does not have perfect knowledge about the (mis)alignment of all data.

llvm/lib/Target/AMDGPU/DSInstructions.td
880

Typo "accesses", "performance"

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
1564

Note that Alignment < Align(4) does not prove that the address is not dword aligned, just that the compiler does not know it's dword aligned. But I guess this is the best we can do for now.

This revision is now accepted and ready to land.Apr 12 2022, 1:25 AM
arsenm accepted this revision.Apr 12 2022, 6:03 AM
rampitec updated this revision to Diff 422233.Apr 12 2022, 7:53 AM
rampitec marked an inline comment as done.
rampitec retitled this revision from [AMDGCN] Split unaligned 3 DWORD DS operations to [AMDGPU] Split unaligned 3 DWORD DS operations.

Fixed typos.

Looks OK to me. But there will always be benchmarks that go faster and slower with any change like this, because the compiler does not have perfect knowledge about the (mis)alignment of all data.

Yes, sure. The point of the patch is to minimize the cost of the mistake. If the data is really aligned it will be slower now, but if it is really misaligned it is way slower with a single instruction before the patch.

rampitec marked an inline comment as done.Apr 12 2022, 8:02 AM
rampitec added inline comments.
llvm/lib/Target/AMDGPU/SIISelLowering.cpp
1564

Right, then in this case if it is misaligned by 1 or 2 it is faster with a single instruction. If it is misaligned by 4 or 8 it would be slightly faster to split into 32 bit instructions, but this is what we do not know. If it is really aligned this is really the fastest. But without Align(4) check we would have to split it to b8 instructions and that will be really slow in any scenario.

This revision was landed with ongoing or failed builds.Apr 12 2022, 8:05 AM
This revision was automatically updated to reflect the committed changes.
rampitec marked an inline comment as done.