This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Refine 64 bit misaligned LDS ops selection
ClosedPublic

Authored by rampitec on Apr 18 2022, 12:35 PM.

Details

Summary

Here is the performance data:

Using platform: AMD Accelerated Parallel Processing
Using device: gfx900:xnack-

ds_write_b64                       aligned by  8:  3.2 sec
ds_write2_b32                      aligned by  8:  3.2 sec
ds_write_b16 * 4                   aligned by  8:  7.0 sec
ds_write_b8 * 8                    aligned by  8: 13.2 sec
ds_write_b64                       aligned by  1:  7.3 sec
ds_write2_b32                      aligned by  1:  7.5 sec
ds_write_b16 * 4                   aligned by  1: 14.0 sec
ds_write_b8 * 8                    aligned by  1: 13.2 sec
ds_write_b64                       aligned by  2:  7.3 sec
ds_write2_b32                      aligned by  2:  7.5 sec
ds_write_b16 * 4                   aligned by  2:  7.1 sec
ds_write_b8 * 8                    aligned by  2: 13.3 sec
ds_write_b64                       aligned by  4:  4.6 sec
ds_write2_b32                      aligned by  4:  3.2 sec
ds_write_b16 * 4                   aligned by  4:  7.1 sec
ds_write_b8 * 8                    aligned by  4: 13.3 sec
ds_read_b64                        aligned by  8:  2.3 sec
ds_read2_b32                       aligned by  8:  2.2 sec
ds_read_u16 * 4                    aligned by  8:  4.8 sec
ds_read_u8 * 8                     aligned by  8:  8.6 sec
ds_read_b64                        aligned by  1:  4.4 sec
ds_read2_b32                       aligned by  1:  7.3 sec
ds_read_u16 * 4                    aligned by  1: 14.0 sec
ds_read_u8 * 8                     aligned by  1:  8.7 sec
ds_read_b64                        aligned by  2:  4.4 sec
ds_read2_b32                       aligned by  2:  7.3 sec
ds_read_u16 * 4                    aligned by  2:  4.8 sec
ds_read_u8 * 8                     aligned by  2:  8.7 sec
ds_read_b64                        aligned by  4:  4.4 sec
ds_read2_b32                       aligned by  4:  2.3 sec
ds_read_u16 * 4                    aligned by  4:  4.8 sec
ds_read_u8 * 8                     aligned by  4:  8.7 sec

Using platform: AMD Accelerated Parallel Processing
Using device: gfx1030

ds_write_b64                       aligned by  8:  4.4 sec
ds_write2_b32                      aligned by  8:  4.3 sec
ds_write_b16 * 4                   aligned by  8:  7.9 sec
ds_write_b8 * 8                    aligned by  8: 13.0 sec
ds_write_b64                       aligned by  1: 23.2 sec
ds_write2_b32                      aligned by  1: 23.1 sec
ds_write_b16 * 4                   aligned by  1: 44.0 sec
ds_write_b8 * 8                    aligned by  1: 13.0 sec
ds_write_b64                       aligned by  2: 23.2 sec
ds_write2_b32                      aligned by  2: 23.1 sec
ds_write_b16 * 4                   aligned by  2:  7.9 sec
ds_write_b8 * 8                    aligned by  2: 13.1 sec
ds_write_b64                       aligned by  4: 13.5 sec
ds_write2_b32                      aligned by  4:  4.3 sec
ds_write_b16 * 4                   aligned by  4:  7.9 sec
ds_write_b8 * 8                    aligned by  4: 13.1 sec
ds_read_b64                        aligned by  8:  3.5 sec
ds_read2_b32                       aligned by  8:  3.4 sec
ds_read_u16 * 4                    aligned by  8:  5.3 sec
ds_read_u8 * 8                     aligned by  8:  8.5 sec
ds_read_b64                        aligned by  1: 13.1 sec
ds_read2_b32                       aligned by  1: 22.7 sec
ds_read_u16 * 4                    aligned by  1: 43.9 sec
ds_read_u8 * 8                     aligned by  1:  7.9 sec
ds_read_b64                        aligned by  2: 13.1 sec
ds_read2_b32                       aligned by  2: 22.7 sec
ds_read_u16 * 4                    aligned by  2:  5.6 sec
ds_read_u8 * 8                     aligned by  2:  7.9 sec
ds_read_b64                        aligned by  4: 13.1 sec
ds_read2_b32                       aligned by  4:  3.4 sec
ds_read_u16 * 4                    aligned by  4:  5.6 sec
ds_read_u8 * 8                     aligned by  4:  7.9 sec

GFX10 exposes a different pattern for sub-DWORD load/store performance
than GFX9. On GFX9 it is faster to issue a single unaligned load or
store than a fully split b8 access, where on GFX10 even a full split
is better. However, this is a theoretical only gain because splitting
an access to a sub-dword level will require more registers and packing/
unpacking logic, so ignoring this option it is better to use a single
64 bit instruction on a misaligned data with the exception of 4 byte
aligned data where ds_read2_b32/ds_write2_b32 is better.

Diff Detail

Event Timeline

rampitec created this revision.Apr 18 2022, 12:35 PM
Herald added a project: Restricted Project. · View Herald TranscriptApr 18 2022, 12:35 PM
rampitec requested review of this revision.Apr 18 2022, 12:35 PM
Herald added a project: Restricted Project. · View Herald TranscriptApr 18 2022, 12:35 PM
Herald added a subscriber: wdng. · View Herald Transcript
rampitec added a comment.EditedApr 18 2022, 5:02 PM

I think this is it for LDS. 32 byte access is already fine, we do not want to split it even though that is faster on Navi, but increases register pressure.

One remaining issue which may be addressed is LoadStoreVectorization which would need a change of the definition of "fast" itself.

Another one is global isel, it would be better to handle "fast", which is technically easy, but I can only see a place for it in the legalization, which is a layering violation. So it needs a handler in the lowering instead.

Then there shall be a same handling and experiments for global. Likely global isel shall go after that as I don't think this is a generally good thing to distinguish between address spaces rather than just relying on the "allowed" and "fast" alone.

Another potential area is ignoring "fast" with optimization for size.

arsenm accepted this revision.Apr 21 2022, 6:49 AM

LGTM. Is this benchmark getting permanently added to a suite somewhere?

This revision is now accepted and ready to land.Apr 21 2022, 6:49 AM

LGTM. Is this benchmark getting permanently added to a suite somewhere?

I wish to, I do not want to write it again when we need to bringup a new target. We can discuss on the next meeting where is the right place.

JFYI I have written same tests for global, mubuf and flat scratch since then. We seem to do a right thing for global, but better to always prefer a single dword access on swizzled memory regardless of alignment.

This revision was landed with ongoing or failed builds.Apr 21 2022, 9:37 AM
This revision was automatically updated to reflect the committed changes.