This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Split unaligned 4 DWORD DS operations
ClosedPublic

Authored by rampitec on Apr 12 2022, 2:18 PM.

Details

Summary

Similarly to 3 DWORD operations it is better for performance
to split unlaligned operations as long a these are at least
DWORD alignmened. Performance data:

Using platform: AMD Accelerated Parallel Processing
Using device: gfx900:xnack-

ds_write_b128                      aligned by 16:  4.9 sec
ds_write2_b64                      aligned by 16:  5.1 sec
ds_write2_b32 * 2                  aligned by 16:  5.5 sec
ds_write_b128                      aligned by  1:  8.1 sec
ds_write2_b64                      aligned by  1:  8.7 sec
ds_write2_b32 * 2                  aligned by  1: 14.0 sec
ds_write_b128                      aligned by  2:  8.1 sec
ds_write2_b64                      aligned by  2:  8.7 sec
ds_write2_b32 * 2                  aligned by  2: 14.0 sec
ds_write_b128                      aligned by  4:  5.6 sec
ds_write2_b64                      aligned by  4:  8.7 sec
ds_write2_b32 * 2                  aligned by  4:  5.6 sec
ds_write_b128                      aligned by  8:  5.6 sec
ds_write2_b64                      aligned by  8:  5.1 sec
ds_write2_b32 * 2                  aligned by  8:  5.6 sec
ds_read_b128                       aligned by 16:  3.8 sec
ds_read2_b64                       aligned by 16:  3.8 sec
ds_read2_b32 * 2                   aligned by 16:  4.0 sec
ds_read_b128                       aligned by  1:  4.6 sec
ds_read2_b64                       aligned by  1:  8.1 sec
ds_read2_b32 * 2                   aligned by  1: 14.0 sec
ds_read_b128                       aligned by  2:  4.6 sec
ds_read2_b64                       aligned by  2:  8.1 sec
ds_read2_b32 * 2                   aligned by  2: 14.0 sec
ds_read_b128                       aligned by  4:  4.6 sec
ds_read2_b64                       aligned by  4:  8.1 sec
ds_read2_b32 * 2                   aligned by  4:  4.0 sec
ds_read_b128                       aligned by  8:  4.6 sec
ds_read2_b64                       aligned by  8:  3.8 sec
ds_read2_b32 * 2                   aligned by  8:  4.0 sec

Using platform: AMD Accelerated Parallel Processing
Using device: gfx1030

ds_write_b128                      aligned by 16:  6.2 sec
ds_write2_b64                      aligned by 16:  7.1 sec
ds_write2_b32 * 2                  aligned by 16:  7.6 sec
ds_write_b128                      aligned by  1: 24.1 sec
ds_write2_b64                      aligned by  1: 25.2 sec
ds_write2_b32 * 2                  aligned by  1: 43.7 sec
ds_write_b128                      aligned by  2: 24.1 sec
ds_write2_b64                      aligned by  2: 25.1 sec
ds_write2_b32 * 2                  aligned by  2: 43.7 sec
ds_write_b128                      aligned by  4: 14.4 sec
ds_write2_b64                      aligned by  4: 25.1 sec
ds_write2_b32 * 2                  aligned by  4:  7.6 sec
ds_write_b128                      aligned by  8: 14.4 sec
ds_write2_b64                      aligned by  8:  7.1 sec
ds_write2_b32 * 2                  aligned by  8:  7.6 sec
ds_read_b128                       aligned by 16:  6.2 sec
ds_read2_b64                       aligned by 16:  6.3 sec
ds_read2_b32 * 2                   aligned by 16:  7.5 sec
ds_read_b128                       aligned by  1: 12.5 sec
ds_read2_b64                       aligned by  1: 24.0 sec
ds_read2_b32 * 2                   aligned by  1: 43.6 sec
ds_read_b128                       aligned by  2: 12.5 sec
ds_read2_b64                       aligned by  2: 24.0 sec
ds_read2_b32 * 2                   aligned by  2: 43.6 sec
ds_read_b128                       aligned by  4: 12.5 sec
ds_read2_b64                       aligned by  4: 24.0 sec
ds_read2_b32 * 2                   aligned by  4:  7.5 sec
ds_read_b128                       aligned by  8: 12.5 sec
ds_read2_b64                       aligned by  8:  6.3 sec
ds_read2_b32 * 2                   aligned by  8:  7.5 sec

Diff Detail

Event Timeline

rampitec created this revision.Apr 12 2022, 2:18 PM
Herald added a project: Restricted Project. · View Herald TranscriptApr 12 2022, 2:18 PM
rampitec requested review of this revision.Apr 12 2022, 2:18 PM
Herald added a project: Restricted Project. · View Herald TranscriptApr 12 2022, 2:18 PM
Herald added a subscriber: wdng. · View Herald Transcript
arsenm added inline comments.Apr 12 2022, 3:27 PM
llvm/lib/Target/AMDGPU/SIInstrInfo.td
591–595

Do you really need these PatFrags? The legalizer made the split decisions and the patterns shouldn't need logic deper than it's able to select (which it can regardless of the alignment)

rampitec added inline comments.Apr 12 2022, 3:43 PM
llvm/lib/Target/AMDGPU/SIInstrInfo.td
591–595

If I am using just load_local/store_local I need to override selection of the ds_read2_b64 for alignments less than 4. Using AddedComplexity does not help, it shall be selected differently for alignment less than 4 and 4.

arsenm accepted this revision.Apr 12 2022, 3:45 PM
This revision is now accepted and ready to land.Apr 12 2022, 3:45 PM
This revision was landed with ongoing or failed builds.Apr 12 2022, 4:07 PM
This revision was automatically updated to reflect the committed changes.