This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Refactor LDS alignment checks.
ClosedPublic

Authored by rampitec on Apr 7 2022, 3:50 PM.

Details

Summary

Move features/bugs checks into the single place
allowsMisalignedMemoryAccessesImpl.

This is mostly NFCI except for the order of selection in couple places.
A separate change may be needed to stop lying about Fast.

Diff Detail

Event Timeline

rampitec created this revision.Apr 7 2022, 3:50 PM
Herald added a project: Restricted Project. · View Herald TranscriptApr 7 2022, 3:50 PM
rampitec requested review of this revision.Apr 7 2022, 3:50 PM
Herald added a project: Restricted Project. · View Herald TranscriptApr 7 2022, 3:50 PM
Herald added a subscriber: wdng. · View Herald Transcript

I am not sure we really want to tell truth about the 'Fast' here. If we tell that DS read misaligned by 1 byte is slow vectorizer will not combine 2 of them and we will get 2 separate ds_read_b32 instead of ds_read2_b32. It is slow, but the ds_read2_b32 is still faster than 2 separate instructions equally misaligned. That is what happens then: https://reviews.llvm.org/differential/diff/421361/

rampitec updated this revision to Diff 422051.Apr 11 2022, 3:30 PM

Fixed comment indentation.

arsenm accepted this revision.Apr 11 2022, 5:43 PM
This revision is now accepted and ready to land.Apr 11 2022, 5:43 PM
This revision was landed with ongoing or failed builds.Apr 12 2022, 7:49 AM
This revision was automatically updated to reflect the committed changes.
foad added a comment.Apr 13 2022, 6:32 AM

I am not sure we really want to tell truth about the 'Fast' here. If we tell that DS read misaligned by 1 byte is slow vectorizer will not combine 2 of them and we will get 2 separate ds_read_b32 instead of ds_read2_b32. It is slow, but the ds_read2_b32 is still faster than 2 separate instructions equally misaligned. That is what happens then: https://reviews.llvm.org/differential/diff/421361/

Maybe LoadStoreVectorizer should be changed to create slow instructions, if the instructions being combined were slow already.

foad added inline comments.Apr 13 2022, 6:56 AM
llvm/lib/Target/AMDGPU/SIISelLowering.cpp
1555

You don't need this - it is already handled by using PowerOf2Ceil to initialize RequiredAlignment.

I am not sure we really want to tell truth about the 'Fast' here. If we tell that DS read misaligned by 1 byte is slow vectorizer will not combine 2 of them and we will get 2 separate ds_read_b32 instead of ds_read2_b32. It is slow, but the ds_read2_b32 is still faster than 2 separate instructions equally misaligned. That is what happens then: https://reviews.llvm.org/differential/diff/421361/

Maybe LoadStoreVectorizer should be changed to create slow instructions, if the instructions being combined were slow already.

I have a w/a for this in the D123634, but in general I do not think 'fast' or 'slow' is a right measure. Something is not fast or slow, but faster or slower than something other. This would be a big change though.

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
1555

Indeed.

rampitec marked an inline comment as done.Apr 13 2022, 11:20 AM
rampitec added inline comments.
llvm/lib/Target/AMDGPU/SIISelLowering.cpp
1555

I have a w/a for this in the D123634, but in general I do not think 'fast' or 'slow' is a right measure. Something is not fast or slow, but faster or slower than something other. This would be a big change though.

Speaking of fast and slow this might be modeled with an unsigned speed rank. The direct translation of current bool IsFast will be to 0 and 1. Then target may use more ranks and a higher the number the faster is the access.