Split UnalignedAccessMode into UnalignedDSAccess and UnalignedBufferAccess so
that only UnalignedBufferAccess can be enabled by default for AMDHSA
Details
Diff Detail
Event Timeline
llvm/lib/Target/AMDGPU/AMDGPUSubtarget.cpp | ||
---|---|---|
96 | HSA wants unaligned DS access as well. That is only underaligned ds_read/write_b128 shall not be produced for performance reasons. |
I thought these are still the same hardware control, so aren't truly different features
I see some comments from @rampitec and @arsenm, and @rampitec had also made below comments in the related internal JIRA ticket.
Comments from @rampitec in JIRA ticket: Disabling unaligned dword access is not acceptable for HSA. My proposal was simple: prefer ds_read2_b64 (and write) over b128 if alignment < 16. I never heard of b64 performance issues though, so the rest is the same as now: create a widest load/store possible with that one exception for b128.
From the JIRA comments, I also see @foad is in agreement with above comments. So, it looks like, we should avoid accessing ds_read_128/ds_write_128 when the alignment is < 16. I am not sure, what you think here, but, your comments are very valuable to make further progress here.
If we all agree with above, then looks like the the file DSInstructions.td has to go below changes.
The code
let SubtargetPredicate = HasUnalignedDSAccess in { foreach vt = VReg_96.RegTypes in { defm : DSReadPat_mc <DS_READ_B96, vt, "load_local">; } foreach vt = VReg_128.RegTypes in { defm : DSReadPat_mc <DS_READ_B128, vt, "load_local">; } } // End SubtargetPredicate = HasUnalignedDSAccess
should change to
let SubtargetPredicate = HasUnalignedDSAccess in { foreach vt = VReg_96.RegTypes in { defm : DSReadPat_mc <DS_READ_B96, vt, "load_local">; } foreach vt = VReg_128.RegTypes in { defm : DSReadPat_mc <DS_READ_B128, vt, "load_align16_local">; } } // End SubtargetPredicate = HasUnalignedDSAccess
and the code
let SubtargetPredicate = HasUnalignedDSAccess in { foreach vt = VReg_96.RegTypes in { defm : DSWritePat_mc <DS_WRITE_B96, vt, "store_local">; } foreach vt = VReg_128.RegTypes in { defm : DSWritePat_mc <DS_WRITE_B128, vt, "store_local">; } } // End SubtargetPredicate = HasUnalignedDSAccess
should change to
let SubtargetPredicate = HasUnalignedDSAccess in { foreach vt = VReg_96.RegTypes in { defm : DSWritePat_mc <DS_WRITE_B96, vt, "store_local">; } foreach vt = VReg_128.RegTypes in { defm : DSWritePat_mc <DS_WRITE_B128, vt, "store_align16_local">; } } // End SubtargetPredicate = HasUnalignedDSAccess
I strongly disagree with splitting features in a way that does not correspond with the hardware controls to match a desired output.
We have since then confirmed that ds_read_b64 has the same performance hit on memory not aligned to 64 bit, so 64 bit operations too need an alignment check.
But I see the below ISel pattern for - DS_READ_B64 indeed checks alignment requirment, though not sure if it is over written some where else.
DSInstructions.td:717-719
foreach vt = VReg_64.RegTypes in { defm : DSReadPat_mc <DS_READ_B64, vt, "load_align8_local">; }
And further based on a closer look at the code, I think, in general, we better revert the commit 0654ff703d4e99423133165db63083b831efb9b6 in order to avoid above performance issues because of unligned access. But, I am not aware of the consequences of reverting this patch.
@mbrkusanin / @foad What is your openion on reverting the patch 0654ff703d4e99423133165db63083b831efb9b6?
0654ff703d4e99423133165db63083b831efb9b6 (aka D84403) had good effects like selecting ds_write_b96 when the alignment was 16. It would be a shame to lose that.
HSA wants unaligned DS access as well. That is only underaligned ds_read/write_b128 shall not be produced for performance reasons.