Due to performance reasons, ds_read_*/ds_write_* operations require
strict alignment. Avoid selecting them in under-aligned situations
irrespective of whether "unligned access mode" is enabled or not.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Unit Tests
Event Timeline
llvm/lib/Target/AMDGPU/SIISelLowering.cpp | ||
---|---|---|
1425–1428 | I don't see the reason for this change. Everything the old comment said is still true. |
llvm/lib/Target/AMDGPU/SIISelLowering.cpp | ||
---|---|---|
1425–1428 | We have updated the .td file to be strictly aligned irrespective of if "unaligned access mode" is enabled or not. But, here we are relaxing it if we are not having above changes, and hence both glboal ISel and SDAG ISel breaks down in certain cases. Please try the test "lds-misaligned-bug.ll", both glboal ISel and SDAG ISel asserts for this test, without above change and with only changes to .td file. |
It is actually hard to see the impact without tests updated, at least some specific tests.
llvm/lib/Target/AMDGPU/SIISelLowering.cpp | ||
---|---|---|
1426 | I don't like changing this function to lie and say it doesn't work when it does |
After fixing the review comments by Jay, below tests asserts. Yet to investigate it.
LLVM :: CodeGen/AMDGPU/GlobalISel/lds-misaligned-bug.ll LLVM :: CodeGen/AMDGPU/GlobalISel/load-unaligned.ll LLVM :: CodeGen/AMDGPU/lds-misaligned-bug.ll
Apart from above, below two tests needed update that is taken care.
CodeGen/AMDGPU/ds_read2.ll
CodeGen/AMDGPU/ds_write2.ll
Do we have tests where pointers are actually 16 byte aligned?
llvm/test/CodeGen/AMDGPU/ds_read2.ll | ||
---|---|---|
1076 | I would probably expect ds_read2_b32 for align 4. |
Updated the fix - make sure that the function allowsMisalignedMemoryAccessesImpl()
checks for the strict alignment requirment for ds_read/ds_write operations.
Please check the lit test - store-local.128.ll which is now updated within this patch.
llvm/test/CodeGen/AMDGPU/ds_read2.ll | ||
---|---|---|
1076 | The fix is further updated to take care of this - Now the allowsMisalignedMemoryAccessesImpl() checks for the strict alignment requirment for ds_read/ds_write operations. |
llvm/lib/Target/AMDGPU/SIISelLowering.cpp | ||
---|---|---|
1426 | My understanding is that the function allowsMisalignedMemoryAccessesImpl() should check for the strict alignment requirment for ds_read/ds_write operations. I am now further updated this function accordingly. |
@foad can we run it trough gfx perf suite? With exception of one ds_write_b64 store it seems reasonable to me. Of course these splits to b8 would be unfortunate but I assume it shall not happen in real life too often and the data is usually better aligned.
llvm/test/CodeGen/AMDGPU/ds_write2.ll | ||
---|---|---|
849 | It seems it misses the same split to b32. |
llvm/lib/Target/AMDGPU/SIISelLowering.cpp | ||
---|---|---|
1426 | You are making this stricter than the instructions really allow |
llvm/lib/Target/AMDGPU/SIISelLowering.cpp | ||
---|---|---|
1426 | That is more or less the point. I think there is some balance here between the perf drop when a wide access is used for misaligned data and perf loss when a narrow access is used on data for which we cannot prove alignment but it just happens to be aligned. This seems to warrant a perf testing on this patch. @cfang does this solve perf regression you were working on? |
llvm/lib/Target/AMDGPU/SIISelLowering.cpp | ||
---|---|---|
1426 | The function reports if it's legal and works. IsFast is separate and can change, but it should still report this as a legal unaligned access |
llvm/lib/Target/AMDGPU/SIISelLowering.cpp | ||
---|---|---|
1426 | Verified with SHOC SGEMM, this patch could resolve the regression. |
llvm/lib/Target/AMDGPU/SIISelLowering.cpp | ||
---|---|---|
1426 | Ugh! Do you know if we can use ds_read2_b64 for align4? The main question here do we need to split b128 only or everything misaligned? |
llvm/test/CodeGen/AMDGPU/ds_read2.ll | ||
---|---|---|
603 | I did, but then we've got info from @cfang about a case where dword aligned loads were causing regressions. Although according to the latest comment that were still b128 reads. In essence the problem is that we do not always know the alignment and data may be better aligned than declared. In this case a wider load will work better than a narrower. I.e. when unaligned access is off that becomes a probability question. Note that chances to get unaligned 128 bit case are higher than unaligned 64 bit case. That is why I have requested to perform measurements. We know for sure b128 is mostly a bad choice if we have underaligned case. I suspect b8 and b16 splits are overkills, but I am not really sure about 64b split into 32b. |
llvm/lib/Target/AMDGPU/SIISelLowering.cpp | ||
---|---|---|
1407–1408 | They do not *require* the alignment for performance reasons. This should report whether it works, and isFast for whether we want to prefer it |
Vulkan graphics rarely uses DS instructions. Vulkan compute uses them, but we don't have any standard set of benchmarks, and in any case the workloads would be very similar to OpenCL compute. So I don't think we need to do any Vulkan-specific benchmarking.
I am abandoning this patch since an alternative patch - https://reviews.llvm.org/D100008 is landed in upstream. Next step is to do some useful performance related experimentation for unaligned accessing of ds instructions before coming to a common conclusion on more generic fix.
They do not *require* the alignment for performance reasons. This should report whether it works, and isFast for whether we want to prefer it