This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] ds_read_*/ds_write_* operations require strict alignment.
AbandonedPublic

Authored by hsmhsm on Mar 25 2021, 9:52 AM.

Details

Summary

Due to performance reasons, ds_read_*/ds_write_* operations require
strict alignment. Avoid selecting them in under-aligned situations
irrespective of whether "unligned access mode" is enabled or not.

Diff Detail

Event Timeline

hsmhsm created this revision.Mar 25 2021, 9:52 AM
hsmhsm requested review of this revision.Mar 25 2021, 9:52 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 25 2021, 9:52 AM
NOTE: As of now, I have intentionally *NOT* updated lit tests since update seems to be not trivial. I actually wanted to initiate review discussion and fix the review comments so that I can update the lit tests once-for-all at the end.
foad added inline comments.Mar 25 2021, 10:15 AM
llvm/lib/Target/AMDGPU/SIISelLowering.cpp
1425–1428

I don't see the reason for this change. Everything the old comment said is still true.

hsmhsm added inline comments.Mar 25 2021, 10:33 AM
llvm/lib/Target/AMDGPU/SIISelLowering.cpp
1425–1428

We have updated the .td file to be strictly aligned irrespective of if "unaligned access mode" is enabled or not. But, here we are relaxing it if we are not having above changes, and hence both glboal ISel and SDAG ISel breaks down in certain cases. Please try the test "lds-misaligned-bug.ll", both glboal ISel and SDAG ISel asserts for this test, without above change and with only changes to .td file.

It is actually hard to see the impact without tests updated, at least some specific tests.

arsenm added inline comments.Mar 25 2021, 10:47 AM
llvm/lib/Target/AMDGPU/SIISelLowering.cpp
1426

I don't like changing this function to lie and say it doesn't work when it does

hsmhsm updated this revision to Diff 333360.Mar 25 2021, 11:08 AM

After fixing the review comments by Jay, below tests asserts. Yet to investigate it.

LLVM :: CodeGen/AMDGPU/GlobalISel/lds-misaligned-bug.ll
LLVM :: CodeGen/AMDGPU/GlobalISel/load-unaligned.ll
LLVM :: CodeGen/AMDGPU/lds-misaligned-bug.ll

Apart from above, below two tests needed update that is taken care.

CodeGen/AMDGPU/ds_read2.ll
CodeGen/AMDGPU/ds_write2.ll

Do we have tests where pointers are actually 16 byte aligned?

llvm/test/CodeGen/AMDGPU/ds_read2.ll
1076

I would probably expect ds_read2_b32 for align 4.

hsmhsm updated this revision to Diff 333501.Mar 25 2021, 10:23 PM

Updated the fix - make sure that the function allowsMisalignedMemoryAccessesImpl()
checks for the strict alignment requirment for ds_read/ds_write operations.

Do we have tests where pointers are actually 16 byte aligned?

Please check the lit test - store-local.128.ll which is now updated within this patch.

hsmhsm added inline comments.Mar 25 2021, 10:26 PM
llvm/test/CodeGen/AMDGPU/ds_read2.ll
1076

The fix is further updated to take care of this - Now the allowsMisalignedMemoryAccessesImpl() checks for the strict alignment requirment for ds_read/ds_write operations.

hsmhsm added inline comments.Mar 25 2021, 10:28 PM
llvm/lib/Target/AMDGPU/SIISelLowering.cpp
1426

My understanding is that the function allowsMisalignedMemoryAccessesImpl() should check for the strict alignment requirment for ds_read/ds_write operations. I am now further updated this function accordingly.

@foad can we run it trough gfx perf suite? With exception of one ds_write_b64 store it seems reasonable to me. Of course these splits to b8 would be unfortunate but I assume it shall not happen in real life too often and the data is usually better aligned.

llvm/test/CodeGen/AMDGPU/ds_write2.ll
849

It seems it misses the same split to b32.

arsenm added inline comments.Mar 26 2021, 10:29 AM
llvm/lib/Target/AMDGPU/SIISelLowering.cpp
1426

You are making this stricter than the instructions really allow

rampitec added inline comments.Mar 26 2021, 10:43 AM
llvm/lib/Target/AMDGPU/SIISelLowering.cpp
1426

That is more or less the point. I think there is some balance here between the perf drop when a wide access is used for misaligned data and perf loss when a narrow access is used on data for which we cannot prove alignment but it just happens to be aligned.

This seems to warrant a perf testing on this patch. @cfang does this solve perf regression you were working on?

arsenm added inline comments.Mar 29 2021, 2:56 PM
llvm/lib/Target/AMDGPU/SIISelLowering.cpp
1426

The function reports if it's legal and works. IsFast is separate and can change, but it should still report this as a legal unaligned access

cfang added inline comments.Mar 29 2021, 4:04 PM
llvm/lib/Target/AMDGPU/SIISelLowering.cpp
1426

Verified with SHOC SGEMM, this patch could resolve the regression.
For single precision (align4), ds_read2_b32 are generated;
For double precision (align8), ds_read2_b64 are generated.
Both are much better than unaligned ds_read_b128 (which was the reason of regression).

rampitec added inline comments.Mar 29 2021, 4:06 PM
llvm/lib/Target/AMDGPU/SIISelLowering.cpp
1426

Ugh! Do you know if we can use ds_read2_b64 for align4? The main question here do we need to split b128 only or everything misaligned?

foad added inline comments.Mar 30 2021, 3:57 AM
llvm/test/CodeGen/AMDGPU/ds_read2.ll
603

@rampitec didn't you say that you rely on unaligned dword reads to get good performance? So I guess this change is unacceptable.

rampitec added inline comments.Mar 30 2021, 9:27 AM
llvm/test/CodeGen/AMDGPU/ds_read2.ll
603

I did, but then we've got info from @cfang about a case where dword aligned loads were causing regressions. Although according to the latest comment that were still b128 reads.

In essence the problem is that we do not always know the alignment and data may be better aligned than declared. In this case a wider load will work better than a narrower. I.e. when unaligned access is off that becomes a probability question. Note that chances to get unaligned 128 bit case are higher than unaligned 64 bit case.

That is why I have requested to perform measurements. We know for sure b128 is mostly a bad choice if we have underaligned case. I suspect b8 and b16 splits are overkills, but I am not really sure about 64b split into 32b.

arsenm requested changes to this revision.Mar 30 2021, 9:38 AM
arsenm added inline comments.
llvm/lib/Target/AMDGPU/SIISelLowering.cpp
1407–1408

They do not *require* the alignment for performance reasons. This should report whether it works, and isFast for whether we want to prefer it

This revision now requires changes to proceed.Mar 30 2021, 9:38 AM
foad added a comment.Apr 1 2021, 6:33 AM

@foad can we run it trough gfx perf suite? With exception of one ds_write_b64 store it seems reasonable to me. Of course these splits to b8 would be unfortunate but I assume it shall not happen in real life too often and the data is usually better aligned.

Vulkan graphics rarely uses DS instructions. Vulkan compute uses them, but we don't have any standard set of benchmarks, and in any case the workloads would be very similar to OpenCL compute. So I don't think we need to do any Vulkan-specific benchmarking.

hsmhsm abandoned this revision.Apr 7 2021, 7:48 PM

I am abandoning this patch since an alternative patch - https://reviews.llvm.org/D100008 is landed in upstream. Next step is to do some useful performance related experimentation for unaligned accessing of ds instructions before coming to a common conclusion on more generic fix.