This is an archive of the discontinued LLVM Phabricator instance.

AMDGPU/GlobalISel: Widen non-power-of-2 load results
ClosedPublic

Authored by arsenm on Jan 21 2020, 11:24 AM.

Details

Summary

Load extra bits if suitably aligned. This allows using widened
3-vector loads on SI, and fixes legalization for <9 x s32> (which LSV
apparently forms frequently on lowered kernel argument lists).

Fix incorrectly treating these as legal on SI. This should emit a
64-bit store and a 32-bit store.

I think all of the load and store rules are just about complete, but
due for a rewrite.

Diff Detail

Event Timeline

arsenm created this revision.Jan 21 2020, 11:24 AM
arsenm updated this revision to Diff 239408.Jan 21 2020, 12:39 PM
arsenm edited the summary of this revision. (Show Details)

Fix more incorrect SI handling, handle more 96-bit splits

A couple of notes, in addition to the inline comment:

  • On hardware that supports unaligned loads (CI+), we should just keep dword loads if the load has align=1 (no known alignment at all). This will execute much faster when the pointer happens to be aligned, and will still be faster in the unaligned case. The unaligned-support should be a subtarget feature, because Windows KMD is apparently unable to set the relevant register setting to enable the hardware feature.
  • The same may be true when align=2.
  • All of this probably only applies to global (and possibly local) loads, because loading from scratch has more limitation because of the swizzling.
llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
720–722

As long as the alignment and size are both at least 32, I believe we can always support it. E.g., a dwordx4 load from a pointer that's 4-byte aligned is okay.

Though... admittedly in that case you may end up loading an additional cache line which you otherwise wouldn't load, because you cross a cacheline boundary. So maybe the right decision is to keep the test as is?

Either way, the decision ought to be documented as a comment.

arsenm marked an inline comment as done.Feb 5 2020, 5:58 AM
llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
720–722

This isn't really changing the handling of alignment breakdown, or even the memory access. This is only for the result register type, and just needs to avoid doing something incorrect. The decisions for decomposing based on alignment is already present in the other legalization rules. The rule set here is pretty confusing, and soon due for a rewrite after a few more issues are fixed in the LegalizerHelper

nhaehnle accepted this revision.Feb 12 2020, 2:40 AM
nhaehnle added inline comments.
llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
720–722

Okay, fair enough.

This revision is now accepted and ready to land.Feb 12 2020, 2:40 AM
llvm/test/CodeGen/AMDGPU/GlobalISel/legalize-load-flat.mir