This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU][GlobalISel] Fix 96 and 128 local loads and stores
ClosedPublic

Authored by mbrkusanin on Jun 11 2020, 2:41 AM.

Details

Summary

Fix local ds_read/write_b96/b128 so they can be selected if the alignment
allows. Otherwise, either pick appropriate ds_read2/write2 instructions or break
them down.

Diff Detail

Event Timeline

mbrkusanin created this revision.Jun 11 2020, 2:41 AM

Other than on SI, there are 96-bit DS read/write. Is this just working around some later selection problem?

llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
275 ↗(On Diff #270081)

The address space doesn't make this special? This willl break SI?

mbrkusanin added a comment.EditedJun 12 2020, 6:35 AM

Yes, it basically avoids problems of not being able to select 3x32 for local address space. SDag was breaking these down to a ds_read_b64 and ds_read_b32 so I did the same thing for GlobalISel.

I've looked at .td files and it seems that the following pattern can be added so ds_read_b96 can be selected

foreach vt = VReg_64.RegTypes in {
defm : DSReadPat_mc <DS_READ_B64, vt, "load_alignX_local">;
}

but I'm not sure what should the minimal alignment (X) be for this specific instruction. Any idea? For alignment of 4 every test will pass but, otherwise we'll need to break some cases to b64, b32 pairs.

llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
275 ↗(On Diff #270081)

SI will be fine because hasDwordx3LoadStores will be false. Local address space uses ds_read and ds_write so the name is slightly confusing.

Yes, it basically avoids problems of not being able to select 3x32 for local address space. SDag was breaking these down to a ds_read_b64 and ds_read_b32 so I did the same thing for GlobalISel.

I've looked at .td files and it seems that the following pattern can be added so ds_read_b96 can be selected

foreach vt = VReg_64.RegTypes in {
defm : DSReadPat_mc <DS_READ_B64, vt, "load_alignX_local">;
}

but I'm not sure what should the minimal alignment (X) be for this specific instruction. Any idea? For alignment of 4 every test will pass but, otherwise we'll need to break some cases to b64, b32 pairs.

I'm not sure what the alignment requirement is (I think it also depends on whether unaligned DS access is supported, which we never fully implemented in the compiler).

FYI I am working on completely rewriting the load/store legalization rules and moving this into custom lowering

llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
275 ↗(On Diff #270081)

I mean all the logic here should work without special casing the address space here

mbrkusanin retitled this revision from AMDGPU/GlobalISel: Fix 96-bit local loads to AMDGPU/GlobalISel: Fix 96 and 128 local loads and stores.
mbrkusanin edited the summary of this revision. (Show Details)

Looking at ISA .pdf docs for SI (gfx6) and onward I have not found any requirements for alignments on local loads and stores. There are mentions of dword alignment for reads and writes of dword and larger for buffer instructions but nothing more specific for LDS or GDS. SDag likes to break down ds_read/write_b128 in certain cases but does not know about b96. It seems to me that the code was not updated since SI.
Now b96 and b128 will be picked for align 4 and larger (align 2 and 1 are broken down same way as before). Furthermore, there are several Vulkan conformance tests that have align 4 loads and stores (96 and 128) that will now pass.

Alternative solution would be to break them down to ds_read/write_b64/b32 when we're not sure. But I don't have a test to check if this is necessary.

Since all changes are in .td files this should not cause you any problems with rewriting load/store legality rules.

Looking at ISA .pdf docs for SI (gfx6) and onward I have not found any requirements for alignments on local loads and stores. There are mentions of dword alignment for reads and writes of dword and larger for buffer instructions but nothing more specific for LDS or GDS. SDag likes to break down ds_read/write_b128 in certain cases but does not know about b96. It seems to me that the code was not updated since SI.
Now b96 and b128 will be picked for align 4 and larger (align 2 and 1 are broken down same way as before). Furthermore, there are several Vulkan conformance tests that have align 4 loads and stores (96 and 128) that will now pass.

They definitely have some alignment requirements, but it's poorly documented. I'm pretty confident b128 requires 16-byte alignment, and b64 requires 8. I think gfx9 added unaligned access support (dependent on some config registers), but I think we never fully handled all of these changes. I think the linux driver hardcodes this to allow unaligned access.
The code was updated since SI, but probably not since 96-bit types were added to MVT.

Alternative solution would be to break them down to ds_read/write_b64/b32 when we're not sure. But I don't have a test to check if this is necessary.

Since all changes are in .td files this should not cause you any problems with rewriting load/store legality rules.

mbrkusanin edited the summary of this revision. (Show Details)

Sorry, I was away for a few days.

Now for CI and VI, b96 and b128 have minimal alignment of 16, gfx9 and onward has minimal alignment of 4 (anything lower then 4 is already broken down for all targets).

arsenm added inline comments.Jun 22 2020, 12:52 PM
llvm/lib/Target/AMDGPU/AMDGPUInstructions.td
501

I think the Aligned<> subclasses didn't actually work for some reason, but I only half fixed the patterns maybe?

llvm/lib/Target/AMDGPU/DSInstructions.td
683–706

You shouldn't need to re-consider the legalization logic. The selector can mostly assume legal inputs. If the less aligned version wasn't legal, it should have been broken down. This also depends more specifically on the unaligned features, rather than gfx78

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
1365–1386 ↗(On Diff #272441)

Can you commit this as a separate patch? This changes the DAG path too. I also don't think "unaligned" here means requires 4 byte alignment

mbrkusanin edited the summary of this revision. (Show Details)
mbrkusanin set the repository for this revision to rG LLVM Github Monorepo.

Code that was changing alignment requirements from SITargetLowering::allowsMisalignedMemoryAccessesImpl in now in D82788.

mbrkusanin marked 3 inline comments as done.Jun 29 2020, 10:24 AM
mbrkusanin added inline comments.
llvm/lib/Target/AMDGPU/AMDGPUInstructions.td
501

I changed it so now Aligned<> subclasses are used for both load and store. They seem to work fine.

llvm/lib/Target/AMDGPU/DSInstructions.td
683–706

I couldn't get rid of subtarget predicates because of the way SDag uses allowsMisalignedMemoryAccessesImpl. For example on gfx7/8, ds_read_b128 requires alignment of 16, but we need to say that alignment of 8 is also okay because we can pick ds_read2_b64. GISel however just sees that alignment of 8 is okay and picks ds_read_b128 instead of ds_read2_b64. If both are acceptable according to DSInstructions.td then GIsel will pick the first one (If i change the order in .td file and move it up it will actually pick ds_read2_b64 but that breaks any structure that file had).

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
1365–1386 ↗(On Diff #272441)

Moved to D82788

mbrkusanin marked an inline comment as done.Jun 29 2020, 10:30 AM
mbrkusanin added inline comments.
llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp
1328–1407

This is currently unsed but is required because of DS128Bit8ByteAligned pattern. Without the patternt GIsel does not know how to pick ds_read2/write2_b64.

Currently if ds_read/write_b128 are not legal because of alignment they will be broken down to 4 ds_read/write_b32 instructions and later combined in SILoadStoreOptimizer. It seems to me that is should be possible to tell ISel's lowerLoad and lowerStore to pick ds_read2/write2_b64 when appropriate with this pattern. But it would be easier if that was a separate patch.

arsenm added inline comments.Jun 29 2020, 3:54 PM
llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp
1328

This is basically the same as the splitting the 64-bit case into read2. Can you factor that to avoid duplicating all of this

mbrkusanin updated this revision to Diff 275117.Jul 2 2020, 7:34 AM

Reduced duplicated code for SelectDS64Bit4ByteAligned and SelectDS128Bit8ByteAligned.

  • Rebase
  • Updated to reflect the changes in parent revision.
mbrkusanin retitled this revision from AMDGPU/GlobalISel: Fix 96 and 128 local loads and stores to [AMDGPU][GlobalISel] Fix 96 and 128 local loads and stores.
  • Rebase
mbrkusanin updated this revision to Diff 280440.EditedJul 24 2020, 6:35 AM

Moved tests from here to parent revision (mistakenly put them in wrong patch):

  • llvm/test/Transforms/LoadStoreVectorizer/AMDGPU/merge-stores.ll
  • llvm/test/Transforms/LoadStoreVectorizer/AMDGPU/multiple_tails.ll
  • Rebase
  • Ping
arsenm added inline comments.Aug 10 2020, 12:13 PM
llvm/lib/Target/AMDGPU/AMDGPUInstructions.td
501

Did you double check the generated matcher table? The problem was the current emitter only checks one of these predicates at a time, so it successfully imports but then doesn't actually perform the check. It's silently ignored

mbrkusanin added inline comments.Aug 18 2020, 9:10 AM
llvm/lib/Target/AMDGPU/AMDGPUInstructions.td
501
  • "GIM_CheckMemoryAlignment, /*MI*/0, /*MMO*/0, /*MinAlign*/16," and "8" does show up throughout .inc files (AMDGPUGenGlobalISel.inc) for ds_read/write instructions.
  • Aligned<> does affect produced code and other tests.
  • Regardless off what is used, "let MinAlignment = X;" or "Aligned<X>", identical .inc files are produced.
arsenm accepted this revision.Aug 18 2020, 9:13 AM
This revision is now accepted and ready to land.Aug 18 2020, 9:13 AM
  • Updated llvm/test/CodeGen/AMDGPU/GlobalISel/inst-select-load-local-128.mir