Fix local ds_read/write_b96/b128 so they can be selected if the alignment
allows. Otherwise, either pick appropriate ds_read2/write2 instructions or break
them down.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
Other than on SI, there are 96-bit DS read/write. Is this just working around some later selection problem?
llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp | ||
---|---|---|
275 ↗ | (On Diff #270081) | The address space doesn't make this special? This willl break SI? |
Yes, it basically avoids problems of not being able to select 3x32 for local address space. SDag was breaking these down to a ds_read_b64 and ds_read_b32 so I did the same thing for GlobalISel.
I've looked at .td files and it seems that the following pattern can be added so ds_read_b96 can be selected
foreach vt = VReg_64.RegTypes in { defm : DSReadPat_mc <DS_READ_B64, vt, "load_alignX_local">; }
but I'm not sure what should the minimal alignment (X) be for this specific instruction. Any idea? For alignment of 4 every test will pass but, otherwise we'll need to break some cases to b64, b32 pairs.
llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp | ||
---|---|---|
275 ↗ | (On Diff #270081) | SI will be fine because hasDwordx3LoadStores will be false. Local address space uses ds_read and ds_write so the name is slightly confusing. |
I'm not sure what the alignment requirement is (I think it also depends on whether unaligned DS access is supported, which we never fully implemented in the compiler).
FYI I am working on completely rewriting the load/store legalization rules and moving this into custom lowering
llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp | ||
---|---|---|
275 ↗ | (On Diff #270081) | I mean all the logic here should work without special casing the address space here |
Looking at ISA .pdf docs for SI (gfx6) and onward I have not found any requirements for alignments on local loads and stores. There are mentions of dword alignment for reads and writes of dword and larger for buffer instructions but nothing more specific for LDS or GDS. SDag likes to break down ds_read/write_b128 in certain cases but does not know about b96. It seems to me that the code was not updated since SI.
Now b96 and b128 will be picked for align 4 and larger (align 2 and 1 are broken down same way as before). Furthermore, there are several Vulkan conformance tests that have align 4 loads and stores (96 and 128) that will now pass.
Alternative solution would be to break them down to ds_read/write_b64/b32 when we're not sure. But I don't have a test to check if this is necessary.
Since all changes are in .td files this should not cause you any problems with rewriting load/store legality rules.
They definitely have some alignment requirements, but it's poorly documented. I'm pretty confident b128 requires 16-byte alignment, and b64 requires 8. I think gfx9 added unaligned access support (dependent on some config registers), but I think we never fully handled all of these changes. I think the linux driver hardcodes this to allow unaligned access.
The code was updated since SI, but probably not since 96-bit types were added to MVT.
Alternative solution would be to break them down to ds_read/write_b64/b32 when we're not sure. But I don't have a test to check if this is necessary.
Since all changes are in .td files this should not cause you any problems with rewriting load/store legality rules.
Sorry, I was away for a few days.
Now for CI and VI, b96 and b128 have minimal alignment of 16, gfx9 and onward has minimal alignment of 4 (anything lower then 4 is already broken down for all targets).
llvm/lib/Target/AMDGPU/AMDGPUInstructions.td | ||
---|---|---|
501 | I think the Aligned<> subclasses didn't actually work for some reason, but I only half fixed the patterns maybe? | |
llvm/lib/Target/AMDGPU/DSInstructions.td | ||
683–706 | You shouldn't need to re-consider the legalization logic. The selector can mostly assume legal inputs. If the less aligned version wasn't legal, it should have been broken down. This also depends more specifically on the unaligned features, rather than gfx78 | |
llvm/lib/Target/AMDGPU/SIISelLowering.cpp | ||
1365–1386 ↗ | (On Diff #272441) | Can you commit this as a separate patch? This changes the DAG path too. I also don't think "unaligned" here means requires 4 byte alignment |
Code that was changing alignment requirements from SITargetLowering::allowsMisalignedMemoryAccessesImpl in now in D82788.
llvm/lib/Target/AMDGPU/AMDGPUInstructions.td | ||
---|---|---|
501 | I changed it so now Aligned<> subclasses are used for both load and store. They seem to work fine. | |
llvm/lib/Target/AMDGPU/DSInstructions.td | ||
683–706 | I couldn't get rid of subtarget predicates because of the way SDag uses allowsMisalignedMemoryAccessesImpl. For example on gfx7/8, ds_read_b128 requires alignment of 16, but we need to say that alignment of 8 is also okay because we can pick ds_read2_b64. GISel however just sees that alignment of 8 is okay and picks ds_read_b128 instead of ds_read2_b64. If both are acceptable according to DSInstructions.td then GIsel will pick the first one (If i change the order in .td file and move it up it will actually pick ds_read2_b64 but that breaks any structure that file had). | |
llvm/lib/Target/AMDGPU/SIISelLowering.cpp | ||
1365–1386 ↗ | (On Diff #272441) | Moved to D82788 |
llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp | ||
---|---|---|
1328–1407 | This is currently unsed but is required because of DS128Bit8ByteAligned pattern. Without the patternt GIsel does not know how to pick ds_read2/write2_b64. Currently if ds_read/write_b128 are not legal because of alignment they will be broken down to 4 ds_read/write_b32 instructions and later combined in SILoadStoreOptimizer. It seems to me that is should be possible to tell ISel's lowerLoad and lowerStore to pick ds_read2/write2_b64 when appropriate with this pattern. But it would be easier if that was a separate patch. |
llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp | ||
---|---|---|
1328 | This is basically the same as the splitting the 64-bit case into read2. Can you factor that to avoid duplicating all of this |
Reduced duplicated code for SelectDS64Bit4ByteAligned and SelectDS128Bit8ByteAligned.
Moved tests from here to parent revision (mistakenly put them in wrong patch):
- llvm/test/Transforms/LoadStoreVectorizer/AMDGPU/merge-stores.ll
- llvm/test/Transforms/LoadStoreVectorizer/AMDGPU/multiple_tails.ll
llvm/lib/Target/AMDGPU/AMDGPUInstructions.td | ||
---|---|---|
501 | Did you double check the generated matcher table? The problem was the current emitter only checks one of these predicates at a time, so it successfully imports but then doesn't actually perform the check. It's silently ignored |
llvm/lib/Target/AMDGPU/AMDGPUInstructions.td | ||
---|---|---|
501 |
|
This is basically the same as the splitting the 64-bit case into read2. Can you factor that to avoid duplicating all of this