Details
Diff Detail
- Build Status
Buildable 13487 Build 13487: arc lint + arc unit
Event Timeline
Version 2: As discussed on IRC.
HSA is the only one that cares about unaligned loads. Unaligned loads are
broken on all other targets.
lib/Target/AMDGPU/SIISelLowering.cpp | ||
---|---|---|
153 | Would it be better to add a query for whether unaligned accesses were supported? Currently it would only be for AmdHsaOS but would allow it to be easily changed in the future. |
lib/Target/AMDGPU/SIISelLowering.cpp | ||
---|---|---|
153 | We already have that. The problem is working around a variety of issues derived from operation legality not being specified per address space of |
Version 3:
- Much simpler.
- Addrspacecast is never inserted.
- Unaligned loads are unaffected (work fine).
- Only scalar loads support 32-bit pointers. An address in a VGPR will fail to compile.
- D41715 (amdgpu.uniform on loads) is required for enforce scalar loads in some cases.
This needs to update AMDGPUAliasAnalysis. Also needs more test coverage. I don't see this testing unaligned access or some of the other places it was added.
This part is problematic:
"Only scalar loads support 32-bit pointers. An address in a VGPR will fail to compile."
"D41715 (amdgpu.uniform on loads) is required for enforce scalar loads in some cases."
We can't rely on metadata to be able to compile
lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp | ||
---|---|---|
1446 | I think this needs to be _XEXEC |
I added CONSTANT_ADDRESS_32BIT to most places that used CONSTANT_ADDRESS. I don't know whether that's necessary or not.
This part is problematic:
"Only scalar loads support 32-bit pointers. An address in a VGPR will fail to compile."
"D41715 (amdgpu.uniform on loads) is required for enforce scalar loads in some cases."We can't rely on metadata to be able to compile
We need to rely on metadata. VMEM loads are simply unsupported because we don't need them with the 32-bit address space. Unaligned loads are also untested and unsupported because we don't need them either.
Only scalar loads support 32-bit pointers. An address in a VGPR will
fail to compile. That's OK because the results of loads will only be used
in places where VGPRs are forbidden.
Updated AMDGPUAliasAnalysis and used SReg_64_XEXEC.
The tests cover all uses cases we need for Mesa.
I don't know whether handling CONSTANT_ADDRESS_32BIT everywhere
is necessary. Comments welcome.
This needs documentation in AMDGPUUsage.rst.
Relying on metadata for correctness is indeed not okay. We should either say that CONSTANT_ADDRESS_32BIT just assumes uniformness, and move the address to an SGPR (via v_readfirstlane) if required, *or* support this also with VMEM instructions.
As far as I understand, the point of this change is to use 32-bit pointers for descriptor tables. It doesn't seem too far-fetched that we'll eventually have to supported extensions with divergent resource descriptors, so I vaguely prefer the second solution.
The other question is, why do we need a new address space at all? Can't we synthesize an appropriate pointer via inttoptr casts? I believe this is what SCPC is doing.
Nicolai, I think you mean LLPC is synthesizing a pointer using inttoptr. That is true, but Marek pointed out to us that it is a bad thing because it means alias analysis cannot do such a good job. So we're interested in using 32 bit pointers instead.
For LLPC we need to extend a 32 bit pointer to 64 bit either by a value supplied in an option/feature, or by using the high half of s_getpc. But we don't need to get that into this change -- we can add it later, as long as this change is structured to allow it.
Here is why relying on metadata is OK.
The behavior of 64-bit pointers:
- If the address is in VGPRs and amdgpu.uniform is not dropped, you'll get readfirstlane and correct behavior.
- If the address is in VGPRs and amdgpu.uniform is dropped by a random pass, you'll get SMEM opcodes reading descriptors from VGPRs, so you'll get an invalid binary without an error and a GPU hang.
The behavior for 32-bit pointers:
- If the address is in VGPRs and amdgpu.uniform is not dropped, you'll get readfirstlane and correct behavior.
- If the address is in VGPRs and amdgpu.uniform is dropped by a random pass, you'll get a compile error.
Therefore, 32-bit pointers are a significant improvement in compiler behavior over 64-bit pointers. The current implementation covers everything Mesa will ever need. 32-bit pointers in VMEM opcodes would be a bonus, but it would also be useless for Mesa.
As far as I understand, the point of this change is to use 32-bit pointers for descriptor tables. It doesn't seem too far-fetched that we'll eventually have to supported extensions with divergent resource descriptors, so I vaguely prefer the second solution.
Game developers will be advised to use the readfirstlane intrinsic in a loop, as has happened in the past. As long as AMD doesn't support divergent resource descriptors in other drivers, we are fine.
The other question is, why do we need a new address space at all? Can't we synthesize an appropriate pointer via inttoptr casts? I believe this is what SCPC is doing.
The short story is: We should never use inttoptr if InstCombine can't remove it. inttoptr is unoptimizable by LLVM.
Only scalar loads support 32-bit pointers. An address in a VGPR will
fail to compile. That's OK because the results of loads will only be used
in places where VGPRs are forbidden.
Updated AMDGPUAliasAnalysis and used SReg_64_XEXEC.
The tests cover all uses cases we need for Mesa.
Updated documentation.
This is exactly why it's not OK? If it's dropped you get a compile error or miscompile
I'd rather get a compile error than miscompile. Only 64-bit pointers miscompile. This patch makes sure that 32-bit pointers do not miscompile. We've always been relying on metadata for correctness. This patch doesn't change that, nor does it intend to.
I get it. Some people want VMEM with 32-bit pointers. That is, however, not so easy for me. This patch implements the subset of 32-bit pointer functionality that Mesa will use, and only that. I've been using it and testing it for ~26 days now.
Just to clarify, do you mean that the 64-bit pointer will be used in a VMEM load, which would (typically) load a descriptor into VGPRs and then that descriptor would be fed as VGPRs into a MUBUF or MIMG instruction? We should fix that and automatically introduce v_readfirstlanes. I actually thought I had fixed that bug at some point in the past, but maybe it has reappeared :/
I agree that unlike silent miscompilation, compile errors are something we could potentially live with today. But if moving to VMEM is too difficult, I think it would still be better to declare that CONSTANT_ADDRESS_32BIT can only be used with dynamically uniform addresses, and just introduce a v_readfirstlane regardless of whether the metadata is there.
The other question is, why do we need a new address space at all? Can't we synthesize an appropriate pointer via inttoptr casts? I believe this is what SCPC is doing.
The short story is: We should never use inttoptr if InstCombine can't remove it. inttoptr is unoptimizable by LLVM.
Hmm yeah, I vaguely remember this now, although I have to admit that I never properly understood the details. I suppose adding the address space is easier than fighting the fight at the generic LLVM level.
32-bit loads are always considered uniform and so are always translated
to s_loads with possible v_readfirstlane.
If you don't want uniformity, expand the pointer into a descriptor
manually and use SI.load.const.
llvm/trunk/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp | ||
---|---|---|
232 ↗ | (On Diff #133216) | Is it still uniform even if depends on divergent data? %tid = tail call i32 @llvm.amdgcn.workgroup.id.x() %gep = getelementptr i32 addrspace(6)* %tid %val = load i32, i32 addrspace(6)* %gep This is not correct |
llvm/trunk/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp | ||
---|---|---|
232 ↗ | (On Diff #133216) | The address space implies uniformity and is geared towards shader resource descriptor loads where uniformity is required. Non-uniform addresses result in v_readfirstlane. In the future, the implementation can be extended to support non-divergent data and VMEM loads/stores/atomics. |
In fact v_readfirstlane is inserted by the ISel to glue vector input to the unexpected scalar instruction.
This means that compiler user writing valid IR will get unexpected behavior.
Is this documented somewhere?
My objections WRT implementation are:
Bypassing the normal way of processing values divergence is misleading. I was very much surprised to see "amdgpu.uniform" metadata already set at the point (AMDGPUAnnotateUniformValues) where they are expected to be queried from DA.
Moreover they were set for the value that is reported by DA as divergent!
The correct place to do this is TargetTransformInfo::isAlwaysUniform hook that I added specifically for handling target-specific features (like readfirstlane itself BTW).
Using that hook lets the DA to process values that produce uniform result irrelative of their operands divergence correctly. Then the DA computes the divergence for such exceptions in normal way and no hackery on metadata is needed.
We just query the divergence in AMDGPUAnnotateUniformValues as we did before and set metadata accordingly.
The address space is meant to be used for loading shader resource descriptors. It's not a general-purpose address space.
I think this needs to be _XEXEC