This is an archive of the discontinued LLVM Phabricator instance.

AMDGPU: Add 32-bit constant address space
ClosedPublic

Authored by mareko on Dec 31 2017, 8:09 PM.

Diff Detail

Repository
rL LLVM

Event Timeline

mareko created this revision.Dec 31 2017, 8:09 PM

FYI, I'd like to get this into LLVM 6.0 if it's OK.

mareko updated this revision to Diff 128485.Jan 2 2018, 7:42 PM

Version 2: As discussed on IRC.

HSA is the only one that cares about unaligned loads. Unaligned loads are
broken on all other targets.

t-tye added inline comments.Jan 2 2018, 8:08 PM
lib/Target/AMDGPU/SIISelLowering.cpp
153 ↗(On Diff #128485)

Would it be better to add a query for whether unaligned accesses were supported? Currently it would only be for AmdHsaOS but would allow it to be easily changed in the future.

arsenm added inline comments.Jan 2 2018, 8:30 PM
lib/Target/AMDGPU/SIISelLowering.cpp
153 ↗(On Diff #128485)

We already have that. The problem is working around a variety of issues derived from operation legality not being specified per address space of

mareko updated this revision to Diff 128561.Jan 3 2018, 1:47 PM

Version 3:

  • Much simpler.
  • Addrspacecast is never inserted.
  • Unaligned loads are unaffected (work fine).
  • Only scalar loads support 32-bit pointers. An address in a VGPR will fail to compile.
  • D41715 (amdgpu.uniform on loads) is required for enforce scalar loads in some cases.
mareko marked 2 inline comments as done.Jan 4 2018, 9:09 AM
arsenm requested changes to this revision.Jan 9 2018, 7:46 AM

This needs to update AMDGPUAliasAnalysis. Also needs more test coverage. I don't see this testing unaligned access or some of the other places it was added.

This part is problematic:
"Only scalar loads support 32-bit pointers. An address in a VGPR will fail to compile."
"D41715 (amdgpu.uniform on loads) is required for enforce scalar loads in some cases."

We can't rely on metadata to be able to compile

lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp
1447 ↗(On Diff #128561)

I think this needs to be _XEXEC

This revision now requires changes to proceed.Jan 9 2018, 7:46 AM
mareko added a comment.Jan 9 2018, 4:38 PM

This needs to update AMDGPUAliasAnalysis. Also needs more test coverage. I don't see this testing unaligned access or some of the other places it was added.

I added CONSTANT_ADDRESS_32BIT to most places that used CONSTANT_ADDRESS. I don't know whether that's necessary or not.

This part is problematic:
"Only scalar loads support 32-bit pointers. An address in a VGPR will fail to compile."
"D41715 (amdgpu.uniform on loads) is required for enforce scalar loads in some cases."

We can't rely on metadata to be able to compile

We need to rely on metadata. VMEM loads are simply unsupported because we don't need them with the 32-bit address space. Unaligned loads are also untested and unsupported because we don't need them either.

mareko added a comment.Jan 9 2018, 4:41 PM
This comment was removed by mareko.
mareko updated this revision to Diff 129282.Jan 10 2018, 8:50 AM

Only scalar loads support 32-bit pointers. An address in a VGPR will
fail to compile. That's OK because the results of loads will only be used
in places where VGPRs are forbidden.

Updated AMDGPUAliasAnalysis and used SReg_64_XEXEC.
The tests cover all uses cases we need for Mesa.

I don't know whether handling CONSTANT_ADDRESS_32BIT everywhere
is necessary. Comments welcome.

mareko marked an inline comment as done.Jan 15 2018, 7:40 AM

This needs documentation in AMDGPUUsage.rst.

Relying on metadata for correctness is indeed not okay. We should either say that CONSTANT_ADDRESS_32BIT just assumes uniformness, and move the address to an SGPR (via v_readfirstlane) if required, *or* support this also with VMEM instructions.

As far as I understand, the point of this change is to use 32-bit pointers for descriptor tables. It doesn't seem too far-fetched that we'll eventually have to supported extensions with divergent resource descriptors, so I vaguely prefer the second solution.

The other question is, why do we need a new address space at all? Can't we synthesize an appropriate pointer via inttoptr casts? I believe this is what SCPC is doing.

tpr added a comment.Jan 24 2018, 10:40 AM

Nicolai, I think you mean LLPC is synthesizing a pointer using inttoptr. That is true, but Marek pointed out to us that it is a bad thing because it means alias analysis cannot do such a good job. So we're interested in using 32 bit pointers instead.

tpr added a comment.Jan 24 2018, 10:43 AM

For LLPC we need to extend a 32 bit pointer to 64 bit either by a value supplied in an option/feature, or by using the high half of s_getpc. But we don't need to get that into this change -- we can add it later, as long as this change is structured to allow it.

This needs documentation in AMDGPUUsage.rst.

Relying on metadata for correctness is indeed not okay. We should either say that CONSTANT_ADDRESS_32BIT just assumes uniformness, and move the address to an SGPR (via v_readfirstlane) if required, *or* support this also with VMEM instructions.

Here is why relying on metadata is OK.

The behavior of 64-bit pointers:

  • If the address is in VGPRs and amdgpu.uniform is not dropped, you'll get readfirstlane and correct behavior.
  • If the address is in VGPRs and amdgpu.uniform is dropped by a random pass, you'll get SMEM opcodes reading descriptors from VGPRs, so you'll get an invalid binary without an error and a GPU hang.

The behavior for 32-bit pointers:

  • If the address is in VGPRs and amdgpu.uniform is not dropped, you'll get readfirstlane and correct behavior.
  • If the address is in VGPRs and amdgpu.uniform is dropped by a random pass, you'll get a compile error.

Therefore, 32-bit pointers are a significant improvement in compiler behavior over 64-bit pointers. The current implementation covers everything Mesa will ever need. 32-bit pointers in VMEM opcodes would be a bonus, but it would also be useless for Mesa.

As far as I understand, the point of this change is to use 32-bit pointers for descriptor tables. It doesn't seem too far-fetched that we'll eventually have to supported extensions with divergent resource descriptors, so I vaguely prefer the second solution.

Game developers will be advised to use the readfirstlane intrinsic in a loop, as has happened in the past. As long as AMD doesn't support divergent resource descriptors in other drivers, we are fine.

The other question is, why do we need a new address space at all? Can't we synthesize an appropriate pointer via inttoptr casts? I believe this is what SCPC is doing.

The short story is: We should never use inttoptr if InstCombine can't remove it. inttoptr is unoptimizable by LLVM.

mareko updated this revision to Diff 131490.Jan 25 2018, 11:55 AM

Only scalar loads support 32-bit pointers. An address in a VGPR will
fail to compile. That's OK because the results of loads will only be used
in places where VGPRs are forbidden.

Updated AMDGPUAliasAnalysis and used SReg_64_XEXEC.
The tests cover all uses cases we need for Mesa.

Updated documentation.

This needs documentation in AMDGPUUsage.rst.

Relying on metadata for correctness is indeed not okay. We should either say that CONSTANT_ADDRESS_32BIT just assumes uniformness, and move the address to an SGPR (via v_readfirstlane) if required, *or* support this also with VMEM instructions.

Here is why relying on metadata is OK.

The behavior of 64-bit pointers:

  • If the address is in VGPRs and amdgpu.uniform is not dropped, you'll get readfirstlane and correct behavior.
  • If the address is in VGPRs and amdgpu.uniform is dropped by a random pass, you'll get SMEM opcodes reading descriptors from VGPRs, so you'll get an invalid binary without an error and a GPU hang.

The behavior for 32-bit pointers:

  • If the address is in VGPRs and amdgpu.uniform is not dropped, you'll get readfirstlane and correct behavior.
  • If the address is in VGPRs and amdgpu.uniform is dropped by a random pass, you'll get a compile error.

This is exactly why it's not OK? If it's dropped you get a compile error or miscompile

This is exactly why it's not OK? If it's dropped you get a compile error or miscompile

I'd rather get a compile error than miscompile. Only 64-bit pointers miscompile. This patch makes sure that 32-bit pointers do not miscompile. We've always been relying on metadata for correctness. This patch doesn't change that, nor does it intend to.

I get it. Some people want VMEM with 32-bit pointers. That is, however, not so easy for me. This patch implements the subset of 32-bit pointer functionality that Mesa will use, and only that. I've been using it and testing it for ~26 days now.

Here is why relying on metadata is OK.

The behavior of 64-bit pointers:

  • If the address is in VGPRs and amdgpu.uniform is not dropped, you'll get readfirstlane and correct behavior.
  • If the address is in VGPRs and amdgpu.uniform is dropped by a random pass, you'll get SMEM opcodes reading descriptors from VGPRs, so you'll get an invalid binary without an error and a GPU hang.

Just to clarify, do you mean that the 64-bit pointer will be used in a VMEM load, which would (typically) load a descriptor into VGPRs and then that descriptor would be fed as VGPRs into a MUBUF or MIMG instruction? We should fix that and automatically introduce v_readfirstlanes. I actually thought I had fixed that bug at some point in the past, but maybe it has reappeared :/

I agree that unlike silent miscompilation, compile errors are something we could potentially live with today. But if moving to VMEM is too difficult, I think it would still be better to declare that CONSTANT_ADDRESS_32BIT can only be used with dynamically uniform addresses, and just introduce a v_readfirstlane regardless of whether the metadata is there.

The other question is, why do we need a new address space at all? Can't we synthesize an appropriate pointer via inttoptr casts? I believe this is what SCPC is doing.

The short story is: We should never use inttoptr if InstCombine can't remove it. inttoptr is unoptimizable by LLVM.

Hmm yeah, I vaguely remember this now, although I have to admit that I never properly understood the details. I suppose adding the address space is easier than fighting the fight at the generic LLVM level.

mareko updated this revision to Diff 132216.Jan 31 2018, 11:02 AM

32-bit loads are always considered uniform and so are always translated
to s_loads with possible v_readfirstlane.

If you don't want uniformity, expand the pointer into a descriptor
manually and use SI.load.const.

This revision was not accepted when it landed; it landed in state Needs Review.Feb 7 2018, 8:05 AM
This revision was automatically updated to reflect the committed changes.
alex-t added a subscriber: alex-t.Feb 14 2018, 11:48 AM
alex-t added inline comments.
llvm/trunk/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp
232

Is it still uniform even if depends on divergent data?
Like this:

%tid = tail call i32 @llvm.amdgcn.workgroup.id.x()
%gep = getelementptr  i32 addrspace(6)* %tid
%val = load i32, i32 addrspace(6)*  %gep

This is not correct
Moreover, this violates Divergence Analysis results

mareko added inline comments.Feb 14 2018, 3:18 PM
llvm/trunk/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp
232

The address space implies uniformity and is geared towards shader resource descriptor loads where uniformity is required. Non-uniform addresses result in v_readfirstlane. In the future, the implementation can be extended to support non-divergent data and VMEM loads/stores/atomics.

In fact v_readfirstlane is inserted by the ISel to glue vector input to the unexpected scalar instruction.
This means that compiler user writing valid IR will get unexpected behavior.
Is this documented somewhere?

My objections WRT implementation are:
Bypassing the normal way of processing values divergence is misleading. I was very much surprised to see "amdgpu.uniform" metadata already set at the point (AMDGPUAnnotateUniformValues) where they are expected to be queried from DA.
Moreover they were set for the value that is reported by DA as divergent!

The correct place to do this is TargetTransformInfo::isAlwaysUniform hook that I added specifically for handling target-specific features (like readfirstlane itself BTW).
Using that hook lets the DA to process values that produce uniform result irrelative of their operands divergence correctly. Then the DA computes the divergence for such exceptions in normal way and no hackery on metadata is needed.
We just query the divergence in AMDGPUAnnotateUniformValues as we did before and set metadata accordingly.

In fact v_readfirstlane is inserted by the ISel to glue vector input to the unexpected scalar instruction.
This means that compiler user writing valid IR will get unexpected behavior.
Is this documented somewhere?

My objections WRT implementation are:
Bypassing the normal way of processing values divergence is misleading. I was very much surprised to see "amdgpu.uniform" metadata already set at the point (AMDGPUAnnotateUniformValues) where they are expected to be queried from DA.
Moreover they were set for the value that is reported by DA as divergent!

The address space is meant to be used for loading shader resource descriptors. It's not a general-purpose address space.