Page MenuHomePhabricator

[amdgpu] Add codegen support for HIP dynamic shared memory.
Needs ReviewPublic

Authored by hliao on Jun 24 2020, 12:53 PM.

Details

Summary
  • HIP uses an unsized extern array extern __shared__ T s[] to declare the dynamic shared memory, which size is not known at the compile time.

Diff Detail

Event Timeline

hliao created this revision.Jun 24 2020, 12:53 PM
kpyzhov accepted this revision.Jun 24 2020, 1:04 PM
This revision is now accepted and ready to land.Jun 24 2020, 1:04 PM
arsenm requested changes to this revision.Jun 24 2020, 1:05 PM

Needs to handle globalisel too.

I also thought we were trying to get rid of group static size. It's broken with LDS relocations which we need to move towards. Can we just switch directly to using a relocation here?

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
5579

Why does it specifically need to be a 0 sized array? I think this would depend purely on the linkage, or treat any 0 sized type the same way

This revision now requires changes to proceed.Jun 24 2020, 1:05 PM
scchan added a subscriber: scchan.Jun 24 2020, 1:24 PM
scchan added inline comments.
llvm/lib/Target/AMDGPU/SIISelLowering.cpp
5583

don't you need to make sure whether the static size would give you an offset with the correct alignment?

hliao marked an inline comment as done.Jun 24 2020, 1:49 PM

I just found that change for non-HSA/-PAL environment. I need to check how it works and fit into other tests. So far, that's a critical change to ensure we won't change the original source code too much. Is it possible to address that relocation in a long run (says 1~3 weeks) to avoid the tight schedule.

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
5579

That's the only syntax accepted under HIP clang. At least, zero-sized types should be checked to ensure developers understand the usage of the dynamic shared memory.

hliao marked an inline comment as done.Jun 24 2020, 1:50 PM
hliao added inline comments.
llvm/lib/Target/AMDGPU/SIISelLowering.cpp
5583

I remember that size should be always DWORD aligned. Let me check the code calculated that.

t-tye added a comment.Jun 24 2020, 3:13 PM

My understanding is this feature is equivalent to the OpenCL dynamic group segment allocation. The runtime would presumably implement it in a similar way.

So the HIP runtime must take the static LDS size, round up to the alignment requirement of the dynamic allocation (OpenCL just uses the maximally aligned OpenCL data type), then add the size of the dynamic LDS. The AQL packet group segment field is set to the total LDS size.

In OpenCL there can be multiple kernel arguments, and the LDS address is passed to each. But for HIP there is only one dynamic area denoted by this weird extern. How is the dynamic LDS storage accessed? Is the address passed as an implicit kernel argument, or does the compiler implicitly use the aligned static LDS size?

I don't think this actually works since you could have multiple 0 sized objects, and they would both get the same address. I think this has to be an external relocation

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
5579

That's a bit too much HIP specific logic. Also what does this do if there are more than one? How can these return different addresses?

5583

The global has an explicit alignment that needs to be respected

hliao added a comment.Jun 25 2020, 7:31 AM

My understanding is this feature is equivalent to the OpenCL dynamic group segment allocation. The runtime would presumably implement it in a similar way.

So the HIP runtime must take the static LDS size, round up to the alignment requirement of the dynamic allocation (OpenCL just uses the maximally aligned OpenCL data type), then add the size of the dynamic LDS. The AQL packet group segment field is set to the total LDS size.

In OpenCL there can be multiple kernel arguments, and the LDS address is passed to each. But for HIP there is only one dynamic area denoted by this weird extern. How is the dynamic LDS storage accessed? Is the address passed as an implicit kernel argument, or does the compiler implicitly use the aligned static LDS size?

This's the point. To keep compatible with CUDA, multiple dynamically sized arrays in a single kernel must declare a single extern unsized array and uses the address to divide it into multiple arrays by developers themselves. It's in fact quite similar to the local memory parameter in OpenCL. Thus, all extern unsized __shared__ arrays are mapped onto the same address. The developer has the responsibility to divide it into multiple ones. That's in fact is same across HCC and HIP-Clang except that we want to maximize the compatibility and avoid changing source code too much.

hliao updated this revision to Diff 283833.Fri, Aug 7, 1:05 AM

Address the alignment issue.

arsenm added a comment.Fri, Aug 7, 1:36 PM

Also missing the globalisel handling

llvm/lib/Target/AMDGPU/AMDGPUMachineFunction.h
31

I don't think this should be mutable

34

This doesn't need to be MaybeAlign, just Align. Also expand to Dynamic?

Also should probably elaborate that this is used for the case where a dynamically sized global is used

64

Rename to getAllocatedLDSSize()? I think there should be a separate method to get the size plus the roundup to the dynamic alignment

90

Set is the wrong word here; ensureDynamicLDSAlign()?

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
5580

This should check if the allocated size of GV->getValueType() is 0, not special case 0 sized arrays

5584

Should be the alignment of the aggregate itself, not the element type

arsenm added a comment.Fri, Aug 7, 1:49 PM

For the globalisel part, you'll need D84638 and the global lowering should introduce the intrinsic, not the machine pseudo

arsenm added inline comments.Fri, Aug 7, 2:01 PM
llvm/lib/Target/AMDGPU/AMDGPUMachineFunction.h
34

This also needs to be added to the MachineFunctionInfo serialization

65

I think having this here actually breaks the calculation of the total size for all of the statically known globals

hliao updated this revision to Diff 284255.Sun, Aug 9, 11:11 PM

Add GlobalISel and MIR support.

hliao marked an inline comment as done.Sun, Aug 9, 11:15 PM
hliao added inline comments.
llvm/lib/Target/AMDGPU/AMDGPUMachineFunction.h
64

To report the LDS usage accurately, the final LDSSize need to count the padding due to dynamic shared memory alignment. Now, the re-alignment on LDS is explicitly done before the final instruction selection.

65

The re-alignment is done explicitly just before final instruction selection and just once. BTW, getLDSSize should be only valid after instruction selection.

arsenm added inline comments.Mon, Aug 10, 12:54 PM
llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp
2808–2811

I think these should remain distinct queries/fields, not fixed up at an arbitrary point. GlobalISel will miss this for example. The asm printer would query the kind that accounts for the dynamic padding

llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
2228

This should not special case 0 sized arrays and should check the allocated size

llvm/lib/Target/AMDGPU/AMDGPUMachineFunction.h
38

This can still just be Align

92

This isn't a set since it does more than set the value. ensureDynamicLDSAlign?

98–99

This shouldn't be a mutation, but return the aligned up size.

totalLDSAllocSize()?

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
5580

This should not special case 0 sized arrays. This is 0 allocation size

llvm/test/CodeGen/MIR/AMDGPU/machine-function-info.ll
50

The default should be 1

hliao marked an inline comment as done.Mon, Aug 10, 1:05 PM
hliao added inline comments.
llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp
2808–2811

GlobalISel calls adjustLDSSizeForDynLDSAlign similarly before the finalization of ISel.

llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
2228

I tend to be restrictive here to follow how that is used in HIP, zero-sized array always has zero allocated size. If other clients need similar usage but general zero-allocated type, we may enhance accordingly.

llvm/lib/Target/AMDGPU/AMDGPUMachineFunction.h
98–99

don't we need to report the accurate usage of LDS? Does that alignment padding need counting as well for the final LDSSize?

arsenm added inline comments.Mon, Aug 10, 1:07 PM
llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp
2808–2811

Having extra state that needs to be made consistent is bad. It's better to just track the two independent fields and do the roundup when needed at the end

llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
2228

Types don't mean anything. Any 0 sized globals are getting allocated to the same address. We're just going to miscompile other 0 sized types. We have no reason to treat other 0 sized types differently

2230–2231

Should use the aggregate alignment, not the element