- HIP uses an unsized extern array extern __shared__ T s[] to declare the dynamic shared memory, which size is not known at the compile time.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
Needs to handle globalisel too.
I also thought we were trying to get rid of group static size. It's broken with LDS relocations which we need to move towards. Can we just switch directly to using a relocation here?
llvm/lib/Target/AMDGPU/SIISelLowering.cpp | ||
---|---|---|
5584 | Why does it specifically need to be a 0 sized array? I think this would depend purely on the linkage, or treat any 0 sized type the same way |
llvm/lib/Target/AMDGPU/SIISelLowering.cpp | ||
---|---|---|
5588 | don't you need to make sure whether the static size would give you an offset with the correct alignment? |
I just found that change for non-HSA/-PAL environment. I need to check how it works and fit into other tests. So far, that's a critical change to ensure we won't change the original source code too much. Is it possible to address that relocation in a long run (says 1~3 weeks) to avoid the tight schedule.
llvm/lib/Target/AMDGPU/SIISelLowering.cpp | ||
---|---|---|
5584 | That's the only syntax accepted under HIP clang. At least, zero-sized types should be checked to ensure developers understand the usage of the dynamic shared memory. |
llvm/lib/Target/AMDGPU/SIISelLowering.cpp | ||
---|---|---|
5588 | I remember that size should be always DWORD aligned. Let me check the code calculated that. |
My understanding is this feature is equivalent to the OpenCL dynamic group segment allocation. The runtime would presumably implement it in a similar way.
So the HIP runtime must take the static LDS size, round up to the alignment requirement of the dynamic allocation (OpenCL just uses the maximally aligned OpenCL data type), then add the size of the dynamic LDS. The AQL packet group segment field is set to the total LDS size.
In OpenCL there can be multiple kernel arguments, and the LDS address is passed to each. But for HIP there is only one dynamic area denoted by this weird extern. How is the dynamic LDS storage accessed? Is the address passed as an implicit kernel argument, or does the compiler implicitly use the aligned static LDS size?
I don't think this actually works since you could have multiple 0 sized objects, and they would both get the same address. I think this has to be an external relocation
llvm/lib/Target/AMDGPU/SIISelLowering.cpp | ||
---|---|---|
5584 | That's a bit too much HIP specific logic. Also what does this do if there are more than one? How can these return different addresses? | |
5588 | The global has an explicit alignment that needs to be respected |
This's the point. To keep compatible with CUDA, multiple dynamically sized arrays in a single kernel must declare a single extern unsized array and uses the address to divide it into multiple arrays by developers themselves. It's in fact quite similar to the local memory parameter in OpenCL. Thus, all extern unsized __shared__ arrays are mapped onto the same address. The developer has the responsibility to divide it into multiple ones. That's in fact is same across HCC and HIP-Clang except that we want to maximize the compatibility and avoid changing source code too much.
llvm/lib/Target/AMDGPU/AMDGPUMachineFunction.h | ||
---|---|---|
34 | This doesn't need to be MaybeAlign, just Align. Also expand to Dynamic? Also should probably elaborate that this is used for the case where a dynamically sized global is used | |
94 | Set is the wrong word here; ensureDynamicLDSAlign()? | |
llvm/lib/Target/AMDGPU/SIISelLowering.cpp | ||
5585 | This should check if the allocated size of GV->getValueType() is 0, not special case 0 sized arrays | |
5589 | Should be the alignment of the aggregate itself, not the element type |
For the globalisel part, you'll need D84638 and the global lowering should introduce the intrinsic, not the machine pseudo
llvm/lib/Target/AMDGPU/AMDGPUMachineFunction.h | ||
---|---|---|
68 | To report the LDS usage accurately, the final LDSSize need to count the padding due to dynamic shared memory alignment. Now, the re-alignment on LDS is explicitly done before the final instruction selection. | |
69 | The re-alignment is done explicitly just before final instruction selection and just once. BTW, getLDSSize should be only valid after instruction selection. |
llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp | ||
---|---|---|
2808–2811 ↗ | (On Diff #284255) | I think these should remain distinct queries/fields, not fixed up at an arbitrary point. GlobalISel will miss this for example. The asm printer would query the kind that accounts for the dynamic padding |
llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp | ||
2284 | This should not special case 0 sized arrays and should check the allocated size | |
llvm/lib/Target/AMDGPU/AMDGPUMachineFunction.h | ||
38 | This can still just be Align | |
96 | This isn't a set since it does more than set the value. ensureDynamicLDSAlign? | |
102–103 | This shouldn't be a mutation, but return the aligned up size. totalLDSAllocSize()? | |
llvm/lib/Target/AMDGPU/SIISelLowering.cpp | ||
5585 | This should not special case 0 sized arrays. This is 0 allocation size | |
llvm/test/CodeGen/MIR/AMDGPU/machine-function-info.ll | ||
50 | The default should be 1 |
llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp | ||
---|---|---|
2808–2811 ↗ | (On Diff #284255) | GlobalISel calls adjustLDSSizeForDynLDSAlign similarly before the finalization of ISel. |
llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp | ||
2284 | I tend to be restrictive here to follow how that is used in HIP, zero-sized array always has zero allocated size. If other clients need similar usage but general zero-allocated type, we may enhance accordingly. | |
llvm/lib/Target/AMDGPU/AMDGPUMachineFunction.h | ||
102–103 | don't we need to report the accurate usage of LDS? Does that alignment padding need counting as well for the final LDSSize? |
llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp | ||
---|---|---|
2808–2811 ↗ | (On Diff #284255) | Having extra state that needs to be made consistent is bad. It's better to just track the two independent fields and do the roundup when needed at the end |
llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp | ||
2284 | Types don't mean anything. Any 0 sized globals are getting allocated to the same address. We're just going to miscompile other 0 sized types. We have no reason to treat other 0 sized types differently | |
2286–2287 | Should use the aggregate alignment, not the element |
Add preFinalizeLowering so that both DAGISel and GISel shares the same path
to adjust LDS size.
llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp | ||
---|---|---|
2286–2287 | Propose to add preFinalizeLowering before pseudo instruction expansion so that both GISel and DAGISel have the chance to adjust LDS size. |
llvm/include/llvm/CodeGen/TargetLowering.h | ||
---|---|---|
2794 ↗ | (On Diff #286135) | The last thing we need is more callbacks called at random points |
llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp | ||
2286–2287 | It's really cleaner to just have this compute what you care about at the point you care about it. Having a point where this needs to be made consistent is both worse from a serialization perspective, and from an optimization point since theoretically we could spill into the padding later |
llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp | ||
---|---|---|
2286–2287 | But, as we discuss this is a short-term solution before the linker could perform the per-kernel LDS resolution. As the spilling to LDS is not implemented yet, this short-term solution should be kept as simple as possible and finally reverted. To pad LDS for the shared memory array, we have to wait until all static LDS ones are allocated. That's should be the point where amdgcn_groupstatcisize is about to be expanded, i.e. before finalizing lowering. |
llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp | ||
---|---|---|
2286–2287 | Short term workaround or not, the less mutable state the better. I see no advantage to accumulating the allocated + padding size in a single variable vs. keeping the two separate. You have to track both in MFI anyway, and accumulating like this loses information. |
llvm/lib/Target/AMDGPU/AMDGPUMachineFunction.h | ||
---|---|---|
41 | Leftover comment: Just before the final selection, LDSSize is adjusted accordingly. | |
42 | This doesn't need to be MaybeAlign, it can just be Align and default to 1 | |
102–103 | ensureDynamicLDSAlign(), and don't need to conditionally set it. This also does not need to mutate LDSSize | |
llvm/test/CodeGen/MIR/AMDGPU/machine-function-info-no-ir.mir | ||
126 | Should be 1 |
llvm/include/llvm/CodeGen/MIRYamlMapping.h | ||
---|---|---|
168–171 | Should add parser tests for these cases | |
llvm/lib/Target/AMDGPU/AMDGPUMachineFunction.cpp | ||
59 | This is an independent field and should not be changed here | |
llvm/lib/Target/AMDGPU/AMDGPUMachineFunction.h | ||
99 | This should still not be modifying LDSSize. This is still missing an independent query to give the static + rounded size |
llvm/lib/Target/AMDGPU/AMDGPUMachineFunction.cpp | ||
---|---|---|
59 | As the sequence of static LDS allocation and dynamic LDS alignment updates are processed in the program order or reverse of that order, we need to collect all static LDS usage and dynamic LDS alignment. As we remove the previous the one single point adjustment, we need to update LDSSizze if there's any static LDS allocation or dynamic LDS alignment updates. | |
llvm/lib/Target/AMDGPU/AMDGPUMachineFunction.h | ||
99 | Valid LDS queries should be done after instruction selection. LDSSize is ONLY modified within instruction selection through static LDS allocation and dynamic LDS alignment update. |
llvm/include/llvm/CodeGen/MIRYamlMapping.h | ||
---|---|---|
168–171 | llvm/test/CodeGen/MIR/AMDGPU/machine-function-info-no-ir.mir covers that. |
llvm/include/llvm/CodeGen/MIRYamlMapping.h | ||
---|---|---|
168–171 | It doesn't cover the error cases | |
llvm/lib/Target/AMDGPU/AMDGPUMachineFunction.cpp | ||
59 | OK yes, this is in the right place now. However,r it should be where the alignment is updated | |
llvm/lib/Target/AMDGPU/SIISelLowering.cpp | ||
5588 | This logic should be moved into allocateLDSGlobal |
llvm/lib/Target/AMDGPU/AMDGPUMachineFunction.cpp | ||
---|---|---|
59 | As we don't know which one will be processed last, we need to update LDSSize in both cases to ensure the correct one is calculated. | |
llvm/lib/Target/AMDGPU/SIISelLowering.cpp | ||
5588 | the allocation of dynamic LDS is not handled by the compiler. We only collect the alignment. |
llvm/lib/Target/AMDGPU/AMDGPUMachineFunction.cpp | ||
---|---|---|
59 | This is actually terrible and we're going to burn extra padding, but I guess it's conservatively correct |
llvm/include/llvm/CodeGen/MIRYamlMapping.h | ||
---|---|---|
168–171 | Add an invalid case. |
llvm/lib/Target/AMDGPU/AMDGPUMachineFunction.cpp | ||
---|---|---|
59 | By maintaining StaticLDSSize, the padding for dynamic LDS is done only once. However, we need to continuously update it if there's any static LDS allocation or dynamic LDS alignment updates. |
llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp | ||
---|---|---|
2284 | Should get global's alignment, not just the type | |
llvm/lib/Target/AMDGPU/SIISelLowering.cpp | ||
5588 | This should take the alignment from the global, not just the type alignment | |
llvm/test/CodeGen/AMDGPU/GlobalISel/hip.extern.shared.array.ll | ||
11 | Needs some tests with larger explicit alignments | |
llvm/test/CodeGen/MIR/AMDGPU/machine-function-info-dynlds-align-invalid-case.mir | ||
2–3 | This isn't XFAIL, it's run with not and check the error message output |
llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp | ||
---|---|---|
2284 | Good catch, fixed in the latest revision. |
Should add parser tests for these cases