- In general, a generic point may alias to pointers in all other address spaces. However, for certain cases enforced by the programming model, we may found a generic point won't alias to pointers to local objects.
- When a generic pointer is loaded from the constant address space, it could only be a pointer to the GLOBAL or CONSTANT address space. Thus, it won't alias to pointers to the PRIVATE or LOCAL address space.
- When a generic pointer is passed as a kernel argument, it also could only be a pointer to the GLOBAL or CONSTANT address space. Thus, it also won't alias to pointers to the PRIVATE or LOCAL address space.
Details
- Reviewers
rampitec arsenm yaxunl nhaehnle - Commits
- rG2a0e4d1c01c9: [amdgpu] Enhance AMDGPU AA.
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
llvm/lib/Target/AMDGPU/AMDGPUAliasAnalysis.cpp | ||
---|---|---|
104–105 | Do you mean GOT-like relocation for LDS? That's possible if we don't want to clone sub-function per kernel-function. But, as GOT is created in the backend, it needs a PseudoValue for IR instead of a real Value. We could easily tell them from the load from constant memory in user code. |
FYI I've done some Vulkan CTS testing using LLPC with this patch and didn't notice any problems.
llvm/lib/Target/AMDGPU/AMDGPUAliasAnalysis.cpp | ||
---|---|---|
114 | What is we pass a pointer to LDS allocated on the host side? |
llvm/lib/Target/AMDGPU/AMDGPUAliasAnalysis.cpp | ||
---|---|---|
114 | No LDS pointer is passed through constant variables in user code. So far, LDS is allocated and assigned during codegen. Late, as LDS could be assigned per kernel and from 0, we could choose clone sub-function (called by more than one kernel functions) so that assignment could be done within linker, or add GOT like relocation if sub-function is not cloned. The late needs to prepare GOT entries on the host side but that load from GOT could be easily distinguished in the backend as it needs to use PseuodSource in MMO. |
llvm/lib/Target/AMDGPU/AMDGPUAliasAnalysis.cpp | ||
---|---|---|
114 | But that's not constant, it is just Argument? |
llvm/lib/Target/AMDGPU/AMDGPUAliasAnalysis.cpp | ||
---|---|---|
114 | In HIP, LDS pointer cannot be passed as an argument to a kernel. |
I think they are correct for OpenCL, since in OpenCL shared var can only be declared in kernel function or passed by kernel arg.
However I am not sure whether a constant pointer can pointer to shared memory, i.e, whether the address of a shared variable is compile time constant, or whether the following is valid code:
__shared__ int a; __constant__ int *b = &a;
Currently clang allows it but nvcc does not https://godbolt.org/z/9W8vee
I tends to agree with nvcc's treatment since this allows more flexible way of implementing shared variable supports in backend. @tra for advice
But you are not checking for a constant pointer here!
llvm/lib/Target/AMDGPU/AMDGPUAliasAnalysis.cpp | ||
---|---|---|
114 | In OpenCL it is legal. |
In HIP __constant__ is a variable attribute, not the address space of the pointee. __constant__ int * means a pointer itself in constant address space and pointing to generic/flat address space.
Where do you check for this specifically in this block:
} else if (const Argument *Arg = dyn_cast<Argument>(ObjA)) { const Function *F = Arg->getParent(); switch (F->getCallingConv()) { case CallingConv::AMDGPU_KERNEL: // In the kernel function, kernel arguments won't alias to (local) // variables in shared or private address space. return NoAlias;
I was talking about semantic check in language. Here is the IR. In IR a kernel arg can pointing to constant or global addr due to promotion. Originally all kernel arg of HIP points to generic addr space only.
I don't think that should be valid regard to the lifetime definition of shared variables, which have a maximal lifetime of the kernel execution. Instead, global CONSTANT and GLOBAL variables have longer lifetime than kernel execution. It's not valid to assign a pointer of LOCAL variables to a CONSTANT or GLOBAL in the static initializer (when no kernel execution is started). Within kernel/function body, CONSTANT cannot be assigned anymore. That's not a valid case.
For OpenCL, since it won't allow generic pointer as kernel function arguments, there never be such a case, a generic pointer argument.
llvm/lib/Target/AMDGPU/AMDGPUAliasAnalysis.cpp | ||
---|---|---|
114 | But, OpenCL doesn't allow generic pointer as kernel function arguments. We won't have generic pointer arguments for OpenCL |
OK, that makes sense. Now we only need to make sure we never compile anything but HIP and OpenCL <= 2.0. For example that we will not support fortran maybe? And never extend any of the languages to allow different address space arguments.
My point here is that it is language specific, but there is nothing language specific on the AA implementation. A possible solution is to define some attributes or metadata saying that a certain situation may not ever happen and then check it in AA. That property has to ve set by the language though as AA may not know the source semantics.
Unless we change how LDS is used in the future hardware, I don't see any change of the lifetime of LDS variables. Similar to PRIVATE variables, they are only valid once the kernel is executed. Any reference to them outside the kernel execution doesn't make any sense. All languages and models so far have the similar usage of LDS. Any reference to LDS variables outside kernel execution needs persistent LDS support. That should be a significant change that requires more component support, especially the hardware.
LDS and SCRATCH both behave more like TLS. The allocations come into existence when when a thread (or group of threads) get created, and the lifetime ends when those thread(s) terminate. It is UB to reference that memory outside that lifetime. Furthermore, it is UB to dereference the address of LDS and SCRATCH in any thread other than the one that created the address. These rules are defined by the languages although not well explained.
Passing an LDS or SCRATCH address between threads is meaningful provided only the thread(s) that "own" the address dereference it. So storing the address in a global "place" to be read later by an "owning" thread is meaningful. However, some languages may restrict what they allow. So passing as a kernel argument in CUDA appears to not be allowed even though it is meaningful provided the above restricts are met. In OpenCL, there are special rules for passing LDS/Local to a kernel. In OpenCL you actually pass in a byte size, and the kernel dispatch allocates dynamic LDS automatically and passes the address of that to the created thread(s). CUDA has a different syntax for dynamic LDS/Local that is more like TLS.
So how is TLS handled? It seems a TLS address cannot be compile/link time value since it is a runtime concept. So using relocations to initialize global memory program scope variables seems invalid. Initializing a pointer object that is allocated in LDS/SCRATCH to be the address of another LDS/SCRATCH allocated in the same "owning" thread is meaningful and could be implemented using relocations. However, I suspect the languages do not allow this. I am unclear if TLS allows this either.
OK, that makes sense. Now we only need to make sure we never compile anything but HIP and OpenCL <= 2.0. For example that we will not support fortran maybe? And never extend any of the languages to allow different address space arguments.
My point here is that it is language specific, but there is nothing language specific on the AA implementation. A possible solution is to define some attributes or metadata saying that a certain situation may not ever happen and then check it in AA. That property has to ve set by the language though as AA may not know the source semantics.
So you are saying that is always OK to assume no aliasing between a flat pointer which is a kernel argument and a pointer to LDS? OK, thanks!
So you are saying that is always OK to assume no aliasing between a flat pointer which is a kernel argument and a pointer to LDS? OK, thanks!
No I am not quite saying that as some languages are not clears. Having said that, some compiler implementations are assuming that for some languages. Basically the rule is language specific, so AA would need to ask the language if it is permissible to assume that or not. Also bear in mind the OpenCL case for LDS where the kernel argument is not really being passed in from externally, but created independently for each thread/group-of-threads.
Generic pointers are another issue. They are pointers that may point to multiple address spaces. But the rules of dereferncing when they reference the non-global address space are the same. There can be rules that allow a generic pointers to be known to only point to one address space, in which case they can be treated the same as if they were a pointer to that address space. At the hardware level, FLAT instructions can be used to implement language generic pointers. But FLAT instructions can also be used when the address space is fixed, in which case the semantics are the same as the single address space case.
Unlike OpenCL, the CUDA language does not have the address space of pointers as part of the type system. But it still allows allocation of objects to specific address spaces. For CUDA all addressing is conceptually generic, but the allocation address space can be propagated to know the fixed address space of the FLAT operations.
To me deciding point here was that LDS is not actually allocated on host, but instead requested to be allocated at dispatch. If so then host cannot get an actual pointer to it and thus cannot convert it to a generic pointer and pass to a kernel.
Theoretically one can forge a generic pointer which will point to a specific LDS location after it is allocated, but I believe taking a pointer to an unallocated memory is a UB by any language standards.
The LDS/SCRATCH is actually allocated on individual wave/group-of-wave creation which is even smaller granularity than dispatch. The language defines that even if a thread has a valid LDS/SCRATCH address for some other wave, it is UB to access it. So the host (or some other wave) can get a pointer to LDS/SCRATCH, and can pass it to another wave, but only the wave that "owns" the allocation can access it. What the language may say is that passing a generic into a kernel is not allowed if it points to LDS. I do not believe CUDA explicitly states this, but some compilers appear to implement this.
Also note that an LDS pointer passed into a kernel dispatch A cannot be a legal LDS for that dispatch since until the waves of that dispatch are created, they have no LDS. So if a generic pointer is passed in as a kernel argument then best it can be is the LDS for some other already created wave of another dispatch B. The waves of dispatch A cannot access that LDS pointer as then do not "own" it. The best they can do is pass it to the "owning" waves of dispatch B. That cannot be done via kernel arguments as the waves are already executing. So even if the language allows kernel arguments to be generic pointers pointing to LDS, it is safe for single-thread-AA to assume they cannot alias this waves LDS since such pointers are UB to be accessed by this wave.
Theoretically one can forge a generic pointer which will point to a specific LDS location after it is allocated, but I believe taking a pointer to an unallocated memory is a UB by any language standards.
The language models state that addresses must reference allocated objects (or one past the end) that are created, and it is UB to access any pointer that does not reference an allocated object. Forging or type punning (except for char in C) is UB.
llvm/lib/Target/AMDGPU/AMDGPUAliasAnalysis.cpp | ||
---|---|---|
108 | The discussion so far has been about loads from kernel arguments, but this check is far weaker. Would it be possible to construct a relocation in non-kernel memory that refers to LDS? |
Right, graphics shaders manage LDS such that it is used to pass arguments between certain shader kinds. But that is not the case for compute.
llvm/lib/Target/AMDGPU/AMDGPUAliasAnalysis.cpp | ||
---|---|---|
114 | It is not legal in OpenCL. In OpenCL you actually pass in a byte size that causes the driver to cause that much LDS to be allocated and the pointer to it passed to the kernel. So the host is not passing an LDS pointer, it is passing a dynamic size for and LDS allocation. As mentioned above, passing a generic pointer to OpenCL also cannot pass a pointer to LDS that the kernel can access. LDS is allocated by waves as they are created, so it is not possible to pass in a pointer to something that does not yet exist. |
llvm/lib/Target/AMDGPU/AMDGPUAliasAnalysis.cpp | ||
---|---|---|
104–105 |
Well it depends how they are defined. Graphics is using LDS relocations to stitch together pipeline parts to pass arguments through LDS. Compute does not allow this. If LDS allocation is managed in a TLS like manner (which is not currently done) then maybe there is a place for LDS relocations. But it is not like a global memory relocation. |
Is it true that we couldn't have a relocation for an LDS value?