This patch changes the HeapToStack optimization to attach the return alignment
attribute information to the created alloca instruction. This would cause
problems when replacing the heap allocation with an alloca did not respect the
alignment of the original heap allocation, which would typically be aligned on
an 8 or 16 byte boundary. Malloc calls now contain alignment attributes,
so we can use that information here.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Unit Tests
Event Timeline
Shouldn't/can't you query the alignment for that pointer?
Hardcoding anything like this is a sign of a problem.
I haven't found an interface in LLVM to query default alignment information from malloc. As far as I know, the malloc functions are defined by the GNU documentation to be aligned to 16 on 64-bit systems, and 8 on 32-bit systems and Clang behaves similarly. A more complete solution would be to change the __kmpc_alloc_shared RTL function to be an aligned malloc function. That way in Clang we can just use the natural alignment of the underlying type, or query the target info from there. But I don't think that is explicitly necessary because we just need to mimic the pointer's alignment as it would be from the malloc call since we're replacing it with an alloca. If you know of somewhere I can query the default malloc alignment from within LLVM I can add that, or I can just check the data layout and use the 8 / 16 distinction used in https://www.gnu.org/software/libc/manual/html_node/Aligned-Memory-Blocks.html#Aligned-Memory-Blocks. I'd prefer not to change the OpenMP RTL to include the alignment, as that would change a lot of tests and code for little gain I can see.
I'm not sure if deriving it is a solution here, considering that we are replacing a runtime call with defined default alignment with an alloca that should always at least match that. I can try using AAAlign to query it, my first thought was to use the element type of the bitcast that always follows the __kmpc_alloc_shared call, but @jdoerfert just told me to just pick whatever the default is when I asked. This issue comes from https://github.com/kokkos/kokkos/issues/4224.
Then i guess you need to basically introduce an interface to do what https://en.cppreference.com/w/cpp/types/max_align_t does, but based on a datalayout.
Seems reasonable. I know we can query this information from clang, e.g. https://clang.llvm.org/doxygen/classclang_1_1TargetInfo.html#a01403a5106161d4d3cd0c50c43150f89, but I don't think there is an existing string in the data layout to encode this. Will I be adding a new format this? That would be a reasonably large change so I just want to make sure I'm on the right page.
Err, no. I'm simply thinking that datalayout already specifies the primitive [scalar] types, so you should just need to go through them and pick the one with maximal alignment requirement, and pick it.
llvm/lib/Transforms/IPO/AttributorAttributes.cpp | ||
---|---|---|
5931 | Would a comment help to explain what the hard-coded 16 means? |
The default data layout contains a 128 bit float, so if we just check the maximum alignment we'll always get at least 16, even on 32-bit architectures. I could only consider the ones set explicitly by the data layout string, but doesn't that go against the purpose of the defaults?
For which target/architecture? What happens on other target/architectures?
so if we just check the maximum alignment we'll always get at least 16, even on 32-bit architectures. I could only consider the ones set explicitly by the data layout string, but doesn't that go against the purpose of the defaults?
This is from the documentation on the data layout. It defines the default values used when initializing the data layout. It seems these can only be overridden, and I can't imagine a situation where someone would override it to define a 128 bit float to have 64-bit alignment, so the largest alignment we'll have in the data layout will always be at least 16 bytes.
When constructing the data layout for a given target, LLVM starts with a default set of specifications which are then (possibly) overridden by the specifications in the datalayout keyword. The default specifications are given in this list:
e - little endian p:64:64:64 - 64-bit pointers with 64-bit alignment. p[n]:64:64:64 - Other address spaces are assumed to be the same as the default address space. S0 - natural stack alignment is unspecified i1:8:8 - i1 is 8-bit (byte) aligned i8:8:8 - i8 is 8-bit (byte) aligned i16:16:16 - i16 is 16-bit aligned i32:32:32 - i32 is 32-bit aligned i64:32:64 - i64 has ABI alignment of 32-bits but preferred alignment of 64-bits f16:16:16 - half is 16-bit aligned f32:32:32 - float is 32-bit aligned f64:64:64 - double is 64-bit aligned f128:128:128 - quad is 128-bit aligned v64:64:64 - 64-bit vector is 64-bit aligned v128:128:128 - 128-bit vector is 128-bit aligned
I think the most straightforward way to solve this is to add an alignment attribute to the return value when we generate code, then just copy that when we replace it. It'll change some tests but I'll try that.
Changing the method. Adding alignment information to the runtime call and using
it when we create the alloca. I might need to add the alignment information to
the runtime call to make the implementation sound, but I haven't encountered any
problems with the runtime implementation.
clang/lib/CodeGen/CGOpenMPRuntimeGPU.cpp | ||
---|---|---|
1411 | This doesn't work. If the type alignment is > 8 the stack won't fulfill it unless you modify /// Add worst-case padding so that future allocations are properly aligned. constexpr const uint32_t Alignment = 8; in openmp/libomptarget/DeviceRTL/src/State.cpp. Proposal: | |
1475 | Not needed. Will cause a warning, no? | |
llvm/lib/Transforms/IPO/AttributorAttributes.cpp | ||
5940 | This is sensible but needs a test. You can even do it without the else for all allocations. With the proposed changes above alloc_shared would also fall into the aligned_alloc case. |
I will split this into two revisions, one handling the return alignment attribute in the Attributor, and one adding alignment information to the __kmpc_alloc_shared OpenMP runtime call, turning it into an aligned allocation.
clang/lib/CodeGen/CGOpenMPRuntimeGPU.cpp | ||
---|---|---|
1411 | That was an original though, I was hoping to avoid the extra work, but I think this is definitely the only way to solve this reasonably, it might also allow us to use the stack more efficiently. We'll still want this alignment information, but we'll need to inform the runtime of the expected alignment. | |
1475 | Forgot about this, not intended to be included. | |
llvm/lib/Transforms/IPO/AttributorAttributes.cpp | ||
5940 | Yes, we want this regardless because all malloc like calls now seem to have alignment attributes, which makes sure we respect the alignment of the original malloc call. I can probably split this into another patch. |
Removing OpenMP code, only adding support for return alignments. Fixing OpenMP
will occur in a following patch.
This doesn't work. If the type alignment is > 8 the stack won't fulfill it unless you modify
in openmp/libomptarget/DeviceRTL/src/State.cpp.
The fact that the state has a fixed alignment right now makes it impossible to allocate higher aligned types anyway.
Proposal:
Add an argument to _alloc_shared that is the alignment as computed above, effecitively making it _alloc_shared_aligned. Modify the stack to actually align the base pointer rather than extend the allocation based on the alignment passed in. Then any type alignment can be handled, including user aligned types.