Points to the start of implicit arguments (appended after explicit arguments)
Details
Diff Detail
- Repository
- rL LLVM
Event Timeline
We don't actually want these. We now have the kernarg.segment.ptr intrinsic, so the library should just directly read the offsets from there
I assume it points to the beginning of the kernel args.
Is the idea to have another intrinsic to read from the end (workdim, global offset), or is it OK to have intrinsics for those?
We should fix clover to not read from the end of the arguments (I thought this is what it already did)?
some are before the kernel args (global size, local size, ngroups -- this is done by radeonsi/r600 mesa driver). others are after the kernel arguments (work_dim, global-offset -- this is done by clover).
The intention was to move all implicit arguments after the explicit ones, so new implicit arguments can be added without breaking ABI (moving explicit arguments).
I think clover should move towards matching the HSA ABI closer. Most of the implicit arguments would then be user SGPR inputs like HSA uses, and the number of implicit args would be reduced.
new implicit arguments were appended to allow newer mesa to work with older llvm (without ifdef hell) any kind of significant abi change would break that (otherwise it'd be in one palce and we'd avoi this problem).
I'm not sure how much of a problem it is and what other (possible) users of clover might want. anyway, it's beyond scope of me trying to make get_global_offset() work. would you be OK if I restricted the changes to r600?
I agree. I actually started working on this last week. I think all implicit args that aren't passed in SGPRs should be at the end of the kernarg segment.
lib/Target/AMDGPU/SIISelLowering.cpp | ||
---|---|---|
1574–1579 ↗ | (On Diff #57944) | There should be a separate intrinsic which points to the start of implicit args. As the ABI is currently, the library could use kernarg_segment_ptr to load the workgroup size information. |
This is a bit confusing. If the implicit args should be in SGPRs, wouldn't we need one intrinsic per implicit arg? and how would the number of implicit args be reduced? the only redundant one is global size (libclc computes it as num_groups * local_size).
I agree. I actually started working on this last week. I think all implicit args that aren't passed in SGPRs should be at the end of the kernarg segment.
My idea was to switch work_dim and newly implemented global_offset, as those are already appended (the rest can be switched by patches to libclc and clover). However, doesn't this contradict Matt's suggestion to pass implicit arguments in SGPRs?
I think clover should move towards matching the HSA ABI closer. Most of the implicit arguments would then be user SGPR inputs like HSA uses, and the number of implicit args would be reduced.
This is a bit confusing. If the implicit args should be in SGPRs, wouldn't we need one intrinsic per implicit arg? and how would the number of implicit args be reduced? the only redundant one is global size (libclc computes it as num_groups * local_size).
I agree. I actually started working on this last week. I think all implicit args that aren't passed in SGPRs should be at the end of the kernarg segment.
My idea was to switch work_dim and newly implemented global_offset, as those are already appended (the rest can be switched by patches to libclc and clover). However, doesn't this contradict Matt's suggestion to pass implicit arguments in SGPRs?
SGPR space is limited, so we won't be able to pass all implicit arguments this way, so some will need to be added to the kernarg buffer. The types of values that should be passed in SGPRs are things that tend to be common across all runtimes, like work-group/work-item size, scratch buffer pointers, etc.
We could do what we are planning for OpenCL and add a struct pointer for future expansion at the end of some decided N bytes to reserve.
If 256 bytes were reserved at the beginning, that would be more than enough for any implicit needs and would keep the user arg base alignment the same (not sure that really matters).
OK, so to be specific about currently passed information.
workdim, wg_size, num_group should be eventually passed in SGPRs and therefore should have their own intrinsic, correct?
should those values also be duplicated in the kernarg segment?
the rest (global_size, global_offset) are loaded via kernargs segment ptr
do you still want two pointers (beginning of kernarg and beginning of implicit args) for mesa, or is it ok to have one if all the information is present at the appended location?
reserving 256 bytes at the beginning would waste half a KC block for r600 (and might run into limitations for other hw), so clover is probably best of appending the information to minimize space (it should work until there are kernels with variable parameters). radeonsi driver is free to change this for GCN hw.
If you look at enum PreloadedValue in SIRegisterInfo.h (this may have moved to SIMachineFunctionInfo.h by now), those are the values preloaded into SGPRs by the HSA runtime. If we are going to be changing the ABI for radeonsi clover, I would really like it to match what we use for HSA. For r600, it doesn't really matter to me what the ABI ends up being.
should those values also be duplicated in the kernarg segment?
the rest (global_size, global_offset) are loaded via kernargs segment ptr
do you still want two pointers (beginning of kernarg and beginning of implicit args) for mesa, or is it ok to have one if all the information is present at the appended location?
I think there needs to be both. The intrinsics names should correctly describe what they do, so I don't want to have kernarg.segment.ptr mean different things for different arches.
reserving 256 bytes at the beginning would waste half a KC block for r600 (and might run into limitations for other hw), so clover is probably best of appending the information to minimize space (it should work until there are kernels with variable parameters). radeonsi driver is free to change this for GCN hw.
Why not the offset from the base pointer intrinsic rather than an intrinsic to the offset?
Does not really matter, but pointer appeared more flexible. if you decide to move implicit args elsewhere (that is not offset to kernarg segment) you only need to reimplement the intrinsic, without updating library.
Because if the implict args are stored after the explict kernel args, then the offset from kernarg.base.ptr to the start of implicit args is not known until compile time.
This patch LGTM, but I think we should drop segment from the intrinsic name. In HSA 'segment' means address space, and we don't have a separate address space for implicit args.
just to be sure I understand correctly. your suggestion is to have "implicitarg.offset" so the libclc(or any other user) then uses "_builtin_kernarg_segment_ptr() + _builtin_implict_arg_offset()" to read the implicit arguments?
I have no preference either way on this. Since this patch has been outstanding for a while, I say just commit it. We can always add _builtin_implict_arg_offset() later if we decide it is a better solution.