This is an archive of the discontinued LLVM Phabricator instance.

AMDGPU/SI: Add implicitarg.ptr intrinsic.
ClosedPublic

Authored by jvesely on May 16 2016, 1:26 PM.

Details

Summary

Points to the start of implicit arguments (appended after explicit arguments)

Diff Detail

Repository
rL LLVM

Event Timeline

jvesely updated this revision to Diff 57388.May 16 2016, 1:26 PM
jvesely retitled this revision from to AMDGPU/SI: Add amdgcn versions of remaining builtins.
jvesely updated this object.
jvesely added a reviewer: tstellarAMD.
jvesely set the repository for this revision to rL LLVM.
jvesely added subscribers: arsenm, llvm-commits.

We don't actually want these. We now have the kernarg.segment.ptr intrinsic, so the library should just directly read the offsets from there

We don't actually want these. We now have the kernarg.segment.ptr intrinsic, so the library should just directly read the offsets from there

I assume it points to the beginning of the kernel args.
Is the idea to have another intrinsic to read from the end (workdim, global offset), or is it OK to have intrinsics for those?

We don't actually want these. We now have the kernarg.segment.ptr intrinsic, so the library should just directly read the offsets from there

I assume it points to the beginning of the kernel args.
Is the idea to have another intrinsic to read from the end (workdim, global offset), or is it OK to have intrinsics for those?

We should fix clover to not read from the end of the arguments (I thought this is what it already did)?

We don't actually want these. We now have the kernarg.segment.ptr intrinsic, so the library should just directly read the offsets from there

I assume it points to the beginning of the kernel args.
Is the idea to have another intrinsic to read from the end (workdim, global offset), or is it OK to have intrinsics for those?

We should fix clover to not read from the end of the arguments (I thought this is what it already did)?

some are before the kernel args (global size, local size, ngroups -- this is done by radeonsi/r600 mesa driver). others are after the kernel arguments (work_dim, global-offset -- this is done by clover).
The intention was to move all implicit arguments after the explicit ones, so new implicit arguments can be added without breaking ABI (moving explicit arguments).

We don't actually want these. We now have the kernarg.segment.ptr intrinsic, so the library should just directly read the offsets from there

I assume it points to the beginning of the kernel args.
Is the idea to have another intrinsic to read from the end (workdim, global offset), or is it OK to have intrinsics for those?

We should fix clover to not read from the end of the arguments (I thought this is what it already did)?

some are before the kernel args (global size, local size, ngroups -- this is done by radeonsi/r600 mesa driver). others are after the kernel arguments (work_dim, global-offset -- this is done by clover).
The intention was to move all implicit arguments after the explicit ones, so new implicit arguments can be added without breaking ABI (moving explicit arguments).

I think clover should move towards matching the HSA ABI closer. Most of the implicit arguments would then be user SGPR inputs like HSA uses, and the number of implicit args would be reduced.

We don't actually want these. We now have the kernarg.segment.ptr intrinsic, so the library should just directly read the offsets from there

I assume it points to the beginning of the kernel args.
Is the idea to have another intrinsic to read from the end (workdim, global offset), or is it OK to have intrinsics for those?

We should fix clover to not read from the end of the arguments (I thought this is what it already did)?

some are before the kernel args (global size, local size, ngroups -- this is done by radeonsi/r600 mesa driver). others are after the kernel arguments (work_dim, global-offset -- this is done by clover).
The intention was to move all implicit arguments after the explicit ones, so new implicit arguments can be added without breaking ABI (moving explicit arguments).

I think clover should move towards matching the HSA ABI closer. Most of the implicit arguments would then be user SGPR inputs like HSA uses, and the number of implicit args would be reduced.

new implicit arguments were appended to allow newer mesa to work with older llvm (without ifdef hell) any kind of significant abi change would break that (otherwise it'd be in one palce and we'd avoi this problem).
I'm not sure how much of a problem it is and what other (possible) users of clover might want. anyway, it's beyond scope of me trying to make get_global_offset() work. would you be OK if I restricted the changes to r600?

jvesely updated this revision to Diff 57944.May 20 2016, 9:58 AM
jvesely retitled this revision from AMDGPU/SI: Add amdgcn versions of remaining builtins to AMDGPU/SI: Make kernarg.segment.ptr point to implicit arguments for non HSA.
jvesely updated this object.

Drop new amdgcn intrinsics. make kernarg segment mesa compatible instead.

We don't actually want these. We now have the kernarg.segment.ptr intrinsic, so the library should just directly read the offsets from there

I assume it points to the beginning of the kernel args.
Is the idea to have another intrinsic to read from the end (workdim, global offset), or is it OK to have intrinsics for those?

We should fix clover to not read from the end of the arguments (I thought this is what it already did)?

some are before the kernel args (global size, local size, ngroups -- this is done by radeonsi/r600 mesa driver). others are after the kernel arguments (work_dim, global-offset -- this is done by clover).
The intention was to move all implicit arguments after the explicit ones, so new implicit arguments can be added without breaking ABI (moving explicit arguments).

I think clover should move towards matching the HSA ABI closer. Most of the implicit arguments would then be user SGPR inputs like HSA uses, and the number of implicit args would be reduced.

I agree. I actually started working on this last week. I think all implicit args that aren't passed in SGPRs should be at the end of the kernarg segment.

lib/Target/AMDGPU/SIISelLowering.cpp
1554–1559

There should be a separate intrinsic which points to the start of implicit args. As the ABI is currently, the library could use kernarg_segment_ptr to load the workgroup size information.

We don't actually want these. We now have the kernarg.segment.ptr intrinsic, so the library should just directly read the offsets from there

I assume it points to the beginning of the kernel args.
Is the idea to have another intrinsic to read from the end (workdim, global offset), or is it OK to have intrinsics for those?

We should fix clover to not read from the end of the arguments (I thought this is what it already did)?

some are before the kernel args (global size, local size, ngroups -- this is done by radeonsi/r600 mesa driver). others are after the kernel arguments (work_dim, global-offset -- this is done by clover).
The intention was to move all implicit arguments after the explicit ones, so new implicit arguments can be added without breaking ABI (moving explicit arguments).

I think clover should move towards matching the HSA ABI closer. Most of the implicit arguments would then be user SGPR inputs like HSA uses, and the number of implicit args would be reduced.

This is a bit confusing. If the implicit args should be in SGPRs, wouldn't we need one intrinsic per implicit arg? and how would the number of implicit args be reduced? the only redundant one is global size (libclc computes it as num_groups * local_size).

I agree. I actually started working on this last week. I think all implicit args that aren't passed in SGPRs should be at the end of the kernarg segment.

My idea was to switch work_dim and newly implemented global_offset, as those are already appended (the rest can be switched by patches to libclc and clover). However, doesn't this contradict Matt's suggestion to pass implicit arguments in SGPRs?

I think clover should move towards matching the HSA ABI closer. Most of the implicit arguments would then be user SGPR inputs like HSA uses, and the number of implicit args would be reduced.

This is a bit confusing. If the implicit args should be in SGPRs, wouldn't we need one intrinsic per implicit arg? and how would the number of implicit args be reduced? the only redundant one is global size (libclc computes it as num_groups * local_size).

I agree. I actually started working on this last week. I think all implicit args that aren't passed in SGPRs should be at the end of the kernarg segment.

My idea was to switch work_dim and newly implemented global_offset, as those are already appended (the rest can be switched by patches to libclc and clover). However, doesn't this contradict Matt's suggestion to pass implicit arguments in SGPRs?

SGPR space is limited, so we won't be able to pass all implicit arguments this way, so some will need to be added to the kernarg buffer. The types of values that should be passed in SGPRs are things that tend to be common across all runtimes, like work-group/work-item size, scratch buffer pointers, etc.

This comment was removed by tstellarAMD.

We could do what we are planning for OpenCL and add a struct pointer for future expansion at the end of some decided N bytes to reserve.

If 256 bytes were reserved at the beginning, that would be more than enough for any implicit needs and would keep the user arg base alignment the same (not sure that really matters).

I think clover should move towards matching the HSA ABI closer. Most of the implicit arguments would then be user SGPR inputs like HSA uses, and the number of implicit args would be reduced.

This is a bit confusing. If the implicit args should be in SGPRs, wouldn't we need one intrinsic per implicit arg? and how would the number of implicit args be reduced? the only redundant one is global size (libclc computes it as num_groups * local_size).

I agree. I actually started working on this last week. I think all implicit args that aren't passed in SGPRs should be at the end of the kernarg segment.

My idea was to switch work_dim and newly implemented global_offset, as those are already appended (the rest can be switched by patches to libclc and clover). However, doesn't this contradict Matt's suggestion to pass implicit arguments in SGPRs?

SGPR space is limited, so we won't be able to pass all implicit arguments this way, so some will need to be added to the kernarg buffer. The types of values that should be passed in SGPRs are things that tend to be common across all runtimes, like work-group/work-item size, scratch buffer pointers, etc.

OK, so to be specific about currently passed information.
workdim, wg_size, num_group should be eventually passed in SGPRs and therefore should have their own intrinsic, correct?
should those values also be duplicated in the kernarg segment?

the rest (global_size, global_offset) are loaded via kernargs segment ptr

do you still want two pointers (beginning of kernarg and beginning of implicit args) for mesa, or is it ok to have one if all the information is present at the appended location?

We could do what we are planning for OpenCL and add a struct pointer for future expansion at the end of some decided N bytes to reserve.

If 256 bytes were reserved at the beginning, that would be more than enough for any implicit needs and would keep the user arg base alignment the same (not sure that really matters).

reserving 256 bytes at the beginning would waste half a KC block for r600 (and might run into limitations for other hw), so clover is probably best of appending the information to minimize space (it should work until there are kernels with variable parameters). radeonsi driver is free to change this for GCN hw.

I think clover should move towards matching the HSA ABI closer. Most of the implicit arguments would then be user SGPR inputs like HSA uses, and the number of implicit args would be reduced.

This is a bit confusing. If the implicit args should be in SGPRs, wouldn't we need one intrinsic per implicit arg? and how would the number of implicit args be reduced? the only redundant one is global size (libclc computes it as num_groups * local_size).

I agree. I actually started working on this last week. I think all implicit args that aren't passed in SGPRs should be at the end of the kernarg segment.

My idea was to switch work_dim and newly implemented global_offset, as those are already appended (the rest can be switched by patches to libclc and clover). However, doesn't this contradict Matt's suggestion to pass implicit arguments in SGPRs?

SGPR space is limited, so we won't be able to pass all implicit arguments this way, so some will need to be added to the kernarg buffer. The types of values that should be passed in SGPRs are things that tend to be common across all runtimes, like work-group/work-item size, scratch buffer pointers, etc.

OK, so to be specific about currently passed information.
workdim, wg_size, num_group should be eventually passed in SGPRs and therefore should have their own intrinsic, correct?

If you look at enum PreloadedValue in SIRegisterInfo.h (this may have moved to SIMachineFunctionInfo.h by now), those are the values preloaded into SGPRs by the HSA runtime. If we are going to be changing the ABI for radeonsi clover, I would really like it to match what we use for HSA. For r600, it doesn't really matter to me what the ABI ends up being.

should those values also be duplicated in the kernarg segment?

the rest (global_size, global_offset) are loaded via kernargs segment ptr

do you still want two pointers (beginning of kernarg and beginning of implicit args) for mesa, or is it ok to have one if all the information is present at the appended location?

I think there needs to be both. The intrinsics names should correctly describe what they do, so I don't want to have kernarg.segment.ptr mean different things for different arches.

We could do what we are planning for OpenCL and add a struct pointer for future expansion at the end of some decided N bytes to reserve.

If 256 bytes were reserved at the beginning, that would be more than enough for any implicit needs and would keep the user arg base alignment the same (not sure that really matters).

reserving 256 bytes at the beginning would waste half a KC block for r600 (and might run into limitations for other hw), so clover is probably best of appending the information to minimize space (it should work until there are kernels with variable parameters). radeonsi driver is free to change this for GCN hw.

jvesely updated this revision to Diff 58649.May 26 2016, 10:43 AM
jvesely retitled this revision from AMDGPU/SI: Make kernarg.segment.ptr point to implicit arguments for non HSA to AMDGPU/SI: Add implicitarg.segment.ptr intrinsic..
jvesely updated this object.
jvesely edited edge metadata.

Create new intrinsic for implicit args

Why not the offset from the base pointer intrinsic rather than an intrinsic to the offset?

Why not the offset from the base pointer intrinsic rather than an intrinsic to the offset?

Does not really matter, but pointer appeared more flexible. if you decide to move implicit args elsewhere (that is not offset to kernarg segment) you only need to reimplement the intrinsic, without updating library.

Why not the offset from the base pointer intrinsic rather than an intrinsic to the offset?

Because if the implict args are stored after the explict kernel args, then the offset from kernarg.base.ptr to the start of implicit args is not known until compile time.

tstellarAMD added a comment.EditedMay 30 2016, 2:29 PM

This patch LGTM, but I think we should drop segment from the intrinsic name. In HSA 'segment' means address space, and we don't have a separate address space for implicit args.

jvesely updated this revision to Diff 59004.May 30 2016, 4:22 PM
jvesely retitled this revision from AMDGPU/SI: Add implicitarg.segment.ptr intrinsic. to AMDGPU/SI: Add implicitarg.ptr intrinsic..
jvesely updated this object.

drop segment from the intrinsic name

Why not the offset from the base pointer intrinsic rather than an intrinsic to the offset?

Because if the implict args are stored after the explict kernel args, then the offset from kernarg.base.ptr to the start of implicit args is not known until compile time.

Why is this an issue? This won't be known anyway

jvesely marked an inline comment as done.Jun 3 2016, 3:59 PM

Why not the offset from the base pointer intrinsic rather than an intrinsic to the offset?

Because if the implict args are stored after the explict kernel args, then the offset from kernarg.base.ptr to the start of implicit args is not known until compile time.

Why is this an issue? This won't be known anyway

just to be sure I understand correctly. your suggestion is to have "implicitarg.offset" so the libclc(or any other user) then uses "_builtin_kernarg_segment_ptr() + _builtin_implict_arg_offset()" to read the implicit arguments?

jvesely updated this revision to Diff 60512.Jun 13 2016, 6:18 AM

rebase after r272512

tstellarAMD accepted this revision.Jun 17 2016, 8:20 PM
tstellarAMD edited edge metadata.

Why not the offset from the base pointer intrinsic rather than an intrinsic to the offset?

Because if the implict args are stored after the explict kernel args, then the offset from kernarg.base.ptr to the start of implicit args is not known until compile time.

Why is this an issue? This won't be known anyway

just to be sure I understand correctly. your suggestion is to have "implicitarg.offset" so the libclc(or any other user) then uses "_builtin_kernarg_segment_ptr() + _builtin_implict_arg_offset()" to read the implicit arguments?

I have no preference either way on this. Since this patch has been outstanding for a while, I say just commit it. We can always add _builtin_implict_arg_offset() later if we decide it is a better solution.

This revision is now accepted and ready to land.Jun 17 2016, 8:20 PM
jvesely updated this revision to Diff 61438.Jun 21 2016, 1:35 PM
jvesely edited edge metadata.

fix build without D20298 (GRID_OFFSET is still needed)

This revision was automatically updated to reflect the committed changes.