User Details
- User Since
- Aug 18 2016, 12:44 PM (353 w, 6 d)
Fri, May 5
Apr 29 2023
I think we need to similarly remove a wavefrontsize64 function if compiling in wave32 mode.
Apr 27 2023
How confident are we in our ability to strip the convergent attribute off of functions that don't need it? Seems like this could cause performance regressions.
Apr 25 2023
I think a function with wavefrontsize32 is incompatible with a wavefrontsize64 compilation and similarly for a wavefrontsize64 function with a wavefrontsize32 compilation and we should take the usual steps for incompatible functions.
Apr 21 2023
Apr 20 2023
Apr 4 2023
Mar 28 2023
No objection here.
Mar 22 2023
It might be better to loop over only active lanes, found using ctz or clz builtins.
Mar 21 2023
FWIW, there is no desire to read from inactive lanes. The loop is supposed to only be reading from, and writing to, lanes that were active before the for loop is executed by a select single lane.
Mar 5 2023
Feb 14 2023
My current understanding is the c-p will go into already forked clang-16, but not to rocm 5.4. So rocm device-libs will be accompanied by the older clang-16 w/o this and stay compatible. Someone building from scratch will use latest clang-16 and staging device-libs with this change. Do you think this will work?
Feb 9 2023
Also, I see no median intrinsic described in the LLVM programming guide.
The device library is using the builtin, so I assume the build will fail when this change lands.
Jan 23 2023
No objections here...
Nov 22 2022
__builtin_sqrtf does not produce a correctly rounded result. I don't recommend this change.
Oct 28 2022
Thank you!
Oct 12 2022
We're in a lose-lose situation here. No matter what we do, someone will complain. But we're are definitely on more solid footing by going for correctness.
Oct 11 2022
Different functions providing different behaviors can be handled at link time like any other function, instead of the same functions providing different behaviors per translation unit and requires cloning. The current scheme transfers complexity from the device library build system into the driver and user binaries
Also, we may want to use uppercase for other purposes in the future.
I don't particularly see a need for this. I am not opposed to a "did you mean" in the error diagnostic.
Sep 22 2022
Sep 16 2022
Sep 15 2022
Aug 19 2022
Aug 18 2022
Following existing naming, it might make sense to rename "rtn_b32" --> "rtn" and "rtn_b64" --> "rtnl".
Aug 5 2022
Aug 4 2022
Jul 14 2022
Jun 24 2022
The "is_" prefix is not helpful. I suggest either "uses_dynamic_stack" or just "dynamic_stack".
Jun 23 2022
Jun 14 2022
Thanks @bcahoon. If there were another available approach that always worked, we would use it.
Jun 7 2022
This should include updates to https://llvm.org/docs/AMDGPUUsage.html#kernel-descriptor if not planned for a different patch.
Apr 27 2022
t-tye should review this too.
Apr 25 2022
Apr 13 2022
I think this is too focused. There are other image_sample_lz intrinsics and all of them potentially need to be replaced if following this approach.
Apr 11 2022
Apr 7 2022
Looks fine to me!
Mar 21 2022
This change goes along with 1194b9cdda30d and should probably have been landed with it.
Feb 18 2022
Feb 16 2022
Feb 2 2022
Dec 2 2021
Can we agree to drop the macro in LLVM 15 and note that in a comment or elsewhere?
Sep 27 2021
Jun 10 2021
I believe that for the purposes of detecting lane 0 mbcnt_lo is sufficient.
Jun 8 2021
Two approaches for limiting the stores to lane 0 of each wave:
- Write 1 to exec mask, store, and write -1 to exec mask. This works since the exec mask at the start of the wave when this happens is -1
- Check for lane == 0 and branch. The lane can be computed by a) wave64: builtin_amdgcn_mbcnt_hi(~0u, builtin_amdgcn_mbcnt_lo(~0u, 0u)) b) wave32: __builtin_amdgcn_mbcnt_lo(~0u, 0u)
May 13 2021
This looks good to me.
May 6 2021
Mar 19 2021
Jan 25 2021
That is the safest thing to do. However, I'm now hearing some thinking that memory likely to be targeted by f.p. atomics is likely to be cached. But again, better to be safe at least until we're clear this will always be the case.
Jan 20 2021
Thanks. LGTM.
That sequence will give a very accurate result as long as overflow and underflow is avoided. LGTM.
Nov 5 2020
Should this also be IntrConvergent?
Oct 22 2020
Looks good to me.
Jul 9 2020
May 28 2020
LGTM
May 7 2020
May 2 2020
Apr 22 2020
Apr 17 2020
Apr 14 2020
Apr 13 2020
In my opinion, for on-line compile for OpenCL, the platform is responsible for setting __OPENCL_VERSION__. Also, it should be the platform's choice as to how to respond the image support query and how __IMAGE_SUPPORT__ is set. For offline compile, it doesn't seem unreasonable to ask the developer to set these.
Apr 11 2020
I don't think we can guarantee this is or will be supported on all devices. The language runtime makes this decision.
Apr 8 2020
In addition to predefining __ATOMIC_RELAXED, etc., clang also predefines __OPENCL_MEMORY_SCOPE_WORK_ITEM and friends. So it doesn't really seem unreasonable for clang to also predefine its known syncscopes, and to require the argument to be one of those integers.
Apr 3 2020
LGTM
Apr 2 2020
Please go ahead and update to a string for the scope.
Mar 25 2020
Thanks. This looks fine to me.
Mar 20 2020
Mar 12 2020
I'm not aware of any valid explicit uses of this intrinsic by library or developer code, so this seems fine to me.
Feb 20 2020
I recall we agreed that conformance tests using mixed types were broken, so this change should be OK. Hopefully this will not affect users.
Feb 19 2020
LGTM
Jan 7 2020
Should this be looking forward to also handling OpenCL, which does require vector support?
Jan 6 2020
I am OK with proceeding here.
I am OK with proceeding here.
Dec 11 2019
Dec 10 2019
This looks OK to me, although tuning on correctly rounded division any time denorms are enabled is not actually required by OpenCL.
Aug 14 2019
Looks fine to me.
Looks fine to me.
Looks fine to me. Thanks!
Looks fine to me. Thanks!
Jun 14 2019
Jun 10 2019
We need to communicate with anyone generating IR to ensure this is being generated before we change the default. clang is only one of those generators. This change will also need to be documented in the usage document.