User Details
- User Since
- Aug 18 2016, 12:44 PM (306 w, 22 h)
Fri, Jun 24
The "is_" prefix is not helpful. I suggest either "uses_dynamic_stack" or just "dynamic_stack".
Thu, Jun 23
Tue, Jun 14
Thanks @bcahoon. If there were another available approach that always worked, we would use it.
Tue, Jun 7
This should include updates to https://llvm.org/docs/AMDGPUUsage.html#kernel-descriptor if not planned for a different patch.
Apr 27 2022
t-tye should review this too.
Apr 25 2022
Apr 13 2022
I think this is too focused. There are other image_sample_lz intrinsics and all of them potentially need to be replaced if following this approach.
Apr 11 2022
Apr 7 2022
Looks fine to me!
Mar 21 2022
This change goes along with 1194b9cdda30d and should probably have been landed with it.
Feb 18 2022
Feb 16 2022
Feb 2 2022
Dec 2 2021
Can we agree to drop the macro in LLVM 15 and note that in a comment or elsewhere?
Sep 27 2021
Jun 10 2021
I believe that for the purposes of detecting lane 0 mbcnt_lo is sufficient.
Jun 8 2021
Two approaches for limiting the stores to lane 0 of each wave:
- Write 1 to exec mask, store, and write -1 to exec mask. This works since the exec mask at the start of the wave when this happens is -1
- Check for lane == 0 and branch. The lane can be computed by a) wave64: builtin_amdgcn_mbcnt_hi(~0u, builtin_amdgcn_mbcnt_lo(~0u, 0u)) b) wave32: __builtin_amdgcn_mbcnt_lo(~0u, 0u)
May 13 2021
This looks good to me.
May 6 2021
Mar 19 2021
Jan 25 2021
That is the safest thing to do. However, I'm now hearing some thinking that memory likely to be targeted by f.p. atomics is likely to be cached. But again, better to be safe at least until we're clear this will always be the case.
Jan 20 2021
Thanks. LGTM.
That sequence will give a very accurate result as long as overflow and underflow is avoided. LGTM.
Nov 5 2020
Should this also be IntrConvergent?
Oct 22 2020
Looks good to me.
Jul 9 2020
May 28 2020
LGTM
May 7 2020
May 2 2020
Apr 22 2020
Apr 17 2020
Apr 14 2020
Apr 13 2020
In my opinion, for on-line compile for OpenCL, the platform is responsible for setting __OPENCL_VERSION__. Also, it should be the platform's choice as to how to respond the image support query and how __IMAGE_SUPPORT__ is set. For offline compile, it doesn't seem unreasonable to ask the developer to set these.
Apr 11 2020
I don't think we can guarantee this is or will be supported on all devices. The language runtime makes this decision.
Apr 8 2020
In addition to predefining __ATOMIC_RELAXED, etc., clang also predefines __OPENCL_MEMORY_SCOPE_WORK_ITEM and friends. So it doesn't really seem unreasonable for clang to also predefine its known syncscopes, and to require the argument to be one of those integers.
Apr 3 2020
LGTM
Apr 2 2020
Please go ahead and update to a string for the scope.
Mar 25 2020
Thanks. This looks fine to me.
Mar 20 2020
Mar 12 2020
I'm not aware of any valid explicit uses of this intrinsic by library or developer code, so this seems fine to me.
Feb 20 2020
I recall we agreed that conformance tests using mixed types were broken, so this change should be OK. Hopefully this will not affect users.
Feb 19 2020
LGTM
Jan 7 2020
Should this be looking forward to also handling OpenCL, which does require vector support?
Jan 6 2020
I am OK with proceeding here.
I am OK with proceeding here.
Dec 11 2019
Dec 10 2019
This looks OK to me, although tuning on correctly rounded division any time denorms are enabled is not actually required by OpenCL.
Aug 14 2019
Looks fine to me.
Looks fine to me.
Looks fine to me. Thanks!
Looks fine to me. Thanks!
Jun 14 2019
Jun 10 2019
We need to communicate with anyone generating IR to ensure this is being generated before we change the default. clang is only one of those generators. This change will also need to be documented in the usage document.
May 31 2019
Apr 25 2019
Mar 18 2019
Feb 6 2019
Maybe there are already other types like this, but it saddens me that an offline compiled code object could potentially not work properly if the application is using any of these types. Or should the runtime try to detect a problem using argument metadata?
Jan 28 2019
I think it is perfectly reasonable to treat these as essentially relaxed-only atomic RMW operations and require the application to use fences or barriers if necessary. The ordering and scope are only needed if we ever need this operation to act as a non-relaxed atomic RMW.
Nov 1 2018
Actually the conversions don't need non-default-rounded operations, nor are non-default-rounded arithmetic operations required by OpenCL. However, we've had requests to implement functions such as add_rtz(x,y) which computes x+y with round-to-zero rounding. Our competitors offer such functions, and we implemented them for HSAIL. So we are really trying to get back to parity with HSAIL.
One thing we've wanted for compute for quite a while now is a way to request non-default-rounded add, sub, mul, div, fma, and sqrt. Assuming we ever figure out how to represent these in the IR, ideally without falling back on intrinsics, could this approach be used to implement and minimize the mode changes for those as well?
Oct 16 2018
Ping. There's quite a bit of interest in getting this exposed by clang.
Sep 28 2018
Aug 7 2018
This approach seems fine to me.
Jul 10 2018
By the way, since types are being mixed, shouldn't the summary say something like optimize fma((float)S0.x, (float)S1.x, fma((float)S0.y, (float)S1.y, S2)) --> fdot2(S0, S1, S2)? We only want this transformation if S0 and S1 are <2 x f16>.
This operation only rounds a single time, and unfortunately always flushes f32 denorms. Thus this transformation should only be done when unsafe math is requested.
Jun 27 2018
Thanks, looks good.
Jun 22 2018
LGTM
Jun 13 2018
Looks good to me.
Jun 5 2018
May 8 2018
Thanks! Looks good to me.
Mar 23 2018
LGTM
LGTM, but I'd rather use '+' instead of ',' for the features.
Mar 22 2018
Mar 2 2018
Looks fine to me.
Feb 27 2018
Feb 26 2018
Feb 23 2018
Feb 16 2018
Feb 15 2018
Feb 14 2018
Jan 31 2018
Jan 30 2018
As I understand it, the option users pass to clang++ is --cuda-gpu-arch=<Arch>. Can't we arrange to generate the right triple if they use gfx900 or some other AMD target name for <Arch>?
Jan 29 2018
Should we expect that the last 3 arguments have any effect? Do we want to test to ensure they have the expected effects?
I believe there are 20 distinct samplers. I'd be more comfortable with this if you could arrange that __translate_sampler_initializer would be called with values in [0, 19] instead of [133, 322].
Jan 26 2018
Can this be tested?
Were you going to add min and max separately?