This is an archive of the discontinued LLVM Phabricator instance.

[libomptarget][devicertl][nfc] Remove some cuda intrinsics, simplify
ClosedPublic

Authored by JonChesterfield on Jan 20 2021, 9:41 AM.

Details

Summary

[libomptarget][devicertl][nfc] Remove some cuda intrinsics, simplify

Replace popc, ffs with clang intrinsics. Move kmpc_impl_min to only file
that uses it and replace template with explictly typed.

Diff Detail

Event Timeline

JonChesterfield requested review of this revision.Jan 20 2021, 9:41 AM
Herald added a project: Restricted Project. · View Herald TranscriptJan 20 2021, 9:41 AM
openmp/libomptarget/deviceRTLs/amdgcn/src/target_impl.h
68

reordered to closer match nvptx

openmp/libomptarget/deviceRTLs/common/src/reduction.cu
187

Also considered 'min', but expect it to collide with one of the included headers, and 'min_uint32_t' but couldn't see a pretty way to write that inTheNamingConvention.

INLINE static is a slightly strange construct, chosen to match the functions above.

266

Three call sites, all uint32_t. Don't need a template.

openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.h
95–96

These might be better changing typed as __kmpc_impl_lanemask_t, implemented in a common header that picks builtin_ffs or builtin_ffs based on sizeof(__kmpc_impl_lanemask_t). This keeps the current name mangling dispatch and separate implementations.

tianshilei1992 added inline comments.Jan 20 2021, 9:50 AM
openmp/libomptarget/deviceRTLs/common/src/reduction.cu
187

The semantics of min(x, y) is same as x < y ? x : y but the implementation could be very different. For some data type (especially integers) we don't even need a branch here. I don't think it's good for performance to just use branch. If we know the type is uint32_t, then we can use bitwise operation here.

openmp/libomptarget/deviceRTLs/common/src/reduction.cu
187

This will codegen as selec (https://godbolt.org/z/8fxPo7)

define dso_local i64 @min(i64 %0, i64 %1) local_unnamed_addr #0 {
  %3 = icmp ult i64 %0, %1
  %4 = select i1 %3, i64 %0, i64 %1
  ret i64 %4
}

which the backend will have an easier time with than decoding the equivalent bit shuffling.

openmp/libomptarget/deviceRTLs/common/src/reduction.cu
187

Okay. If this can solve the problem, __kmpc_impl_min can be removed as well.

openmp/libomptarget/deviceRTLs/common/src/reduction.cu
187

Yep, __kmpc_impl_min is gone.

I've kept kmpcMin as a local function in preference to open coding x<y?x:y at each call site because it's shorter and fractionally easier to read than checking whether the argument order is min or max

This revision is now accepted and ready to land.Jan 20 2021, 11:44 AM
This revision was landed with ongoing or failed builds.Jan 20 2021, 11:45 AM
This revision was automatically updated to reflect the committed changes.