This is an archive of the discontinued LLVM Phabricator instance.

[OpenMP] Use default grid value for static grid size
ClosedPublic

Authored by jdoerfert on Aug 20 2023, 8:06 PM.

Details

Summary

If the user did not provide any static clause to override the grid size,
we assume the default grid size as upper bound and use it to improve
code generation through vendor specific attributes.

Fixes: https://github.com/llvm/llvm-project/issues/64816

Diff Detail

Event Timeline

jdoerfert created this revision.Aug 20 2023, 8:06 PM
Herald added a project: Restricted Project. · View Herald TranscriptAug 20 2023, 8:06 PM
jdoerfert requested review of this revision.Aug 20 2023, 8:06 PM
Herald added a project: Restricted Project. · View Herald TranscriptAug 20 2023, 8:06 PM
jhuber6 added inline comments.Aug 21 2023, 8:50 AM
llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp
4134

Can't we just get the module from the Function and check the triple?

Is amdgpu-flat-work-group-size a bound or exact value?

Is amdgpu-flat-work-group-size a bound or exact value?

It's a range. What really matters is the upper bound, that is the value you see in the test (max flat ...).

jdoerfert updated this revision to Diff 552052.Aug 21 2023, 9:24 AM

Added checks.

jhuber6 accepted this revision.Aug 21 2023, 9:24 AM

LG, thanks

This revision is now accepted and ready to land.Aug 21 2023, 9:24 AM
This revision was landed with ongoing or failed builds.Aug 23 2023, 11:13 AM
This revision was automatically updated to reflect the committed changes.
Herald added a project: Restricted Project. · View Herald TranscriptAug 23 2023, 11:13 AM

This patch produces the following difference in IR out of CodeGen.

Without this patch:

%nvptx_num_threads.i = tail call i32 @__kmpc_get_hardware_num_threads_in_block() #2
call void @__kmpc_distribute_static_init_4(ptr addrspacecast (ptr addrspace(1) @2 to ptr), i32 %1, i32 91, ptr nonnull %.omp.is_last.ascast.i, ptr nonnull %.omp.comb.lb.ascast.i, ptr nonnull %.omp.comb.ub.ascast.i, ptr nonnull %.omp.stride.ascast.i, i32 1, i32 %nvptx_num_threads.i) #2

With this patch:

call void @__kmpc_distribute_static_init_4(ptr addrspacecast (ptr addrspace(1) @2 to ptr), i32 %1, i32 91, ptr nonnull %.omp.is_last.ascast.i, ptr nonnull %.omp.comb.lb.ascast.i, ptr nonnull %.omp.comb.ub.ascast.i, ptr nonnull %.omp.stride.ascast.i, i32 1, i32 256) #2

Setting the blocksize to a constant too early would be a problem if the runtime changes the blocksize, e.g. because of an environment variable or because of a low trip count (D152014). Comments? @jdoerfert

This patch produces the following difference in IR out of CodeGen.

Without this patch:

%nvptx_num_threads.i = tail call i32 @__kmpc_get_hardware_num_threads_in_block() #2
call void @__kmpc_distribute_static_init_4(ptr addrspacecast (ptr addrspace(1) @2 to ptr), i32 %1, i32 91, ptr nonnull %.omp.is_last.ascast.i, ptr nonnull %.omp.comb.lb.ascast.i, ptr nonnull %.omp.comb.ub.ascast.i, ptr nonnull %.omp.stride.ascast.i, i32 1, i32 %nvptx_num_threads.i) #2

With this patch:

call void @__kmpc_distribute_static_init_4(ptr addrspacecast (ptr addrspace(1) @2 to ptr), i32 %1, i32 91, ptr nonnull %.omp.is_last.ascast.i, ptr nonnull %.omp.comb.lb.ascast.i, ptr nonnull %.omp.comb.ub.ascast.i, ptr nonnull %.omp.stride.ascast.i, i32 1, i32 256) #2

Setting the blocksize to a constant too early would be a problem if the runtime changes the blocksize, e.g. because of an environment variable or because of a low trip count (D152014). Comments? @jdoerfert

From OpenMP-Opt:

case OMPRTL___kmpc_get_hardware_num_threads_in_block:
   Changed = Changed | foldKernelFnAttribute(A, "omp_target_thread_limit");
   break;

this is wrong. We should fold thread limit, not num_threads_in_block.
The latter can only be folded if we will not lower it, which we currently cannot guarantee.