The device runtime contains several calls to __kmpc_get_hardware_num_threads_in_block
and __kmpc_get_hardware_num_blocks. If the thread_limit and the num_teams are constant,
these calls can be folded to the constant value.
In this patch we use the already introduced AAFoldRuntimeCall and the NumTeams and
NumThreads kernel attributes (to be introduced in a different patch) to fold these functions.
The code checks all the kernels, and if their attributes match, the functions are folded.
In the future we will explore specializing for multiple values of NumThreads and NumTeams.
Depends on D106390
Probably worth a separate patch to add them.