Parallel regions are outlined as functions with capture variables explicitly generated as distinct parameters in the function's argument list. That complicates the fork_call interface in the OpenMP runtime: (1) the fork_call is variadic since there is a variable number of arguments to forward to the outlined function, (2) wrapping/unwrapping arguments happens in the OpenMP runtime, which is sub-optimal, has been a source of ABI bugs, and has a hardcoded limit (16) in the number of arguments, (3) forwarded arguments must cast to pointer types, which complicates debugging. This patch avoids those issues by aggregating captured arguments in a struct to pass to the fork_call.
Yes, with a complication: for combined directives of worksharing loops (distributed parallel for) clang emits the lower bound and upper bound as distinct parameters besides the global tid, bound tid and the captured variables (aggregated in the struct of this patch). We will need to have a flag in parallel_51 to unwrap the global args array for this case.
It's not offload-specific, so the patch applies to host OpenMP too. The patch, as is, does not introduce any incompatibilities, the OpenMP runtime parallel call interface stays the same and the only change is the aggregation of the outlined function arguments in a struct. However, it makes sense to simplify the OpenMP parallel call runtime interface to avoid variadic arguments. This change is not necessary for this patch, but I'm considering to include it too.
(answered now, I forgot to submit this yesterday -.- )
It is not. It applies to all parallel regions. We still use the variadic call on the CPU but we can now replace it with a non-variadic interface.
This does not bring any compatibility issues, right?
We'll keep the old interface (kmpc_fork_call) for the host (libomp) and code that uses it will continue to work. New code will use it also in a compatible way
because we change caller and callee of kmpc_fork_call to adjust for the aggregate. This will also hold as the new non-variadic interface is introduced.
The device side is simpler, we are only compatible with a device runtime that machtes (as we link it into the app), so there is no compatibility issue to be expected.
We used this kind of codegen initially but later found out that it causes a large overhead when gathering pointers into a record. What about hybrid scheme where the first args are passed as arguments and others (if any) are gathered into a record?
I'm confused, maybe I misunderstand the problem. The parallel function arguments need to go from the main thread to the workers somehow, I don't see how this is done w/o a record. This patch makes it explicit though.