It is observed that device code compilation takes most of the compilation time when
clang compiles CUDA/HIP programs since device code usually contains complicated
computation code. Often times such code are highly coupled, which results in
a few large source files which become bottlenecks of a whole project. Things become
worse when such code is compiled with multiple gpu archs, since clang compiles for
each gpu arch sequentially. In practice, it is common to compile for more than 5 gpu
archs.
To alleviate this issue, this patch implements a simple scheduler which let clang
driver compile independent jobs in parallel.
This patch tries to minimize impact on existing clang driver. No changes to action
builder and tool chain. It introduces a driver option -parallel-jobs=n to control number
of parallel jobs to launch. By default it is 1, and it is NFC per clang driver behavior.
If llvm/clang is built with LLVM_ENABLE_THREADS off, this change is also NFC.
The basic design of the scheduler is to find the dependence among the jobs and
use a thread to launches a job when its dependent jobs are done.
Nit: Using pointer as a key will result in sub-compilations being done in different order from run to run and that may result in build results changing from run to run.
I can't think of a realistic scenario yet. One case where it may make a difference is generation of dependency file.
We currently leak some output file name flags to device-side compilations. E.g. -fsyntax-only -MD -MF foo.d will write foo.d for each compilation. At best we'll end up with the result of whichever sub-compilation finished last. At worst we'll end up with corrupt output. In this case it's the output argument leak that's the problem, but I suspect there may be other cases where execution order will be observable.