It is observed that device code compilation takes most of the compilation time when
clang compiles CUDA/HIP programs since device code usually contains complicated
computation code. Often times such code are highly coupled, which results in
a few large source files which become bottlenecks of a whole project. Things become
worse when such code is compiled with multiple gpu archs, since clang compiles for
each gpu arch sequentially. In practice, it is common to compile for more than 5 gpu
To alleviate this issue, this patch implements a simple scheduler which let clang
driver compile independent jobs in parallel.
This patch tries to minimize impact on existing clang driver. No changes to action
builder and tool chain. It introduces a driver option -parallel-jobs=n to control number
of parallel jobs to launch. By default it is 1, and it is NFC per clang driver behavior.
If llvm/clang is built with LLVM_ENABLE_THREADS off, this change is also NFC.
The basic design of the scheduler is to find the dependence among the jobs and
use a thread to launches a job when its dependent jobs are done.