(Re-land of r260448, which was reverted in r260522 due to a test failure
in Driver/output-file-cleanup.c that only showed up in fresh builds.)
Previously we attempted to be smart; if one job failed, we'd run all
jobs that didn't depend on the failing job.
Problem is, this doesn't work well for e.g. CUDA compilation without
-save-temps. In this case, the device-side and host-side Assemble
actions (which actually are responsible for preprocess, compile,
backend, and assemble, since we're not saving temps) are necessarily
distinct. So our clever heuristic doesn't help us, and we repeat every
error message once for host and once for each device arch.
The main effect of this change, other than fixing CUDA, is that if you
pass multiple cc files to one instance of clang and you get a compile
error, we'll stop when the first cc1 job fails.