This is an archive of the discontinued LLVM Phabricator instance.

Add -stop-on-failure driver option, and enable it by default for CUDA compiles.
ClosedPublic

Authored by jlebar on Jan 23 2016, 12:57 PM.

Details

Summary

When compiling CUDA, we run the frontend N times, once for each device
arch. This means that if you have a compile error in your file, you'll
see that error N times.

Relatedly, if ptxas fails, we'll output that error and then still try to
pass its output to fatbinary, which then fails because (duh) its input
file doesn't exist.

This patch stops compilations with -stop-on-failure as soon as we
encounter an error. -stop-on-failure is turned on by default for CUDA
compilations.

Diff Detail

Repository
rL LLVM

Event Timeline

jlebar updated this revision to Diff 45803.Jan 23 2016, 12:57 PM
jlebar retitled this revision from to Add -stop-on-failure driver option, and enable it by default for CUDA compiles..
jlebar updated this object.
jlebar added a reviewer: tra.
jlebar added subscribers: echristo, cfe-commits, jhen.

Friendly ping.

tra added inline comments.Jan 28 2016, 10:52 AM
include/clang/Driver/Options.td
1807 ↗(On Diff #45803)

I'd use 'no-' prefix.

lib/Driver/Driver.cpp
650 ↗(On Diff #45803)

Why is StopOnFailure is false in this case? Shouldn't it obey command line options, too?

jlebar updated this revision to Diff 46300.Jan 28 2016, 11:33 AM
jlebar marked an inline comment as done.

Address tra's review comment (rename flag).

lib/Driver/Driver.cpp
650 ↗(On Diff #45803)

This function is called when the compiler has an internal error or crashes. The jobs we're executing here are preprocessor jobs dumping debugging info. I figured we should not stop on failure when outputting that info?

In general it feels like keeping 2 errors might make the most sense:

(Using a multiarch build rather than a cuda command line, but it should still be the same behavior for consistency)

t.c:

#if _NOT_ARCH4_
#error "aiee!"
#endif

clang -arch arch1 -arch arch2 -arch arch3 -arch arch4 t.c

seems like it might be nice to get 3 errors here rather than a single one and fixing that single one, then getting another one, etc. or realizing what the error is here.

I don't feel strongly about this, but I'm still uncertain as to why we want to make things more complicated here :)

-eric

tra accepted this revision.Jan 28 2016, 12:30 PM
tra edited edge metadata.

LGTM.

lib/Driver/Driver.cpp
652 ↗(On Diff #46300)

As far as I can tell, we don't do anything interesting if we've detected that *any* of the commands have failed. That suggests that doing anything beyond the first failing command does not do us any good. That would suggest that we may really want StopOnFailure=true here.

'false' would preserve current behavior, though.

In either case I'm OK with a constant here.

This revision is now accepted and ready to land.Jan 28 2016, 12:30 PM

In general it feels like keeping 2 errors might make the most sense:

#if _NOT_ARCH4_
#error "aiee!"
#endif

clang -arch arch1 -arch arch2 -arch arch3 -arch arch4 t.c

seems like it might be nice to get 3 errors here rather than a single one and fixing that single one, then getting another one, etc. or realizing what the error is here.

Yes, this patch makes that case worse.

But I suspect errors that apply to some but not all archs will be far less common than errors that apply to all arches -- regular C++ errors like missing a semicolon or whatever. It feels pretty overwhelming to output N copies of every error in those cases, especially when you consider multipage template errors.

In addition, iirc there's no separation between errors outputted for different archs, so it really looks like we're just outputting multiple copies of the errors for fun.

I don't feel strongly about this, but I'm still uncertain as to why we want to make things more complicated here :)

The other reason, which is less important, is that when you have one arch and ptxas fails -- which, it shouldn't, but we're not good enough to catch everything yet, and likely won't be for some time -- the error you get is

ptxas: foo is not defined
*FATAL ERROR*: fatbinary failed, /tmp/bar.cubin does not exist.

I'd like not to display that second line, since it hides the actual problem. Once you get used to it, it's not a big deal, but it tripped me up for a few minutes, and I'm the one who added the call to ptxas.

lib/Driver/Driver.cpp
652 ↗(On Diff #46300)

Sorry, I think I'm misunderstanding something. Would you mind rephrasing this?

As far as I can tell, we don't do anything interesting if we've detected that *any* of the commands have failed. That suggests that doing anything beyond the first failing command does not do us any good.

The scenario I thought this change applied to was:

External tool crashes during a call to ExecuteJobs() (not this one). We now want to output preprocessed inputs, so we run this code, which again calls ExecuteJobs(), but these jobs only run the preprocessor on the inputs.

Now suppose one of those preprocessor jobs fails. Maybe it has a bad preprocessor directive, or maybe #error would be enough. It seems to me in this case that we should continue running the other preprocessor jobs, so we dump as much debug info as we can.

Note that if the StopOnFailure flag is false, afaict it's entirely possible for us to have two inputs, one of which has a pp error and the other of which causes a compiler crash -- if we stopped on failure here, we wouldn't output anything for the second input, which is the one we're interested in.

Sorry again, I'm sure I'm missing something.

tra added inline comments.Jan 28 2016, 1:43 PM
lib/Driver/Driver.cpp
652 ↗(On Diff #46300)

Look at the lines below. If there are any failing commands we just report an error and return.
Even if there are multiple preprocessor jobs and if some of them succeed, we would not get to use their output.

jlebar updated this revision to Diff 46315.Jan 28 2016, 2:35 PM
jlebar marked 3 inline comments as done.
jlebar edited edge metadata.

Pass StopOnFailure = true when running the preprocessor after an ICE.

jlebar added inline comments.Jan 28 2016, 2:37 PM
lib/Driver/Driver.cpp
652 ↗(On Diff #46300)

Oh.

Thanks. :)

The other reason, which is less important, is that when you have one arch and ptxas fails -- which, it shouldn't, but we're not good enough to catch everything yet, and likely won't be for some time -- the error you get is

ptxas: foo is not defined
*FATAL ERROR*: fatbinary failed, /tmp/bar.cubin does not exist.

I'd like not to display that second line, since it hides the actual problem. Once you get used to it, it's not a big deal, but it tripped me up for a few minutes, and I'm the one who added the call to ptxas.

This seems like more of a problem - we don't have this with the existing BindArchAction set of things, we stop before trying to call lipo on darwin. Hrm.

-eric

Eric, are you OK with this going in, or do you want to consider alternatives?

Talking to echristo irl, he would like to know why we don't have this problem with mac universal binaries -- or, do we? He would like to be consistent; I'm onboard with that.

jlebar added a comment.Feb 9 2016, 6:32 PM

Okay, I see why things don't work as expected without this patch but do work for e.g. macos universal binaries.

The reason is, we build a completely separate set of actions for each invocation of cc1 -- one for the host compilation, and one for each device arch. Then the logic inside Compilation.cpp, which is in fact trying not to display duplicate errors, doesn't work, because it doesn't know that these compilations are related.

I think I may be able to fix this.

jlebar updated this revision to Diff 47520.Feb 10 2016, 2:16 PM

Per IRL discussion with echristo, updated so that we just bail as soon as one
subjob fails.

echristo accepted this revision.Feb 10 2016, 2:18 PM
echristo added a reviewer: echristo.

This works for me and I can't think of anything it's going to break so LGTM.

Thanks!

-eric

This revision was automatically updated to reflect the committed changes.

espindola reverted this in r260522 because of test failures in Driver/output-file-cleanup.c.

The reason I didn't catch this locally is that the test is non-hermetic -- if it passed once in an objdir, this patch does not make it fail again. You have to nuke (part of) the objdir before it will fail.

I'll send a patch to make the test hermetic. Unsure yet whether it's a bug in this patch or the test that the test fails at all.