This is an archive of the discontinued LLVM Phabricator instance.

[BPF] Emit UNDEF rather than constant 0
Needs ReviewPublic

Authored by tamird on Jul 27 2023, 5:15 PM.

Details

Summary

In certain cases the BPF backend may be confronted with IR that cannot
be expressed as valid BPF assembly. Unfortunately not all LLVM APIs
permit errors to be propagated upwards; instead the backend is expected
to emit diagnostics which are interpreted by the embedding application
(i.e. clang). This means that the backend must sometimes produce
incorrect assembly to avoid tripping assertions upstream and rely on the
embedding application respecting its diagnostics rather than attempting
to use the incorrect assembly it has generated.

This patch changes the strategy taken by the BPF backend when it
encounters:

  • definitions of functions which take more arguments than permitted by the BPF target.
  • calls to functions which take more arguments than permitted by the BPF target.

Prior to this patch the BPF replaced overrunning arguments with
"constant 0"; after this patch it replaces them with UNDEF.

This change is an attempt at being defensive. While most embedders (e.g.
clang) correctly handle "error" diagnostics by aborting code generation,
users of the LLVM API may implement custom diagnostics handlers which do
not exit on "error". This is exacerbated by the lack of documentation
surrounding the diagnostics functions and severity levels.

By emitting UNDEF rather than "constant 0" this patch attempts to
prevent the resulting program from passing the BPF verifier. Prior to
this patch such programs would almost always pass the verifier but
behave incorrectly.

Reverses choices made in 20a8d8e (D20571) and 353a5e0 (D20726).

Diff Detail

Event Timeline

tamird created this revision.Jul 27 2023, 5:15 PM
Herald added a project: Restricted Project. · View Herald TranscriptJul 27 2023, 5:15 PM
Herald added a subscriber: hiraditya. · View Herald Transcript
tamird requested review of this revision.Jul 27 2023, 5:15 PM
Herald added a project: Restricted Project. · View Herald TranscriptJul 27 2023, 5:15 PM
tamird updated this revision to Diff 544985.Jul 27 2023, 5:32 PM

Add missing delegation.

tamird edited the summary of this revision. (Show Details)Jul 27 2023, 5:33 PM
arsenm added a subscriber: arsenm.Jul 27 2023, 5:45 PM
arsenm added inline comments.
llvm/lib/Target/BPF/BPFISelLowering.cpp
872

Just let the proper context error happen, this is just worse

llvm/lib/Target/BPF/BPFISelLowering.h
78

TargetLowering should not contain state, you don't need this

tamird added inline comments.Jul 27 2023, 7:15 PM
llvm/lib/Target/BPF/BPFISelLowering.cpp
872

Please read the description. The problem is that without fataling, the embedder can use the miscompiled code (and we saw this because our diagnostic handler did not treat error as fatal).

llvm/lib/Target/BPF/BPFISelLowering.h
78

Did you read the description?

ast requested changes to this revision.Jul 27 2023, 7:29 PM

Please provide a test case that demonstrates the issue.
Right now it looks to me that aya front-end generates broken IR and it causes all kinds of issues in the backend.
We won't be doing defensive programming to help front-ends debug their issues.

This revision now requires changes to proceed.Jul 27 2023, 7:29 PM

Please provide a test case that demonstrates the issue.
Right now it looks to me that aya front-end generates broken IR and it causes all kinds of issues in the backend.
We won't be doing defensive programming to help front-ends debug their issues.

Please please read the description. Writing a test for this would require use of the C API, and would amount to writing an llc-like program with a diagnostic handler that is arguably wrong.

This diff is intended to start a discussion about the error handling path being error prone. Can you engage with that please?

I grepped the code base and see that finalizeLowering() logic is customized only in a handful of cases:

  • llvm/lib/Target/ARM/ARMISelLowering.cpp
  • llvm/lib/Target/AMDGPU/SIISelLowering.cpp
  • llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

In all these cases there is no similar logic with regards to error handling.

We have the following "fail" messages in the BPFISelLowering:

  • "stack arguments are not supported"
  • "variadic functions are not supported"
  • "aggregate returns are not supported"
  • "too many arguments"
  • "pass by value not supported"
  • "A call to built-in function '$func' is not supported." (this one is for ExternalSymbolSDNode)
  • "aggregate returns are not supported"
  • "only small returns supported"

There are no true warnings here, every case is actually an error. (Maybe with exception for ExternalSymbolSDNode which just should be allowed). From my practice I rarely hit even one instance of such errors => it might be the case that capability to report multiple errors at once is not that important. Maybe just update fail function to call report_fatal_error() internally?

BPFISelLowering is a field in BPFSubtarget, as far as I understand sub-target is a singleton => lowering instance is also a singleton, thus having a global state there is not ideal.

@yonghong-song , what do you think?

I grepped the code base and see that finalizeLowering() logic is customized only in a handful of cases:

  • llvm/lib/Target/ARM/ARMISelLowering.cpp
  • llvm/lib/Target/AMDGPU/SIISelLowering.cpp
  • llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

In all these cases there is no similar logic with regards to error handling.

We have the following "fail" messages in the BPFISelLowering:

  • "stack arguments are not supported"
  • "variadic functions are not supported"
  • "aggregate returns are not supported"
  • "too many arguments"
  • "pass by value not supported"
  • "A call to built-in function '$func' is not supported." (this one is for ExternalSymbolSDNode)
  • "aggregate returns are not supported"
  • "only small returns supported"

There are no true warnings here, every case is actually an error. (Maybe with exception for ExternalSymbolSDNode which just should be allowed). From my practice I rarely hit even one instance of such errors => it might be the case that capability to report multiple errors at once is not that important. Maybe just update fail function to call report_fatal_error() internally?

BPFISelLowering is a field in BPFSubtarget, as far as I understand sub-target is a singleton => lowering instance is also a singleton, thus having a global state there is not ideal.

@yonghong-song , what do you think?

From the description:

Note also that 8e6aab3 (D21368) removed the exit-on-error flag which is necessary for the unit tests with this change.

I think that even if everyone agreed that crashing is acceptable we'd have to find a solution that would allow the tests to pass. Any ideas?

Note also that 8e6aab3 (D21368) removed the exit-on-error flag which is necessary for the unit tests with this change.

I think that even if everyone agreed that crashing is acceptable we'd have to find a solution that would allow the tests to pass. Any ideas?

Whats the problem with tests?
The following snippet works on Linux (tried on a single test, should apply to the rest):

diff --git a/llvm/lib/Target/BPF/BPFISelLowering.cpp b/llvm/lib/Target/BPF/BPFISelLowering.cpp
index ffeef14d413b..857b729b0217 100644
--- a/llvm/lib/Target/BPF/BPFISelLowering.cpp
+++ b/llvm/lib/Target/BPF/BPFISelLowering.cpp
@@ -48,6 +48,7 @@ static void fail(const SDLoc &DL, SelectionDAG &DAG, const Twine &Msg,
   MachineFunction &MF = DAG.getMachineFunction();
   DAG.getContext()->diagnose(DiagnosticInfoUnsupported(
       MF.getFunction(), Twine(Str).concat(Msg), DL.getDebugLoc()));
+  llvm::report_fatal_error("Fatal error in BPFISelLowering, can't proceed");
 }
 
 BPFTargetLowering::BPFTargetLowering(const TargetMachine &TM,
diff --git a/llvm/test/CodeGen/BPF/many_args1.ll b/llvm/test/CodeGen/BPF/many_args1.ll
index 9e9a3fdfc953..4fb709437f91 100644
--- a/llvm/test/CodeGen/BPF/many_args1.ll
+++ b/llvm/test/CodeGen/BPF/many_args1.ll
@@ -1,5 +1,4 @@
-; RUN: not llc -march=bpf < %s 2> %t1
-; RUN: FileCheck %s < %t1
+; RUN: (llc -march=bpf < %s 2>&1 || exit 0) | FileCheck %s
 ; CHECK: error: <unknown>:0:0: in function foo i32 (i32, i32, i32): t10: i64 = GlobalAddress<ptr @bar> 0 too many arguments
 
 ; Function Attrs: nounwind uwtable

But CI is also run on Windows, so probably a more portable incantation is needed (don't have Windows at hand to check).
Separate question is whether to use report_fatal_error or something that does not generate stack trace as error messages have the necessary information after recent changes.

After examining source code for llvm-lit, I found an option --crash for not. Here is the portable incantation:

; RUN: not --crash llc %s -filetype=null -march=bpf 2>&1 | FileCheck %s

This has to be used if command crashes (e.g. exits with report_fatal_error()), it is not needed if command does exit(1). Does not seem to be documented but grep shows wide usage.

tamird updated this revision to Diff 546577.Aug 2 2023, 12:34 PM

Update tests using not --crash

tamird marked 2 inline comments as done.Aug 2 2023, 12:35 PM

Nice! Done. Will wait for @yonghong-song.

tamird edited the summary of this revision. (Show Details)Aug 2 2023, 12:43 PM
ast requested changes to this revision.Aug 2 2023, 4:51 PM

Agree with Eduard. Doing report_fatal_error() in fail() is cleaner. Proceeding with miscompiled code can only cause further issues in other parts of the backend.
Not sure why we didn't do this earlier.

This revision now requires changes to proceed.Aug 2 2023, 4:51 PM
arsenm added inline comments.Aug 2 2023, 4:55 PM
llvm/lib/Target/BPF/BPFISelLowering.cpp
872

Then your compiler driver / embedder / whatever is broken and needs to check the context. TargetLowering must not be stateful and you should not emit a fatal error here

872

check the context for errors

arsenm added inline comments.Aug 2 2023, 4:57 PM
llvm/lib/Target/BPF/BPFISelLowering.cpp
872

emit proper context error is ideal for user facing errors, you just need to ensure you emit some kind of valid output. report_fatal_error can be OK for cases users cannot reasonably reach. deferring to finalizeLowering doesn't make any sense, if you're going to report_fatal_error just do it immediately. doing it here is a middle ground that doesn't make sense

tamird added a comment.Aug 2 2023, 6:13 PM

Agree with Eduard. Doing report_fatal_error() in fail() is cleaner. Proceeding with miscompiled code can only cause further issues in other parts of the backend.
Not sure why we didn't do this earlier.

Counterpoint: D98471. @yonghong-song can you weigh in please?

It's not apparent that there's an obviously right thing to do here. The standard error emitting mechanisms seem to always involve a risk of miscompilation. Does anyone know how other backends deal with this situation, where code generation cannot reasonably proceed? Looks like other backends use UNDEF in at least some similar cases (https://github.com/llvm/llvm-project/blob/cd328c10ad503b243cc333f813c0c6ee66598ffc/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp#L1265-L1269). From my perspective, as long as we miscompile more severely - that is we produce code that definitely doesn't work, rather than code that might work but do the wrong thing - then I'm happy. @ast I believe you suggested constant zeroes rather than UNDEF in earlier reviews -- can you shed some light on why that was deemed preferable?

ast added a comment.Aug 2 2023, 6:34 PM

Agree with Eduard. Doing report_fatal_error() in fail() is cleaner. Proceeding with miscompiled code can only cause further issues in other parts of the backend.
Not sure why we didn't do this earlier.

Counterpoint: D98471. @yonghong-song can you weigh in please?

It's not apparent that there's an obviously right thing to do here. The standard error emitting mechanisms seem to always involve a risk of miscompilation. Does anyone know how other backends deal with this situation, where code generation cannot reasonably proceed? Looks like other backends use UNDEF in at least some similar cases (https://github.com/llvm/llvm-project/blob/cd328c10ad503b243cc333f813c0c6ee66598ffc/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp#L1265-L1269). From my perspective, as long as we miscompile more severely - that is we produce code that definitely doesn't work, rather than code that might work but do the wrong thing - then I'm happy. @ast I believe you suggested constant zeroes rather than UNDEF in earlier reviews -- can you shed some light on why that was deemed preferable?

actually. let's step back.
The compilation doesn't report success at the end, so what is the problem?
diagnose() and proceed is a normal pattern used in many backends.

tamird added a comment.Aug 2 2023, 6:59 PM

Agree with Eduard. Doing report_fatal_error() in fail() is cleaner. Proceeding with miscompiled code can only cause further issues in other parts of the backend.
Not sure why we didn't do this earlier.

Counterpoint: D98471. @yonghong-song can you weigh in please?

It's not apparent that there's an obviously right thing to do here. The standard error emitting mechanisms seem to always involve a risk of miscompilation. Does anyone know how other backends deal with this situation, where code generation cannot reasonably proceed? Looks like other backends use UNDEF in at least some similar cases (https://github.com/llvm/llvm-project/blob/cd328c10ad503b243cc333f813c0c6ee66598ffc/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp#L1265-L1269). From my perspective, as long as we miscompile more severely - that is we produce code that definitely doesn't work, rather than code that might work but do the wrong thing - then I'm happy. @ast I believe you suggested constant zeroes rather than UNDEF in earlier reviews -- can you shed some light on why that was deemed preferable?

actually. let's step back.
The compilation doesn't report success at the end, so what is the problem?
diagnose() and proceed is a normal pattern used in many backends.

This is half-correct. As I write in the description: whether to use the result of compilation is decided by the embedding application. In earlier versions of bpf-linker, these diagnostics were not considered fatal, which resulted in miscompiled code (which uses constant 0 when nothing sensible can be done). This in turn produced very hard to debug failures.

It seems obvious that doing anything other than emitting diagnostics is going against the grain in LLVM. This means we have to allow compilation superficially "succeed". Having said that, I think we should produce obviously broken code rather than producing code that might work if the stars align, but might produce subtly incorrect behavior.

ast added a comment.Aug 2 2023, 8:26 PM

Agree with Eduard. Doing report_fatal_error() in fail() is cleaner. Proceeding with miscompiled code can only cause further issues in other parts of the backend.
Not sure why we didn't do this earlier.

Counterpoint: D98471. @yonghong-song can you weigh in please?

It's not apparent that there's an obviously right thing to do here. The standard error emitting mechanisms seem to always involve a risk of miscompilation. Does anyone know how other backends deal with this situation, where code generation cannot reasonably proceed? Looks like other backends use UNDEF in at least some similar cases (https://github.com/llvm/llvm-project/blob/cd328c10ad503b243cc333f813c0c6ee66598ffc/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp#L1265-L1269). From my perspective, as long as we miscompile more severely - that is we produce code that definitely doesn't work, rather than code that might work but do the wrong thing - then I'm happy. @ast I believe you suggested constant zeroes rather than UNDEF in earlier reviews -- can you shed some light on why that was deemed preferable?

actually. let's step back.
The compilation doesn't report success at the end, so what is the problem?
diagnose() and proceed is a normal pattern used in many backends.

This is half-correct. As I write in the description: whether to use the result of compilation is decided by the embedding application. In earlier versions of bpf-linker, these diagnostics were not considered fatal, which resulted in miscompiled code (which uses constant 0 when nothing sensible can be done). This in turn produced very hard to debug failures.

It seems obvious that doing anything other than emitting diagnostics is going against the grain in LLVM. This means we have to allow compilation superficially "succeed". Having said that, I think we should produce obviously broken code rather than producing code that might work if the stars align, but might produce subtly incorrect behavior.

Without specific test this all are hand wave excuses. Fix you front-end.

I messed up.
@ast is right, I missed the fact that clang already exits with return code 1 when any of the conditions reported by fail(...) happen, e.g.:

$ cat t.c
struct foo { int a[100]; };
extern void consume(void *);
struct foo bar(struct foo foo) {
    consume(&foo);
    return foo;
}
$ clang --target=bpf t.c; echo "exit code is $?"
t.c:3:12: error: aggregate returns are not supported
    3 | struct foo bar(struct foo foo) {
      |            ^
1 error generated.
exit code is 1

This happens because diagnose() is implemented as follows:

void LLVMContext::diagnose(const DiagnosticInfo &DI) {
  ...
  if (pImpl->DiagHandler &&
      (!pImpl->RespectDiagnosticFilters || isDiagnosticEnabled(DI)) &&
      pImpl->DiagHandler->handleDiagnostics(DI))
    return;
  ...
  // Otherwise, print the message with a prefix based on the severity.
  ...
  if (DI.getSeverity() == DS_Error)
    exit(1);
}

And fail() uses DiagnosticInfoUnsupported which has DS_Error severity. Usage of diagnostic in combination with DiagnosticInfoUnsupported could be found in other backends as well, e.g. in RISCVISelLowering.cpp. So, it looks like no changes are necessary in the BPF lowering and embedder needs to install diagnostics handler and look for DS_Error messages.

tamird updated this revision to Diff 546851.Aug 3 2023, 7:12 AM

Write tests!

tamird retitled this revision from [BPF] report fatal error rather than miscompile to [BPF] Emit UNDEF rather than constant 0.Aug 3 2023, 7:13 AM
tamird edited the summary of this revision. (Show Details)

Without specific test this all are hand wave excuses. Fix you front-end.

Yes, I fixed it in https://github.com/aya-rs/bpf-linker/commit/c36c254ae4db8d7bddd33752dd189334a4c58f83 and https://github.com/aya-rs/bpf-linker/commit/e474f8262db57f1344636466a644cc8929bd25ba.

The purpose of this proposed change is to save future users of the C API the same pain. I have added tests.

tamird marked 3 inline comments as done.Aug 3 2023, 7:14 AM
tamird updated this revision to Diff 546885.Aug 3 2023, 8:15 AM

Appease clang-tidy.

yonghong-song added inline comments.Aug 3 2023, 10:27 PM
llvm/test/Bindings/llvm-c/many_args2.ll
12 ↗(On Diff #546885)

Could you explain what does the following mean?

Before this test was introduced, the BPF backend would generate incorrect-but-viable code that would pass the verifier and violate user expectations at runtime.
llvm/test/Bindings/llvm-c/struct_ret2.ll
7 ↗(On Diff #546885)

The same here. I guess I must miss something. Do you mean bpf backend generates incorrect code, or it detects incorrect condition but proceed and eventually report a success rather than a failure?
I must miss something here.

eddyz87 added inline comments.Aug 4 2023, 3:36 AM
llvm/test/Bindings/llvm-c/struct_ret2.ll
7 ↗(On Diff #546885)

Hi @yonghong-song ,

BPF backend detects and reports unsupported patterns, the report is done using function diagnose(). For diagnostic events with error severity diagnose() calls exit(1) if no other diagnostic handlers are installed (see comment). However, the backend is also designed to proceed after issuing error diagnostics by injecting incorrect instructions. @tamird suggests to change some of these instructions to make it more obvious for BPF verifier that bytecode is bogus. The argument is that this would help during development of standalone tools that embed LLVM and for some reason install error handlers that don't react properly to error level diagnostics. I wonder what is the default error handler for embedders, let me check.

tamird marked an inline comment as done.Aug 4 2023, 4:32 AM
tamird added inline comments.
llvm/test/Bindings/llvm-c/many_args2.ll
12 ↗(On Diff #546885)

It means that in the case that sensible code could not be generated, the BPF backend would generate r0=0; exit, which would appease the verifier but do something other than what the user intended at runtime.

This test asserts that such "broken" functions do not set r0, giving a strong probability that the verifier will reject it.

Resolving this thread, let's continue in the other one.

llvm/test/Bindings/llvm-c/struct_ret2.ll
7 ↗(On Diff #546885)

@eddyz87 is mostly correct.

also wondered what the default handler would do; before I added the current handler in my test, I observed that the handler would exit(1).

However note the following:

  • clang obviously does something more sophisticated; we know this because when multiple diagnostics are issued, we seem them all. This means the diagnostic handler is stateful.
  • it's very common to provide a custom handler to allow filtering on severity, logging to a file, etc.
  • it is not at all obvious that DS_Error is to be considered fatal. When I first discovered the bug in bpf-linker I searched for documentation about these severity levels, and found basically nothing explaining this. Even @eddyz87's earlier comments refer to clang source code rather than documentation.

Fundamentally this is a documentation problem, but it would certainly not hurt to be a bit more defensive in the BPF backend IMO.

DiagnosticSeverity

LLVMContextSetDiagnosticHandler

eddyz87 added inline comments.Aug 4 2023, 5:25 AM
llvm/test/Bindings/llvm-c/struct_ret2.ll
7 ↗(On Diff #546885)

Imho, given the context, changing "get fake zero" to "get fake undef" in a few places is not a big deal. I'm also not sure that we need complicated tests infrastructure for this. ¯\_(ツ)_/¯.

tamird marked an inline comment as done.Aug 4 2023, 5:47 AM
tamird added inline comments.
llvm/test/Bindings/llvm-c/struct_ret2.ll
7 ↗(On Diff #546885)

Please provide a test case that demonstrates the issue.
Right now it looks to me that aya front-end generates broken IR and it causes all kinds of issues in the backend.
We won't be doing defensive programming to help front-ends debug their issues.

Without specific test this all are hand wave excuses. Fix you front-end.

So if I understand correctly, BPF backend can handle clang frontend generated IR properly and issue proper warning. But with rust frontend, the IR is a little bit different and BPF backend will behave differently and allow to generate illegal code. Is this right? The bpf-unsupported-emit.c is exactly the case to demonstrate that rust frontend may generate different IR and the BPF backend will generate incorrect code. Is my understanding correct?

Regarding to diag from constant 0 to undef, I am not 100% confident that this will solve all the problems. What if now r0 is undef but later continued compilation generates an assignment to r0? So we should solve the problem to ensure for buggy code the compilation fail rather succeed.

tamird added a comment.Aug 5 2023, 9:41 AM

So if I understand correctly, BPF backend can handle clang frontend generated IR properly and issue proper warning. But with rust frontend, the IR is a little bit different and BPF backend will behave differently and allow to generate illegal code. Is this right? The bpf-unsupported-emit.c is exactly the case to demonstrate that rust frontend may generate different IR and the BPF backend will generate incorrect code. Is my understanding correct?

No, that's not correct. As you can see, all the IR in the tests I added was generated from C. There's nothing about the IR generated from Rust that is to blame.

The problem is that it's very easy to accidentally *not* exit from the diagnostic handler, as the test demonstrate -- and in case you don't, you will get "success" with broken code.

Regarding to diag from constant 0 to undef, I am not 100% confident that this will solve all the problems. What if now r0 is undef but later continued compilation generates an assignment to r0? So we should solve the problem to ensure for buggy code the compilation fail rather succeed.

Yeah, that's a fair point. Do you have ideas on how to achieve that?

Hi @yonghong-song, what next steps would you like to see here?

Based on the discussion with @eddyz87, the following llvm diagnose implementation shows why clang will through an error rather than letting compilation continue:

void LLVMContext::diagnose(const DiagnosticInfo &DI) {
  if (auto *OptDiagBase = dyn_cast<DiagnosticInfoOptimizationBase>(&DI))
    if (LLVMRemarkStreamer *RS = getLLVMRemarkStreamer())
      RS->emit(*OptDiagBase);

  // If there is a report handler, use it.
  if (pImpl->DiagHandler &&
      (!pImpl->RespectDiagnosticFilters || isDiagnosticEnabled(DI)) &&
      pImpl->DiagHandler->handleDiagnostics(DI))
    return;
      
  if (!isDiagnosticEnabled(DI))
    return;
  
  // Otherwise, print the message with a prefix based on the severity.
  DiagnosticPrinterRawOStream DP(errs());
  errs() << getDiagnosticMessagePrefix(DI.getSeverity()) << ": ";
  DI.print(DP);
  errs() << "\n";
  if (DI.getSeverity() == DS_Error)
    exit(1);
}

I suspect you might have your own handler like in the below code

if (pImpl->DiagHandler &&
    (!pImpl->RespectDiagnosticFilters || isDiagnosticEnabled(DI)) &&
    pImpl->DiagHandler->handleDiagnostics(DI))
  return;

If this is the case, in your own handler, you should ensure that if
DI.getSeverity() == DS_Error you should do 'exit(1)`.

Yes, thank you. As mentioned earlier, we have fixed this bug. The purpose of this patch is to be defensive and avoid this footgun for future users of the LLVM C API.

Do you feel that this patch degrades the quality of the implementation?

Yes, thank you. As mentioned earlier, we have fixed this bug. The purpose of this patch is to be defensive and avoid this footgun for future users of the LLVM C API.

Do you feel that this patch degrades the quality of the implementation?

With current clang, could you show me the error message with 'constant 0' vs. 'undef'? We can then decide to pick the one which is more user friendly.

With current clang, could you show me the error message with 'constant 0' vs. 'undef'? We can then decide to pick the one which is more user friendly.

You're referring to verifier errors, right? With "constant 0" there is no verifier error - the program will just do the wrong thing (not what the programmer intended). I don't have the "undef" verifier error on hand, I can get that to you tomorrow.

With current clang, could you show me the error message with 'constant 0' vs. 'undef'? We can then decide to pick the one which is more user friendly.

You're referring to verifier errors, right? With "constant 0" there is no verifier error - the program will just do the wrong thing (not what the programmer intended). I don't have the "undef" verifier error on hand, I can get that to you tomorrow.

Not about verifier errors but errors at compilation time. The compilation should fail in two instances in this patch.

With current clang, could you show me the error message with 'constant 0' vs. 'undef'? We can then decide to pick the one which is more user friendly.

You're referring to verifier errors, right? With "constant 0" there is no verifier error - the program will just do the wrong thing (not what the programmer intended). I don't have the "undef" verifier error on hand, I can get that to you tomorrow.

Not about verifier errors but errors at compilation time. The compilation should fail in two instances in this patch.

We must be misunderstanding each other. This patch doesn't change the error messages presented to the user at compilation time.

The purpose of the tests in this patch are to show how it is trivially easy to incorrectly use the LLVM C API such that diagnostics "errors" do not halt compilation.

The purpose of the non-test changes in this patch are to cause the resulting program to fail to verify. Before this patch the resulting program would appear to be valid, but violate the user's intent at runtime.

We must be misunderstanding each other. This patch doesn't change the error messages presented to the user at compilation time.

The purpose of the tests in this patch are to show how it is trivially easy to incorrectly use the LLVM C API such that diagnostics "errors" do not halt compilation.

The purpose of the non-test changes in this patch are to cause the resulting program to fail to verify. Before this patch the resulting program would appear to be valid, but violate the user's intent at runtime.

Okay, I did a little more study by examining other backends by searching key word 'diagnose'.

For example, I found two cases for X86:

X86/X86InsertPrefetch.cpp:    Ctx.diagnose(DiagnosticInfoSampleProfile(Filename, Msg,
X86/X86ISelLoweringCall.cpp:  DAG.getContext()->diagnose(

For the second case (X86ISelLoweringCall.cpp), they added after diagnose

VA.convertToReg(X86::FP0); // Set reg to FP0, avoid hitting asserts.

to avoid hitting asserts. This is similar to BPF one to add InVals.push_back() to avoid asserts later.
For X86InsertPrefetch.cpp, they simply just return to the caller with error.

For RISCV,

RISCV/RISCVISelLowering.cpp:    MF.getFunction().getContext().diagnose(DiagnosticInfoUnsupported{
RISCV/RISCVISelLowering.cpp:        MF.getFunction().getContext().diagnose(DiagnosticInfoUnsupported{
RISCV/RISCVISelLowering.cpp:        MF.getFunction().getContext().diagnose(DiagnosticInfoUnsupported{
RISCV/RISCVISelLowering.cpp:    F.getContext().diagnose(DiagnosticInfoUnsupported{

it seems after diagnose() in error branch, they do not have any special processing and assuming everything is normal after diagnose.

AMGPU is another example

AMDGPU/SIISelLowering.cpp:    DAG.getContext()->diagnose(NoGraphicsHSA);
AMDGPU/SIISelLowering.cpp:    Ctx.diagnose(NoTrap);
AMDGPU/SIISelLowering.cpp:  DAG.getContext()->diagnose(InvalidAddrSpaceCast);
AMDGPU/SIISelLowering.cpp:  DAG.getContext()->diagnose(BadIntrin);
AMDGPU/SIISelLowering.cpp:  DAG.getContext()->diagnose(BadIntrin);
AMDGPU/SIISelLowering.cpp:      DAG.getContext()->diagnose(BadIntrin);
AMDGPU/SIISelLowering.cpp:      DAG.getContext()->diagnose(BadIntrin);
AMDGPU/SIInstrInfo.cpp:  C.diagnose(IllegalCopy);

It looks like for AMDGPU, it does use UNDEF:

...
DAG.getContext()->diagnose(NoGraphicsHSA);
return DAG.getEntryNode();
...
Ctx.diagnose(NoTrap);
return Chain;
...
DAG.getContext()->diagnose(InvalidAddrSpaceCast);
return DAG.getUNDEF(ASC->getValueType(0));
...
DAG.getContext()->diagnose(BadIntrin);  <=== 3 times
return DAG.getUNDEF(VT);
...
    if (!Subtarget->hasCompressedExport()) {
    DiagnosticInfoUnsupported BadIntrin(
        DAG.getMachineFunction().getFunction(),
        "intrinsic not supported on subtarget", DL.getDebugLoc());
    DAG.getContext()->diagnose(BadIntrin);   <=== no special handling after this
  } 
  SDValue Src0 = Op.getOperand(4);

So looks like the approach largely in three categories:

  • add change to avoid asserts later (bpf is doing InVals.push_back() due to the same reason)
  • do not need to anything speical (bpf has to add InVals.push_back() to avoid later assertion)
  • using UNDEF to proceed.

So I think using UNDEF is okay here for BPF. This does not affect clang but it may improve for other projects (rust or others) which use llvm APIs for bpf backend.

So please make the following changes to the diff:

  • remove bpf-unspported-emit-c. I think this is not needed. But please explain in the commit message how it is possible to have a custom diagnose handler which can avoid 'exit()' and proceed with compilation.
  • please describe why UNDEF is used instead of 'constant 0' based on analysis with the usage of other backends.
  • please describe with your custom diagnose handler, what kind of error message (or lack of error message) with and without this patch.
tamird updated this revision to Diff 550041.Aug 14 2023, 11:42 AM

Remove tests, rewrite commit message.

tamird edited the summary of this revision. (Show Details)Aug 14 2023, 11:43 AM

@yonghong-song done.

please describe with your custom diagnose handler, what kind of error message (or lack of error message) with and without this patch.

there is no change to diagnostics emitted to diagnostics handlers.

ast added a comment.Aug 17 2023, 6:05 PM

Sorry for the delay. I've been on PTO.
The change looks fine, but it's not clear to me what code will be generated with UNDEF.
Before the change it would be 'r0=0; exit' and now just 'exit' ?
Meaning that r0 will be left in whatever state that previous code did.
So instead of generating consistent return 0 the user would have to debug random return values ?
The chance of failing the verifier is slim. The functions are rarely completely empty.
r0 is likely used in the earlier code and the verifier won't complain that it's not initialized.
So the change contradicts the intent to remove the footgun.

Sorry for the delay. I've been on PTO.
The change looks fine, but it's not clear to me what code will be generated with UNDEF.
Before the change it would be 'r0=0; exit' and now just 'exit' ?
Meaning that r0 will be left in whatever state that previous code did.
So instead of generating consistent return 0 the user would have to debug random return values ?
The chance of failing the verifier is slim. The functions are rarely completely empty.
r0 is likely used in the earlier code and the verifier won't complain that it's not initialized.
So the change contradicts the intent to remove the footgun.

When I first wrote this patch, I tested it on real code and did see verifier errors. It's a deep stack to unwind to reconstruct the state of the world back then to be able to quote what those errors were and what the generated code looked like. I can do it, but I'd like us to converge on what the right thing to do here is.

It seems you now agree that there's a footgun here. I'm not married to the use of UNDEF, is there a better solution you can think of that would reliably produce errors? Would reconstructing the generated code and verifier errors from a month ago be useful?

Lastly: @yonghong-song suggested that the tests be removed. In light of your comments, should we restore the tests so that we can make claims about the behavior in the presence of more complicated functions?

ast added a comment.Aug 17 2023, 9:52 PM

Sorry for the delay. I've been on PTO.
The change looks fine, but it's not clear to me what code will be generated with UNDEF.
Before the change it would be 'r0=0; exit' and now just 'exit' ?
Meaning that r0 will be left in whatever state that previous code did.
So instead of generating consistent return 0 the user would have to debug random return values ?
The chance of failing the verifier is slim. The functions are rarely completely empty.
r0 is likely used in the earlier code and the verifier won't complain that it's not initialized.
So the change contradicts the intent to remove the footgun.

When I first wrote this patch, I tested it on real code and did see verifier errors. It's a deep stack to unwind to reconstruct the state of the world back then to be able to quote what those errors were and what the generated code looked like. I can do it, but I'd like us to converge on what the right thing to do here is.

It seems you now agree that there's a footgun here. I'm not married to the use of UNDEF, is there a better solution you can think of that would reliably produce errors? Would reconstructing the generated code and verifier errors from a month ago be useful?

I don't see a footgun tbh. Improper handling of diag is purely an aya issue.
But I agree with intent to generate something non-verifiable in case of error. We need to figure what it would be.
There is no trap instruction that we can generate.
We can generate 'ja -1', but it's worse than current return 0.

Lastly: @yonghong-song suggested that the tests be removed. In light of your comments, should we restore the tests so that we can make claims about the behavior in the presence of more complicated functions?

I didn't get the point of bpf-unsupported-emit.c. It just demonstrates that it can incorrectly ignore DS_Error. We cannot have such "test" in the repo, since it would encourage incorrect diag processing.
In general I would still like to have a test for every commit.
Right now I don't see how UNDEF helps here. It looks worse form debugging of compiled bpf code pov.
If we can come up with something better than return 0 we certainly need a test for that.

Sorry for the delay. I've been on PTO.
The change looks fine, but it's not clear to me what code will be generated with UNDEF.
Before the change it would be 'r0=0; exit' and now just 'exit' ?
Meaning that r0 will be left in whatever state that previous code did.
So instead of generating consistent return 0 the user would have to debug random return values ?
The chance of failing the verifier is slim. The functions are rarely completely empty.
r0 is likely used in the earlier code and the verifier won't complain that it's not initialized.
So the change contradicts the intent to remove the footgun.

When I first wrote this patch, I tested it on real code and did see verifier errors. It's a deep stack to unwind to reconstruct the state of the world back then to be able to quote what those errors were and what the generated code looked like. I can do it, but I'd like us to converge on what the right thing to do here is.

It seems you now agree that there's a footgun here. I'm not married to the use of UNDEF, is there a better solution you can think of that would reliably produce errors? Would reconstructing the generated code and verifier errors from a month ago be useful?

I don't see a footgun tbh. Improper handling of diag is purely an aya issue.
But I agree with intent to generate something non-verifiable in case of error. We need to figure what it would be.
There is no trap instruction that we can generate.
We can generate 'ja -1', but it's worse than current return 0.

Lastly: @yonghong-song suggested that the tests be removed. In light of your comments, should we restore the tests so that we can make claims about the behavior in the presence of more complicated functions?

I didn't get the point of bpf-unsupported-emit.c. It just demonstrates that it can incorrectly ignore DS_Error. We cannot have such "test" in the repo, since it would encourage incorrect diag processing.
In general I would still like to have a test for every commit.
Right now I don't see how UNDEF helps here. It looks worse form debugging of compiled bpf code pov.
If we can come up with something better than return 0 we certainly need a test for that.

How could we write such a test without bpf-unsupported-emit.c?

By the way, it is still possible to test this using e.g. llc or clang. As noted previously in this thread the default error handler for llc/clang is stateful, so it proceeds the code generation, produces assembly code and returns non-zero error code.

I checked how UNDEF is handled, short answer: it disappears and it could replace instructions that use it.
For example:

$ cat t.c
extern void bar(void);
int foo(int a, int b, int c, int d, int e, int f) {
  bar();
  return a + f;
}
$ clang -O2 --target=bpf -mllvm -print-after=verify,bpf-isel,processimpdefs -c t.c -o - | llvm-objdump -d -
*** IR Dump After Module Verifier (verify) ***
; Function Attrs: nounwind
define dso_local i32 @foo(i32 noundef %a, i32 noundef %b, i32 noundef %c, i32 noundef %d, i32 noundef %e, i32 noundef %f) local_unnamed_addr #0 {
entry:
  %call = tail call i32 @bar() #2
  %add = add nsw i32 %f, %a
  ret i32 %add
}
t.c:2:5: error: stack arguments are not supported
    2 | int foo(int a, int b, int c, int d, int e, int f) {
      |     ^
# *** IR Dump After BPF DAG->DAG Pattern Instruction Selection (bpf-isel) ***:
# Machine code for function foo: IsSSA, TracksLiveness

bb.0.entry:
  ADJCALLSTACKDOWN 0, 0, implicit-def dead $r11, implicit $r11
  JAL @bar, implicit-def $r0, implicit-def dead $r1, implicit-def dead $r2, implicit-def dead $r3, implicit-def dead $r4, implicit-def dead $r5, implicit $r11
  ADJCALLSTACKUP 0, 0, implicit-def dead $r11, implicit $r11
  %5:gpr = COPY $r0
  %6:gpr = IMPLICIT_DEF
  $r0 = COPY %6:gpr
  RET implicit $r0

# End machine code for function foo.

# *** IR Dump After Process Implicit Definitions (processimpdefs) ***:
# Machine code for function foo: IsSSA, TracksLiveness

bb.0.entry:
  ADJCALLSTACKDOWN 0, 0, implicit-def dead $r11, implicit $r11
  JAL @bar, implicit-def $r0, implicit-def dead $r1, implicit-def dead $r2, implicit-def dead $r3, implicit-def dead $r4, implicit-def dead $r5, implicit $r11
  ADJCALLSTACKUP 0, 0, implicit-def dead $r11, implicit $r11
  RET implicit undef $r0

# End machine code for function foo.

1 error generated.

0000000000000000 <foo>:
       0:	85 10 00 00 ff ff ff ff	call -0x1
       1:	95 00 00 00 00 00 00 00	exit

Note how add instruction disappeared, processimpdefs converted uses of IMPLICIT_DEF to undef, undef just peeks some physical register (implicitly defined r0 in this case). The final result looks confusing and will pass verification w/o issues. It looks like getUNDEF() would some times be confusing and sometimes lead to verifier errors (although one would need to track back these errors).

Thanks @eddyz87 for a detailed example. The following is my understanding.

  • In any case, there is a compilation error. The binary should not be used since it is invalid with large chance.
  • The continual optimization will be done with either existing code or this patch (Constant 0 -> UNDEF)
  • For existing code, for *extra* parameters, the compilation will proceed with those extra parameters as constant 0.
  • With this patch, for *extra* parameters, the compilation will proceed with those extra parameters as undefined.
  • 'constant 0' may simplify the code and largely preserve the code flow.
  • 'undefined' may also simplify the code as compiler may discard some codes involved with 'undefined' values.

I think either 'constant 0' or 'undefined' func proto argument should not impact subsequent compilation warnings.
The only possible difference is generated code and with 'constant 0' is surely more close to the original code.
Considering the buggy binary should not be really used by verifier, I think staying with existing 'constant 0'
seems more reasonable, unless a more stronger argument favors 'undef' approach.

Thanks @eddyz87 for a detailed example. The following is my understanding.

  • In any case, there is a compilation error. The binary should not be used since it is invalid with large chance.
  • The continual optimization will be done with either existing code or this patch (Constant 0 -> UNDEF)
  • For existing code, for *extra* parameters, the compilation will proceed with those extra parameters as constant 0.
  • With this patch, for *extra* parameters, the compilation will proceed with those extra parameters as undefined.
  • 'constant 0' may simplify the code and largely preserve the code flow.
  • 'undefined' may also simplify the code as compiler may discard some codes involved with 'undefined' values.

I think either 'constant 0' or 'undefined' func proto argument should not impact subsequent compilation warnings.
The only possible difference is generated code and with 'constant 0' is surely more close to the original code.
Considering the buggy binary should not be really used by verifier, I think staying with existing 'constant 0'
seems more reasonable, unless a more stronger argument favors 'undef' approach.

Perhaps we should draw inspiration from libbpf's behavior when it is unable to relocate: https://github.com/torvalds/linux/blob/d4ddefee5160dc477d0e30c9d7a10ce8861c3007/tools/lib/bpf/relo_core.c#L982

Rather than UNDEF or constant zero, we could use some marker values:

  • when the function reads arguments from the stack: 0xbad4265 "bad args"
  • when the function return is larger than a register: 0xbad2375 "bad rets"

WDYT?

@yonghong-song what do you think of that suggestion?

@yonghong-song what do you think of that suggestion?

I think existing error message should be okay. There is really no need to expose magic number like 0xbad4265.