This is an archive of the discontinued LLVM Phabricator instance.

Fix null pointer dereference in `ProfileSummaryInfo::getPSI()`
Needs ReviewPublic

Authored by fez on Jun 5 2020, 9:01 AM.

Details

Summary

This fixes a regression introduced by a cosmetic change
here: https://reviews.llvm.org/D54669

The cosmetic change in the link above introduces a scenario where
calling getPSI() may dereference a NULL pointer (namely if the PSI
member of ProfileSummaryInfo is NULL).
When compiling with clang-9 with -O2, calls to getPSI are inlined.
The compiler sees the nullptr dereference (which is undefined behavior)
and is entitled to remove any nullptr check.

This causes snippets like this to crash:

// The following call to getPSI is inlined, and it contains a pointer
// dereference.
auto *PSI = &Stuff.getPSI();

// This check is assumed to be always true, because if it was false
// the following line would summon undefined behavior.
// So the compiler emits the code for `if (true)`
if (PSI)
  PSI->call_any_method(); // This dereferences PSI which is nullptr

This commit changes the APIs of ProfileSummaryInfo to return naked
pointers instead of dereferencing a possibly null pointer.
In this way, the pointer (null or not) is never dereferenced, and the
compiler is not entitled to remove the if (PSI) check.

Diff Detail

Event Timeline

fez created this revision.Jun 5 2020, 9:01 AM
fez added a reviewer: vsk.Jun 5 2020, 9:03 AM

I added @vsk as a reviewer, because he's the one who accepted the patch set that introduced the code that causes the bug.

fez updated this revision to Diff 268871.Jun 5 2020, 10:27 AM

Updated the patch to build on master. The previous one did not apply cleanly to trunk.

fez edited the summary of this revision. (Show Details)Jun 5 2020, 10:28 AM
fez edited the summary of this revision. (Show Details)

When is a null PSI accessed? It's not clear to me how this happens (based on ProfileSummaryInfoWrapperPass::doInitialization).

Could you share a reduced regression test that reproduces the crash without this patch applied?

Can you revert D54669 and remove the pointer to reference change and recommit?

fez added a comment.Jun 5 2020, 4:43 PM
In D81269#2077071, @vsk wrote:

When is a null PSI accessed? It's not clear to me how this happens (based on ProfileSummaryInfoWrapperPass::doInitialization).

Could you share a reduced regression test that reproduces the crash without this patch applied?

At the moment, I can reliably reproduce it by compiling llvm itself with clang-9 -O2. In lib/Transforms/InstCombine/InstructionCombining.cpp the nullptr check on PSI gets removed. I checked in the compiled object file InstructionCombining.cpp.o and the nullptr check at line 3785 (PSI && PSI->hasProfileSummary()) is gone.

However, I cannot reproduce the crash directly inside llvm's codebase. I tracked down the offending code to a library we're using in our codebase. It was working until upgrading llvm. I have the source of the library and I have minimized the offending code to the following:

legacy::FunctionPassManager PM(MyModule);
PM.add(createInstructionCombiningPass());
PM.run(MyFunction);

In this case doInitialization is not called, so PSI is null.

On one hand, I agree that without calling PM.doInitialization() the API of the legacy::FunctionPassManager is being misused.
The contract of the API is violated, because without calling doInitialization one could end up using some uninitialized state.

On the other hand, the crash I got with the snipped above was surprising. I was debugging my code to figure out the problem, and I saw that getPSI was called in InstructionCombining.cpp, returning nullptr. After the call to getPSI I saw a null pointer check, so I assumed that the presence of PSI is optional, as the rest of the code seemed to suggest. So, I expected the nullptr check to fail, but inspecting stepping through it I jumped directly to the next instruction, and checking the underlying assembly I realized that the compiler had optimized away the check. Everything was even more confused by the fact that compiling llvm (not the snippet) with -O0 the bug went away, because the optimizer just left the nullptr check on PSI in its right place.
After all, if the new API after the cosmetic change returns a reference, why hasn't the call sites be updated to have references instead of pointers? And why are the returned values from getPSI even checked for null if they are actually references? You see what I mean?

So, it seems to me that there is some kind of ambiguity here. In llvm's codebase doInitialization is always properly called. Hence, whenever getPSI gets called, PSI is never null. This is good. And this is also the reason why I cannot reproduce the bug directly in llvm's codebase.

However nobody theoretically stops a user to do something similar to the small snippet above. And in that case what happens is that, before the cosmetic patch I referenced, the code would work just fine, simply skipping over the initialization of BFI (which is guarded by the nullptr check on PSI). Instead, after the cosmetic change I referenced, the code crashes. In principle, there's nothing wrong with the crash per se, because it happens when someone violates the contract (always call doInitialization before run). But the interaction of the cosmetic change with undefined behavior makes very hard for someone debugging the code to find out what the real root cause, because the optimizer gets in the way. It's not an API that is hard to misuse. And for a user may not be straightforward to figure out that a crash in runOnFunction depends on the fact that doInitialization was not called AND the compiler decided to optimize away the nullptr check.

I realize that strictly speaking this may not be a bug, because it gets triggered only when breaking the contract, but the fact that it depends on optimization levels makes it kinda surprising. So, I don't know where do you want to go from here. Do you think that this is worth discussing or I should just drop it?

fez added a comment.Jun 5 2020, 4:50 PM

Can you revert D54669 and remove the pointer to reference change and recommit?

Right now it does not revert cleanly.

And according to the additional discussion that I posted above it might not be necessary to revert the change after all, given that the crash is only triggered by someone misusing the API, forgetting to call doInitialize before run.

Anyway, I'll wait to hear what @vsk has to say, and if we decide to go on with the fix (or some variants of it), I will revert the D54669, erase the offending parts from the cosmetic change, and recommit it.

vsk added a comment.Jun 8 2020, 10:24 AM

@fez thanks for explaining. Changing the library to initialize passes properly seems like the way to go, otherwise we'd be advertising that the pass initialization API is in general optional to call (which it's not). Re: chasing down the UB, I realize that's not a great experience, but istm the way to fix that is to modernize getPSI usage to expect a reference and test with -fsanitize=null to qualify the change.

fez added a comment.Jun 8 2020, 9:19 PM

Ok then. I'll drop this patch. Is there a way to close the diff?

vsk added a comment.Jun 9 2020, 5:02 PM

Near the web UI for adding comments, you should be able to "Abandon Revision" from the "Add Action..." dropdown menu. If not, that's ok, the history here should be sufficient info for future readers.