Page MenuHomePhabricator

[OpaquePtr] Require matching signature in getCalledFunction()
AcceptedPublic

Authored by nikic on Jul 9 2021, 2:00 PM.

Details

Reviewers
aeubanks
Group Reviewers
Restricted Project
Summary

With opaque pointers, it's possible to directly call a function with a different signature, without an intermediate bitcast. However, lot's of code using getCalledFunction() reasonably assumes that the signatures match (which is always true without opaque pointers). Add an explicit check to that effect.

The test case is from D105313, where I ran into the problem, but on further investigation this also affects lots of other code, we just have little coverage with mismatching signatures. The change from D105313 is still desirable for other reasons, but this patch addresses (I believe) the root problem when it comes to opaque pointers.

Diff Detail

Unit TestsFailed

TimeTest
2,600 msx64 debian > libarcher.critical::critical.c
Script: -- : 'RUN: at line 15'; /var/lib/buildkite-agent/builds/llvm-project/build/./bin/clang -fopenmp -pthread -fno-experimental-isel -g -O1 -fsanitize=thread -I /var/lib/buildkite-agent/builds/llvm-project/openmp/tools/archer/tests -I /var/lib/buildkite-agent/builds/llvm-project/build/projects/openmp/runtime/src -L /var/lib/buildkite-agent/builds/llvm-project/build/lib -Wl,-rpath,/var/lib/buildkite-agent/builds/llvm-project/build/lib /var/lib/buildkite-agent/builds/llvm-project/openmp/tools/archer/tests/critical/critical.c -o /var/lib/buildkite-agent/builds/llvm-project/build/projects/openmp/tools/archer/tests/critical/Output/critical.c.tmp -latomic && env TSAN_OPTIONS='ignore_noninstrumented_modules=0:ignore_noninstrumented_modules=1' /var/lib/buildkite-agent/builds/llvm-project/build/projects/openmp/tools/archer/tests/critical/Output/critical.c.tmp 2>&1 | tee /var/lib/buildkite-agent/builds/llvm-project/build/projects/openmp/tools/archer/tests/critical/Output/critical.c.tmp.log | /var/lib/buildkite-agent/builds/llvm-project/build/./bin/FileCheck /var/lib/buildkite-agent/builds/llvm-project/openmp/tools/archer/tests/critical/critical.c
3,040 msx64 debian > libarcher.races::critical-unrelated.c
Script: -- : 'RUN: at line 13'; /var/lib/buildkite-agent/builds/llvm-project/build/./bin/clang -fopenmp -pthread -fno-experimental-isel -g -O1 -fsanitize=thread -I /var/lib/buildkite-agent/builds/llvm-project/openmp/tools/archer/tests -I /var/lib/buildkite-agent/builds/llvm-project/build/projects/openmp/runtime/src -L /var/lib/buildkite-agent/builds/llvm-project/build/lib -Wl,-rpath,/var/lib/buildkite-agent/builds/llvm-project/build/lib /var/lib/buildkite-agent/builds/llvm-project/openmp/tools/archer/tests/races/critical-unrelated.c -o /var/lib/buildkite-agent/builds/llvm-project/build/projects/openmp/tools/archer/tests/races/Output/critical-unrelated.c.tmp -latomic && env TSAN_OPTIONS='ignore_noninstrumented_modules=0:ignore_noninstrumented_modules=1' /var/lib/buildkite-agent/builds/llvm-project/openmp/tools/archer/tests/deflake.bash /var/lib/buildkite-agent/builds/llvm-project/build/projects/openmp/tools/archer/tests/races/Output/critical-unrelated.c.tmp 2>&1 | tee /var/lib/buildkite-agent/builds/llvm-project/build/projects/openmp/tools/archer/tests/races/Output/critical-unrelated.c.tmp.log | /var/lib/buildkite-agent/builds/llvm-project/build/./bin/FileCheck /var/lib/buildkite-agent/builds/llvm-project/openmp/tools/archer/tests/races/critical-unrelated.c
3,050 msx64 debian > libarcher.races::lock-nested-unrelated.c
Script: -- : 'RUN: at line 13'; /var/lib/buildkite-agent/builds/llvm-project/build/./bin/clang -fopenmp -pthread -fno-experimental-isel -g -O1 -fsanitize=thread -I /var/lib/buildkite-agent/builds/llvm-project/openmp/tools/archer/tests -I /var/lib/buildkite-agent/builds/llvm-project/build/projects/openmp/runtime/src -L /var/lib/buildkite-agent/builds/llvm-project/build/lib -Wl,-rpath,/var/lib/buildkite-agent/builds/llvm-project/build/lib /var/lib/buildkite-agent/builds/llvm-project/openmp/tools/archer/tests/races/lock-nested-unrelated.c -o /var/lib/buildkite-agent/builds/llvm-project/build/projects/openmp/tools/archer/tests/races/Output/lock-nested-unrelated.c.tmp -latomic && env TSAN_OPTIONS='ignore_noninstrumented_modules=0:ignore_noninstrumented_modules=1' /var/lib/buildkite-agent/builds/llvm-project/openmp/tools/archer/tests/deflake.bash /var/lib/buildkite-agent/builds/llvm-project/build/projects/openmp/tools/archer/tests/races/Output/lock-nested-unrelated.c.tmp 2>&1 | tee /var/lib/buildkite-agent/builds/llvm-project/build/projects/openmp/tools/archer/tests/races/Output/lock-nested-unrelated.c.tmp.log | /var/lib/buildkite-agent/builds/llvm-project/build/./bin/FileCheck /var/lib/buildkite-agent/builds/llvm-project/openmp/tools/archer/tests/races/lock-nested-unrelated.c
3,190 msx64 debian > libarcher.races::lock-unrelated.c
Script: -- : 'RUN: at line 13'; /var/lib/buildkite-agent/builds/llvm-project/build/./bin/clang -fopenmp -pthread -fno-experimental-isel -g -O1 -fsanitize=thread -I /var/lib/buildkite-agent/builds/llvm-project/openmp/tools/archer/tests -I /var/lib/buildkite-agent/builds/llvm-project/build/projects/openmp/runtime/src -L /var/lib/buildkite-agent/builds/llvm-project/build/lib -Wl,-rpath,/var/lib/buildkite-agent/builds/llvm-project/build/lib /var/lib/buildkite-agent/builds/llvm-project/openmp/tools/archer/tests/races/lock-unrelated.c -o /var/lib/buildkite-agent/builds/llvm-project/build/projects/openmp/tools/archer/tests/races/Output/lock-unrelated.c.tmp -latomic && env TSAN_OPTIONS='ignore_noninstrumented_modules=0:ignore_noninstrumented_modules=1' /var/lib/buildkite-agent/builds/llvm-project/openmp/tools/archer/tests/deflake.bash /var/lib/buildkite-agent/builds/llvm-project/build/projects/openmp/tools/archer/tests/races/Output/lock-unrelated.c.tmp 2>&1 | tee /var/lib/buildkite-agent/builds/llvm-project/build/projects/openmp/tools/archer/tests/races/Output/lock-unrelated.c.tmp.log | /var/lib/buildkite-agent/builds/llvm-project/build/./bin/FileCheck /var/lib/buildkite-agent/builds/llvm-project/openmp/tools/archer/tests/races/lock-unrelated.c
2,990 msx64 debian > libarcher.races::parallel-simple.c
Script: -- : 'RUN: at line 13'; /var/lib/buildkite-agent/builds/llvm-project/build/./bin/clang -fopenmp -pthread -fno-experimental-isel -g -O1 -fsanitize=thread -I /var/lib/buildkite-agent/builds/llvm-project/openmp/tools/archer/tests -I /var/lib/buildkite-agent/builds/llvm-project/build/projects/openmp/runtime/src -L /var/lib/buildkite-agent/builds/llvm-project/build/lib -Wl,-rpath,/var/lib/buildkite-agent/builds/llvm-project/build/lib /var/lib/buildkite-agent/builds/llvm-project/openmp/tools/archer/tests/races/parallel-simple.c -o /var/lib/buildkite-agent/builds/llvm-project/build/projects/openmp/tools/archer/tests/races/Output/parallel-simple.c.tmp -latomic && env TSAN_OPTIONS='ignore_noninstrumented_modules=0:ignore_noninstrumented_modules=1' /var/lib/buildkite-agent/builds/llvm-project/openmp/tools/archer/tests/deflake.bash /var/lib/buildkite-agent/builds/llvm-project/build/projects/openmp/tools/archer/tests/races/Output/parallel-simple.c.tmp 2>&1 | tee /var/lib/buildkite-agent/builds/llvm-project/build/projects/openmp/tools/archer/tests/races/Output/parallel-simple.c.tmp.log | /var/lib/buildkite-agent/builds/llvm-project/build/./bin/FileCheck /var/lib/buildkite-agent/builds/llvm-project/openmp/tools/archer/tests/races/parallel-simple.c
View Full Test Results (17 Failed)

Event Timeline

nikic created this revision.Jul 9 2021, 2:00 PM
nikic requested review of this revision.Jul 9 2021, 2:00 PM
Herald added a project: Restricted Project. · View Herald TranscriptJul 9 2021, 2:00 PM
aeubanks accepted this revision.Jul 9 2021, 2:26 PM
aeubanks added a subscriber: aeubanks.

this seems like a good compromise

This revision is now accepted and ready to land.Jul 9 2021, 2:26 PM

we should update the documentation to reflect this, and also perhaps make this UB and optimize it away to make it more obvious to a user that something is wrong

Probably a reasonable path forward - though I worry this is creating some subtle technical debt where callers will expect that getCalledFunction's result is null/non-null in the same places where the Function is a direct parameter. (but, yeah, no doubt there's already technical debt that assumes they're the same type - so it's a bit of a loss either way)

This is probably the lesser pain & we'll have to fix up whatever cases come up as it comes up.

I guess the only alternative might be to rename it (not suggesting doing anything in particular at call sites - just straight up find/replace) to "getExactCalledFunction" or something like that. But I don't have strong feelings.

we should update the documentation to reflect this and also perhaps make this UB and optimize it away to make it more obvious to a user that something is wrong

I understand that many function declarations that are different with typed pointers resolve to the same type with opaque pointers. But won't we still expect mismatched function types when calling variadic functions? (Are there other legitimate cases, I wonder?)

(Does removing code make UB obvious to users? If obviousness is the goal (not runtime performance) I wonder if something could be hooked up in UBSan instead? or a guaranteed crash?)

Probably a reasonable path forward - though I worry this is creating some subtle technical debt where callers will expect that getCalledFunction's result is null/non-null in the same places where the Function is a direct parameter. (but, yeah, no doubt there's already technical debt that assumes they're the same type - so it's a bit of a loss either way)

Scanning uses in lib/Analysis, a bunch of callers seem to trust the current header docs -- which say that getCalledFunction() returns the function unless it's an indirect call -- and continue as if bitcasts aren't a thing. It might be valuable (besides opaque pointer work) to update getCalledFunction() to see through the bitcasts even with typed pointers.

E.g., see this code in DevirtSCCRepeatedPass::run:

for (Instruction &I : instructions(N.getFunction()))
  if (auto *CB = dyn_cast<CallBase>(&I)) {
    if (CB->getCalledFunction()) {
      ++Count.Direct;
    } else {
      ++Count.Indirect;
      CallHandles.insert({CB, WeakTrackingVH(CB)});
    }
  }
nikic added a comment.Jul 9 2021, 3:16 PM

we should update the documentation to reflect this and also perhaps make this UB and optimize it away to make it more obvious to a user that something is wrong

I understand that many function declarations that are different with typed pointers resolve to the same type with opaque pointers. But won't we still expect mismatched function types when calling variadic functions? (Are there other legitimate cases, I wonder?)

Yeah, I'm not really sure whether this is always UB or what the exact rules would be. Aligning pointer types is clearly legal, but possibly any cast that is still compatible from an ABI perspective would be legal.

Probably a reasonable path forward - though I worry this is creating some subtle technical debt where callers will expect that getCalledFunction's result is null/non-null in the same places where the Function is a direct parameter. (but, yeah, no doubt there's already technical debt that assumes they're the same type - so it's a bit of a loss either way)

Scanning uses in lib/Analysis, a bunch of callers seem to trust the current header docs -- which say that getCalledFunction() returns the function unless it's an indirect call -- and continue as if bitcasts aren't a thing. It might be valuable (besides opaque pointer work) to update getCalledFunction() to see through the bitcasts even with typed pointers.

E.g., see this code in DevirtSCCRepeatedPass::run:

for (Instruction &I : instructions(N.getFunction()))
  if (auto *CB = dyn_cast<CallBase>(&I)) {
    if (CB->getCalledFunction()) {
      ++Count.Direct;
    } else {
      ++Count.Indirect;
      CallHandles.insert({CB, WeakTrackingVH(CB)});
    }
  }

I think that would be best served by a different function. This patch preserves the current behavior of getCalledFunction() with opaque pointers (no "bitcasts" allowed) -- we could add an additional function like getCalledFunctionStripCasts() that ignores bitcasts (or for opaque pointers, allows signature mismatch) and use that where profitable. Worth noting that we already have isIndirectCall(), which is not the same as getCalledFunction()!=nullptr, that one does consider bitcasts not to be indirect.

nikic updated this revision to Diff 357814.Jul 11 2021, 12:17 PM

Use dyn_cast_or_null to fix metadata printing test.

I think that would be best served by a different function. This patch preserves the current behavior of getCalledFunction() with opaque pointers (no "bitcasts" allowed) -- we could add an additional function like getCalledFunctionStripCasts() that ignores bitcasts (or for opaque pointers, allows signature mismatch) and use that where profitable. Worth noting that we already have isIndirectCall(), which is not the same as getCalledFunction()!=nullptr, that one does consider bitcasts not to be indirect.

I had a more through look at callers in llvm/. I think most callers would want to be updated to getCalledFunctionStripCasts(). There are many callers (especially in Analysis and Transforms) that use this as a shortcut to detect if the call is direct (existence of CallBase::isIndirectCall() notwithstanding). The only caller I could find that clearly knew that nullptr could mean something other than "not a direct call" was computeFunctionSummary in llvm/lib/Analysis/ModuleSummaryAnalysis.cpp (which then does does the strip itself).

There are a few callers that would be broken by adjusting the behaviour, mostly due to assuming that parameter attributes are valid. Maybe those callers could be moved to a new function, and the current function's behaviour changed to match most callers' expectations? (Even these probably want to strip pointer casts as long as the called function is compatible...)

I feel like this API is an opportunity to reduce the hard-to-reason-about optimization churn caused by flipping the ForceOpaquePtrs flag. With the current patch / status quo, --force-opaque-pointers will cause a bunch of Analysis/Transforms/etc. to learn about a bunch of direct calls they previously thought were indirect. Alternatively, stripping pointer casts in CallBase::getCalledFunction first would separate the direct-call-related churn from the opaque pointers feature.

What do others think?

nikic added a comment.Jul 12 2021, 2:24 PM

@dexonsmith While there are many calls of getCalledFunction() that could deal with bitcast stripping, there are also a significant number that rely on the current behavior for correctness. E.g. pretty much any code dealing with intrinsics assumes that the call matches the intrinsic signature. There are a few places that do things similar to the problematic ValueTracking usage, i.e. assume that the number of arguments on the function and call are correlated (FuncAttrs is another example that does this). Of course, it would be better if they didn't do that -- but then again, there are over 500 calls to getCalledFunction() in LLVM, and they all need to be audited for whether the new interpretation is safe or adjusted to make it safe. Failing to do so will result in assertion failures in an area where we have essentially zero test coverage.

Which is why I'm arguing that this should be a new method, which can be used after safety has been audited for individual callers.

I think that would be best served by a different function. This patch preserves the current behavior of getCalledFunction() with opaque pointers (no "bitcasts" allowed) -- we could add an additional function like getCalledFunctionStripCasts() that ignores bitcasts (or for opaque pointers, allows signature mismatch) and use that where profitable. Worth noting that we already have isIndirectCall(), which is not the same as getCalledFunction()!=nullptr, that one does consider bitcasts not to be indirect.

I had a more through look at callers in llvm/. I think most callers would want to be updated to getCalledFunctionStripCasts(). There are many callers (especially in Analysis and Transforms) that use this as a shortcut to detect if the call is direct (existence of CallBase::isIndirectCall() notwithstanding). The only caller I could find that clearly knew that nullptr could mean something other than "not a direct call" was computeFunctionSummary in llvm/lib/Analysis/ModuleSummaryAnalysis.cpp (which then does does the strip itself).

There are a few callers that would be broken by adjusting the behaviour, mostly due to assuming that parameter attributes are valid. Maybe those callers could be moved to a new function, and the current function's behaviour changed to match most callers' expectations? (Even these probably want to strip pointer casts as long as the called function is compatible...)

I feel like this API is an opportunity to reduce the hard-to-reason-about optimization churn caused by flipping the ForceOpaquePtrs flag. With the current patch / status quo, --force-opaque-pointers will cause a bunch of Analysis/Transforms/etc. to learn about a bunch of direct calls they previously thought were indirect. Alternatively, stripping pointer casts in CallBase::getCalledFunction first would separate the direct-call-related churn from the opaque pointers feature.

What do others think?

I'd generally agree with your thoughts there - especially if it means we can separate the direct-call-related churn from the opaque pointers feature.

@nikic, I think we're mostly on the same page about the current state; I agree with your point that a new name could be safer; IMO the existing name is the right one for the documented behaviour (the 500+ callers carve out a desire path), but there are ways to stage changes like that to minimize risk.

Regardless of how things are staged / named / etc., any thoughts on the right thing to do? IOW, it seems to me like the proposed audit would reduce differences between --force-opaque-pointers on vs. off; do you agree? enough of a reduction that it's worth doing the audit?

nikic added a comment.Jul 12 2021, 2:59 PM

@nikic, I think we're mostly on the same page about the current state; I agree with your point that a new name could be safer; IMO the existing name is the right one for the documented behaviour (the 500+ callers carve out a desire path), but there are ways to stage changes like that to minimize risk.

Regardless of how things are staged / named / etc., any thoughts on the right thing to do? IOW, it seems to me like the proposed audit would reduce differences between --force-opaque-pointers on vs. off; do you agree? enough of a reduction that it's worth doing the audit?

"Call to bitcasted function" is considered non-canonical in LLVM IR and InstCombine will try to move casts from the call to the arguments (see transformConstExprCastCall). I believe it should succeed for any cases where opaque pointers will make the signatures the same. Thanks to this, most passes will only see the canonicalized form that performs a direct call. For that reason, I wouldn't expect this to have a major impact during the opaque pointer switch, at least relative to other changes (much less bitcasts, more i8 GEPs, etc.)

intrinsics should always have matching arguments/signatures, we should perhaps have a verifier check for that?
is this going to return null for vararg function calls now?

perhaps we could see if getCalledFunction() now returns null where previously it wouldn't on anything in e.g. llvm-test-suite/spec/etc

nikic added a comment.Jul 14 2021, 1:47 PM

intrinsics should always have matching arguments/signatures, we should perhaps have a verifier check for that?

That sounds reasonable to me. Intrinsic calls are always required to be direct, so verifying the signature should be safe and will help prevent mistakes. Implemented in D106013.

is this going to return null for vararg function calls now?

perhaps we could see if getCalledFunction() now returns null where previously it wouldn't on anything in e.g. llvm-test-suite/spec/etc

Just to be sure I replaced the new check with an assertion, and as expected it doesn't fire on test-suite. If it did, that would imply broken IR (with typed pointers).

"Call to bitcasted function" is considered non-canonical in LLVM IR and InstCombine will try to move casts from the call to the arguments (see transformConstExprCastCall). I believe it should succeed for any cases where opaque pointers will make the signatures the same. Thanks to this, most passes will only see the canonicalized form that performs a direct call. For that reason, I wouldn't expect this to have a major impact during the opaque pointer switch, at least relative to other changes (much less bitcasts, more i8 GEPs, etc.)

Ah, thanks for explaining. I agree that makes it lower value to front-load.

intrinsics should always have matching arguments/signatures, we should perhaps have a verifier check for that?

That sounds reasonable to me. Intrinsic calls are always required to be direct, so verifying the signature should be safe and will help prevent mistakes. Implemented in D106013.

As per https://reviews.llvm.org/D106013#2878472, I wonder how hard it would be in practice to maintain this invariant for all functions (not just intrinsics), making this patch unnecessary. Besides an update to ValueMapper::remapInstruction to fix the inliner, what else is required?

@nikic reminded me that pointers-to-functions don't have bitcasts anymore (I was confused and thinking there were just fewer, from the collapse of pointer parameters). (Thanks for your patience while I worked through this!)

I still wonder if it'd be better to

  • rename this function first (maybe getCalledFunctionIfSameType() or similar) in a prep patch -- making its behaviour clear in the name, and making the logical holes in callers obvious
  • then land this patch to make it behave the same way for opaque pointers
  • ... but leave it for later/separate work to add back a getCalledFunction() that actually does what (most) callers expect and move callers over to it

WDYT?

(If others think it's better to just go forward with this patch as-is for now, I'm fine with that too.)

@dexonsmith I personally don't think it's worth the churn. We can split the users of getCalledFunction() into three categories: 1. Those that explicitly require that explicitly require a signature match (https://github.com/llvm/llvm-project/blob/0529e2e01888129b21becd1fe3a61d9cb07c6fcd/llvm/lib/Target/AMDGPU/AMDGPUFixFunctionBitcasts.cpp#L36 is the most pure example of that), 2. Those that don't care because InstCombine canonicalizes away bitcasts and 3. Those where it may be relevant for phase ordering reasons. I expect that group 2 is by far the largest, and group 3 is rather small, which means that there would be little practical benefit to doing this work.