This is an archive of the discontinued LLVM Phabricator instance.

[Function Specialisation] Fix use after free
ClosedPublic

Authored by labrinea on Jan 31 2022, 2:52 AM.

Details

Summary

This is a fix for a use-after-free found by the address sanitizer when compiling GCC: https://github.com/llvm/llvm-project/issues/52821

The Function Specialization pass may remove instructions, cached inside the PredicateBase class, which are later being dereferenced from the SCCPInstVisitor class. To prevent the dangling references I am lazily deleting the dead instructions after the Solver has run.

Diff Detail

Event Timeline

labrinea created this revision.Jan 31 2022, 2:52 AM
labrinea requested review of this revision.Jan 31 2022, 2:52 AM
Herald added a project: Restricted Project. · View Herald TranscriptJan 31 2022, 2:52 AM
nikic edited reviewers, added: fhahn, nikic; removed: llvm-commits.Jan 31 2022, 3:45 AM
nikic added a subscriber: nikic.

Can you please explain in more detail where exactly an instruction referenced by PredicateInfo gets removed, and why other passes using PredicateInfo (IPSCCP and NewGVN) do not encounter this issue?

Can you please explain in more detail where exactly an instruction referenced by PredicateInfo gets removed, and why other passes using PredicateInfo (IPSCCP and NewGVN) do not encounter this issue?

Hi. If you follow the github link there's a dump from the address sanitizer and some explanation on the description of the issue. The instruction is removed by the function specialization pass inside tryToReplaceWithConstant after all its uses have been replaced. As far as I understand, the dangling pointer is then being dereferenced inside PredicateBase::getConstraint() (the function I've patched), which is being called from SCCPInstVisitor::handleCallResult. Both FunctionSpecialization and IPSCCP are using the SCCPSolver, which in turn is using the SCCPInstVisitor. The bug was exposed with ASAN when compilig GCC, so if the NewGVN is using the same machinery as the other passes, then it is potentially affected. I made a bootstrap build of clang with ASAN and FunctionSpecialization enabled and it succeeded. Also the reproducer is not triggering the use-after-free anymore.

fhahn added a comment.Jan 31 2022, 9:25 AM

Can you please explain in more detail where exactly an instruction referenced by PredicateInfo gets removed, and why other passes using PredicateInfo (IPSCCP and NewGVN) do not encounter this issue?

Hi. If you follow the github link there's a dump from the address sanitizer and some explanation on the description of the issue. The instruction is removed by the function specialization pass inside tryToReplaceWithConstant after all its uses have been replaced. As far as I understand, the dangling pointer is then being dereferenced inside PredicateBase::getConstraint() (the function I've patched), which is being called from SCCPInstVisitor::handleCallResult. Both FunctionSpecialization and IPSCCP are using the SCCPSolver, which in turn is using the SCCPInstVisitor. The bug was exposed with ASAN when compilig GCC, so if the NewGVN is using the same machinery as the other passes, then it is potentially affected. I made a bootstrap build of clang with ASAN and FunctionSpecialization enabled and it succeeded. Also the reproducer is not triggering the use-after-free anymore.

It sounds like the real problem may be that FunctionSpecialization removes instructions before the solver is finished and this lead to issues if the removed instructions has an associated PredicateInfo. WeakVH seems like a very heavy hammer to fix the underlying issue. Could FunctionSpecialization just delay removing instructions until all solving is done? If that's not possible, we may have to add a way to remove stale predicate info entries.

labrinea updated this revision to Diff 404727.Jan 31 2022, 2:21 PM
labrinea edited the summary of this revision. (Show Details)

As suggested by Florian, instead of using a WeakVH I am lazily removing the replaced instructions after the Solver has run. None of the existing tests actually covers this code path I am afraid. Examining the debug output of the reproducer I found two PhiNode instructions being replaced with null.

Agreed, lazily removing the instructions in this way is what I was expecting.

But this is a bit odd, a.k.a. not what I was expecting:

None of the existing tests actually covers this code path I am afraid.

Do you mean this:

if (I->isSafeToRemove()) {
   I->eraseFromParent();
   Solver.removeLatticeValueFor(I);
 }

Only triggers in the reproducer attached to the bugreport and not in the regression tests? I think we should first try hard to get a test for that, which I hope is doable. Otherwise, we might as well remove this code and rely on later clean up passes to delete dead code which I think is a valid strategy and alternative if it makes things really complicated here. But again, better would be to see if we can trigger this and a test for it.

labrinea updated this revision to Diff 404871.Feb 1 2022, 3:20 AM

Added some debug output:

  • when replacing a value
  • when removing dead instructions

Agreed, lazily removing the instructions in this way is what I was expecting.

But this is a bit odd, a.k.a. not what I was expecting:

None of the existing tests actually covers this code path I am afraid.

Do you mean this:

if (I->isSafeToRemove()) {
   I->eraseFromParent();
   Solver.removeLatticeValueFor(I);
 }

Only triggers in the reproducer attached to the bugreport and not in the regression tests? I think we should first try hard to get a test for that, which I hope is doable. Otherwise, we might as well remove this code and rely on later clean up passes to delete dead code which I think is a valid strategy and alternative if it makes things really complicated here. But again, better would be to see if we can trigger this and a test for it.

I was wrong when I said we are not testing the removal of dead instructions. I was looking at the tests under the original commit D93838, but we have actually added more tests since. For example function-specialization-constant-expression.ll is exercising this code path.

I was wrong when I said we are not testing the removal of dead instructions. I was looking at the tests under the original commit D93838, but we have actually added more tests since. For example function-specialization-constant-expression.ll is exercising this code path.

Ok, very good, that's what I thought.

One more question about tests: does the sanitzer failure also happens/happened on these existing tests? If so, I think we are done. If not, we are possibly missing something and in this case, do we need to add a sanitizer test? I have never looked that, so don't even know if there are sanitizers tests that run opt.

fhahn added a comment.Feb 1 2022, 6:00 AM

I was wrong when I said we are not testing the removal of dead instructions. I was looking at the tests under the original commit D93838, but we have actually added more tests since. For example function-specialization-constant-expression.ll is exercising this code path.

Ok, very good, that's what I thought.

One more question about tests: does the sanitzer failure also happens/happened on these existing tests? If so, I think we are done. If not, we are possibly missing something and in this case, do we need to add a sanitizer test? I have never looked that, so don't even know if there are sanitizers tests that run opt.

To expose the issue, we might have to add a test where a compare that is used in a predicate gets simplified to a constant and removed and the solver needs to run again on the function.

ormris removed a subscriber: ormris.Feb 1 2022, 9:41 AM
labrinea updated this revision to Diff 405234.Feb 2 2022, 5:51 AM
labrinea edited the summary of this revision. (Show Details)

I have heavily modified the original reproducer, but the problem is that we don't have buildbots running the llvm regression tests with asan. I can only see two builders for libc with asan here https://lab.llvm.org/buildbot/#/builders. Do you think there's still value in adding this testcase?

fhahn added a comment.Feb 2 2022, 5:55 AM

I have heavily modified the original reproducer, but the problem is that we don't have buildbots running the llvm regression tests with asan. I can only see two builders for libc with asan here https://lab.llvm.org/buildbot/#/builders. Do you think there's still value in adding this testcase?

There should be multiple bots that run with various sanitizers, e.g. https://lab.llvm.org/buildbot/#/builders/168

SjoerdMeijer accepted this revision.Feb 2 2022, 6:04 AM

I have heavily modified the original reproducer, but the problem is that we don't have buildbots running the llvm regression tests with asan. I can only see two builders for libc with asan here https://lab.llvm.org/buildbot/#/builders. Do you think there's still value in adding this testcase?

There should be multiple bots that run with various sanitizers, e.g. https://lab.llvm.org/buildbot/#/builders/168

In that case, let's do it; we have got a fix that looks reasonable and we have a test case. Before committing, the test case could do with a comment and checking a bit more though. Let's at least check the statements that the solver is changing and/or perhaps check for absence of the statements that are deleted (if that is still the case here).

This revision is now accepted and ready to land.Feb 2 2022, 6:04 AM

Ah, right, I didn't notice there was another builder page. I'll generate some check lines and commit. Thanks both :)

This revision was landed with ongoing or failed builds.Feb 2 2022, 8:32 AM
This revision was automatically updated to reflect the committed changes.