This is an archive of the discontinued LLVM Phabricator instance.

StackColoring: better handling of statically unreachable code
ClosedPublic

Authored by thanm on Apr 30 2018, 7:01 AM.

Details

Summary

Avoid assert/crash during liveness calculation in situations where the
incoming machine function has statically unreachable BBs.

Fixes PR37130.

Diff Detail

Event Timeline

thanm created this revision.Apr 30 2018, 7:01 AM
thanm added a comment.Apr 30 2018, 7:07 AM

I would welcome suggestions on how to write a regression test / lit test for this issue. The tricky part seems to be insuring that a statically unreachable BB survives all the way through the llc pipeline to make it to stack coloring.

@thanm You can take the reproducer in the bug report, produce LLVM-IR from it (-S -emit-llvm), then run it through llc with "-stop-before stack-slot-coloring" to get MIR output. The test file would then be invoked with the same options as before except you'd add '-start-before stack-slot-coloring'.

test/CodeGen/X86/branch_instruction_and_target_split_perf_nops.mir is an example of such a test.

thanm added a comment.Apr 30 2018, 7:23 AM

@thanm You can take the reproducer in the bug report, produce LLVM-IR from it (-S -emit-llvm), then run it through llc with "-stop-before stack-slot-coloring" to get MIR output. The test file would then be invoked with the same options as before except you'd add '-start-before stack-slot-coloring'.

Thank you! I will look into doing that.

thanm added a comment.May 1 2018, 1:21 PM

@thanm You can take the reproducer in the bug report, produce LLVM-IR from it (-S -emit-llvm), then run it through llc with "-stop-before stack-slot-coloring" to get MIR output. The test file would then be invoked with the same options as before except you'd add '-start-before stack-slot-coloring'.

Thank you! I will look into doing that.

Well, I am not having much luck creating a runnable MIR test case using this recipe. I tried:

clang -S -emit-llvm ...
llc llvm-pr-37130.opt.ll -stop-before stack-coloring -o abc.mir

then edited the resulting file to introduce the statically unreachable block. When I run it using

llc ../llvm/test/CodeGen/X86/stack-coloring-unreachable-bb.mir -no-stack-coloring=false -start-before stack-coloring

I get a fatal error in the pass manager: "Unable to schedule 'Slot index numbering' required by 'Merge disjoint stack slots'".

I am probably making some sort of rookie mistake here somewhere, but I am not sure what.

thanm updated this revision to Diff 145056.May 3 2018, 11:08 AM

Add testcase.

thanm added a comment.May 3 2018, 11:10 AM

Ready for review at this point.

Friendly ping... let me know if I need to recruit additional reviews.

This looks pretty obviously right, i'd commit it.

sdardis added a comment.EditedMay 10 2018, 4:29 PM

Sorry for the delay, but this change looks ok, but the supplied test case is flaky and crashes sometimes on my machine:

Assertion failed: (!isa<Constant>(V) && "Can't get a constant or global slot with this!"), function getLocalSlot, file /Users/sdardis/dev/llvm/llvm/lib/IR/AsmWriter.cpp, line 1038.
Stack dump:
0. Program arguments: /Users/sdardis/dev/llvm/llvmbuild3/bin/llc -mcpu=corei7 -no-stack-coloring=false -stop-after stack-coloring -o /dev/null /Users/sdardis/dev/llvm/llvm/test/CodeGen/X86/stack-coloring-unreachable-bb.mir

  1. Running pass 'Function Pass Manager' on module '/Users/sdardis/dev/llvm/llvm/test/CodeGen/X86/stack-coloring-unreachable-bb.mir'.
  2. Running pass 'MIR Printing Pass' on function '@foo'

0 llc 0x0000000107cfdeff llvm::sys::PrintStackTrace(llvm::raw_ostream&) + 63
1 llc 0x0000000107cfe3a9 PrintStackTraceSignalHandler(void*) + 25
2 llc 0x0000000107cfaf99 llvm::sys::RunSignalHandlers() + 425
3 llc 0x0000000107cfe629 SignalHandler(int) + 345
4 libsystem_platform.dylib 0x00007fff9ee7b52a _sigtramp + 26
5 libsystem_platform.dylib 0x00007fff63e0f5c8 _sigtramp + 3304669368
6 libsystem_c.dylib 0x00007fff9c3af6df abort + 129
7 libsystem_c.dylib 0x00007fff9c376dd8 basename + 0
8 llc 0x0000000106e07ff1 llvm::SlotTracker::getLocalSlot(llvm::Value const*) + 113
9 llc 0x0000000106e07f6e llvm::ModuleSlotTracker::getLocalSlot(llvm::Value const*) + 126
10 llc 0x0000000106b9bc2f llvm::MIPrinter::print(llvm::MachineBasicBlock const&) + 447
11 llc 0x0000000106b97d30 llvm::MIRPrinter::print(llvm::MachineFunction const&) + 944
12 llc 0x0000000106b9f12d llvm::printMIR(llvm::raw_ostream&, llvm::MachineFunction const&) + 45
13 llc 0x0000000106ba62fa (anonymous namespace)::MIRPrintingPass::runOnMachineFunction(llvm::MachineFunction&) + 298
14 llc 0x0000000106a1b2a6 llvm::MachineFunctionPass::runOnFunction(llvm::Function&) + 454
15 llc 0x000000010708916d llvm::FPPassManager::runOnFunction(llvm::Function&) + 413
16 llc 0x00000001070894a5 llvm::FPPassManager::runOnModule(llvm::Module&) + 117
17 llc 0x000000010708a1fa (anonymous namespace)::MPPassManager::runOnModule(llvm::Module&) + 2010
18 llc 0x000000010708978b llvm::legacy::PassManagerImpl::run(llvm::Module&) + 347
19 llc 0x000000010708ae41 llvm::legacy::PassManager::run(llvm::Module&) + 33
20 llc 0x000000010474f991 compileModule(char**, llvm::LLVMContext&) + 19281
21 llc 0x000000010474ab3e main + 4062
22 libdyld.dylib 0x00007fff9b6d85ad start + 1
/Users/sdardis/dev/llvm/llvmbuild3/test/CodeGen/X86/Output/stack-coloring-unreachable-bb.mir.script: line 1: 84647 Abort trap: 6 /Users/sdardis/dev/llvm/llvmbuild3/bin/llc -mcpu=corei7 -no-stack-coloring=false -stop-after stack-coloring -o /dev/null /Users/sdardis/dev/llvm/llvm/test/CodeGen/X86/stack-coloring-unreachable-bb.mir

Digging in with valgrind shows several invalid reads and "Conditional jump or move depends on uninitialised value(s)".

Digging in with LLDB gives differing results (somewhat expected). Can you run valgrind on the run of llc for this test case and report your findings?

test/CodeGen/X86/stack-coloring-unreachable-bb.mir
1

Either rename this file to PR37310.mir or cite the PR number as a comment.

1

Add -mtriple=x86_64-unknown-linux-gnu to the command line here.

3

Explain that this is testing that the stack coloring pass handles statically unreachable blocks.

22–25

This can all be removed.

27

These "; Function Attrs.." lines can all be removed.

28

local_unnamed_addr can be removed.

69–72

As far as I can see, these can be removed as well, along with #<num>s in the function declarations.

73–78

This can be removed.

thanm added a comment.May 11 2018, 8:13 AM

Sorry for the delay, but this change looks ok, but the supplied test case is flaky and crashes sometimes on my machine:

Digging in with LLDB gives differing results (somewhat expected). Can you run valgrind on the run of llc for this test case and report your findings?

You are right, the testcase is indeed flaky. I left out the "-start-before stack-coloring" in the llc run rule -- as a result, what's happening is that the UnreachableBlockElimLegacyPass is running (prior to stack coloring), so it deletes the unreachable BB "unreachable:". The corresponding unreachable machine BB still exists, however, and it includes a reference to the BB, which becomes stale, leading to the crash.

If I add back in "-start-before stack-coloring" to the LLC command line (so as not preserve the BBs), I am back at my original problem of

Unable to schedule 'Slot index numbering' required by 'Merge disjoint stack slots'

for which I have no viable solution. I think what I will do is add a new knob to disable unreachable block elimination, then run llc as usual. Let me work on that.

test/CodeGen/X86/stack-coloring-unreachable-bb.mir
1

Will do.

3

Will do.

22–25

OK, will do. Regarding this and your other suggestions, is there a tool or programmatic way to do this? Or is this just an expected part of creating an MIR testcase? Please bear with me, I haven't written a lot of MIR tests.

28

Will do.

69–72

Will do.

73–78

Will do.

thanm updated this revision to Diff 146335.May 11 2018, 8:36 AM

Revised test case. New hidden option to turn off statically
unreachable block elimination.

I've tried the revised patch, but I am still seeing flakiness when running this on my darwin machine.

It appears CodeGenPrepare::eliminateMostlyEmptyBlock is leaving around stale information that is then read by the MIRPrintingPass.

I have a suggestion inline, and with that, the change to UnreachableBlockElim.cpp seems unnecessary. Can you also test on your side?

test/CodeGen/X86/PR37310.mir
1 ↗(On Diff #146335)

Can you add '-start-before isel' here?

test/CodeGen/X86/stack-coloring-unreachable-bb.mir
22–25

It's typical to reduce both LLVM IR and MIR to the smallest form that reproduces / tests the bug that a patch fixes. Unfortunately I don't believe we have a clang/llvm output mode which reduces output for test cases.

I've tried the revised patch, but I am still seeing flakiness when running this on my darwin machine.

OK, got it. Sounds like the test case still needs more work.

I am traveling this week and don't have access to a darwin machine, but I should be able to test that scenario when I get back.

It appears CodeGenPrepare::eliminateMostlyEmptyBlock is leaving around stale information that is then read by the MIRPrintingPass.

I have a suggestion inline, and with that, the change to UnreachableBlockElim.cpp seems unnecessary. Can you also test on your side?

Will do.

thanm updated this revision to Diff 147073.May 16 2018, 6:37 AM

Incorporate changes suggested by Simon:

  • remove option to disable statically unreachable block elim
  • modify MIR test to include "-start-before isel"
sdardis accepted this revision.May 16 2018, 8:01 AM
sdardis added a reviewer: sdardis.

LGTM. Testing this with valgrind on linux and darwin shows no issues now.

This revision is now accepted and ready to land.May 16 2018, 8:01 AM
thanm added a comment.May 16 2018, 8:59 AM

Thanks! Review comments and suggestions much appreciated. I will try to submit this later in the week.

This revision was automatically updated to reflect the committed changes.

Hi Simon,

Your suggested revision to the MIR testcase ("-start-before isel") doesn't seem to agree with some of the buildbots. Do you have any insight into why this might be the case?

I will go ahead and revert the change now, will try to iron out the test case issues later on.

Thanks, Than

Not immediately, can you forward me the links to some of the failed builds?

thakis added a subscriber: thakis.May 21 2018, 10:47 AM

Did you land the revert? I'm seeing the same failure running tests locally as on http://lab.llvm.org:8011/builders/clang-cmake-x86_64-avx2-linux/builds/4445/steps/ninja%20check%201/logs/FAIL%3A%20LLVM%3A%3APR37310.mir

If you haven't reverted yet, please do.

This was reverted by 332742. Sorry, I failed at syncing up my checkout.

Hi Than,

Sorry I'm not sure what to do in that case. Locally it runs fine on my machine, and I've tested it on another system and it's ok there. You're probably safe enough to re-land the change without the test case.

Thanks,
Simon

thanm added a comment.May 22 2018, 7:31 AM

Sorry I'm not sure what to do in that case. Locally it runs fine on my machine, and I've tested it on another system and it's ok there. You're probably safe enough to re-land the change without the test case.

I agree, given the nature of the change. I'll prepare another patch.

thanm added a comment.May 25 2018, 6:39 AM

I agree, given the nature of the change. I'll prepare another patch.

OK, I finally figured out the problem. It turns out that "isel" is not (as I had naively assumed) the instruction selection pass, it's a separate pass specific to AMDGPU (confusing name, IMHO). I changed the test to use "-start-before dwarfehprepare" (a real pass that is not target-specific) and it seems to work now. I will send another patch.