This is an archive of the discontinued LLVM Phabricator instance.

When picking a row from an UnwindPlan on a mid-stack frame, decr pc to get within bounds of the CALL instruction
ClosedPublic

Authored by jasonmolenda on May 4 2022, 1:08 PM.

Details

Summary

This is addressing a bit of a big problem we're hitting with the way optimized code is getting generated for kernel binaries today; we have a central function that has a function call with one set of unwind rules, but the return address is a different block with really different unwind rules (maybe it's a noreturn call, I didn't read through the source tbh).

It's the usual problem with this kind of thing - we need to decrement the return pc before looking up the scope for unwind rules. The patch is straightforward, but I haven't come up with any way to test it short of "have a kernel binary and corefile" and I can't use either of those (no small part of it being that they're enormous). Not thrilled about this, but I haven't come up with a synthetic repo case for it.

I don't know who would be a good reviewer given how little I can provide here - I'll add Pavel and Greg, but don't feel obligated to give an opinion if you don't have time to look at the code. I would let this simmer for longer trying to come up with some kind of test case, but it happens to impact a code sequence that is hit by a lot of internal people here, so I need to get going with a fix.

Diff Detail

Event Timeline

jasonmolenda created this revision.May 4 2022, 1:08 PM
Herald added a project: Restricted Project. · View Herald TranscriptMay 4 2022, 1:08 PM
jasonmolenda requested review of this revision.May 4 2022, 1:08 PM

Ah, I can't share the binary/corefile that led me to this issue but I can show what it looks like. I have a function with a noreturn call, it's a crashing scenario where (as always) this is the most important bit to get right, to show the crash,

    0xfffffff02058d2c4 <+3404>: mov    x1, x19
    0xfffffff02058d2c8 <+3408>: bl     -0xfdf3b49dc              ; panic_with_thread_kernel_state 
->  0xfffffff02058d2cc <+3412>: mov    x0, x19
    0xfffffff02058d2d0 <+3416>: mov    x1, x24

and the unwind plan rules for these instructions look like

row[115]: 3396: CFA=fp+16 => x5= <same> x8= <same> x19=[CFA-24] x20=[CFA-32] x21=[CFA-40] x22=[CFA-48] x23=[CFA-56] x24=[CFA-64] x25=[CFA-72] x26=[CFA-80] x27=[CFA-88] x28=[CFA-96] fp=[CFA-16] lr=[CFA-8] 
row[116]: 3412: CFA=fp+16 => x19= <same> x20= <same> x21= <same> x22= <same> x23= <same> x24= <same> x25= <same> x26= <same> x27= <same> x28= <same> fp= <same> lr= <same> 
row[117]: 3428: CFA=fp+16 => x8=[CFA-128] x19=[CFA-24] x20=[CFA-32] x21=[CFA-40] x22=[CFA-48] x23=[CFA-56] x24=[CFA-64] x25=[CFA-72] x26=[CFA-80] x27=[CFA-88] x28=[CFA-96] fp=[CFA-16] lr=[CFA-8]

We're under the BL call here, and if we use

row[115]: 3396: CFA=fp+16 => x5= <same> x8= <same> x19=[CFA-24] x20=[CFA-32] x21=[CFA-40] x22=[CFA-48] x23=[CFA-56] x24=[CFA-64] x25=[CFA-72] x26=[CFA-80] x27=[CFA-88] x28=[CFA-96] fp=[CFA-16] lr=[CFA-8]

we'll walk the stack correctly. But instead we're using the return address which has an unwind rule

row[116]: 3412: CFA=fp+16 => x19= <same> x20= <same> x21= <same> x22= <same> x23= <same> x24= <same> x25= <same> x26= <same> x27= <same> x28= <same> fp= <same> lr= <same>

(which is honestly real suspect in its own right and I'll be looking at what codepath lldb was following when it came up with that, it's a complex function and I don't have an opinion at first blush)

But the problem is obvious when you see this - the unwind state after a noreturn could be any old thing, and we can't use it reliably.

clayborg requested changes to this revision.May 5 2022, 11:44 AM

Inline comments point out where can can get rid of the second parameter to IsUnwindPlanValidForCurrentPC() as no one uses it anymore.

lldb/source/Target/RegisterContextUnwind.cpp
519–522

This can easily just be:

m_behaves_like_zeroth_frame = !decr_pc_and_recompute_addr_range;
642–643

remove "valid_offset" variable and also remove it from the second argument to IsUnwindPlanValidForCurrentPC as it isn't used anywhere now and is just dead code.

This revision now requires changes to proceed.May 5 2022, 11:44 AM
jasonmolenda added inline comments.May 5 2022, 1:34 PM
lldb/source/Target/RegisterContextUnwind.cpp
642–643

hm, this merits consideration more widely; I looked at all callers to IsUnwindPlanValidForCurrentPC and none of them use the valid_offset that it provides; they are all merely checking that the offset is covered by the UnwindPlan's byte size. I suspect none of these are actually necessary; we picked the unwind plan based on symbol name so you'd need a "symbol" that has a large byte size, but an UnwindPlan that was sourced from some input that limited the byte size range covered. idk, it might be possible tho, especially in a stripped binary.

I'll change all of the callers to not use the returned valid_offset for now.

clayborg added inline comments.May 5 2022, 1:51 PM
lldb/source/Target/RegisterContextUnwind.cpp
642–643

If it is actually being used somewhere, then that is fine. My search of the code showed 3 locations, and this location was the only one that was using it, but now it isn't. Are there other locations I missed?

Update to address Greg's feedback.

jasonmolenda added inline comments.May 5 2022, 1:55 PM
lldb/source/Target/RegisterContextUnwind.cpp
642–643

I wrote that poorly. I meant to say that no one was using the valid_offset returned value, so I would change the method to not return it. I uploaded an updated patch showing what I mean here.

Accidentally included an unrelated change in the last update.

This revision was not accepted when it landed; it landed in state Needs Review.May 5 2022, 2:18 PM
This revision was automatically updated to reflect the committed changes.
clayborg added inline comments.May 5 2022, 2:34 PM
lldb/source/Target/RegisterContextUnwind.cpp
90

Don't we want to subtract 1 here? Otherwise this function might return true incorrectly if we actually need to subtract at least 1 from the PC

jasonmolenda added inline comments.May 5 2022, 2:38 PM
lldb/source/Target/RegisterContextUnwind.cpp
90

honestly this method is a bit suspect, when we look at it today. It's first "is $pc within ByteSize", then "ok is pc-1 within ByteSize", with a bonus check if somehow we have a negative offset. Maybe it became suspect with revision over time. I don't think it's wrong, but it's not written with much clarity and I should look into how it's changed over time to see if there was a cleaner vision of its goals originally...