This is an archive of the discontinued LLVM Phabricator instance.

Clarify how watchpoint description in stop packets work, fix AArch64 unintended behavior
ClosedPublic

Authored by jasonmolenda on Apr 7 2023, 3:17 PM.

Details

Summary

Orig posted in https://reviews.llvm.org/D147674 as part of a debugserver patch, creating two phabs to track those separate patches so it's easier to review.

This patch fixes a problem with watchpoints on AArch targets using lldb-server (and soon debugserver), where a large write that begins before a watched region, and extends into the watched region, is reported as a trap address before the watched memory range. I can show this on an AArch64 Ubuntu install:

{
   uint8_t buf[8];
   uint64_t *u64_p = (uint64_t *) buf;
   *u64_p = 5; // break here
   (*u64_p)++;

At the break here line, if and do watch set variable buf[2] and c, the packets from lldb server look like

<  21> send packet: $Z2,fffffffff36a,1#75
<   6> read packet: $OK#9a
<   5> send packet: $c#63

<838> read packet: $T05thread:493;name:a.out;  [...] ;reason:watchpoint;description:323831343734393736373037343334203320323831343734393736373037343332;#db

The description is 281474976707434 3 281474976707432 aka 0x0000fffffffff36a 3 0x0000fffffffff368. The first address is an address contained within the hit watchpoint. The second number is the hardware register index. The third number is the actual address that was accessed,

(lldb)  v &buf u64_p
(uint8_t (*)[8]) &buf = 0x0000fffffffff368
(uint64_t *) u64_p = 0x0000fffffffff368
(lldb)

Right now, lldb has a rule that "for MIPS an ARM, if the third number is NOT contained within any watchpoint region, silently continue past this watchpoint". From the user's perspective, lldb misses the watchpoint modification. We see this a lot on macOS with bzero()/memset() when you're watching an element of an object and it is overwritten.

For MIPS this behavior is correct because any access to an 8B granule triggers a watchpoint in that granule. If I watch 4 bytes at 0x1004, and someone accesses 1 byte at 0x1002, a watchpoint will trigger. Jaydeep Patil in 2015 ( https://reviews.llvm.org/D11672 ) added this third argument and the "silently step over it and don't tell the user" behavior; that patch decodes the instruction that caused the trap to determine the actual address that was accessed.

However, the intended use/meaning of this third field was very confusing (to me), and in 2016 Omair ( https://reviews.llvm.org/D21516 ) enabled the same behavior on ARM targets and reported the actual accessed address. I'm not exactly sure what issue motivated this, it may have simply been a misunderstanding of how StopInfo decodes and uses these three addresses. If nothing is done on ARM targets and an access outside a watched region, lldb stops execution and doesn't know how to disable the correct watchpoint & advance execution, my guess is that he was addressing that issue.

As the reason:watchpoint's description value has grown from "wp addr + hw reg index" to "wp addr + hw reg index + actual trap address", with implicit behavior on MIPS that an actual trap address silently continues if it's outside the range of a watchpoint, this is not the best. We should stop using description and add 2-4 keys in the stop info packet, e.g. wp-address:, wp-reg-index:, wp-hit-address:, silently-continue: that can be provided as needed. We can remove the "on MIPS do this" behavior from generic lldb -- the stub is probably in the best position to know if this is a false watchpoint tigger that should be ignored, and it can tell lldb. Looking at the latest ARM ARM A-Profile docs, this seems like a situation that can exist in some modes as well, we may need to flag this from AArch64 debugserver/lldb-server in the future.

We don't currently have MIPS supported in lldb - it was removed c. 2021 in anticipation of someone starting to actively maintain it again. But this "silently continue" behavior may be needed for AArch64, and it may be needed again for MIPS if that target is revived.

Diff Detail

Event Timeline

jasonmolenda created this revision.Apr 7 2023, 3:17 PM
jasonmolenda requested review of this revision.Apr 7 2023, 3:17 PM

I agree that silent continue was wrong and should be fixed. I tried for to remember what I was trying to do when I wrote that patch ... still dont remember much but digged out some information that may be useful for this patch review.

We had a bunch of funny behaving hardware mostly Nexus phones with different types of watchpoint behavior being implement by every vendor.

From our local record i found that stp issue was never fixed. Some vendor machines reported correct hit_address while some didnt. In LLVM we do have bug report for another of these issues in one of the cases where STP instruction can trigger multiple watchpoints located side by side. https://bugs.llvm.org/show_bug.cgi?id=30758

On Linux ptrace is responsible for reporting a watchpoint hit address and also responsible for setting/unsetting watchpoints. In case of Arm64 ptrace while reporting watchpoints performs some heuristic based calculations to exactly cater for the case you have mentioned where access reports a address out of range. See watchpoint_handler code here:https://elixir.bootlin.com/linux/latest/source/arch/arm64/kernel/hw_breakpoint.c#L754

And this comment copied from same file :
/*

Arm64 hardware does not always report a watchpoint hit address that matches
one of the watchpoints set. It can also report an address "near" the
watchpoint if a single instruction access both watched and unwatched
addresses. There is no straight-forward way, short of disassembling the
offending instruction, to map that address back to the watchpoint. This
function computes the distance of the memory access from the watchpoint as a
heuristic for the likelihood that a given access triggered the watchpoint. *
See Section D2.10.5 "Determining the memory location that caused a Watchpoint
exception" of ARMv8 Architecture Reference Manual for details. *
The function returns the distance of the address from the bytes watched by
the watchpoint. In case of an exact match, it returns 0. */

Thanks Omair! Yeah we have the same problem on Darwin - AArch64 doesn't require that the address we get on a watchpoint trap in the FAR register point within the address range we are watching. I never fixed this on Darwin so I know the failure mode - lldb says "there was a watchpoint hit, but I couldn't possibly figure out which watchpoint it is, I will stop execution now" and the user has to figure out that they need to manually disable the watchpoint, instruction step, and re-insert the watchpoint. I'm sure you were looking at the same issue and fixed it by using the MIPS silently-skip codepath in 2016 by accident. I misunderstood what those three fields in the description were doing for a few days before I finally figured it out myself, I'm not at all surprised other people misunderstood it too.

It looks like some future AArch64 cpus may give us a watchpoint number in the ESR, instead of an address that was accessed, that will be a nice improvement over having heuristics to figure out which watchpoint was likely tripped.

Thanks for looking into all this, it makes my head hurt thinking about it :)

It will likely take us (Linaro) a while to do the lldb-server changes, but I do want to see this fixed eventually. We have some work upcoming for memcopy operations that will need special care for watchpoints, so that is a good excuse to do it.

lldb/docs/lldb-gdb-remote.txt
1609

What does this "v." mean? Is it short for "see here " or some kind of citation?

lldb/source/Plugins/Process/Utility/NativeRegisterContextDBReg_arm64.h
62

Worth noting why here? (minimum watch size and alignment I guess)

64

Also does this need to use /// for the doxygen \a thing to work?

(I've no idea, Jonas is forever telling me to use /// though :) )

lldb/source/Plugins/Process/gdb-remote/ProcessGDBRemote.cpp
2232–2235

Is this a note to self or is this code now actually doing that?

lldb/source/Target/StopInfo.cpp
1019

Please document this here as well.

I am not 100% whether silently skip true means always silently skip this watchpoint, or silently skip out of range hits attributed to this watchpoint.

jasonmolenda added inline comments.Apr 11 2023, 3:37 PM
lldb/docs/lldb-gdb-remote.txt
1609

Yeah, maybe not widely known. "v." short for "vide" latin, see. cf https://latinlexicon.org/LNS_abbreviations.php ;)

lldb/source/Plugins/Process/Utility/NativeRegisterContextDBReg_arm64.h
62

I was taking notes as I read the source to understand how arm linux behaved, and left those notes in the header, maybe I should just remove them. The DREG struct has three addresses in it, and it wasn't really clear to me what was stored in them.

lldb/source/Plugins/Process/gdb-remote/ProcessGDBRemote.cpp
2232–2235

Sorry that wasn't clear. We can receive watchpoint notifications one of three ways: watch/awatch/rwatch (standard gdb RSP), reason:watchpoint with a description asciihex string of 1-to-3 numbers, and on Darwin systems, a mach exception. ProcessGDBRemote rewrites watch/awatch/rwatch to reason:watchpoint + description with only the one address, and that's the only thing decoded into a StopInfo object.

lldb/source/Target/StopInfo.cpp
1019

Good point.

Update the patch to incorporate David's first feedback. I know this is a bit of a tricky patch to review because it is 95% comments/documentation/renaming of variables, followed by the one key change of removing core >= ArchSpec::eCore_arm_generic && core <= ArchSpec::eCore_arm_aarch64) which means ARM systems won't follow this MIPS false-watchpoint-hit behavior any more, for a watchpoint trap address outside the address range being watched (but DOES access the watched range on ARM).

omjavaid accepted this revision.Apr 12 2023, 12:45 AM

Looks good to me as well

This revision is now accepted and ready to land.Apr 12 2023, 12:45 AM

FWIW to show the motivation for this patch (which, again comes down to removing core >= ArchSpec::eCore_arm_generic && core <= ArchSpec::eCore_arm_aarch64 when creating the StopInfo from the watchpoint description), the test case I wrote in https://reviews.llvm.org/D147820 shows the problem where I adopt the lldb-server mechanism of watchpoint reporting on Darwin systems and I hit this problem:

#include <stdint.h>
#include <stdio.h>
int main() {
  union {
	  uint8_t buf[8];
	  uint64_t val;
  } a;
  a.val = 0;  // break here
  for (int i = 0; i < 5; i++) {
    a.val = i;
    printf ("a.val is %lu\n", a.val);
  }
}

(lldb) br s -p break
(lldb) r
Process 2182 stopped
-> 8   	  a.val = 0;  // break here

(lldb) w s v a.buf[2]
Watchpoint created: Watchpoint 1: addr = 0xfffffffff3a2 size = 1 state = enabled type = w
    declare @ '/home/jason/a.c:7'
    watchpoint spec = 'a.buf[2]'
    new value: '\xff'
(lldb) c
Process 2182 resuming
a.val is 0
a.val is 1
a.val is 2
a.val is 3
a.val is 4
Process 2182 exited with status = 0 (0x00000000) 
(lldb)

These watchpoint hits are silently skipped over on AArch64. If you turn on the packet log, you'll see the watchpoint hits reported by debugserver, but lldb silently skips them. This is correct behavior on MIPS; ARM targets using lldb-server should not have opted into this behavior. When we set aside the comments and variable renaming, that's what we're fixing with this patch.