This is an archive of the discontinued LLVM Phabricator instance.

LLDB Arm Watchpoints: Use single hardware watchpoint slot to watch multiple bytes where possible
AbandonedPublic

Authored by omjavaid on Sep 15 2016, 6:41 AM.

Details

Reviewers
clayborg
labath
Summary

This patch enables LLDB to use a single hardware watchpoint slot to watch multiple bytes or half words.

For example:
instead of using 4 different hardware watchpoint slots to watch 1 byte at 0x0040, 0x0041, 0x0042 and 0x0043. We are now able to use 1 hardware watchpoint slot to watch all 4 of these single byte watchpoints. Similarly for two consecutive half words we can use a single slot.

Arm has 4 hardware watchpoint slots and this patch optimizes the use of those slots wherever possible.

I have made necessary changes to watchpointSizeTests to capture this change.

I am working on similar patch for AArch64.

Diff Detail

Event Timeline

omjavaid updated this revision to Diff 71500.Sep 15 2016, 6:41 AM
omjavaid retitled this revision from to LLDB Arm Watchpoints: Use single hardware watchpoint slot to watch multiple bytes where possible.
omjavaid updated this object.
omjavaid added a reviewer: labath.
omjavaid added a subscriber: lldb-commits.
labath requested changes to this revision.Sep 15 2016, 7:47 AM
labath edited edge metadata.

I have some doubts about the validity of this patch. We should make sure those are cleared before putting this in.

packages/Python/lldbsuite/test/functionalities/watchpoint/watchpoint_size/TestWatchpointSizes.py
43 ↗(On Diff #71500)

It's not clear to me why you need to modify the existing test for this change. You are adding functionality, so all existing tests should pass as-is (which will also validate that your change did not introduce regressions).

154 ↗(On Diff #71500)

It looks like this test will test something completely different on arm than on other architectures. You would probably be better off writing a new test for this.

source/Plugins/Process/Linux/NativeRegisterContextLinux_arm.cpp
574

The logic here is extremely convoluted. Doesn't this code basically boil down to:

current_size = m_hwp_regs[wp_index].control & 1 ? GetWatchpointSize(wp_index) : 0;
new_size = llvm::NextPowerOf2(std::max(current_size, watch_mask));
// update the control value, write the debug registers...

Also watch_mask should probably be renamed to watch_size, as it doesn't appear to be a mask.

609

This looks a bit worrying.
What will happen after the following sequence of events:

  • client tells us to set a watchpoint at 0x1000
  • we set the watchpoint
  • client tells us to set a watchpoint at 0x1001
  • we extend the previous watchpoint to watch this address as well
  • client tells us to delete the watchpoint at 0x1000
  • ???

Will we remain watching the address 0x1001? I don't see how will you be able to do that without maintaining a some info about the original watchpoints the client requested (and I have not seen that code). Please add a test for this.

This revision now requires changes to proceed.Sep 15 2016, 7:47 AM
clayborg requested changes to this revision.Sep 15 2016, 1:03 PM
clayborg edited edge metadata.

Great fix. Just fix the testing so that it isn't ARM specific. There shouldn't be any:

if self.getArchitecture() in ['arm']:
  do arm stuff
else:
  do non arm stuff

Also we will need to be able to test the set watch at 0x1000, then at 0x1001 by sharing, clear 0x1000 make sure 0x1001 still works, etc.

comments inline.

packages/Python/lldbsuite/test/functionalities/watchpoint/watchpoint_size/TestWatchpointSizes.py
43 ↗(On Diff #71500)

As we keep on adding these tests its increasing our overall testing time. I just thought using the same test with some changes will cover the cases we want to test.

Anyways we can write separate tests as well.

source/Plugins/Process/Linux/NativeRegisterContextLinux_arm.cpp
574

Seems legit. I ll update this in next patch.

609

I just realized that I missed a crucial change in my last patch that we need to make all this work.

Let me get back with correction and desired test cases.

omjavaid updated this revision to Diff 71633.Sep 16 2016, 6:47 AM
omjavaid edited edge metadata.

I have added a new test case that tests suggested scnario without changing any previous test cases.

Also I have made sure we re validate all watchpoint installed on thread resume to make sure we have the latest values assigned to hardware watchpoint registers.

This passes on ARM (RaspberryPi3, Samsung Chromebook). I have not yet tested on android.

This will fail on targets which dont support multiple watchpoint slots.

Also this should fail on AArch64 which I am currently working on.

zturner added inline comments.
packages/Python/lldbsuite/test/functionalities/watchpoint/multi_watchpoint_slots/main.c
24

What's up with all the double spaced source code? Is this intentional?

source/Plugins/Process/Linux/NativeRegisterContextLinux_arm.cpp
515–523

This block of code is a bit confusing to me. Is this equivalent to:

lldb::addr_t start = llvm::alignDown(addr, 4);
lldb::addr_t end = addr + size;
if (start == end || (end-start)>4)
  return LLDB_INVALID_INDEX32;
labath added inline comments.Sep 18 2016, 1:14 PM
packages/Python/lldbsuite/test/functionalities/watchpoint/multi_watchpoint_slots/main.c
24

Indeed. The spaces seem superfluous.

source/Plugins/Process/Linux/NativeRegisterContextLinux_arm.cpp
515–523

I am not sure this is much clearer, especially, as we will later need a separate varaible for end-start anyway.

+1 for llvm::alignDown though.

571

Looks much better. Any reason for not using NextPowerOf2 ? Among other things, it is self-documenting, so you do not need the comment above that.

source/Plugins/Process/Linux/NativeThreadLinux.cpp
205

If you add this, then the comment below becomes obsolete.

Seems like a pretty elegant solution to the incremental watchpoint update problem. I am wondering whether we need to do it on every resume though. I think it should be enough to do it when a watchpoint gets deleted (NTL::RemoveWatchpoint). Also, we should throw out the implementation of NativeRegisterContextLinux_arm::ClearHardwareWatchpoint -- it's no longer necessary, and it's not even correct anymore.

Answers to comments. I will upload a updated patch after corrections and updates.

packages/Python/lldbsuite/test/functionalities/watchpoint/multi_watchpoint_slots/main.c
24

Agreed. Will be removed in next iteration.

source/Plugins/Process/Linux/NativeRegisterContextLinux_arm.cpp
515–523

There is significant performance difference when we choose addr = addr & (~0x03); over llvm::alignDown

It will eventually effect responsiveness if we keep increasing code size like this.

Even if we use -Os then alignDown is squeezed down to 8-9 instructions. I havnt tried clang though.

Instructions needed for addr = addr & (~0x03):

4008bd:	48 8b 45 e8          	mov    -0x18(%rbp),%rax
4008c1:	48 83 e0 fc          	and    $0xfffffffffffffffc,%rax
4008c5:	48 89 45 e8          	mov    %rax,-0x18(%rbp)

Call penalty for llvm::alignDown
400918: ba 00 00 00 00 mov $0x0,%edx

40091d:	48 89 ce             	mov    %rcx,%rsi
400920:	48 89 c7             	mov    %rax,%rdi
400923:	e8 ae 00 00 00       	callq  4009d6 <_Z9alignDownmmm>
400928:	49 89 c4             	mov    %rax,%r12
40092b:	48 8b 5d d8          	mov    -0x28(%rbp),%rbx

Disassembly for llvm::alignDown
00000000004009d6 <_Z9alignDownmmm>:

4009d6:	55                   	push   %rbp
4009d7:	48 89 e5             	mov    %rsp,%rbp
4009da:	48 89 7d f8          	mov    %rdi,-0x8(%rbp)
4009de:	48 89 75 f0          	mov    %rsi,-0x10(%rbp)
4009e2:	48 89 55 e8          	mov    %rdx,-0x18(%rbp)
4009e6:	48 8b 45 e8          	mov    -0x18(%rbp),%rax
4009ea:	ba 00 00 00 00       	mov    $0x0,%edx
4009ef:	48 f7 75 f0          	divq   -0x10(%rbp)
4009f3:	48 89 55 e8          	mov    %rdx,-0x18(%rbp)
4009f7:	48 8b 45 f8          	mov    -0x8(%rbp),%rax
4009fb:	48 2b 45 e8          	sub    -0x18(%rbp),%rax
4009ff:	ba 00 00 00 00       	mov    $0x0,%edx
400a04:	48 f7 75 f0          	divq   -0x10(%rbp)
400a08:	48 0f af 45 f0       	imul   -0x10(%rbp),%rax
400a0d:	48 89 c2             	mov    %rax,%rdx
400a10:	48 8b 45 e8          	mov    -0x18(%rbp),%rax
400a14:	48 01 d0             	add    %rdx,%rax
400a17:	5d                   	pop    %rbp
400a18:	c3                   	retq   
400a19:	0f 1f 80 00 00 00 00 	nopl   0x0(%rax)

Number of instructions generated for alignDown with gcc -Os

400892:	48 8b 6c 24 08       	mov    0x8(%rsp),%rbp
400897:	48 8b 4c 24 10       	mov    0x10(%rsp),%rcx
40089c:	31 d2                	xor    %edx,%edx
40089e:	be c2 0a 40 00       	mov    $0x400ac2,%esi
4008a3:	bf a0 11 60 00       	mov    $0x6011a0,%edi
4008a8:	48 89 e8             	mov    %rbp,%rax
4008ab:	48 f7 f1             	div    %rcx
4008ae:	48 0f af c1          	imul   %rcx,%rax
4008b2:	48 89 c3             	mov    %rax,%rbx
571

so llvm::NextPowerOf2 doesnt serve the intended behaviour.

llvm::NextPowerOf2 returns 2 for 1, 4 for 2 or 3, and 8 for 4.

We just want to make sure that our new_size is a power of 2 if its already not which will only be the case if size turns out to be 3.

source/Plugins/Process/Linux/NativeThreadLinux.cpp
205

This can be improved for performance I intentionally didn't do it to minimize changes to the generic component.

labath accepted this revision.Sep 22 2016, 5:57 AM
labath edited edge metadata.
labath added inline comments.
source/Plugins/Process/Linux/NativeRegisterContextLinux_arm.cpp
515–523

I tried this with clang-3.6 -O3. The entire alignDown function call compiled down to andq $-4, %rdi. I'll leave this up to you, as I think it is readable enough right now, but I don't think we should be afraid of using utility functions like this. I think LLVM cares a lot more about performance than we, so I believe we can rely on them knowing what they're doing. Also we have much bigger higher-level issues affecting performance, the least of which is the change below, where you re-calculate all watchpoints on every resume.

571

OK, nevermind then.

source/Plugins/Process/Linux/NativeThreadLinux.cpp
205

I don't care about the performance too much - the code isn't that hot.

What bothers me is that this leaves the code in an inconsistent state - NativeRegisterContextLinux_arm::ClearHardwareWatchpoint thinks it is doing the watchpoint removal "the old way" whereas what it does actually does not matter, as we will nuke the watchpoint registers anyways. Ideally, I'd like to see this done in the opposite order - first switching the code to use the "nuking" approach to removing watchpoints, and after that adding the slot reuse code (at which point it will not need to touch any generic code at all). If you want to do it the other way then go ahead, but I do expect to see a follow-up change to clean this up.

omjavaid updated this revision to Diff 72589.Sep 26 2016, 5:39 PM
omjavaid edited edge metadata.

This is a new version of what seems to me fully implementing functionality we intend to have here.

On a second thought nuking ClearHardwareWatchpoint function seems to be the wrong approach here. I spent some time taking different approaches and it turns out that if we do not ClearHardwareWatchpoint when back-end asks us to remove it then we wont be able to step over watchpoints. On ARM targets we have to first clear and then reinstall watchpoints to step over the watchpoint instruction.

On the other hand if we call NativeRegisterContextLinux_arm::ClearHardwareWatchpoint then that watchpoint stands removed if call is just to delete watch on one of the bytes. And if we follow up with creating a new watchpoint on a different word the slot being used may appear vaccant which is actually inconsistent behavior.

So I have a new approach that does clear watchpoint registers if NativeRegisterContextLinux_arm::ClearHardwareWatchpoint is called but we still track reference counts by re-introducing refcount that I removed in my last patch. This will mean that a follow up create may fail just because there are still references to disabled watchpoint and watchpoint slots are still not vaccant. I have made changes to the test to reflect this behaviour.

Please comment if you have any reservation about this approach.

labath requested changes to this revision.Sep 27 2016, 3:27 AM
labath edited edge metadata.

This is a new version of what seems to me fully implementing functionality we intend to have here.

On a second thought nuking ClearHardwareWatchpoint function seems to be the wrong approach here. I spent some time taking different approaches and it turns out that if we do not ClearHardwareWatchpoint when back-end asks us to remove it then we wont be able to step over watchpoints. On ARM targets we have to first clear and then reinstall watchpoints to step over the watchpoint instruction.

On the other hand if we call NativeRegisterContextLinux_arm::ClearHardwareWatchpoint then that watchpoint stands removed if call is just to delete watch on one of the bytes. And if we follow up with creating a new watchpoint on a different word the slot being used may appear vaccant which is actually inconsistent behavior.

So I have a new approach that does clear watchpoint registers if NativeRegisterContextLinux_arm::ClearHardwareWatchpoint is called but we still track reference counts by re-introducing refcount that I removed in my last patch. This will mean that a follow up create may fail just because there are still references to disabled watchpoint and watchpoint slots are still not vaccant. I have made changes to the test to reflect this behaviour.

Please comment if you have any reservation about this approach.

Hmm.... I am indeed starting to have serious reservations about this.

The more I think about this, the more it starts to look like a big hack. So, now ClearHardwareWatchpoint still maintains a refcount on the number of users of the watchpoint slot, but it disables the slot everytime the slot usage count is decremented ? (as opposed to when the refcount reaches zero). And this is supposed to be the reason that step-over watchpoint (a pretty critical piece of functionality) works ? And the reason why we are still able to do the watchpoints is that before a continue (but not before a single-step, because that would break the previous item) we nuke the watchpoint registers (and their reference counts) and start over?

I am not convinced that having watchpoint slot sharing is important enough to balance the amount of technical debt this introduces.

This revision now requires changes to proceed.Sep 27 2016, 3:27 AM
omjavaid updated this revision to Diff 72723.Sep 27 2016, 3:39 PM
omjavaid edited edge metadata.

Give this approach a rethink I dont see a lot of problems with this final implementation unless it fails on other architectures.
We are already hacking our way to have these byte selection watchpoints working in existing code. New code seems to be improving the hack in my opinion.

Let me explain what I am doing and may be you can provide your suggestion and feedback.

Watchpoint Install => Register watchpoint into hardware watchpoint register cache
Watchpoint Enable => Enable in cache and write registers using ptrace interface.

Ideally we should be able to install/uninstall watchpoints in cache and then enable them all on a resume.
In case of arm If a watchpoint is hit we should be able to disable that watchpoint.
Step over watchpoint instruction and then re-enable the watchpoint.
Our existing implementation will require a lot of changes to do that so thats why here is what i am doing.

SetHardwareWatchpoint => Performs Install and Enable

  • If a new watchpoint slot is going to be used we Install and enable.
  • For new watchpoint we should be able to complete both Install and or we report an error.
  • If a duplicate slot is going to be used we Install and enable if required.
  • Install means updating size if needed plus updating ref count.
  • Enable means updating registers if size was updated.

ClearHardwareWatchpoint

  • Disable and uinstall watchpoint means
  • Decrement ref count and clear hardware watchpoint regsiters.
  • Advantage of keeping ref counts is:
  • If refcount is greater than zero then SetHardwareWatchpoint cannot use this slot for a new watchpoint (new address).
  • But SetHardwareWatchpoint can be use this slot to install duplicate watchpoints (same address but different byte or word)

ClearAllHardwareWatchpoint -- Just clear the whole watchpoint cache and call SetHardwareWatchpoint for all available watchpoints.

NativeThreadLinux:

On Watchpoint Remove -> Invalidate watchpoint cache
On Resume - > Re-validate watchpoints by creating a new cache and re-enabling all watchpoints

So this fixes our step-over issue and also preserves watchpoint slot if it is being used by multiple watchpoints.

Can you think of any scenarios which might fail for this approach?

Give this approach a rethink I dont see a lot of problems with this final implementation unless it fails on other architectures.
We are already hacking our way to have these byte selection watchpoints working in existing code. New code seems to be improving the hack in my opinion.

  • I don't believe that the presence of one hack justifies the addition of another. If anything, then it's the opposite.
  • The fact that we have to fiddle with the byte selects to get the kernel to accept our watchpoints could be thought of as a "hack", but I think there it's justified, as otherwise we would be unable to watch anything that isn't dword-aligned. And most importantly, the issue is contained within the SetHardwareWatchpoint function -- noone outside of that needs to know about the fact that we are doing something funny with byte-selects. This is not the case here, as I explain below.

Can you think of any scenarios which might fail for this approach?

I can. Try the following sequence of commands:

  • position the PC on an instruction that will write to address A (aligned)
  • set a byte watchpoint on A+0
  • set a byte watchpoint on A+1
  • clear the second watchpoint
  • single-step

What will happen with your patch applied? (*) Nothing, as the watchpoint will not get hit because you disabled it. Now this is a pretty dubious example, but I think it demonstrates, the problems I have with this solution very nicely.

You are lying to the client about which watchpoints are enabled. The client asked you to remove just one watchpoint, so as far as he is concerned the other one is still active. But you have disabled the other one as well. And now you make this a feature, as without it single stepping a multi-watchpoint will not work. And you are here relying on the fact that the client implements the step-over-watchpoint as a simple $z2, $s, $Z2 packet sequence. If the client decides to do anything more complicated, say evaluate an expression to get some state before the update, then your feature will break, as an intermediate $c, will reinstate your hidden watchpoint (in fact, this is maybe possible already, if a conditional breakpoint and a watchpoint are hit simultaneously). And you don't even fix the step-over-instruction-triggering-multiple-watchpoints problem definitively - this can still happen e.g., if you have a watch on 0x1000 and 0x1008, and a single instruction triggers both.(**)

This is why I think this change is bad, as it is making an invisible line between the lowest levels of the server and the highest levels of the client, which now have to be kept in sync, otherwise things will unexpectedly break. I don't believe that having the ability to watch more than 4 locations at once is worth it, particularly when it can be easily worked around by the user (just set one big watchpoint instead of multiple small ones). Do you have a specific use case where this would be necessary/useful? I personally have never used more than one watchpoint per debug session in my life, and I expect this to be true for most people (I think the typical use case is "who is corrupting my variable?"), So, I believe that it is better to have support for fewer watchpoints, but have it working well, than the opposite.

(*) Now when I tried this out on lldb without this change, it still did not work as expected - for some reason, after hitting and stepping over the watchpoint, lldb decided to to issue $c, and lost control of the inferior. I think that tracking this issue and fixing it would have more impact than adding the multi-watchpoint support (and it will reduce the number of corner cases, where we are wrong rather than increasing it).

(**) Another issue whose solving would have more impact than this.

Is there anything we need to do on this review?

omjavaid abandoned this revision.Nov 25 2016, 2:01 AM

There is not exact solution that satisfies all corner cases. Abandoning for now until I come up with a solution that covers us from all corners.