This is an archive of the discontinued LLVM Phabricator instance.

[lldb][AArch64] Add type marker to ReadAll/WriteALLRegisterValues data
ClosedPublic

Authored by DavidSpickett on Jul 31 2023, 6:28 AM.

Details

Summary

While working in support for SME's ZA register, I found a circumstance
where restoring ZA after SVE, when the current SVE mode is non-streaming,
will kick the process back into FPSIMD mode. Meaning the SVE values that
you just wrote are now cut off at 128 bit.

The fix for that is to write ZA then SVE. Problem with that
is, the current ReadAll/WriteAll makes a lot of assumptions about the
saved data length.

This patch changes the format so there is a "type" written before
each data block. This tells WriteAllRegisterValues what it's looking at
without brittle checks on length, or assumptions about ordering.

If we want to change the order of restoration, all we now have to
do is change the order of saving.

This exposes a bug where the TLS registers are not restored.
This will be fixed by https://reviews.llvm.org/D156512 in some form,
depending on what lands first.

Existing SVE tests certainly check restoration and when I got this
wrong, many, many tests failed. So I think we have enough coverage
already, and more will be coming with future ZA changes.

Diff Detail

Event Timeline

DavidSpickett created this revision.Jul 31 2023, 6:28 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 31 2023, 6:28 AM
DavidSpickett requested review of this revision.Jul 31 2023, 6:28 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 31 2023, 6:28 AM

Even if this turns out not to be needed for Za handling, I think WriteAllRegisterValues not having to care about the layout is an improvement in itself.

Plus, this will give us a warning if we forget to save/restore new registers in future. It would have told me I had not done TLS.

And an important note, you don't need the size for each block because the scalable blocks have headers from ptrace embedded in them anyway (SVE/SSVE/ZA).

I did remove some checks that seemed redundant because all data read by WriteAll is going to have been written by ReadALL, so they were unlikely to fail.

I have some failing tests on Graviton to debug:

Failed Tests (2):
  lldb-api :: commands/register/register/aarch64_sve_registers/rw_access_static_config/TestSVERegisters.py
  lldb-api :: tools/lldb-server/TestGdbRemoteRegisterState.py

But the idea still stands.

Correct m_sve_state size to fix tests on SVE machines.

Remove misleading comment. We save the whole SVE context, FPSIMD or not.

DavidSpickett edited the summary of this revision. (Show Details)Jul 31 2023, 8:53 AM
DavidSpickett retitled this revision from [LLDB][AArch64] Add kind marker to ReadAll/WriteALLRegisterValues data to [lldb][AArch64] Add kind marker to ReadAll/WriteALLRegisterValues data.Jul 31 2023, 9:02 AM
Matt added a subscriber: Matt.Jul 31 2023, 12:12 PM

Use uint8_t for kind, fix some places doing sizeof uint32_t not of the kind type.

Put back memcopy of fpr registers, which is now tested by the parent patch.

Oh, and fixed the FIXME now that the TLS fixes have landed.

I have a feeling that expressions are becoming an expensive operation with amount of data we need to move back and forth between various buffers. Is there a way we can optimise this may be write register directly from source buffer.
Also how much minimum data we need to move when SME is enabled?

I have a feeling that expressions are becoming an expensive operation with amount of data we need to move back and forth between various buffers. Is there a way we can optimise this may be write register directly from source buffer.

I don't know about expensive literally but I did wonder if we could rely on the NativeRegisterContextLinux_arm64 buffers not being modified while the expression runs. Then do what I think you mean which is just flush them all to the process.

I'll need to work out exactly when SaveAll/WriteAll is and can be called. Part of me wonders why we even have such methods if you could just flush the existing buffers, but it wouldn't be the first time we've duplicated things.

Also how much minimum data we need to move when SME is enabled?

At minimum the SVE registers and the ZA header and tpidr2. So ~ (45*16) + 16 + 16 + 8. If ZA is on then you've got another 16*16 on top of that.

But to be clear, the amount of data is not the issue this change aims to address, it's purely the ordering of the restoration.

Is there a way we can optimise this may be write register directly from source buffer.

Though now I wonder if when you say source here you mean the member buffers in the register context, or literally the src pointer we use in the code.

I don't know about expensive literally but I did wonder if we could rely on the NativeRegisterContextLinux_arm64 buffers not being modified while the expression runs. Then do what I think you mean which is just flush them all to the process.

I'll need to work out exactly when SaveAll/WriteAll is and can be called. Part of me wonders why we even have such methods if you could just flush the existing buffers, but it wouldn't be the first time we've duplicated things.

I guess it looks like a lucrative solution but process/registers state between two consecutive stops can not be determined hence we are stuck with old fashioned way of save/restore.

At minimum the SVE registers and the ZA header and tpidr2. So ~ (45*16) + 16 + 16 + 8. If ZA is on then you've got another 16*16 on top of that.

But to be clear, the amount of data is not the issue this change aims to address, it's purely the ordering of the restoration.

Agreed, I am just curious if we can save some cycles by avoiding copying first from source to cache and then writing to registers. Specially when we are going to invalidate the cache anyway after the write.

I understand now. If we take WriteTLS for example, that ends with m_tls_is_valid = false;. So you're suggesting instead we have a WriteTLS whose source is not the usual TLS buffer, but the set of saved register data instead. I'll look into it.

clang-format some missing bits.

Replace a missing memcpy in the FPR case. This only works because the buffer happens to
still contain the previous state. If there is some route to restore arbitray states,
this would be broken, but I don't know how that might happen.

First thing to note is that WriteRegister also behaves this way, but there it is more appropriate because it updates only part of the buffer before writing it out in its entirety. Useful to know where the pattern came from though.

You would need roughly the following per WriteXYZ:

-      error = WriteTLS();
+      error = WriteTLS(src);

-Status NativeRegisterContextLinux_arm64::WriteTLS() {
+Status NativeRegisterContextLinux_arm64::WriteTLS(const void* src/*=nullptr*/) {

-  ioVec.iov_base = GetTLSBuffer();
+  ioVec.iov_base = src ? const_cast<void*>(src) : GetTLSBuffer();

-  Status WriteTLS();
+  Status WriteTLS(const void* src=nullptr);

We can assume that the buffer is the same as the data to be written back if it's something static like TLS. For SVE/SME, we would have resized our buffer first, so it holds there too.

The added complexity isn't that much but I think it adds more thinking time for a developer than it potentially saves in copying time. I already feel like the separate xyz_is_valid is enough to think about and having a potential second data source just adds to that load.

From the header docs it seems I was right that this is used primarily for expression evaluation:

// These two functions are used to implement "push" and "pop" of register
// states.  They are used primarily for expression evaluation, where we need
// to push a new state (storing the old one in data_sp) and then restoring
// the original state by passing the data_sp we got from ReadAllRegisters to
// WriteAllRegisterValues.

Which you would be doing a lot of in a formatter for example, but you'd get better savings implementing a more efficient packet format to do all that at once, I guess.

QSaveRegisterState / QRestoreRegisterState packets call it as part of expression evaluation, though in theory it's not always that. That's an lldb extension anyway so we're in control of it at least. In theory this could be used to restore state that is not just the previous state but I don't know how you'd trigger that from lldb.

The other use is NativeProcessLinux::Syscall which is sufficiently rare we can ignore that.

I did do a very rough benchmark where I printed the same expression 2000 times, so each one is doing a save/restore. Once with the code in this review right now, and again with this potential optimisation added to GPR/FPR/TLS (I'm on a Mountain Jade machine without SVE). Caveat shared machine, made up benchmark, etc. but all runs of both hovered between 16 and 17 seconds. Neither seemed to be consistently lower or higher than the other. Doesn't mean this isn't a speedup in isolation but if it is, it's dwarfed by the syscalls and packets sent back and forth.

First thing to note is that WriteRegister also behaves this way, but there it is more appropriate because it updates only part of the buffer before writing it out in its entirety. Useful to know where the pattern came from though.

You would need roughly the following per WriteXYZ:

-      error = WriteTLS();
+      error = WriteTLS(src);

-Status NativeRegisterContextLinux_arm64::WriteTLS() {
+Status NativeRegisterContextLinux_arm64::WriteTLS(const void* src/*=nullptr*/) {

-  ioVec.iov_base = GetTLSBuffer();
+  ioVec.iov_base = src ? const_cast<void*>(src) : GetTLSBuffer();

-  Status WriteTLS();
+  Status WriteTLS(const void* src=nullptr);

We can assume that the buffer is the same as the data to be written back if it's something static like TLS. For SVE/SME, we would have resized our buffer first, so it holds there too.

The added complexity isn't that much but I think it adds more thinking time for a developer than it potentially saves in copying time. I already feel like the separate xyz_is_valid is enough to think about and having a potential second data source just adds to that load.

From the header docs it seems I was right that this is used primarily for expression evaluation:

// These two functions are used to implement "push" and "pop" of register
// states.  They are used primarily for expression evaluation, where we need
// to push a new state (storing the old one in data_sp) and then restoring
// the original state by passing the data_sp we got from ReadAllRegisters to
// WriteAllRegisterValues.

Which you would be doing a lot of in a formatter for example, but you'd get better savings implementing a more efficient packet format to do all that at once, I guess.

QSaveRegisterState / QRestoreRegisterState packets call it as part of expression evaluation, though in theory it's not always that. That's an lldb extension anyway so we're in control of it at least. In theory this could be used to restore state that is not just the previous state but I don't know how you'd trigger that from lldb.

The other use is NativeProcessLinux::Syscall which is sufficiently rare we can ignore that.

I did do a very rough benchmark where I printed the same expression 2000 times, so each one is doing a save/restore. Once with the code in this review right now, and again with this potential optimisation added to GPR/FPR/TLS (I'm on a Mountain Jade machine without SVE). Caveat shared machine, made up benchmark, etc. but all runs of both hovered between 16 and 17 seconds. Neither seemed to be consistently lower or higher than the other. Doesn't mean this isn't a speedup in isolation but if it is, it's dwarfed by the syscalls and packets sent back and forth.

I mostly agree with what you have above. I was only thinking about ever increasing size of this buffer and thought if we can find a way around duplication. Most probably we ll never have SVE/SME enabled on a slow/small machine to bother us. The only area of concern could be a LLDB running on a resource constrained container but we can ignore that for now.

The only area of concern could be a LLDB running on a resource constrained container but we can ignore that for now.

Ironically we're the ones who are most likely to see that issue :)

omjavaid added inline comments.Sep 1 2023, 4:05 AM
lldb/source/Plugins/Process/Linux/NativeRegisterContextLinux_arm64.cpp
573

I think we have lost some of code readability of ReadAllRegisterValues function by introducing AddSavedRegisters mechanism. It does remove some duplication of memcpy but it also introduces some differentiation between Read/WritellRegisterValues functions. I am inclined that we should put the memcpy into ReadAllRegisterValues to make it look similar to its Write counterpart.

DavidSpickett added inline comments.Sep 1 2023, 4:50 AM
lldb/source/Plugins/Process/Linux/NativeRegisterContextLinux_arm64.cpp
573

I added it to avoid repeating:

Add kind whatever
::memcpy(dst, src, size);
dst = dst + size;

Or rather, not repeating it because it was very easy to forget to advance the pointer. The same applies to WriteAllRegisterValues, but there is some more logic in between the steps so I didn't do it there.

If memcpy returned a pointer one beyond the end of the buffer I wouldn't mind doing:

Add kind
dst = memcpy(...)

But it only returns the original destination pointer.

So I'd rather not inline all the memcopies here but I could try harder to add a convenience function for WriteAllRegisterValues, so they more closely match.

So this is almost the opposite of what you asked for, but I made WriteAllRegisterValues
similar to ReadAll by giving it it's own convenient wrapper function.

Which ensures you don't forget to set a valid bool, or increment the src pointer.

Also I realised I can dedupe a lot of if (error.Fail()). SVE is a bit special due
to writing header then header+register, so it does the header manually.

So hopefully it is more readable in both in the sense that Readall/writeall is almost declarative. The details are all in the helpers so you can't forget one of them.

omjavaid added inline comments.Sep 1 2023, 7:26 AM
lldb/source/Plugins/Process/Linux/NativeRegisterContextLinux_arm64.cpp
539

Now that we are trying to make both Read and Write look similar lets pull out size calculation into a separate helper function.

577–578

Another point that just came to my mind is the kind terminology used here. In LLDB we differentiate registers into kinds example: LLDB kinds, DWARF kinds. For differentiating between register in this an appropraite term would be "register set". So lets rename this to SaveRegisterSet/RestoreRegisterSet.

omjavaid added inline comments.Sep 1 2023, 7:29 AM
lldb/source/Plugins/Process/Linux/NativeRegisterContextLinux_arm64.cpp
503

Consider renaming this to RegisterSetName or RegisterSetType

Kind -> Type

Pull out the caching and size calculation into its own function.

DavidSpickett marked 2 inline comments as done.Sep 4 2023, 9:32 AM
DavidSpickett retitled this revision from [lldb][AArch64] Add kind marker to ReadAll/WriteALLRegisterValues data to [lldb][AArch64] Add type marker to ReadAll/WriteALLRegisterValues data.
DavidSpickett edited the summary of this revision. (Show Details)
omjavaid accepted this revision.Sep 10 2023, 7:05 PM
This revision is now accepted and ready to land.Sep 10 2023, 7:05 PM
This revision was landed with ongoing or failed builds.Sep 10 2023, 11:57 PM
This revision was automatically updated to reflect the committed changes.