This patch adds XRay support in compiler-rt for AArch64 targets.
This patch is one of a series:
Details
Diff Detail
Event Timeline
Thanks for doing this @rSerge! Do you have a plan for supporting tail-exit sleds either now or at a later time?
I ask because the x86_64 code has already started differentiating tail-exit sleds by having its own trampoline. Maybe this distinction isn't necessary in ARM, but was wondering whether there was something that needed to be done here.
I saw your recent commit with tail-exit sleds https://reviews.llvm.org/D26020 , and I plan to implement this too, but not in the current series of commits. Because ARM32 doesn't have this either, so I would rather commit ARM32 and AArch64 tail-exit sled handling together, than diverging in the current series between ARM32 and AArch64. ARM32 still doesn't support Thumb CPU mode, so some prioritizing is needed which one to do first and definitely not all at once.
lib/xray/xray_interface.cc | ||
---|---|---|
29 | Changing. | |
lib/xray/xray_trampoline_AArch64.S | ||
32 | This situation is rare. Usually if the handler is set to nullptr, then the user code is unpatched, so the trampolines are not called. So there is no performance issue here. |
lib/xray/xray_trampoline_AArch64.S | ||
---|---|---|
32 |
Right, makes sense.
I'm not sure I follow. You have added the last update to be the first branch atomically, to make sure there are no race conditions. Is this about the users calling the thunk while the update is ongoing, which can see the code but still don't have a function pointer? In any case, if there is a race condition, it needs to be fixed, not work by chance. |
lib/xray/xray_trampoline_AArch64.S | ||
---|---|---|
32 |
In x86_64 at least, there's no real "race" here because the pointer is implemented as a std::atomic<function_ptr>. The sequence here though at a high level should be:
This is at least the reason why the load of the pointer happens after the registers have been stashed onto the stack. Does that make sense? |
lib/xray/xray_trampoline_AArch64.S | ||
---|---|---|
32 |
Hum, I wasn't expecting the pointer to move back to nullptr at any time. In this case, there's still the race condition that the value changes in between the comparison (CMP X2, #0) and the jump (BEQ) which is, admittedly, very unlikely, but non-zero. |
lib/xray/xray_trampoline_AArch64.S | ||
---|---|---|
32 |
Yeah, that's a feature (see __xray_remove_handler()) to allow for turning off XRay logging at runtime in a sequence that guarantees things will continue to work. To disable XRay at runtime, users may:
Both of these are optional, could be performed in any order (thus operations need to be ensured thread-safe).
The way we handle this in x86_64 at least is to load the value from the global to a register, then perform the comparison with an immediate value (0). It looks like this code is already doing almost exactly the same thing -- maybe we need to be ensuring that the load is synchronised? My non-familiarity with ARM64 is showing here. |
lib/xray/xray_trampoline_AArch64.S | ||
---|---|---|
32 | There is a possibility that the handler is removed or changed (in another thread) after current thread's BEQ, but before the old handler is called, or even when the current thread is in the handler. I think we had this discussion for x86_64 XRay and decided that the code of the handler must be implemented in such a way that it handles decently the situation when the handler function is called after it is removed or changed via XRay API. Complete elimination of the possibility that the old handler is called would require heavy synchronization on the XRay side and will still not eliminate the possibility that the handler is executing in 1 thread and is being removed in another thread. This heavy synchronization is undesirable for the tracing component (XRay), so it seems better to impose some restrictions on the handler code to allow spurious calls (like condition variables allow spurious wakeups, because it's too costly to avoid them). |
lib/xray/xray_trampoline_AArch64.S | ||
---|---|---|
32 | Here's the discussion we had earlier on this: https://groups.google.com/d/msg/llvm-dev/Ft1XUeiSKgw/iABdpOTSCAAJ |
Right, so I think the current scenario is:
- Disabled traces won't get there at all, as they'll branch +32 directly.
- Enabled traces will make use of the push/pop sequence, so no slow down.
- In rare cases, the handler will be removed in between an enabled call, which adds sync problems.
To fix 3, there are three ways:
- Make that an atomic pointer, complicating the thunk code making it bigger and slower.
- Make sure any update to the function pointer is atomic with regards to it's use, which complicates the tracing logic.
- Reduce the probability in the tracing code (some atomics/volatile) and "hope" that the distance between CMP and BEQ is small enough.
We seem to be going for 3.
Given that this only affect XRay, I don't mind this implementation, as long as the users are aware of it. The XRay documentation [1] doesn't seem to cover that.
@dberris, it's up to you, really. :)
cheers,
--renato
For 3, there's a few things working together:
- When tracing is enabled, the sleds have been overwritten to jump/call into the trampolines. Code here defines the trampolines. Note that the sleds can be made to jump/call into the trampolines even if the global log handler is nullptr, because that's already defined to be an std::atomic<function_ptr>.
- The global XRayPatchedFunction is the installed "log handler". This may be nullptr and the trampolines should be loading this in an atomic manner (potentially synchronised on some platforms) before calling it. In x86_64 we're almost assured that this is safe with a normal mov instruction. This is independent of whether XRay instrumentation is on/off (i.e. it's not required that the sleds are patched or unpatched for this pointer to be defined).
- The trampolines (defined in assembly for complete control) must check whether the function pointer in XRayPatchedFunction is valid (i.e. not nullptr) before invoking it. The trampolines still have to abide by the calling conventions of the platform, and therefore must not assume that the pointer is not nullptr (and must gracefully handle the case when it is nullptr).
To fix 3, there are three ways:
- Make that an atomic pointer, complicating the thunk code making it bigger and slower.
We already do this for x86_64 which amounts to a normal mov instruction. Maybe something else needs to be done for ARM64?
- Make sure any update to the function pointer is atomic with regards to it's use, which complicates the tracing logic.
This is already the case in C++ code (xray_interface.cc, see __xray_set_handler(...)). I'm not sure whether this should be different in ARM64 assembler to load the function pointer into a register in a "sequentially consistent" or even with acquire/release semantics.
- Reduce the probability in the tracing code (some atomics/volatile) and "hope" that the distance between CMP and BEQ is small enough.
We seem to be going for 3.
At least in x86_64, because we copy the pointer at one point into a register and compare the register, the original data may have already changed after the copy. Like Serge mentioned, we implicitly require that the handler function be able to handle spurious invocations.
Given that this only affect XRay, I don't mind this implementation, as long as the users are aware of it. The XRay documentation [1] doesn't seem to cover that.
@dberris, it's up to you, really. :)
Thanks -- I'll fix the documentation to at least make it clear what the requirements/expectations on the log handler function ought to be.
It's the ARM64 assembly that I have no knowledge about as to whether there's a difference between a synchronised/atomic load of a pointer-sized value into a register and a normal load operation. If the way it's written now is already thread-safe (even if it's relaxed) then this is fine by me. :)
cheers,
--renato
Thanks Renato!
Just to affirm that I'm fine with this going in, as long as we're confident that the load on ARM64 through the atomic pointer doesn't need to be synchronised further.
Normal loads are not atomic on aarch64, so I still need to understand what the x86 code does and what the guarantee is.
Do you mean that LDR X2, [X1] may load part of the data pointed by X1, then get interrupted so that another thread changes the other part of the data, then LDR X2, [X1] continues reading so that it reads combined/corrupted data?
Or do you mean weak ordering of the instructions being executed on AArch64? The latter should not be a problem because this sequence of instructions carries a dependency one on another:
LDR X1, =_ZN6__xray19XRayPatchedFunctionE LDR X2, [X1] CMP X2, #0 BEQ FunctionEntry_restore
No. I mean in terms of loads and stores. I was trying to understand what the guarantees X86 provides, to make sure they're the same in AArch64.
LDR X1, =_ZN6__xray19XRayPatchedFunctionE LDR X2, [X1] CMP X2, #0 BEQ FunctionEntry_restore
I'm not thinking on this sequence, but on the multiple cores executing the same code.
If you guarantee that the symbol _ZN6__xray19XRayPatchedFunctionE will always be there (I think you can, as you're not patching that one) and that calling it is *always* safe, even if the tracing is already disabled (in the case where you disable and the remaining threads that already loaded into X1 will jump to), then it should be fine.
If enabling/disabling the trace must be atomic (ie, things will break horribly if _ZN6__xray19XRayPatchedFunctionE is called after being disabled), then you need a DMB, because the store is in one thread and the load in another.
Makes sense?
We require from the handler function that it handles well spurious calls (after it is removed or changed to another handler). Perhaps this is not documented anywhere except our discussions on the mailing list, but it is so for all the currently supported CPUs.
_ZN6__xray19XRayPatchedFunctionE is a global variable, so it should always be there. Its definition and setting are in xray_interface.cc file (not touched by my current changes).
Definition:
// This is the function to call when we encounter the entry or exit sleds. std::atomic<void (*)(int32_t, XRayEntryType)> XRayPatchedFunction{nullptr};
Setting:
__xray::XRayPatchedFunction.store(entry, std::memory_order_release);
Added a comment about the requirement for XRay user handler function.
Repeated the changes of https://reviews.llvm.org/D26597 : Disable XRay instrumentation of the XRay runtime.
I'd rather have the repetition here. This is looking odd... :)