This is an archive of the discontinued LLVM Phabricator instance.

[XRay] Support AArch64 in compiler-rt
ClosedPublic

Authored by rSerge on Nov 8 2016, 12:29 PM.

Diff Detail

Event Timeline

rSerge updated this revision to Diff 77237.Nov 8 2016, 12:29 PM
rSerge retitled this revision from to [XRay] Support AArch64 in compiler-rt.
rSerge updated this object.
rSerge added reviewers: dberris, rengolin.
rSerge added subscribers: iid_iunknown, llvm-commits.
rSerge updated this object.Nov 8 2016, 12:32 PM
dberris requested changes to this revision.Nov 8 2016, 7:07 PM
dberris edited edge metadata.

Thanks for doing this @rSerge! Do you have a plan for supporting tail-exit sleds either now or at a later time?

I ask because the x86_64 code has already started differentiating tail-exit sleds by having its own trampoline. Maybe this distinction isn't necessary in ARM, but was wondering whether there was something that needed to be done here.

This revision now requires changes to proceed.Nov 8 2016, 7:07 PM

Thanks for doing this @rSerge! Do you have a plan for supporting tail-exit sleds either now or at a later time?

I ask because the x86_64 code has already started differentiating tail-exit sleds by having its own trampoline. Maybe this distinction isn't necessary in ARM, but was wondering whether there was something that needed to be done here.

I saw your recent commit with tail-exit sleds https://reviews.llvm.org/D26020 , and I plan to implement this too, but not in the current series of commits. Because ARM32 doesn't have this either, so I would rather commit ARM32 and AArch64 tail-exit sled handling together, than diverging in the current series between ARM32 and AArch64. ARM32 still doesn't support Thumb CPU mode, so some prioritizing is needed which one to do first and definitely not all at once.

I see. I'll defer to @rengolin for the ARM-specific bits.

rengolin added inline comments.Nov 11 2016, 7:13 AM
lib/xray/xray_interface.cc
29

I'd rather have the repetition here. This is looking odd... :)

lib/xray/xray_trampoline_AArch64.S
32

can't you compare the address before saving and restoring all those registers?

rSerge updated this revision to Diff 77674.Nov 11 2016, 2:42 PM
rSerge edited edge metadata.
rSerge marked an inline comment as done.
rSerge added inline comments.
lib/xray/xray_interface.cc
29

Changing.

lib/xray/xray_trampoline_AArch64.S
32

This situation is rare. Usually if the handler is set to nullptr, then the user code is unpatched, so the trampolines are not called. So there is no performance issue here.
However, there is a race condition issue. It is important to check for nullptr handler as late as possible before its call. If we push the registers after checking for nullptr handler, we increase the chances that handler is not nullptr when it's checked, but nullptr when it's called.
Though anyway we currently require that the handler function works correctly even if called after its removal via XRay API.

rengolin added inline comments.Nov 14 2016, 3:53 AM
lib/xray/xray_trampoline_AArch64.S
32

This situation is rare. Usually if the handler is set to nullptr, then the user code is unpatched, so the trampolines are not called. So there is no performance issue here.

Right, makes sense.

However, there is a race condition issue. It is important to check for nullptr handler as late as possible before its call. If we push the registers after checking for nullptr handler, we increase the chances that handler is not nullptr when it's checked, but nullptr when it's called.

I'm not sure I follow.

You have added the last update to be the first branch atomically, to make sure there are no race conditions.

Is this about the users calling the thunk while the update is ongoing, which can see the code but still don't have a function pointer?

In any case, if there is a race condition, it needs to be fixed, not work by chance.

dberris added inline comments.Nov 14 2016, 4:00 AM
lib/xray/xray_trampoline_AArch64.S
32

I'm not sure I follow.

You have added the last update to be the first branch atomically, to make sure there are no race conditions.

Is this about the users calling the thunk while the update is ongoing, which can see the code but still don't have a function pointer?

In any case, if there is a race condition, it needs to be fixed, not work by chance.

In x86_64 at least, there's no real "race" here because the pointer is implemented as a std::atomic<function_ptr>. The sequence here though at a high level should be:

  1. The code executing somehow knew to get to this trampoline.
  2. While some threads are running through this trampoline, another updates the global atomic pointer to nullptr.
  3. All threads that encounter the load for the function pointer should see the updated value of the pointer (since it "happens before" the load of the pointer).

This is at least the reason why the load of the pointer happens after the registers have been stashed onto the stack.

Does that make sense?

rengolin added inline comments.Nov 14 2016, 4:22 AM
lib/xray/xray_trampoline_AArch64.S
32
  1. While some threads are running through this trampoline, another updates the global atomic pointer to nullptr.

Hum, I wasn't expecting the pointer to move back to nullptr at any time.

In this case, there's still the race condition that the value changes in between the comparison (CMP X2, #0) and the jump (BEQ) which is, admittedly, very unlikely, but non-zero.

dberris added inline comments.Nov 14 2016, 4:42 AM
lib/xray/xray_trampoline_AArch64.S
32

Hum, I wasn't expecting the pointer to move back to nullptr at any time.

Yeah, that's a feature (see __xray_remove_handler()) to allow for turning off XRay logging at runtime in a sequence that guarantees things will continue to work. To disable XRay at runtime, users may:

  1. Call __xray_remove_handler() to ensure that the current installed logging handler is removed atomically (in a cross-platform manner).
  2. Call __xray_unpatch() to return the state of the sleds to "neutral".

Both of these are optional, could be performed in any order (thus operations need to be ensured thread-safe).

In this case, there's still the race condition that the value changes in between the comparison (CMP X2, #0) and the jump (BEQ) which is, admittedly, very unlikely, but non-zero.

The way we handle this in x86_64 at least is to load the value from the global to a register, then perform the comparison with an immediate value (0). It looks like this code is already doing almost exactly the same thing -- maybe we need to be ensuring that the load is synchronised? My non-familiarity with ARM64 is showing here.

rSerge marked an inline comment as done.Nov 14 2016, 7:52 AM
rSerge added inline comments.
lib/xray/xray_trampoline_AArch64.S
32

There is a possibility that the handler is removed or changed (in another thread) after current thread's BEQ, but before the old handler is called, or even when the current thread is in the handler. I think we had this discussion for x86_64 XRay and decided that the code of the handler must be implemented in such a way that it handles decently the situation when the handler function is called after it is removed or changed via XRay API. Complete elimination of the possibility that the old handler is called would require heavy synchronization on the XRay side and will still not eliminate the possibility that the handler is executing in 1 thread and is being removed in another thread. This heavy synchronization is undesirable for the tracing component (XRay), so it seems better to impose some restrictions on the handler code to allow spurious calls (like condition variables allow spurious wakeups, because it's too costly to avoid them).
I just minimized the chances of spurious handler call by moving the handler check as close as possible to its call.

rSerge added inline comments.Nov 14 2016, 7:55 AM
lib/xray/xray_trampoline_AArch64.S
32

Here's the discussion we had earlier on this: https://groups.google.com/d/msg/llvm-dev/Ft1XUeiSKgw/iABdpOTSCAAJ

rengolin edited edge metadata.Nov 14 2016, 8:24 AM

Right, so I think the current scenario is:

  1. Disabled traces won't get there at all, as they'll branch +32 directly.
  2. Enabled traces will make use of the push/pop sequence, so no slow down.
  3. In rare cases, the handler will be removed in between an enabled call, which adds sync problems.

To fix 3, there are three ways:

  1. Make that an atomic pointer, complicating the thunk code making it bigger and slower.
  2. Make sure any update to the function pointer is atomic with regards to it's use, which complicates the tracing logic.
  3. Reduce the probability in the tracing code (some atomics/volatile) and "hope" that the distance between CMP and BEQ is small enough.

We seem to be going for 3.

Given that this only affect XRay, I don't mind this implementation, as long as the users are aware of it. The XRay documentation [1] doesn't seem to cover that.

@dberris, it's up to you, really. :)

cheers,
--renato

[1] http://llvm.org/docs/XRay.html

Right, so I think the current scenario is:

  1. Disabled traces won't get there at all, as they'll branch +32 directly.
  2. Enabled traces will make use of the push/pop sequence, so no slow down.
  3. In rare cases, the handler will be removed in between an enabled call, which adds sync problems.

For 3, there's a few things working together:

  • When tracing is enabled, the sleds have been overwritten to jump/call into the trampolines. Code here defines the trampolines. Note that the sleds can be made to jump/call into the trampolines even if the global log handler is nullptr, because that's already defined to be an std::atomic<function_ptr>.
  • The global XRayPatchedFunction is the installed "log handler". This may be nullptr and the trampolines should be loading this in an atomic manner (potentially synchronised on some platforms) before calling it. In x86_64 we're almost assured that this is safe with a normal mov instruction. This is independent of whether XRay instrumentation is on/off (i.e. it's not required that the sleds are patched or unpatched for this pointer to be defined).
  • The trampolines (defined in assembly for complete control) must check whether the function pointer in XRayPatchedFunction is valid (i.e. not nullptr) before invoking it. The trampolines still have to abide by the calling conventions of the platform, and therefore must not assume that the pointer is not nullptr (and must gracefully handle the case when it is nullptr).

To fix 3, there are three ways:

  1. Make that an atomic pointer, complicating the thunk code making it bigger and slower.

We already do this for x86_64 which amounts to a normal mov instruction. Maybe something else needs to be done for ARM64?

  1. Make sure any update to the function pointer is atomic with regards to it's use, which complicates the tracing logic.

This is already the case in C++ code (xray_interface.cc, see __xray_set_handler(...)). I'm not sure whether this should be different in ARM64 assembler to load the function pointer into a register in a "sequentially consistent" or even with acquire/release semantics.

  1. Reduce the probability in the tracing code (some atomics/volatile) and "hope" that the distance between CMP and BEQ is small enough.

We seem to be going for 3.

At least in x86_64, because we copy the pointer at one point into a register and compare the register, the original data may have already changed after the copy. Like Serge mentioned, we implicitly require that the handler function be able to handle spurious invocations.

Given that this only affect XRay, I don't mind this implementation, as long as the users are aware of it. The XRay documentation [1] doesn't seem to cover that.

@dberris, it's up to you, really. :)

Thanks -- I'll fix the documentation to at least make it clear what the requirements/expectations on the log handler function ought to be.

It's the ARM64 assembly that I have no knowledge about as to whether there's a difference between a synchronised/atomic load of a pointer-sized value into a register and a normal load operation. If the way it's written now is already thread-safe (even if it's relaxed) then this is fine by me. :)

cheers,
--renato

Thanks Renato!

[1] http://llvm.org/docs/XRay.html

dberris accepted this revision.Nov 15 2016, 5:20 PM
dberris edited edge metadata.

Just to affirm that I'm fine with this going in, as long as we're confident that the load on ARM64 through the atomic pointer doesn't need to be synchronised further.

This revision is now accepted and ready to land.Nov 15 2016, 5:20 PM

Normal loads are not atomic on aarch64, so I still need to understand what the x86 code does and what the guarantee is.

rSerge added a comment.EditedNov 16 2016, 4:43 AM

Normal loads are not atomic on aarch64, so I still need to understand what the x86 code does and what the guarantee is.

Do you mean that LDR X2, [X1] may load part of the data pointed by X1, then get interrupted so that another thread changes the other part of the data, then LDR X2, [X1] continues reading so that it reads combined/corrupted data?

Or do you mean weak ordering of the instructions being executed on AArch64? The latter should not be a problem because this sequence of instructions carries a dependency one on another:

LDR X1, =_ZN6__xray19XRayPatchedFunctionE
LDR X2, [X1]
CMP X2, #0
BEQ FunctionEntry_restore

Do you mean that LDR X2, [X1] may load part of the data pointed by X1, then get interrupted so that another thread changes the other part of the data, then LDR X2, [X1] continues reading so that it reads combined/corrupted data?

No. I mean in terms of loads and stores. I was trying to understand what the guarantees X86 provides, to make sure they're the same in AArch64.

LDR X1, =_ZN6__xray19XRayPatchedFunctionE
LDR X2, [X1]
CMP X2, #0
BEQ FunctionEntry_restore

I'm not thinking on this sequence, but on the multiple cores executing the same code.

If you guarantee that the symbol _ZN6__xray19XRayPatchedFunctionE will always be there (I think you can, as you're not patching that one) and that calling it is *always* safe, even if the tracing is already disabled (in the case where you disable and the remaining threads that already loaded into X1 will jump to), then it should be fine.

If enabling/disabling the trace must be atomic (ie, things will break horribly if _ZN6__xray19XRayPatchedFunctionE is called after being disabled), then you need a DMB, because the store is in one thread and the load in another.

Makes sense?

We require from the handler function that it handles well spurious calls (after it is removed or changed to another handler). Perhaps this is not documented anywhere except our discussions on the mailing list, but it is so for all the currently supported CPUs.
_ZN6__xray19XRayPatchedFunctionE is a global variable, so it should always be there. Its definition and setting are in xray_interface.cc file (not touched by my current changes).
Definition:

// This is the function to call when we encounter the entry or exit sleds.
std::atomic<void (*)(int32_t, XRayEntryType)> XRayPatchedFunction{nullptr};

Setting:

__xray::XRayPatchedFunction.store(entry, std::memory_order_release);
rengolin accepted this revision.Nov 16 2016, 6:11 AM
rengolin edited edge metadata.

It should be fine, then. Adding that comment somewhere would be nice, though.

rSerge updated this revision to Diff 78214.Nov 16 2016, 10:28 AM
rSerge edited edge metadata.

Added a comment about the requirement for XRay user handler function.
Repeated the changes of https://reviews.llvm.org/D26597 : Disable XRay instrumentation of the XRay runtime.

@dberris , @rengolin could you look whether the latest changes (mentioned in a comment above) are ok, and if so, deliver to mainline?

The comment looks good to me, thanks!

This revision was automatically updated to reflect the committed changes.