This is an archive of the discontinued LLVM Phabricator instance.

Use C++11 atomics for ticket locks implementation
ClosedPublic

Authored by pawosm01 on May 3 2016, 11:13 AM.

Details

Summary

This patch replaces use of compiler builtin atomics with
C++11 atomics for ticket locks implementation. Ticket locks
are used in critical places of the runtime, e.g. in the tasking
mechanism.

The main reason this change was introduced is the problem
with work stealing function on ARM architecture which suffered
from nasty race condition. It turned out that the root cause of
the problem lies in the way ticket locks are implemented. Changing
compiler builtins into C++11 atomics solves the problem.

Two assertions were added into kmp_tasking.c which are useful
for detecting early symptoms of something wrong going on with
work stealing, which were among the possible outcomes of the
race condition.

Diff Detail

Repository
rL LLVM

Event Timeline

pawosm01 updated this revision to Diff 56030.May 3 2016, 11:13 AM
pawosm01 retitled this revision from to Use C++11 atomics for ticket locks implementation.
pawosm01 updated this object.
pawosm01 set the repository for this revision to rL LLVM.
pawosm01 added a subscriber: openmp-commits.
jcownie added a subscriber: jcownie.May 4 2016, 2:51 AM

I have only made a top-level scan, so this needs more review. However,
__kmp_baker_check should look like this

static kmp_uint32
__kmp_bakery_check(void *value, kmp_uint32 checker)
{
    return (std::atomic_load( (std::atomic<unsigned> *)value ) == checker);
}

Rather than having a test then returning TRUE or FALSE!

It would also be nice to reduce some of the code replication (e.g. kmp_wait_yield_4_ptr is identical to kmp_wait_yield_4 apart from the argument type [MIssing leading underscores since they cause underlining!] ). Either we should use a template (if we accept that this code is really C++ now), or even a macro for the whole function that is invoked with the function name and type is better than having the whole code duplicated.

I have only made a top-level scan, so this needs more review. However,
__kmp_baker_check should look like this

static kmp_uint32
__kmp_bakery_check(void *value, kmp_uint32 checker)
{
    return (std::atomic_load( (std::atomic<unsigned> *)value ) == checker);
}

Rather than having a test then returning TRUE or FALSE!

Yeah, I'd rather focused on removing this funny loop above return FALSE and didn't spot that loop removal caused whole 'if' became pointless. To be fixed.

It would also be nice to reduce some of the code replication (e.g. kmp_wait_yield_4_ptr is identical to kmp_wait_yield_4 apart from the argument type [MIssing leading underscores since they cause underlining!] ). Either we should use a template (if we accept that this code is really C++ now), or even a macro for the whole function that is invoked with the function name and type is better than having the whole code duplicated.

Note that all those functions are within 'extern "C" { .... }' block so templates aren't welcome there. The questioned slight code replication seems to be better solution for now than preparing ugly macro. This code yields for the further refactor but that is not the point of this patch which is to make tasking work on more architectures.

pawosm01 updated this revision to Diff 56506.May 7 2016, 11:37 AM

pointless 'if' removed

hbae edited edge metadata.May 13 2016, 1:18 PM
  • I don't think it is necessary to introduce a new member "self" since std::atomic should be working fine with pointer types. Is there any specific reason for separating the original variable " initialized" into "initialized" and "self"?
  • I don't know why KMP_FSYNC* calls were removed in the new code. It may not be meaningful in many cases, but it is not dead code as far as I know.
In D19878#429947, @hbae wrote:
  • I don't think it is necessary to introduce a new member "self" since std::atomic should be working fine with pointer types. Is there any specific reason for separating the original variable " initialized" into "initialized" and "self"?

I don't remember, looks redundant for me now.

  • I don't know why KMP_FSYNC* calls were removed in the new code. It may not be meaningful in many cases, but it is not dead code as far as I know.

Unfortunately, I don't see what you're pointing at. Can you quote some example?

hbae added a comment.May 13 2016, 2:33 PM
In D19878#429947, @hbae wrote:
  • I don't know why KMP_FSYNC* calls were removed in the new code. It may not be meaningful in many cases, but it is not dead code as far as I know.

Unfortunately, I don't see what you're pointing at. Can you quote some example?

I was just wondering if you are removing these calls on purpose:

KMP_FSYNC_ACQUIRED() - line 758, 762, 800
KMP_FSYNC_RELEASING() - line 833
pawosm01 added a comment.EditedMay 13 2016, 3:17 PM

Ok, now I see. If this code is not dead, is not live either. For example, KMP_FSYNC_RELEASING macro:

#if USE_ITT_BUILD
// Old stuff for reporting low-level internal synchronization.
#if USE_ITT_NOTIFY
#define KMP_FSYNC_RELEASING( obj )  __itt_fsync_releasing( (void *)( obj ) )
#else
# define KMP_FSYNC_RELEASING( obj )        ((void)0)
#endif
#else
# define KMP_FSYNC_RELEASING( obj )        ((void)0)
#endif

...where __itt_fsync_releasing() is declared:

/**
 * @ingroup legacy
 * @deprecated Legacy API
 * @brief Fast synchronization which does no require spinning.
 * - This special function is to be used by TBB and OpenMP libraries only when they know
 *   there is no spin but they need to suppress TC warnings about shared variable modifications.
 * - It only has corresponding pointers in static library and does not have corresponding function
 *   in dynamic library.
 * @see void __itt_sync_releasing(void* addr);
 */
void ITTAPI __itt_fsync_releasing(void* addr);

...but never defined!

IMHO the whole point of using C++11 atomics is to avoid necessity of using stuff like that. It is up to compiler (reinforced by memory order hints like std::memory_order_release) to generate proper binary code for targeted architecture, so no macros like KMP_FSYNC_RELEASING would be needed.

IMHO the whole point of using C++11 atomics is to avoid necessity of using stuff like that. It is up to compiler (reinforced by memory order hints like std::memory_order_release) to generate proper binary code for targeted architecture, so no macros like KMP_FSYNC_RELEASING would be needed.

Actually these macros were intended not to make proper code, but to send signals to tools. E.g. so that memory checker would not complain on accesses to shared memory inside critical sections. I consulted with the consumers of this code - Intel tools, - and the resolution is the *_fsync_* apis are not used by them nowadays. So the code looks more dead than live, and these stubs entries are never resolved to point to actual entries. As opposed to *_sync_* apis those can be dynamically changed from empty stubs to code with actual calls.

  • Andrey

I have a few unsolicited comments to offer:

(1) The atomic load in kmp_bakery_check should use memory_order_acquire rather than the more restrictive default memory_order_seq_cst. Since this condition is tested with inlined code in quite a few places (checking for == or != my_ticket), I recommend that this detail be implemented in an inlined function or macro that is used ubiquitously.
(2) The atomic loads in kmp_get_ticket_lock_owner and kmp_is_ticket_lock_nestable should use memory_order_relaxed since they aren't used to enforce ordering for synchronization. The owner is written with atomic_store, which could also be memory_order_relaxed. In kmp_release_ticket_lock_with_checks, owner_id is written as 0 without an atomic store. Either you need atomic store and load to manipulate this field or not. I dislike mixing the two. If you are concerned about atomicity, you should use an atomic_store with memory_order_relaxed here too.
(3) I recommend renaming the parameters of kmp_bakery_check to more clearly indicate their intent: current_value and target_value, or now_serving and my_ticket.
(4) I dislike the way gtid+1 is written into a ticket lock's owner_id throughout the file and then
kmp_get_ticket_lock_owner returns owner_id -1. I understand that the range of gtids and that unowned is indicated by 0. I recommend using macros like GTID_TO_LOCK_OWNER(gtid) and LOCK_OWNER_TO_GTID(owner_id) to replace the ugly -1 and +1 everywhere. Also, I think that kmp_get_ticket_lock_owner should return the owner_id and the subtraction should take place outside.
(5) The value for for non-nestable should be defined as something like LOCK_NOT_NESTABLE and not used with a raw -1.
(6) Why not correct the spelling of "checks" in the function prototype
kmp_test_ticket_lock_with_cheks?
(7) I dislike checking the lock owner against 0 to determine if it is unowned. Why not use a macro set to 0: LOCK_NO_OWNER?

I'm sure that there is more worth saying, but I think that I've said enough.

AndreyChurbanov accepted this revision.May 23 2016, 9:53 AM
AndreyChurbanov added a reviewer: AndreyChurbanov.

LGTM.

I think we will address the John's comments separately. As they sounds perfectly sane to me, but should not further delay current change.

  • Andrey
This revision is now accepted and ready to land.May 23 2016, 9:53 AM

Hi Andrey,

Although most of Johns suggestions are good subject of bigger refactor (e.g. GTID_TO_LOCK_OWNER affects other lock mechanisms too), I think his remark regarding lack of atomic store in __kmp_release_ticket_lock_with_checks() is relevant here. And also seems like I didn't update my patch with 'self' field removal. Now we got two options 1) land it as it is and prepare the next patch which should address 'self' field issue and John's issues number 1, 2, 3 and 6 (while 4, 5 and 7 are outside of the scope of this change) or 2) wait 2-3 days until I finish checking updated patch where 'self' is removed and missing atomic store is added.

Once you are going to work on it soon I'd prefer to wait for couple of days. Let's get better patch in.

I was concerned by many days age of the request with most comments been addressed; couple more days - not o problem, I think.

Thanks,
Andrey

AndreyChurbanov requested changes to this revision.May 23 2016, 11:11 AM
AndreyChurbanov edited edge metadata.

Cancel my LGTM basing on Pawel's comment.

This revision now requires changes to proceed.May 23 2016, 11:11 AM
pawosm01 added a comment.EditedMay 23 2016, 11:19 AM

Nah, I don't like this idea of making pointer to 'self' in 'initialized' field atomic - it leads to nasty casting of pointers (e.g. you can't do atomic store of NULL or nullptr to atomic pointer of certain type without ugly casting). I don't like the idea of removing self-pointer and leaving only boolean 'initialized' field either - checking whether self equals lck is a whole point of those sanity check functions. I think it should be left as it is and only missing atomic store in the __kmp_release_ticket_lock_with_checks() requires immediate attention.

I'll put updated patch in a few days.

pawosm01 updated this revision to Diff 58221.May 24 2016, 5:32 AM
pawosm01 edited edge metadata.

I'm placing updated patch here. Overnight executions of OpenUH taskbench tests didn't fail, so I assume relaxed memory ordering (as suggested by Jim) is ok. Missing atomic store also added, __kmp_bakery_check() params renamed.

(as suggested by Jim)

I believe the suggestion was from John, and I wouldn’t want to steal his thunder!

  • Jim

James Cownie <james.h.cownie@intel.com>
SSG/DPD/TCAR (Technical Computing, Analyzers and Runtimes)
Tel: +44 117 9071438

I have a few more comments about the memory orderings. Not having too much order is important for performance. Fences that are inserted unless they are explicitly omitted hurt performance.

After you see my list below, you'll be wondering whether you should have ventured down this path :-)

Line 741 the fetch and add can be relaxed order. the acquire order is important on the load that tests vs. myticket
Line 744 this load isn't for synchronization, it can be relaxed order
Line 749 this atomic load should have acquire order, but does not need to be sequentially consistent
Line 767 this load isn't for synchronization, it can be relaxed order
Line 782 this store isn't for synchronization, the store can be relaxed order
Line 789 this load isn't for synchronization, it can be relaxed order
Line 791 this load can be relaxed because the CAS on line 793 enfoceces the acquire order
Line 806 this load isn't for synchronization, it can be relaxed order
Line 782 this store isn't for synchronization, the store can be relaxed order
Line 827 neither of these loads are for synchronization, so they can be relaxed
Line 840 this load isn't for synchronization, it can be relaxed order
Lines 865-869 all of these stores can be relaxed
Line 870 if this store is release order, all of the prior stores in this routine will complete before the initialized flag is set
Line 882 could be release order to ensure all prior operations on the lock are complete.
Line 883-888 these stores can be relaxed. I could argue that once you clear initialized, you could not bother writing the rest of the values.
Line 986 this load can be relaxed because it isn't for synchronization
Line 922 can be relaxed. all ordering will be enforced by synchronizing accesses for the outer lock
Line 927-928 can be relaxed. these aren't synchronizing.
Line 938 can be relaxed. it isn't synchronizing.
Line 958 can be relaxed. all ordering will be enforced by synchronizing accesses for the outer lock
Lines 964-965 can be relaxed.
Line 977 can be relaxed. it isn't synchronizing.
Lines 994-995 aren't synchronizing and can be relaxed.
Line 1007 isn't synchronizing and can be relaxed.
Line 1029 can be relaxed (or even omitted)
Line 1042 can be relaxed (or even omitted)
Line 1050 can be relaxed
Line 1073 can be relaxed

Hi John,

Thanks for all of your recommendations, I applied all of them and started taskbench in infinite loop on it for the night run. If it still works tomorrow, I'll place updated patch here.

I hope that you are testing it on a system with weak ordering rather than an Intel chip ;-)

John Mellor-Crummey

(sent from my phone)

pawosm01 updated this revision to Diff 58399.May 25 2016, 3:50 AM
pawosm01 edited edge metadata.

Here it is, updated version. Overnight tests on two different AArch64 machines using two different compilers (clang, gcc 5.3) passed.

AndreyChurbanov accepted this revision.May 31 2016, 11:58 AM
AndreyChurbanov edited edge metadata.

LGTM

This revision is now accepted and ready to land.May 31 2016, 11:58 AM
This revision was automatically updated to reflect the committed changes.