This is an archive of the discontinued LLVM Phabricator instance.

If a CAS failed spuriously, then immediately retry is good. If it failed because another core wrote to the cache line, then we have established that said cache line is somewhat contended, in which case pause() may let the other threads progress faster.

This might be difficult to microbenchmark.

In D97079#2575663, @JonChesterfield wrote:

I think this is architecture specific.

If a CAS failed spuriously, then immediately retry is good. If it failed because another core wrote to the cache line, then we have established that said cache line is somewhat contended, in which case pause() may let the other threads progress faster.

If one thread succeeded then ALL other threads competing on the cache line would fail. So they will not progress faster executing pause instruction. Only gain can be for succeeding thread, that means it executes nothing but atomic in a loop, and that is not a real code example...

Actually the atomic functions are never called by clang, so it is kind of cleanup change in any case.

Experiments calling an internal atomic function directly in a loop show either no difference on some processors (where pause instruction is fast enough), or significant speedup (where pause instruction is slow).

One clear gain of this patch is reduced size of the runtime - I see 4096 bytes smaller size of the binary if the patch applied.

If you still think the patch is not good, I am fine to abandon it. But to me it is pretty harmless, and as I mentioned, it reduces the size of the runtime.

In D97079#2594807, @AndreyChurbanov wrote:

In D97079#2575663, @JonChesterfield wrote:

I think this is architecture specific.

If a CAS failed spuriously, then immediately retry is good. If it failed because another core wrote to the cache line, then we have established that said cache line is somewhat contended, in which case pause() may let the other threads progress faster.

If one thread succeeded then ALL other threads competing on the cache line would fail. So they will not progress faster executing pause instruction. Only gain can be for succeeding thread, that means it executes nothing but atomic in a loop, and that is not a real code example...

Actually the atomic functions are never called by clang, so it is kind of cleanup change in any case.

Experiments calling an internal atomic function directly in a loop show either no difference on some processors (where pause instruction is fast enough), or significant speedup (where pause instruction is slow).

One clear gain of this patch is reduced size of the runtime - I see 4096 bytes smaller size of the binary if the patch applied.

If you still think the patch is not good, I am fine to abandon it. But to me it is pretty harmless, and as I mentioned, it reduces the size of the runtime.

I think we should keep this patch. These atomic functions are just trying to mimic what a hardware instruction would accomplish. In line with Andrey's reasoning, if a CAS fails here then there is nothing to wait for since the other thread that caused the fail is done with the atomic. i.e., the critical section is the atomic itself. Hence, the "spin wait" is already over and another retry should take place immediately.

Someone measuring the performance on a few architectures soundly beats my speculation. No objection here, thanks!

LGTM

This revision is now accepted and ready to land.Mar 8 2021, 11:54 AM

Closed by commit rGaaf16b80dd4c: [OpenMP] libomp: eliminate pause from atomic CAS loops (authored by AndreyChurbanov). · Explain WhyMar 9 2021, 7:30 AM

This revision was automatically updated to reflect the committed changes.

AndreyChurbanov added a commit: rGaaf16b80dd4c: [OpenMP] libomp: eliminate pause from atomic CAS loops.

Revision Contents

Path

Size

openmp/

runtime/

src/

kmp_atomic.cpp

12 lines

Diff 329344

openmp/runtime/src/kmp_atomic.cpp

Show First 20 Lines • Show All 773 Lines • ▼ Show 20 Lines
#else		#else
#define OP_GOMP_CRITICAL(OP, FLAG)		#define OP_GOMP_CRITICAL(OP, FLAG)
#define OP_UPDATE_GOMP_CRITICAL(TYPE, OP, FLAG)		#define OP_UPDATE_GOMP_CRITICAL(TYPE, OP, FLAG)
#endif /* KMP_GOMP_COMPAT */		#endif /* KMP_GOMP_COMPAT */

#if KMP_MIC		#if KMP_MIC
#define KMP_DO_PAUSE _mm_delay_32(1)		#define KMP_DO_PAUSE _mm_delay_32(1)
#else		#else
#define KMP_DO_PAUSE KMP_CPU_PAUSE()		#define KMP_DO_PAUSE
#endif /* KMP_MIC */		#endif /* KMP_MIC */

// ------------------------------------------------------------------------		// ------------------------------------------------------------------------
// Operation on *lhs, rhs using "compare_and_store" routine		// Operation on *lhs, rhs using "compare_and_store" routine
// TYPE - operands' type		// TYPE - operands' type
// BITS - size in bits, used to distinguish low level calls		// BITS - size in bits, used to distinguish low level calls
// OP - operator		// OP - operator
#define OP_CMPXCHG(TYPE, BITS, OP) \		#define OP_CMPXCHG(TYPE, BITS, OP) \
▲ Show 20 Lines • Show All 336 Lines • ▼ Show 20 Lines	#define MIN_MAX_CMPXCHG(TYPE, BITS, OP) \
TYPE old_value; \		TYPE old_value; \
temp_val = *lhs; \		temp_val = *lhs; \
old_value = temp_val; \		old_value = temp_val; \
while (old_value OP rhs && /* still need actions? */ \		while (old_value OP rhs && /* still need actions? */ \
!KMP_COMPARE_AND_STORE_ACQ##BITS( \		!KMP_COMPARE_AND_STORE_ACQ##BITS( \
(kmp_int##BITS *)lhs, \		(kmp_int##BITS *)lhs, \
VOLATILE_CAST(kmp_int##BITS ) & old_value, \		VOLATILE_CAST(kmp_int##BITS ) & old_value, \
VOLATILE_CAST(kmp_int##BITS ) & rhs)) { \		VOLATILE_CAST(kmp_int##BITS ) & rhs)) { \
KMP_CPU_PAUSE(); \
temp_val = *lhs; \		temp_val = *lhs; \
old_value = temp_val; \		old_value = temp_val; \
} \		} \
}		}

// -------------------------------------------------------------------------		// -------------------------------------------------------------------------
// 1-byte, 2-byte operands - use critical section		// 1-byte, 2-byte operands - use critical section
#define MIN_MAX_CRITICAL(TYPE_ID, OP_ID, TYPE, OP, LCK_ID, GOMP_FLAG) \		#define MIN_MAX_CRITICAL(TYPE_ID, OP_ID, TYPE, OP, LCK_ID, GOMP_FLAG) \
▲ Show 20 Lines • Show All 938 Lines • ▼ Show 20 Lines	#define OP_CMPXCHG_WR(TYPE, BITS, OP) \
TYPE KMP_ATOMIC_VOLATILE temp_val; \		TYPE KMP_ATOMIC_VOLATILE temp_val; \
TYPE old_value, new_value; \		TYPE old_value, new_value; \
temp_val = *lhs; \		temp_val = *lhs; \
old_value = temp_val; \		old_value = temp_val; \
new_value = rhs; \		new_value = rhs; \
while (!KMP_COMPARE_AND_STORE_ACQ##BITS( \		while (!KMP_COMPARE_AND_STORE_ACQ##BITS( \
(kmp_int##BITS )lhs, VOLATILE_CAST(kmp_int##BITS *) & old_value, \		(kmp_int##BITS )lhs, VOLATILE_CAST(kmp_int##BITS *) & old_value, \
VOLATILE_CAST(kmp_int##BITS ) & new_value)) { \		VOLATILE_CAST(kmp_int##BITS ) & new_value)) { \
KMP_CPU_PAUSE(); \
\
temp_val = *lhs; \		temp_val = *lhs; \
old_value = temp_val; \		old_value = temp_val; \
new_value = rhs; \		new_value = rhs; \
} \		} \
}		}

// -------------------------------------------------------------------------		// -------------------------------------------------------------------------
#define ATOMIC_CMPXCHG_WR(TYPE_ID, OP_ID, TYPE, BITS, OP, GOMP_FLAG) \		#define ATOMIC_CMPXCHG_WR(TYPE_ID, OP_ID, TYPE, BITS, OP, GOMP_FLAG) \
▲ Show 20 Lines • Show All 132 Lines • ▼ Show 20 Lines	#define OP_CMPXCHG_CPT(TYPE, BITS, OP) \
TYPE KMP_ATOMIC_VOLATILE temp_val; \		TYPE KMP_ATOMIC_VOLATILE temp_val; \
TYPE old_value, new_value; \		TYPE old_value, new_value; \
temp_val = *lhs; \		temp_val = *lhs; \
old_value = temp_val; \		old_value = temp_val; \
new_value = (TYPE)(old_value OP rhs); \		new_value = (TYPE)(old_value OP rhs); \
while (!KMP_COMPARE_AND_STORE_ACQ##BITS( \		while (!KMP_COMPARE_AND_STORE_ACQ##BITS( \
(kmp_int##BITS )lhs, VOLATILE_CAST(kmp_int##BITS *) & old_value, \		(kmp_int##BITS )lhs, VOLATILE_CAST(kmp_int##BITS *) & old_value, \
VOLATILE_CAST(kmp_int##BITS ) & new_value)) { \		VOLATILE_CAST(kmp_int##BITS ) & new_value)) { \
KMP_CPU_PAUSE(); \
\
temp_val = *lhs; \		temp_val = *lhs; \
old_value = temp_val; \		old_value = temp_val; \
new_value = (TYPE)(old_value OP rhs); \		new_value = (TYPE)(old_value OP rhs); \
} \		} \
if (flag) { \		if (flag) { \
return new_value; \		return new_value; \
} else \		} else \
return old_value; \		return old_value; \
▲ Show 20 Lines • Show All 378 Lines • ▼ Show 20 Lines	#define MIN_MAX_CMPXCHG_CPT(TYPE, BITS, OP) \
/TYPE old_value; / \		/TYPE old_value; / \
temp_val = *lhs; \		temp_val = *lhs; \
old_value = temp_val; \		old_value = temp_val; \
while (old_value OP rhs && /* still need actions? */ \		while (old_value OP rhs && /* still need actions? */ \
!KMP_COMPARE_AND_STORE_ACQ##BITS( \		!KMP_COMPARE_AND_STORE_ACQ##BITS( \
(kmp_int##BITS *)lhs, \		(kmp_int##BITS *)lhs, \
VOLATILE_CAST(kmp_int##BITS ) & old_value, \		VOLATILE_CAST(kmp_int##BITS ) & old_value, \
VOLATILE_CAST(kmp_int##BITS ) & rhs)) { \		VOLATILE_CAST(kmp_int##BITS ) & rhs)) { \
KMP_CPU_PAUSE(); \
temp_val = *lhs; \		temp_val = *lhs; \
old_value = temp_val; \		old_value = temp_val; \
} \		} \
if (flag) \		if (flag) \
return rhs; \		return rhs; \
else \		else \
return old_value; \		return old_value; \
}		}
▲ Show 20 Lines • Show All 280 Lines • ▼ Show 20 Lines	#define OP_CMPXCHG_CPT_REV(TYPE, BITS, OP) \
TYPE KMP_ATOMIC_VOLATILE temp_val; \		TYPE KMP_ATOMIC_VOLATILE temp_val; \
TYPE old_value, new_value; \		TYPE old_value, new_value; \
temp_val = *lhs; \		temp_val = *lhs; \
old_value = temp_val; \		old_value = temp_val; \
new_value = (TYPE)(rhs OP old_value); \		new_value = (TYPE)(rhs OP old_value); \
while (!KMP_COMPARE_AND_STORE_ACQ##BITS( \		while (!KMP_COMPARE_AND_STORE_ACQ##BITS( \
(kmp_int##BITS )lhs, VOLATILE_CAST(kmp_int##BITS *) & old_value, \		(kmp_int##BITS )lhs, VOLATILE_CAST(kmp_int##BITS *) & old_value, \
VOLATILE_CAST(kmp_int##BITS ) & new_value)) { \		VOLATILE_CAST(kmp_int##BITS ) & new_value)) { \
KMP_CPU_PAUSE(); \
\
temp_val = *lhs; \		temp_val = *lhs; \
old_value = temp_val; \		old_value = temp_val; \
new_value = (TYPE)(rhs OP old_value); \		new_value = (TYPE)(rhs OP old_value); \
} \		} \
if (flag) { \		if (flag) { \
return new_value; \		return new_value; \
} else \		} else \
return old_value; \		return old_value; \
▲ Show 20 Lines • Show All 306 Lines • ▼ Show 20 Lines	#define CMPXCHG_SWP(TYPE, BITS) \
TYPE KMP_ATOMIC_VOLATILE temp_val; \		TYPE KMP_ATOMIC_VOLATILE temp_val; \
TYPE old_value, new_value; \		TYPE old_value, new_value; \
temp_val = *lhs; \		temp_val = *lhs; \
old_value = temp_val; \		old_value = temp_val; \
new_value = rhs; \		new_value = rhs; \
while (!KMP_COMPARE_AND_STORE_ACQ##BITS( \		while (!KMP_COMPARE_AND_STORE_ACQ##BITS( \
(kmp_int##BITS )lhs, VOLATILE_CAST(kmp_int##BITS *) & old_value, \		(kmp_int##BITS )lhs, VOLATILE_CAST(kmp_int##BITS *) & old_value, \
VOLATILE_CAST(kmp_int##BITS ) & new_value)) { \		VOLATILE_CAST(kmp_int##BITS ) & new_value)) { \
KMP_CPU_PAUSE(); \
\
temp_val = *lhs; \		temp_val = *lhs; \
old_value = temp_val; \		old_value = temp_val; \
new_value = rhs; \		new_value = rhs; \
} \		} \
return old_value; \		return old_value; \
}		}

// -------------------------------------------------------------------------		// -------------------------------------------------------------------------
▲ Show 20 Lines • Show All 404 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[OpenMP] libomp: eliminate pause from atomic CAS loopsClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 329344

openmp/runtime/src/kmp_atomic.cpp

[OpenMP] libomp: eliminate pause from atomic CAS loops
ClosedPublic