This is an archive of the discontinued LLVM Phabricator instance.

These HLE variants of the usual atomic exchange intrinsics add the xacquire and xrelease prefixes. They use the same encoding as the repe and repne prefixes, which previously didn't do anything on Intel (and presumably) AMD hardware.

They are a hint to the processor that this is a short critical section, and it is likely that the entire critical section can be entered, run, and committed to memory before another thread needs to use the memory used by the critical section. Your implementation doesn't add these prefixes, but the code will execute correctly. This functionality relies on TSX, which it looks like Intel added some time in 2014, so it's probably widely available.

Personally, I don't want to implement these intrinsics this way. We've already implemented intrinsics "our way" instead of doing whatever Visual C does, and it just leads to developer confusion when they discover that the compiler didn't emit the instruction they want. For example, the rotate intrinsics often don't work (https://llvm.org/pr37387) and the bittestandset (bts) intrinsics are just broken (http://llvm.org/pr33188).

@chandlerc @craig.topper are there any plans for representing HLE hints on atomic instructions in LLVM IR? Alternatively, do you know who would be most interested in adding them? This seems like a reasonable place to use metadata or SubclassOptionalData on regular atomicrmw instructions, since it's a hint that could be dropped. As far as mid-level passes are concerned, these are vanilla acquire and release instructions.

In D47672#1121181, @rnk wrote:

They are a hint to the processor that this is a short critical section, and it is likely that the entire critical section can be entered, run, and committed to memory before another thread needs to use the memory used by the critical section. Your implementation doesn't add these prefixes, but the code will execute correctly. This functionality relies on TSX, which it looks like Intel added some time in 2014, so it's probably widely available.

Personally, I don't want to implement these intrinsics this way. We've already implemented intrinsics "our way" instead of doing whatever Visual C does, and it just leads to developer confusion when they discover that the compiler didn't emit the instruction they want. For example, the rotate intrinsics often don't work (https://llvm.org/pr37387) and the bittestandset (bts) intrinsics are just broken (http://llvm.org/pr33188).

I see. AIUI, the HLE versions just add optional hints, so there should not be any functional differences beyond using the hints or not (theses are optimization hints, yes?). Perhaps I misunderstood the documentation.

are there any plans for representing HLE hints on atomic instructions in LLVM IR?

This would be quite nice to have, the main reason I implemented these functions this way is I don't have the bandwidth at the moment to correctly implement HLE in LLVM/Clang, nor some of the requisite knowledge I'm sure. I also am keen to have Python build with clang-cl on Windows :) But I definitely can understand wanting to do things correctly from the start.

I am going to the LLVM Bay Area monthly social Thursday so if someone wants to discuss this there I'd be more than happy to.

In D47672#1122013, @ethanhs wrote:

In D47672#1121181, @rnk wrote:

They are a hint to the processor that this is a short critical section, and it is likely that the entire critical section can be entered, run, and committed to memory before another thread needs to use the memory used by the critical section. Your implementation doesn't add these prefixes, but the code will execute correctly. This functionality relies on TSX, which it looks like Intel added some time in 2014, so it's probably widely available.

Personally, I don't want to implement these intrinsics this way. We've already implemented intrinsics "our way" instead of doing whatever Visual C does, and it just leads to developer confusion when they discover that the compiler didn't emit the instruction they want. For example, the rotate intrinsics often don't work (https://llvm.org/pr37387) and the bittestandset (bts) intrinsics are just broken (http://llvm.org/pr33188).

I see. AIUI, the HLE versions just add optional hints, so there should not be any functional differences beyond using the hints or not (theses are optimization hints, yes?). Perhaps I misunderstood the documentation.

Yes, as far as I can tell.

are there any plans for representing HLE hints on atomic instructions in LLVM IR?

This would be quite nice to have, the main reason I implemented these functions this way is I don't have the bandwidth at the moment to correctly implement HLE in LLVM/Clang, nor some of the requisite knowledge I'm sure. I also am keen to have Python build with clang-cl on Windows :) But I definitely can understand wanting to do things correctly from the start.

Yeah, I was kind of hoping an HLE expert would appear and suggest a straightforward implementation path.

I am going to the LLVM Bay Area monthly social Thursday so if someone wants to discuss this there I'd be more than happy to.

I should also be able to make it this week, hope to see you there!

We (Intel) have discussed this a little internally. I'll be responding more shortly.

lib/Headers/immintrin.h
386	Shouldn't these still be in intrin.h? immintrin.h is for intrinsics defined by Intel which these arent'.

In D47672#1123953, @craig.topper wrote:

We (Intel) have discussed this a little internally. I'll be responding more shortly.

Great!

FWIW, re intrin.h vs immintrin.h, the documentation for these put them in immintrin.h.

https://docs.microsoft.com/en-us/cpp/intrinsics/interlockedcompareexchange-intrinsic-functions#requirements
https://docs.microsoft.com/en-us/cpp/intrinsics/interlockedexchange-intrinsic-functions#requirements

Fair enough, Then I think we should have a #ifdef _MSC_VER around them so they are only available when pretending to be MSVC. I believe intrin.h does that check very early in the file.

craig.topper added inline comments.Jun 6 2018, 12:56 PM

lib/Headers/immintrin.h
387	what is __DEFAULT_FN_ATTRS defined to here? Its not defined in this file and should have been undeffed before leaving any other file. If its still defined here, it's a bug.

Guard to be used only under MSVC, define default FN attrs

Harbormaster completed remote builds in B19005: Diff 150189.Jun 6 2018, 1:03 PM

ethanhs added inline comments.Jun 6 2018, 1:16 PM

lib/Headers/immintrin.h
387	When I added the _MSC_VER guard I realized this too. Not sure how this still compiled, as when I grepped [#\|#un]def.*__DEFAULT_FN_ATTRS, every define had a matching undef.

It looks like gcc implements additional bits that can be passed to _atomic_exchange and friends, ATOMIC_HLE_ACQUIRE(1 << 16) and ATOMIC_HLE_RELEASE(1 << 17). Basically they're using bits above bit 16 in the order/memory_model as target specific flags. These constants are only defined when targeting X86 and they are validated to ensure they are only paired with the appropriate ATOMIC_ACQUIRE or ATOMIC_RELEASE or a stronger memory model.

As Reid said, its technically safe to drop the hints sometimes so we could use SubClassOptiionalData or metadata. But losing them could have performance implications. If you lose an XACQUIRE, the lock won't be elided as the user expected. And if you keep an XACQUIRE, but lose an XRELEASE the processor will keep trying to speculate farther than it should until it eventually hits some random abort trigger and has to rollback to really acquiring the lock. Both of these would be surprising to the user so we should make an effort not to lose the information as much as possible.

Here's a start at an implementation proposal with some embedded questions.
-Add the X86 ATOMIC_HLE_ACQUIRE/ATOMIC_HLE_RELEASE matching the gcc encoding value.
-Write these intrinsics to pass these flags.
-Teach CGAtomic.cpp to lower those hints to whatever IR representation we choose. If we choose SubclassOptionalData, we'll also need to add bitcode, LL parsing, and printing support. Not sure what we would need for metadata.
-Add an HLE_ACQUIRE and HLE_RELEASE prefixed version of every instruction that can be prefixed to the X86Instr*.td files with appropriate isel patterns. This matches what we do for LOCK already. This is probably somewhere between 130-150 instructions after tblgen expansion for operand sizes, immediate vs register, etc. Ideally we'd devise some way to tag MachineInstr* with a lock, hle acquire, and hle release so that we didn't need separate instruction opcodes for each permutation. But this would just make things scale better is not required for functionality.
-Need a way to represent this in SelectionDAG so X86 specific code can create the right target specific nodes. Do we have a metadata infrastructure there? Or should we store it with the ordering MachineMemOperand? Or in SDNodeFlags?

Obviously a lot of that will take some time. I wonder if it makes sense to add the ATOMIC_HLE_ACQUIRE/ATOMIC_HLE_RELEASE constants, but ignore them in CGAtomics.cpp for now? We could then implement these intrinsics with the code we ultimately want to see there, but not implement the hints yet. Thoughts?

hans added a subscriber: hans.Jun 11 2018, 6:58 AM

It sounds like adding proper support for HLE prefixes is a largeish project.

ctopper, rnk: Do you think it would be worth adding inline asm versions (with the xacquire/release prefixes) of these intrinsics in the meantime? It would inhibit optimizations but be better than the current state of not having the intrinsics at all.

In D47672#1128308, @hans wrote:

It sounds like adding proper support for HLE prefixes is a largeish project.

ctopper, rnk: Do you think it would be worth adding inline asm versions (with the xacquire/release prefixes) of these intrinsics in the meantime? It would inhibit optimizations but be better than the current state of not having the intrinsics at all.

Yeah, let's do that. I'm a lot more comfortable ignoring bugs about missed optimizations with fancy intrinsics than bugs that say the intrinsic doesn't do what it says it does.

In D47672#1128863, @rnk wrote:

In D47672#1128308, @hans wrote:

It sounds like adding proper support for HLE prefixes is a largeish project.

ctopper, rnk: Do you think it would be worth adding inline asm versions (with the xacquire/release prefixes) of these intrinsics in the meantime? It would inhibit optimizations but be better than the current state of not having the intrinsics at all.

Yeah, let's do that. I'm a lot more comfortable ignoring bugs about missed optimizations with fancy intrinsics than bugs that say the intrinsic doesn't do what it says it does.

I'm afraid I've never actually written inline asm in C, but if no one else wants to take this, I'm willing to try to figure it out. I'd like to be able to use these in my CPython port :)

I'll give the inline assembly a shot.

Inline assembly implementations.

Nice! Looks good to me.

This revision is now accepted and ready to land.Jun 13 2018, 1:27 AM

Works great!

FWIW, I found a cfe-dev thread about adding HLE support http://lists.llvm.org/pipermail/cfe-dev/2013-February/028031.html And a bunch of dead patches in phabricator https://reviews.llvm.org/people/revisions/110/

I also spoke to Andi Kleen here at Intel to make sure I got these inline assembly versions correct. And he's not sure CPython should be using these the way it is. It looks like they try to use the HLE versions anytime the memory order is acquire/release. But HLE isn't suitable for every acquire/release.

Closed by commit rL334751: [X86] Add inline assembly versions of _InterlockedExchange_HLEAcquire/Release… (authored by ctopper). · Explain WhyJun 14 2018, 11:48 AM

This revision was automatically updated to reflect the committed changes.

Herald added a subscriber: llvm-commits. · View Herald TranscriptJun 14 2018, 11:48 AM

Revision Contents

Path

Size

lib/

Headers/

immintrin.h

72 lines

intrin.h

10 lines

Diff 150189

lib/Headers/immintrin.h

	Show First 20 Lines • Show All 374 Lines • ▼ Show 20 Lines
	#if !defined(_MSC_VER) \|\| __has_feature(modules) \|\| defined(__PTWRITE__)			#if !defined(_MSC_VER) \|\| __has_feature(modules) \|\| defined(__PTWRITE__)
	#include <ptwriteintrin.h>			#include <ptwriteintrin.h>
	#endif			#endif

	#if !defined(_MSC_VER) \|\| __has_feature(modules) \|\| defined(__INVPCID__)			#if !defined(_MSC_VER) \|\| __has_feature(modules) \|\| defined(__INVPCID__)
	#include <invpcidintrin.h>			#include <invpcidintrin.h>
	#endif			#endif

				#ifdef _MSC_VER
				/* Define the default attributes for these intrinsics */
				#define __DEFAULT_FN_ATTRS __attribute__((__always_inline__, __nodebug__))
				#ifdef __cplusplus
				craig.topperAuthorUnsubmitted Done Reply Inline Actions Shouldn't these still be in intrin.h? immintrin.h is for intrinsics defined by Intel which these arent'. craig.topper: Shouldn't these still be in intrin.h? immintrin.h is for intrinsics defined by Intel which…
				extern "C" {
				craig.topperAuthorUnsubmitted Not Done Reply Inline Actions what is __DEFAULT_FN_ATTRS defined to here? Its not defined in this file and should have been undeffed before leaving any other file. If its still defined here, it's a bug. craig.topper: what is __DEFAULT_FN_ATTRS defined to here? Its not defined in this file and should have been…
				ethanhsUnsubmitted Not Done Reply Inline Actions When I added the _MSC_VER guard I realized this too. Not sure how this still compiled, as when I grepped [#\|#un]def.__DEFAULT_FN_ATTRS, every define had a matching undef. ethanhs:* When I added the _MSC_VER guard I realized this too. Not sure how this still compiled, as when…
				#endif
				/----------------------------------------------------------------------------\
				\|* Interlocked Exchange HLE
				\----------------------------------------------------------------------------/
				#if defined(__i386__) \|\| defined(__x86_64__)
				static __inline__ long __DEFAULT_FN_ATTRS
				_InterlockedExchange_HLEAcquire(long volatile *_Target, long _Value) {
				__atomic_exchange(_Target, &_Value, &_Value, __ATOMIC_ACQUIRE);
				return _Value;
				}
				static __inline__ long __DEFAULT_FN_ATTRS
				_InterlockedExchange_HLERelease(long volatile *_Target, long _Value) {
				__atomic_exchange(_Target, &_Value, &_Value, __ATOMIC_RELEASE);
				return _Value;
				}
				static __inline__ __int64 __DEFAULT_FN_ATTRS
				_InterlockedExchange64_HLEAcquire(__int64 volatile *_Target, __int64 _Value) {
				__atomic_exchange(_Target, &_Value, &_Value, __ATOMIC_ACQUIRE);
				return _Value;
				}
				static __inline__ __int64 __DEFAULT_FN_ATTRS
				_InterlockedExchange64_HLERelease(__int64 volatile *_Target, __int64 _Value) {
				__atomic_exchange(_Target, &_Value, &_Value, __ATOMIC_RELEASE);
				return _Value;
				}
				#endif
				/----------------------------------------------------------------------------\
				\|* Interlocked Compare Exchange HLE
				\----------------------------------------------------------------------------/
				#if defined(__i386__) \|\| defined(__x86_64__)
				static __inline__ long __DEFAULT_FN_ATTRS
				_InterlockedCompareExchange_HLEAcquire(long volatile *_Destination,
				long _Exchange, long _Comparand) {
				__atomic_compare_exchange(_Destination, &_Comparand, &_Exchange, 0,
				__ATOMIC_SEQ_CST, __ATOMIC_ACQUIRE);
				return _Comparand;
				}
				static __inline__ long __DEFAULT_FN_ATTRS
				_InterlockedCompareExchange_HLERelease(long volatile *_Destination,
				long _Exchange, long _Comparand) {
				__atomic_compare_exchange(_Destination, &_Comparand, &_Exchange, 0,
				__ATOMIC_SEQ_CST, __ATOMIC_RELEASE);
				return _Comparand;
				}
				static __inline__ __int64 __DEFAULT_FN_ATTRS
				_InterlockedCompareExchange64_HLEAcquire(__int64 volatile *_Destination,
				__int64 _Exchange, __int64 _Comparand) {
				__atomic_compare_exchange(_Destination, &_Comparand, &_Exchange, 0,
				__ATOMIC_SEQ_CST, __ATOMIC_ACQUIRE);
				return _Comparand;
				}
				static __inline__ __int64 __DEFAULT_FN_ATTRS
				_InterlockedCompareExchange64_HLERelease(__int64 volatile *_Destination,
				__int64 _Exchange, __int64 _Comparand) {
				__atomic_compare_exchange(_Destination, &_Comparand, &_Exchange, 0,
				__ATOMIC_SEQ_CST, __ATOMIC_RELEASE);
				return _Comparand;
				}
				#endif
				#ifdef __cplusplus
				}
				#endif

				#undef __DEFAULT_FN_ATTRS

				#endif /* _MSC_VER */

	#endif /* __IMMINTRIN_H */			#endif /* __IMMINTRIN_H */

lib/Headers/intrin.h

Show First 20 Lines • Show All 168 Lines • ▼ Show 20 Lines
unsigned char _bittestandreset(long *, long);		unsigned char _bittestandreset(long *, long);
static __inline__		static __inline__
unsigned char _bittestandset(long *, long);		unsigned char _bittestandset(long *, long);
void __cdecl _disable(void);		void __cdecl _disable(void);
void __cdecl _enable(void);		void __cdecl _enable(void);
long _InterlockedAddLargeStatistic(__int64 volatile *_Addend, long _Value);		long _InterlockedAddLargeStatistic(__int64 volatile *_Addend, long _Value);
unsigned char _interlockedbittestandreset(long volatile *, long);		unsigned char _interlockedbittestandreset(long volatile *, long);
unsigned char _interlockedbittestandset(long volatile *, long);		unsigned char _interlockedbittestandset(long volatile *, long);
long _InterlockedCompareExchange_HLEAcquire(long volatile *, long, long);
long _InterlockedCompareExchange_HLERelease(long volatile *, long, long);
__int64 _InterlockedcompareExchange64_HLEAcquire(__int64 volatile *, __int64,
__int64);
__int64 _InterlockedCompareExchange64_HLERelease(__int64 volatile *, __int64,
__int64);
void _InterlockedCompareExchangePointer_HLEAcquire(void volatile , void ,		void _InterlockedCompareExchangePointer_HLEAcquire(void volatile , void ,
void *);		void *);
void _InterlockedCompareExchangePointer_HLERelease(void volatile , void ,		void _InterlockedCompareExchangePointer_HLERelease(void volatile , void ,
void *);		void *);
long _InterlockedExchangeAdd_HLEAcquire(long volatile *, long);		long _InterlockedExchangeAdd_HLEAcquire(long volatile *, long);
long _InterlockedExchangeAdd_HLERelease(long volatile *, long);		long _InterlockedExchangeAdd_HLERelease(long volatile *, long);
__int64 _InterlockedExchangeAdd64_HLEAcquire(__int64 volatile *, __int64);		__int64 _InterlockedExchangeAdd64_HLEAcquire(__int64 volatile *, __int64);
__int64 _InterlockedExchangeAdd64_HLERelease(__int64 volatile *, __int64);		__int64 _InterlockedExchangeAdd64_HLERelease(__int64 volatile *, __int64);
▲ Show 20 Lines • Show All 91 Lines • ▼ Show 20 Lines	unsigned char _InterlockedCompareExchange128(__int64 volatile *_Destination,
__int64 _ExchangeLow,		__int64 _ExchangeLow,
__int64 *_CompareandResult);		__int64 *_CompareandResult);
unsigned char _InterlockedCompareExchange128_np(__int64 volatile *_Destination,		unsigned char _InterlockedCompareExchange128_np(__int64 volatile *_Destination,
__int64 _ExchangeHigh,		__int64 _ExchangeHigh,
__int64 _ExchangeLow,		__int64 _ExchangeLow,
__int64 *_ComparandResult);		__int64 *_ComparandResult);
short _InterlockedCompareExchange16_np(short volatile *_Destination,		short _InterlockedCompareExchange16_np(short volatile *_Destination,
short _Exchange, short _Comparand);		short _Exchange, short _Comparand);
__int64 _InterlockedCompareExchange64_HLEAcquire(__int64 volatile *, __int64,
__int64);
__int64 _InterlockedCompareExchange64_HLERelease(__int64 volatile *, __int64,
__int64);
__int64 _InterlockedCompareExchange64_np(__int64 volatile *_Destination,		__int64 _InterlockedCompareExchange64_np(__int64 volatile *_Destination,
__int64 _Exchange, __int64 _Comparand);		__int64 _Exchange, __int64 _Comparand);
void _InterlockedCompareExchangePointer_np(void volatile *_Destination,		void _InterlockedCompareExchangePointer_np(void volatile *_Destination,
void _Exchange, void _Comparand);		void _Exchange, void _Comparand);
long _InterlockedOr_np(long volatile *_Value, long _Mask);		long _InterlockedOr_np(long volatile *_Value, long _Mask);
short _InterlockedOr16_np(short volatile *_Value, short _Mask);		short _InterlockedOr16_np(short volatile *_Value, short _Mask);
__int64 _InterlockedOr64_np(__int64 volatile *_Value, __int64 _Mask);		__int64 _InterlockedOr64_np(__int64 volatile *_Value, __int64 _Mask);
char _InterlockedOr8_np(char volatile *_Value, char _Mask);		char _InterlockedOr8_np(char volatile *_Value, char _Mask);
▲ Show 20 Lines • Show All 668 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[Headers] Add _Interlocked*_HLEAcquire/_HLEReleaseClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 150189

lib/Headers/immintrin.h

lib/Headers/intrin.h

[Headers] Add _Interlocked*_HLEAcquire/_HLERelease
ClosedPublic