This is an archive of the discontinued LLVM Phabricator instance.

[Headers] Add _Interlocked*_HLEAcquire/_HLERelease
ClosedPublic

Authored by craig.topper on Jun 1 2018, 10:56 PM.

Diff Detail

Event Timeline

ethanhs created this revision.Jun 1 2018, 10:56 PM
ethanhs edited the summary of this revision. (Show Details)Jun 1 2018, 10:59 PM
ethanhs edited the summary of this revision. (Show Details)Jun 3 2018, 9:35 PM
ethanhs added a reviewer: rnk.

I read up a little bit on TSX and HLE:
https://software.intel.com/en-us/node/524022
https://en.wikipedia.org/wiki/Transactional_Synchronization_Extensions

These HLE variants of the usual atomic exchange intrinsics add the xacquire and xrelease prefixes. They use the same encoding as the repe and repne prefixes, which previously didn't do anything on Intel (and presumably) AMD hardware.

They are a hint to the processor that this is a short critical section, and it is likely that the entire critical section can be entered, run, and committed to memory before another thread needs to use the memory used by the critical section. Your implementation doesn't add these prefixes, but the code will execute correctly. This functionality relies on TSX, which it looks like Intel added some time in 2014, so it's probably widely available.

Personally, I don't want to implement these intrinsics this way. We've already implemented intrinsics "our way" instead of doing whatever Visual C does, and it just leads to developer confusion when they discover that the compiler didn't emit the instruction they want. For example, the rotate intrinsics often don't work (https://llvm.org/pr37387) and the bittestandset (bts) intrinsics are just broken (http://llvm.org/pr33188).

@chandlerc @craig.topper are there any plans for representing HLE hints on atomic instructions in LLVM IR? Alternatively, do you know who would be most interested in adding them? This seems like a reasonable place to use metadata or SubclassOptionalData on regular atomicrmw instructions, since it's a hint that could be dropped. As far as mid-level passes are concerned, these are vanilla acquire and release instructions.

In D47672#1121181, @rnk wrote:

They are a hint to the processor that this is a short critical section, and it is likely that the entire critical section can be entered, run, and committed to memory before another thread needs to use the memory used by the critical section. Your implementation doesn't add these prefixes, but the code will execute correctly. This functionality relies on TSX, which it looks like Intel added some time in 2014, so it's probably widely available.

Personally, I don't want to implement these intrinsics this way. We've already implemented intrinsics "our way" instead of doing whatever Visual C does, and it just leads to developer confusion when they discover that the compiler didn't emit the instruction they want. For example, the rotate intrinsics often don't work (https://llvm.org/pr37387) and the bittestandset (bts) intrinsics are just broken (http://llvm.org/pr33188).

I see. AIUI, the HLE versions just add optional hints, so there should not be any functional differences beyond using the hints or not (theses are optimization hints, yes?). Perhaps I misunderstood the documentation.

are there any plans for representing HLE hints on atomic instructions in LLVM IR?

This would be quite nice to have, the main reason I implemented these functions this way is I don't have the bandwidth at the moment to correctly implement HLE in LLVM/Clang, nor some of the requisite knowledge I'm sure. I also am keen to have Python build with clang-cl on Windows :) But I definitely can understand wanting to do things correctly from the start.

I am going to the LLVM Bay Area monthly social Thursday so if someone wants to discuss this there I'd be more than happy to.

rnk added a comment.Jun 5 2018, 6:53 PM
In D47672#1121181, @rnk wrote:

They are a hint to the processor that this is a short critical section, and it is likely that the entire critical section can be entered, run, and committed to memory before another thread needs to use the memory used by the critical section. Your implementation doesn't add these prefixes, but the code will execute correctly. This functionality relies on TSX, which it looks like Intel added some time in 2014, so it's probably widely available.

Personally, I don't want to implement these intrinsics this way. We've already implemented intrinsics "our way" instead of doing whatever Visual C does, and it just leads to developer confusion when they discover that the compiler didn't emit the instruction they want. For example, the rotate intrinsics often don't work (https://llvm.org/pr37387) and the bittestandset (bts) intrinsics are just broken (http://llvm.org/pr33188).

I see. AIUI, the HLE versions just add optional hints, so there should not be any functional differences beyond using the hints or not (theses are optimization hints, yes?). Perhaps I misunderstood the documentation.

Yes, as far as I can tell.

are there any plans for representing HLE hints on atomic instructions in LLVM IR?

This would be quite nice to have, the main reason I implemented these functions this way is I don't have the bandwidth at the moment to correctly implement HLE in LLVM/Clang, nor some of the requisite knowledge I'm sure. I also am keen to have Python build with clang-cl on Windows :) But I definitely can understand wanting to do things correctly from the start.

Yeah, I was kind of hoping an HLE expert would appear and suggest a straightforward implementation path.

I am going to the LLVM Bay Area monthly social Thursday so if someone wants to discuss this there I'd be more than happy to.

I should also be able to make it this week, hope to see you there!

We (Intel) have discussed this a little internally. I'll be responding more shortly.

lib/Headers/immintrin.h
386

Shouldn't these still be in intrin.h? immintrin.h is for intrinsics defined by Intel which these arent'.

ethanhs marked an inline comment as done.Jun 6 2018, 12:27 PM

We (Intel) have discussed this a little internally. I'll be responding more shortly.

Great!

FWIW, re intrin.h vs immintrin.h, the documentation for these put them in immintrin.h.

https://docs.microsoft.com/en-us/cpp/intrinsics/interlockedcompareexchange-intrinsic-functions#requirements
https://docs.microsoft.com/en-us/cpp/intrinsics/interlockedexchange-intrinsic-functions#requirements

Fair enough, Then I think we should have a #ifdef _MSC_VER around them so they are only available when pretending to be MSVC. I believe intrin.h does that check very early in the file.

craig.topper added inline comments.Jun 6 2018, 12:56 PM
lib/Headers/immintrin.h
387

what is __DEFAULT_FN_ATTRS defined to here? Its not defined in this file and should have been undeffed before leaving any other file. If its still defined here, it's a bug.

ethanhs updated this revision to Diff 150189.Jun 6 2018, 1:03 PM

Guard to be used only under MSVC, define default FN attrs

ethanhs added inline comments.Jun 6 2018, 1:16 PM
lib/Headers/immintrin.h
387

When I added the _MSC_VER guard I realized this too. Not sure how this still compiled, as when I grepped [#|#un]def.*__DEFAULT_FN_ATTRS, every define had a matching undef.

It looks like gcc implements additional bits that can be passed to _atomic_exchange and friends, ATOMIC_HLE_ACQUIRE(1 << 16) and ATOMIC_HLE_RELEASE(1 << 17). Basically they're using bits above bit 16 in the order/memory_model as target specific flags. These constants are only defined when targeting X86 and they are validated to ensure they are only paired with the appropriate ATOMIC_ACQUIRE or ATOMIC_RELEASE or a stronger memory model.

As Reid said, its technically safe to drop the hints sometimes so we could use SubClassOptiionalData or metadata. But losing them could have performance implications. If you lose an XACQUIRE, the lock won't be elided as the user expected. And if you keep an XACQUIRE, but lose an XRELEASE the processor will keep trying to speculate farther than it should until it eventually hits some random abort trigger and has to rollback to really acquiring the lock. Both of these would be surprising to the user so we should make an effort not to lose the information as much as possible.

Here's a start at an implementation proposal with some embedded questions.
-Add the X86 ATOMIC_HLE_ACQUIRE/ATOMIC_HLE_RELEASE matching the gcc encoding value.
-Write these intrinsics to pass these flags.
-Teach CGAtomic.cpp to lower those hints to whatever IR representation we choose. If we choose SubclassOptionalData, we'll also need to add bitcode, LL parsing, and printing support. Not sure what we would need for metadata.
-Add an HLE_ACQUIRE and HLE_RELEASE prefixed version of every instruction that can be prefixed to the X86Instr*.td files with appropriate isel patterns. This matches what we do for LOCK already. This is probably somewhere between 130-150 instructions after tblgen expansion for operand sizes, immediate vs register, etc. Ideally we'd devise some way to tag MachineInstr* with a lock, hle acquire, and hle release so that we didn't need separate instruction opcodes for each permutation. But this would just make things scale better is not required for functionality.
-Need a way to represent this in SelectionDAG so X86 specific code can create the right target specific nodes. Do we have a metadata infrastructure there? Or should we store it with the ordering MachineMemOperand? Or in SDNodeFlags?

Obviously a lot of that will take some time. I wonder if it makes sense to add the ATOMIC_HLE_ACQUIRE/ATOMIC_HLE_RELEASE constants, but ignore them in CGAtomics.cpp for now? We could then implement these intrinsics with the code we ultimately want to see there, but not implement the hints yet. Thoughts?

hans added a subscriber: hans.Jun 11 2018, 6:58 AM
hans added a comment.Jun 11 2018, 8:18 AM

It sounds like adding proper support for HLE prefixes is a largeish project.

ctopper, rnk: Do you think it would be worth adding inline asm versions (with the xacquire/release prefixes) of these intrinsics in the meantime? It would inhibit optimizations but be better than the current state of not having the intrinsics at all.

rnk added a comment.Jun 11 2018, 1:32 PM

It sounds like adding proper support for HLE prefixes is a largeish project.

ctopper, rnk: Do you think it would be worth adding inline asm versions (with the xacquire/release prefixes) of these intrinsics in the meantime? It would inhibit optimizations but be better than the current state of not having the intrinsics at all.

Yeah, let's do that. I'm a lot more comfortable ignoring bugs about missed optimizations with fancy intrinsics than bugs that say the intrinsic doesn't do what it says it does.

In D47672#1128863, @rnk wrote:

It sounds like adding proper support for HLE prefixes is a largeish project.

ctopper, rnk: Do you think it would be worth adding inline asm versions (with the xacquire/release prefixes) of these intrinsics in the meantime? It would inhibit optimizations but be better than the current state of not having the intrinsics at all.

Yeah, let's do that. I'm a lot more comfortable ignoring bugs about missed optimizations with fancy intrinsics than bugs that say the intrinsic doesn't do what it says it does.

I'm afraid I've never actually written inline asm in C, but if no one else wants to take this, I'm willing to try to figure it out. I'd like to be able to use these in my CPython port :)

I'll give the inline assembly a shot.

craig.topper commandeered this revision.Jun 12 2018, 9:57 PM
craig.topper updated this revision to Diff 151100.
craig.topper added a reviewer: ethanhs.

Inline assembly implementations.

hans accepted this revision.Jun 13 2018, 1:27 AM

Nice! Looks good to me.

This revision is now accepted and ready to land.Jun 13 2018, 1:27 AM
ethanhs accepted this revision.Jun 13 2018, 2:22 AM

Works great!

FWIW, I found a cfe-dev thread about adding HLE support http://lists.llvm.org/pipermail/cfe-dev/2013-February/028031.html And a bunch of dead patches in phabricator https://reviews.llvm.org/people/revisions/110/

I also spoke to Andi Kleen here at Intel to make sure I got these inline assembly versions correct. And he's not sure CPython should be using these the way it is. It looks like they try to use the HLE versions anytime the memory order is acquire/release. But HLE isn't suitable for every acquire/release.

This revision was automatically updated to reflect the committed changes.