Page MenuHomePhabricator

[X86][Btver2] Fix latency and throughput of CMPXCHG instructions.

Authored by andreadb on Aug 19 2019, 9:17 AM.



On Jaguar, CMPXCHG has a latency of 11cy, and a maximum throughput of 0.33 IPC. Throughput is superiorly limited to 0.33 because of the implicit in/out dependency on register EAX. In the case of repeated non-atomic CMPXCHG with the same memory location, store-to-load forwarding occurs and values for sequent loads are quickly forwarded from the store buffer.

Interestingly, the functionality in LLVM that computes the reciprocal throughput doesn't seem to know about RMW instructions. That functionality only looks at the "consumed resource cycles" for the throughput computation. It should be fixed/improved by a future patch. In particular, for RMW instructions, that logic should also take into account for the write latency of in/out register operands.

An atomic CMPXCHG has a latency of ~17cy. Throughput is also limited to ~17cy/inst due to cache locking, which prevents other memory uOPs to start executing before the "lock releasing" store uOP.

CMPXCHG8rr and CMPXCHG8rm are treated specially because they decode to one less macro opcode. Their latency tend to be the same as the other RR/RM variants. RR variants are relatively fast 3cy (but still microcoded - 5 macro opcodes).

The two new hasLockPrefix() functions are used by the btver2 scheduling model check if a MCInst/MachineInst has a LOCK prefix. Calls to hasLockPrefix() have been encoded in predicates of variant scheduling classes that describe lat/thr of CMPXCHG.

Let me know if okay to commit.

Diff Detail


Event Timeline

andreadb created this revision.Aug 19 2019, 9:17 AM
lebedev.ri added inline comments.
8 ↗(On Diff #215917)

Can these test updates be propagated to every resources-cmpxchg.s please?

andreadb marked an inline comment as done.Aug 19 2019, 9:53 AM
andreadb added inline comments.
8 ↗(On Diff #215917)

Sure. I am going to do this as a NFC. Then I update this patch.

andreadb updated this revision to Diff 215942.Aug 19 2019, 10:16 AM

Address review comment.

Test files have been updated at r369279.

RKSimon added inline comments.Aug 19 2019, 10:47 AM
48 ↗(On Diff #215942)

Only the 8b/16b ops should be in resources-cmpxchg.s, the rest should be in resources-x86_64.s - the non-lock ops are already there. Ideally we'd have coverage for all the other lock ops (ADD, ADC, etc.) but I'm happy to deal with that later.

andreadb updated this revision to Diff 215965.Aug 19 2019, 12:21 PM

Patch updated.

CMPXCHG have been moved to resources-x86_64.s as requested by @RKSimon .

Added missing scheduling information for CMPXCHG8B and CMPXCHG16B.
CMPXCHG8B is 11cy and unfortunately doesn't seem to benefit from store-to-load forwarding. That means, throughput is clearly limited by the in/out dependency on GPR registers. The uOP composition is sadly unknown (due to the lack of PMCs for the Integer pipes). I have reused the same mix of consumed resource from the other CMPXCHG instructions for CMPXCHG8B too.

LOCK CMPXCHG8B is instead 18cycles.

CMPXCHG16B is 32cycles. Up to 38cycles when the LOCK prefix is specified. Due to the in/out dependencies, throughput is limited to 1 instruction every 32 (or 38) cycles dependeing on whether the LOCK prefix is specified or not.
I wouldn't be surprised if the microcode for CMPXCHG16B is similar to 2x microcode from CMPXCHG8B. So, I have speculatively set the JALU01 consumption to 2x the resource cycles used for CMPXCHG8B.

RKSimon accepted this revision.Aug 19 2019, 12:58 PM

LGTM - thanks!

This revision is now accepted and ready to land.Aug 19 2019, 12:58 PM
This revision was automatically updated to reflect the committed changes.
Herald added a project: Restricted Project. · View Herald TranscriptAug 20 2019, 3:22 AM