This is an archive of the discontinued LLVM Phabricator instance.

[X86][BtVer2] Fix latency and throughput of XCHG and XADD.
ClosedPublic

Authored by andreadb on Aug 21 2019, 7:15 AM.

Details

Summary

On Jaguar, XCHG has a latency of 1cy and decodes to 2 macro-opcodes. Maximum throughput for XCHG is 1 IPC.
The byte exchange has worse latency and decodes to 1 extra uOP; maximum observed throughput is 0.5 IPC.

xchgb %cl, %dl           # Latency: 2cy    -  uOPs: 3  -  2 ALU
xchgw %cx, %dx           # Latency: 1cy    -  uOPs: 2  -  2 ALU
xchgl %ecx, %edx         # Latency: 1cy    -  uOPs: 2  -  2 ALU
xchgq %rcx, %rdx         # Latency: 1cy    -  uOPs: 2  -  2 ALU

The reg-mem forms of XCHG are atomic operations with an observed latency of 16cy.
The resource usage is similar to the XCHGrr variants. The biggest difference is obviously the bus-locking, which prevents the LS to issue other memory uOPs in parallel until the unlocking store uOP is executed.

xchgb %cl, (%rsp)        # Latency: 16cy   -  uOPs: 3 -- ECX available in 11cy
xchgw %cx, (%rsp)        # Latency: 16cy   -  uOPs: 3 -- ECX available in 11cy
xchgl %ecx, (%rsp)       # Latency: 16cy   -  uOPs: 3 -- ECX available in 11cy
xchgq %rcx, (%rsp)       # Latency: 16cy   -  uOPs: 3 -- ECX available in 11cy

The exchanged in/out register operand becomes available after 11cy from the start of execution. Added test xchg.s to verify that we correctly see that register write committed in 11cy (and not 16cy).

Reg-reg XADD instructions have the same latency/throughput than the byte exchange (register-register variant).

xaddb %cl, %dl           # latency: 2cy    -  uOPs: 3  -  3 ALU
xaddw %cx, %dx           # latency: 2cy    -  uOPs: 3  -  3 ALU
xaddl %ecx, %edx         # latency: 2cy    -  uOPs: 3  -  3 ALU
xaddq %rcx, %rdx         # latency: 2cy    -  uOPs: 3  -  3 ALU

The non-atomic RM variants have a latency of 11cy, and decode to 4 macro-opcodes. They still consume 2 ALU pipes, and the exchange in/out register operand becomes available in 3cy (it matches the 'load-to-use latency').

xaddb %cl, (%rsp)        # latency: 11cy   -  uOPs: 4  -  3 ALU, 1 Ld, 1 St
xaddw %cx, (%rsp)        # latency: 11cy   -  uOPs: 4  -  3 ALU, 1 Ld, 1 St
xaddl %ecx, (%rsp)       # latency: 11cy   -  uOPs: 4  -  3 ALU, 1 Ld, 1 St
xaddq %rcx, (%rsp)       # latency: 11cy   -  uOPs: 4  -  3 ALU, 1 Ld, 1 St

The atomic XADD variants execute in 16cy. The in/out register operand is available after 11cy from the start of execution.

lock xaddb %cl, (%rsp)   # latency: 16cy - uOPs: 4  - 3 ALU, 1 Ld, 1 St -- ECX available in 11cy
lock xaddw %cx, (%rsp)   # latency: 16cy - uOPs: 4  - 3 ALU, 1 Ld, 1 St -- ECX available in 11cy
lock xaddl %ecx, (%rsp)  # latency: 16cy - uOPs: 4  - 3 ALU, 1 Ld, 1 St -- ECX available in 11cy
lock xaddq %rcx, (%rsp)  # latency: 16cy - uOPs: 4  - 3 ALU, 1 Ld, 1 St -- ECX available in 11cy

Added test xadd.s to verify those latencies as well as read-advance values.

Please let me know if okay to commit.

Diff Detail

Repository
rL LLVM

Event Timeline

andreadb created this revision.Aug 21 2019, 7:15 AM
andreadb edited the summary of this revision. (Show Details)Aug 21 2019, 7:49 AM
andreadb updated this revision to Diff 216405.Aug 21 2019, 8:07 AM
andreadb edited the summary of this revision. (Show Details)

Patch updated.

The observed maximum throughput for XADDrr and XCHG8rr is 0.5 IPC, and not 1 IPC - as previously reported.
Increased the number of ALU resource cycles consumed by XADD and XCHG8rr to better match the real throughput of those opcodes.

RKSimon added inline comments.Aug 21 2019, 11:34 AM
lib/Target/X86/X86ScheduleBtVer2.td
406 ↗(On Diff #216405)

It'd be useful if you could add an explanatory comment - all the '_Part' defs are confusing - it takes a while to understand the latency dependencies.

andreadb updated this revision to Diff 216447.Aug 21 2019, 12:52 PM

Address review comment.

Added a comment to explain why we need write pairs to model XADD and XCHG.

RKSimon accepted this revision.Aug 22 2019, 4:09 AM

LGTM - cheers for the explanations!

This revision is now accepted and ready to land.Aug 22 2019, 4:09 AM
This revision was automatically updated to reflect the committed changes.
Herald added a project: Restricted Project. · View Herald TranscriptAug 22 2019, 4:32 AM