On Jaguar, XCHG has a latency of 1cy and decodes to 2 macro-opcodes. Maximum throughput for XCHG is 1 IPC.
The byte exchange has worse latency and decodes to 1 extra uOP; maximum observed throughput is 0.5 IPC.
xchgb %cl, %dl # Latency: 2cy - uOPs: 3 - 2 ALU xchgw %cx, %dx # Latency: 1cy - uOPs: 2 - 2 ALU xchgl %ecx, %edx # Latency: 1cy - uOPs: 2 - 2 ALU xchgq %rcx, %rdx # Latency: 1cy - uOPs: 2 - 2 ALU
The reg-mem forms of XCHG are atomic operations with an observed latency of 16cy.
The resource usage is similar to the XCHGrr variants. The biggest difference is obviously the bus-locking, which prevents the LS to issue other memory uOPs in parallel until the unlocking store uOP is executed.
xchgb %cl, (%rsp) # Latency: 16cy - uOPs: 3 -- ECX available in 11cy xchgw %cx, (%rsp) # Latency: 16cy - uOPs: 3 -- ECX available in 11cy xchgl %ecx, (%rsp) # Latency: 16cy - uOPs: 3 -- ECX available in 11cy xchgq %rcx, (%rsp) # Latency: 16cy - uOPs: 3 -- ECX available in 11cy
The exchanged in/out register operand becomes available after 11cy from the start of execution. Added test xchg.s to verify that we correctly see that register write committed in 11cy (and not 16cy).
Reg-reg XADD instructions have the same latency/throughput than the byte exchange (register-register variant).
xaddb %cl, %dl # latency: 2cy - uOPs: 3 - 3 ALU xaddw %cx, %dx # latency: 2cy - uOPs: 3 - 3 ALU xaddl %ecx, %edx # latency: 2cy - uOPs: 3 - 3 ALU xaddq %rcx, %rdx # latency: 2cy - uOPs: 3 - 3 ALU
The non-atomic RM variants have a latency of 11cy, and decode to 4 macro-opcodes. They still consume 2 ALU pipes, and the exchange in/out register operand becomes available in 3cy (it matches the 'load-to-use latency').
xaddb %cl, (%rsp) # latency: 11cy - uOPs: 4 - 3 ALU, 1 Ld, 1 St xaddw %cx, (%rsp) # latency: 11cy - uOPs: 4 - 3 ALU, 1 Ld, 1 St xaddl %ecx, (%rsp) # latency: 11cy - uOPs: 4 - 3 ALU, 1 Ld, 1 St xaddq %rcx, (%rsp) # latency: 11cy - uOPs: 4 - 3 ALU, 1 Ld, 1 St
The atomic XADD variants execute in 16cy. The in/out register operand is available after 11cy from the start of execution.
lock xaddb %cl, (%rsp) # latency: 16cy - uOPs: 4 - 3 ALU, 1 Ld, 1 St -- ECX available in 11cy lock xaddw %cx, (%rsp) # latency: 16cy - uOPs: 4 - 3 ALU, 1 Ld, 1 St -- ECX available in 11cy lock xaddl %ecx, (%rsp) # latency: 16cy - uOPs: 4 - 3 ALU, 1 Ld, 1 St -- ECX available in 11cy lock xaddq %rcx, (%rsp) # latency: 16cy - uOPs: 4 - 3 ALU, 1 Ld, 1 St -- ECX available in 11cy
Added test xadd.s to verify those latencies as well as read-advance values.
Please let me know if okay to commit.