This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
lib/Target/X86/
-
Target/
-
X86/
-
MCTargetDesc/
-
X86MCTargetDesc.h
-
X86MCTargetDesc.cpp
-
X86InstrInfo.h
-
X86SchedPredicates.td
-
X86ScheduleBtVer2.td
-
test/tools/llvm-mca/X86/BtVer2/
-
tools/
-
llvm-mca/
-
X86/
-
BtVer2/
1/3
resources-cmpxchg.s
-
resources-x86_64.s

Differential D66424

[X86][Btver2] Fix latency and throughput of CMPXCHG instructions.
ClosedPublic

Authored by andreadb on Aug 19 2019, 9:17 AM.

Download Raw Diff

Details

Reviewers

RKSimon
craig.topper

Commits

rGb1bdd97a2671: [X86][Btver2] Fix latency and throughput of CMPXCHG instructions.
rL369365: [X86][Btver2] Fix latency and throughput of CMPXCHG instructions.

Summary

On Jaguar, CMPXCHG has a latency of 11cy, and a maximum throughput of 0.33 IPC. Throughput is superiorly limited to 0.33 because of the implicit in/out dependency on register EAX. In the case of repeated non-atomic CMPXCHG with the same memory location, store-to-load forwarding occurs and values for sequent loads are quickly forwarded from the store buffer.

Interestingly, the functionality in LLVM that computes the reciprocal throughput doesn't seem to know about RMW instructions. That functionality only looks at the "consumed resource cycles" for the throughput computation. It should be fixed/improved by a future patch. In particular, for RMW instructions, that logic should also take into account for the write latency of in/out register operands.

An atomic CMPXCHG has a latency of ~17cy. Throughput is also limited to ~17cy/inst due to cache locking, which prevents other memory uOPs to start executing before the "lock releasing" store uOP.

CMPXCHG8rr and CMPXCHG8rm are treated specially because they decode to one less macro opcode. Their latency tend to be the same as the other RR/RM variants. RR variants are relatively fast 3cy (but still microcoded - 5 macro opcodes).

The two new hasLockPrefix() functions are used by the btver2 scheduling model check if a MCInst/MachineInst has a LOCK prefix. Calls to hasLockPrefix() have been encoded in predicates of variant scheduling classes that describe lat/thr of CMPXCHG.

Let me know if okay to commit.

Diff Detail

Event Timeline

andreadb created this revision.Aug 19 2019, 9:17 AM

Herald added subscribers: jfb, gbedwell. · View Herald TranscriptAug 19 2019, 9:17 AM

lebedev.ri added a subscriber: lebedev.ri.Aug 19 2019, 9:27 AM

lebedev.ri added inline comments.

test/tools/llvm-mca/X86/BtVer2/resources-cmpxchg.s
8	Can these test updates be propagated to every `resources-cmpxchg.s` please?

andreadb marked an inline comment as done.Aug 19 2019, 9:53 AM

andreadb added inline comments.

test/tools/llvm-mca/X86/BtVer2/resources-cmpxchg.s
8	Sure. I am going to do this as a NFC. Then I update this patch.

andreadb mentioned this in rGecbaba672e18: [X86] Added extensive scheduling model tests for all the CMPXCHG variants. NFC.Aug 19 2019, 10:08 AM

Diffusion mentioned this in rL369279: [X86] Added extensive scheduling model tests for all the CMPXCHG variants. NFC.Aug 19 2019, 10:08 AM

Address review comment.

Test files have been updated at r369279.

RKSimon added inline comments.Aug 19 2019, 10:47 AM

test/tools/llvm-mca/X86/BtVer2/resources-cmpxchg.s
33	Only the 8b/16b ops should be in resources-cmpxchg.s, the rest should be in resources-x86_64.s - the non-lock ops are already there. Ideally we'd have coverage for all the other lock ops (ADD, ADC, etc.) but I'm happy to deal with that later.

Diffusion mentioned this in rL369288: [X86] Move scheduling tests for CMPXCHG to the corresponding resources-x86_64.s….Aug 19 2019, 11:19 AM

andreadb mentioned this in rGbf989187c30f: [X86] Move scheduling tests for CMPXCHG to the corresponding resources-x86_64.s….Aug 19 2019, 11:20 AM

Patch updated.

CMPXCHG have been moved to resources-x86_64.s as requested by @RKSimon .

Added missing scheduling information for CMPXCHG8B and CMPXCHG16B.
CMPXCHG8B is 11cy and unfortunately doesn't seem to benefit from store-to-load forwarding. That means, throughput is clearly limited by the in/out dependency on GPR registers. The uOP composition is sadly unknown (due to the lack of PMCs for the Integer pipes). I have reused the same mix of consumed resource from the other CMPXCHG instructions for CMPXCHG8B too.

LOCK CMPXCHG8B is instead 18cycles.

CMPXCHG16B is 32cycles. Up to 38cycles when the LOCK prefix is specified. Due to the in/out dependencies, throughput is limited to 1 instruction every 32 (or 38) cycles dependeing on whether the LOCK prefix is specified or not.
I wouldn't be surprised if the microcode for CMPXCHG16B is similar to 2x microcode from CMPXCHG8B. So, I have speculatively set the JALU01 consumption to 2x the resource cycles used for CMPXCHG8B.

LGTM - thanks!

This revision is now accepted and ready to land.Aug 19 2019, 12:58 PM

Closed by commit rL369365: [X86][Btver2] Fix latency and throughput of CMPXCHG instructions. (authored by adibiagio). · Explain WhyAug 20 2019, 3:22 AM

This revision was automatically updated to reflect the committed changes.

Herald added a project: Restricted Project. · View Herald TranscriptAug 20 2019, 3:22 AM

RKSimon mentioned this in rL369367: [MCA][X86] Add tests for LOCK variants of standard X86 arithmetic ops.Aug 20 2019, 4:12 AM

RKSimon mentioned this in rG6a3dc3e15cb2: [MCA][X86] Add tests for LOCK variants of standard X86 arithmetic ops.Aug 20 2019, 4:17 AM

Revision Contents

Path

Size

lib/

Target/

X86/

MCTargetDesc/

X86MCTargetDesc.h

4 lines

X86MCTargetDesc.cpp

4 lines

X86InstrInfo.h

4 lines

X86SchedPredicates.td

57 lines

X86ScheduleBtVer2.td

75 lines

test/

tools/

llvm-mca/

X86/

BtVer2/

resources-cmpxchg.s

18 lines

resources-x86_64.s

50 lines

Diff 215965

lib/Target/X86/MCTargetDesc/X86MCTargetDesc.h

	Show First 20 Lines • Show All 52 Lines • ▼ Show 20 Lines

	namespace X86_MC {			namespace X86_MC {
	std::string ParseX86Triple(const Triple &TT);			std::string ParseX86Triple(const Triple &TT);

	unsigned getDwarfRegFlavour(const Triple &TT, bool isEH);			unsigned getDwarfRegFlavour(const Triple &TT, bool isEH);

	void initLLVMToSEHAndCVRegMapping(MCRegisterInfo *MRI);			void initLLVMToSEHAndCVRegMapping(MCRegisterInfo *MRI);


				/// Returns true if this instruction has a LOCK prefix.
				bool hasLockPrefix(const MCInst &MI);

	/// Create a X86 MCSubtargetInfo instance. This is exposed so Asm parser, etc.			/// Create a X86 MCSubtargetInfo instance. This is exposed so Asm parser, etc.
	/// do not need to go through TargetRegistry.			/// do not need to go through TargetRegistry.
	MCSubtargetInfo *createX86MCSubtargetInfo(const Triple &TT, StringRef CPU,			MCSubtargetInfo *createX86MCSubtargetInfo(const Triple &TT, StringRef CPU,
	StringRef FS);			StringRef FS);
	}			}

	MCCodeEmitter *createX86MCCodeEmitter(const MCInstrInfo &MCII,			MCCodeEmitter *createX86MCCodeEmitter(const MCInstrInfo &MCII,
	const MCRegisterInfo &MRI,			const MCRegisterInfo &MRI,
	▲ Show 20 Lines • Show All 72 Lines • Show Last 20 Lines

lib/Target/X86/MCTargetDesc/X86MCTargetDesc.cpp

Show First 20 Lines • Show All 64 Lines • ▼ Show 20 Lines	unsigned X86_MC::getDwarfRegFlavour(const Triple &TT, bool isEH) {
if (TT.isOSDarwin())		if (TT.isOSDarwin())
return isEH ? DWARFFlavour::X86_32_DarwinEH : DWARFFlavour::X86_32_Generic;		return isEH ? DWARFFlavour::X86_32_DarwinEH : DWARFFlavour::X86_32_Generic;
if (TT.isOSCygMing())		if (TT.isOSCygMing())
// Unsupported by now, just quick fallback		// Unsupported by now, just quick fallback
return DWARFFlavour::X86_32_Generic;		return DWARFFlavour::X86_32_Generic;
return DWARFFlavour::X86_32_Generic;		return DWARFFlavour::X86_32_Generic;
}		}

		bool X86_MC::hasLockPrefix(const MCInst &MI) {
		return MI.getFlags() & X86::IP_HAS_LOCK;
		}

void X86_MC::initLLVMToSEHAndCVRegMapping(MCRegisterInfo *MRI) {		void X86_MC::initLLVMToSEHAndCVRegMapping(MCRegisterInfo *MRI) {
// FIXME: TableGen these.		// FIXME: TableGen these.
for (unsigned Reg = X86::NoRegister + 1; Reg < X86::NUM_TARGET_REGS; ++Reg) {		for (unsigned Reg = X86::NoRegister + 1; Reg < X86::NUM_TARGET_REGS; ++Reg) {
unsigned SEH = MRI->getEncodingValue(Reg);		unsigned SEH = MRI->getEncodingValue(Reg);
MRI->mapLLVMRegToSEHReg(Reg, SEH);		MRI->mapLLVMRegToSEHReg(Reg, SEH);
}		}

// Mapping from CodeView to MC register id.		// Mapping from CodeView to MC register id.
▲ Show 20 Lines • Show All 696 Lines • Show Last 20 Lines

lib/Target/X86/X86InstrInfo.h

Show First 20 Lines • Show All 521 Lines • ▼ Show 20 Lines	public:
MachineBasicBlock::iterator		MachineBasicBlock::iterator
insertOutlinedCall(Module &M, MachineBasicBlock &MBB,		insertOutlinedCall(Module &M, MachineBasicBlock &MBB,
MachineBasicBlock::iterator &It, MachineFunction &MF,		MachineBasicBlock::iterator &It, MachineFunction &MF,
const outliner::Candidate &C) const override;		const outliner::Candidate &C) const override;

#define GET_INSTRINFO_HELPER_DECLS		#define GET_INSTRINFO_HELPER_DECLS
#include "X86GenInstrInfo.inc"		#include "X86GenInstrInfo.inc"

		static bool hasLockPrefix(const MachineInstr &MI) {
		return MI.getDesc().TSFlags & X86II::LOCK;
		}

Optional<ParamLoadedValue>		Optional<ParamLoadedValue>
describeLoadedValue(const MachineInstr &MI) const override;		describeLoadedValue(const MachineInstr &MI) const override;

protected:		protected:
/// Commutes the operands in the given instruction by changing the operands		/// Commutes the operands in the given instruction by changing the operands
/// order and/or changing the instruction's opcode and/or the immediate value		/// order and/or changing the instruction's opcode and/or the immediate value
/// operand.		/// operand.
///		///
▲ Show 20 Lines • Show All 66 Lines • Show Last 20 Lines

lib/Target/X86/X86SchedPredicates.td

Show First 20 Lines • Show All 78 Lines • ▼ Show 20 Lines	def IsSETAr_Or_SETBEr : CheckAny<[
CheckImmOperand_s<1, "X86::COND_A">,		CheckImmOperand_s<1, "X86::COND_A">,
CheckImmOperand_s<1, "X86::COND_BE">		CheckImmOperand_s<1, "X86::COND_BE">
]>;		]>;

def IsSETAm_Or_SETBEm : CheckAny<[		def IsSETAm_Or_SETBEm : CheckAny<[
CheckImmOperand_s<5, "X86::COND_A">,		CheckImmOperand_s<5, "X86::COND_A">,
CheckImmOperand_s<5, "X86::COND_BE">		CheckImmOperand_s<5, "X86::COND_BE">
]>;		]>;

		// A predicate used to check if an instruction has a LOCK prefix.
		def CheckLockPrefix : CheckFunctionPredicate<
		"X86_MC::hasLockPrefix",
		"X86InstrInfo::hasLockPrefix"
		>;

		def IsRegRegCompareAndSwap_8 : CheckOpcode<[ CMPXCHG8rr ]>;

		def IsRegMemCompareAndSwap_8 : CheckOpcode<[
		LCMPXCHG8, CMPXCHG8rm
		]>;

		def IsRegRegCompareAndSwap_16_32_64 : CheckOpcode<[
		CMPXCHG16rr, CMPXCHG32rr, CMPXCHG64rr
		]>;

		def IsRegMemCompareAndSwap_16_32_64 : CheckOpcode<[
		CMPXCHG16rm, CMPXCHG32rm, CMPXCHG64rm,
		LCMPXCHG16, LCMPXCHG32, LCMPXCHG64,
		LCMPXCHG8B, LCMPXCHG16B
		]>;

		def IsCompareAndSwap8B : CheckOpcode<[ CMPXCHG8B, LCMPXCHG8B ]>;
		def IsCompareAndSwap16B : CheckOpcode<[ CMPXCHG16B, LCMPXCHG16B ]>;

		def IsRegMemCompareAndSwap : CheckOpcode<
		!listconcat(
		IsRegMemCompareAndSwap_8.ValidOpcodes,
		IsRegMemCompareAndSwap_16_32_64.ValidOpcodes
		)>;

		def IsRegRegCompareAndSwap : CheckOpcode<
		!listconcat(
		IsRegRegCompareAndSwap_8.ValidOpcodes,
		IsRegRegCompareAndSwap_16_32_64.ValidOpcodes
		)>;

		def IsAtomicCompareAndSwap_8 : CheckAll<[
		CheckLockPrefix,
		IsRegMemCompareAndSwap_8
		]>;

		def IsAtomicCompareAndSwap : CheckAll<[
		CheckLockPrefix,
		IsRegMemCompareAndSwap
		]>;

		def IsAtomicCompareAndSwap8B : CheckAll<[
		CheckLockPrefix,
		IsCompareAndSwap8B
		]>;

		def IsAtomicCompareAndSwap16B : CheckAll<[
		CheckLockPrefix,
		IsCompareAndSwap16B
		]>;

lib/Target/X86/X86ScheduleBtVer2.td

	Show First 20 Lines • Show All 185 Lines • ▼ Show 20 Lines

	////////////////////////////////////////////////////////////////////////////////			////////////////////////////////////////////////////////////////////////////////
	// Arithmetic.			// Arithmetic.
	////////////////////////////////////////////////////////////////////////////////			////////////////////////////////////////////////////////////////////////////////

	defm : JWriteResIntPair<WriteALU, [JALU01], 1>;			defm : JWriteResIntPair<WriteALU, [JALU01], 1>;
	defm : JWriteResIntPair<WriteADC, [JALU01], 1, [2]>;			defm : JWriteResIntPair<WriteADC, [JALU01], 1, [2]>;

	defm : X86WriteRes<WriteBSWAP32, [JALU01], 1, [1], 1>;			defm : X86WriteRes<WriteBSWAP32, [JALU01], 1, [1], 1>;
	defm : X86WriteRes<WriteBSWAP64, [JALU01], 1, [1], 1>;			defm : X86WriteRes<WriteBSWAP64, [JALU01], 1, [1], 1>;
	defm : X86WriteRes<WriteCMPXCHG,[JALU01], 1, [1], 1>;			defm : X86WriteRes<WriteCMPXCHG, [JALU01], 3, [3], 5>;
	defm : X86WriteRes<WriteCMPXCHGRMW,[JALU01, JSAGU, JLAGU], 4, [1, 1, 1], 2>;			defm : X86WriteRes<WriteCMPXCHGRMW, [JALU01, JSAGU, JLAGU], 11, [3, 1, 1], 6>;
	defm : X86WriteRes<WriteXCHG, [JALU01], 1, [1], 1>;			defm : X86WriteRes<WriteXCHG, [JALU01], 1, [1], 1>;

	defm : JWriteResIntPair<WriteIMul8, [JALU1, JMul], 3, [1, 1], 2>;			defm : JWriteResIntPair<WriteIMul8, [JALU1, JMul], 3, [1, 1], 2>;
	defm : JWriteResIntPair<WriteIMul16, [JALU1, JMul], 3, [1, 1], 2>;			defm : JWriteResIntPair<WriteIMul16, [JALU1, JMul], 3, [1, 1], 2>;
	defm : JWriteResIntPair<WriteIMul16Imm, [JALU1, JMul], 3, [1, 1], 2>;			defm : JWriteResIntPair<WriteIMul16Imm, [JALU1, JMul], 3, [1, 1], 2>;
	defm : JWriteResIntPair<WriteIMul16Reg, [JALU1, JMul], 3, [1, 1], 2>;			defm : JWriteResIntPair<WriteIMul16Reg, [JALU1, JMul], 3, [1, 1], 2>;
	defm : JWriteResIntPair<WriteIMul32, [JALU1, JMul], 3, [1, 1], 2>;			defm : JWriteResIntPair<WriteIMul32, [JALU1, JMul], 3, [1, 1], 2>;
	defm : JWriteResIntPair<WriteIMul32Imm, [JALU1, JMul], 3, [1, 1], 2>;			defm : JWriteResIntPair<WriteIMul32Imm, [JALU1, JMul], 3, [1, 1], 2>;
	▲ Show 20 Lines • Show All 94 Lines • ▼ Show 20 Lines
	def : WriteRes<WriteSystem, [JALU01]> { let Latency = 100; }			def : WriteRes<WriteSystem, [JALU01]> { let Latency = 100; }
	def : WriteRes<WriteMicrocoded, [JALU01]> { let Latency = 100; }			def : WriteRes<WriteMicrocoded, [JALU01]> { let Latency = 100; }
	def : WriteRes<WriteFence, [JSAGU]>;			def : WriteRes<WriteFence, [JSAGU]>;

	// Nops don't have dependencies, so there's no actual latency, but we set this			// Nops don't have dependencies, so there's no actual latency, but we set this
	// to '1' to tell the scheduler that the nop uses an ALU slot for a cycle.			// to '1' to tell the scheduler that the nop uses an ALU slot for a cycle.
	def : WriteRes<WriteNop, [JALU01]> { let Latency = 1; }			def : WriteRes<WriteNop, [JALU01]> { let Latency = 1; }

				def JWriteCMPXCHG8rr : SchedWriteRes<[JALU01]> {
				let Latency = 3;
				let ResourceCycles = [3];
				let NumMicroOps = 3;
				}

				def JWriteLOCK_CMPXCHG8rm : SchedWriteRes<[JALU01, JLAGU, JSAGU]> {
				let Latency = 16;
				let ResourceCycles = [3,16,16];
				let NumMicroOps = 5;
				}

				def JWriteLOCK_CMPXCHGrm : SchedWriteRes<[JALU01, JLAGU, JSAGU]> {
				let Latency = 17;
				let ResourceCycles = [3,17,17];
				let NumMicroOps = 6;
				}

				def JWriteCMPXCHG8rm : SchedWriteRes<[JALU01, JLAGU, JSAGU]> {
				let Latency = 11;
				let ResourceCycles = [3,1,1];
				let NumMicroOps = 5;
				}

				def JWriteCMPXCHG8B : SchedWriteRes<[JALU01, JLAGU, JSAGU]> {
				let Latency = 11;
				let ResourceCycles = [3,1,1];
				let NumMicroOps = 18;
				}

				def JWriteCMPXCHG16B : SchedWriteRes<[JALU01, JLAGU, JSAGU]> {
				let Latency = 32;
				let ResourceCycles = [6,1,1];
				let NumMicroOps = 28;
				}

				def JWriteLOCK_CMPXCHG8B : SchedWriteRes<[JALU01, JLAGU, JSAGU]> {
				let Latency = 19;
				let ResourceCycles = [3,19,19];
				let NumMicroOps = 18;
				}

				def JWriteLOCK_CMPXCHG16B : SchedWriteRes<[JALU01, JLAGU, JSAGU]> {
				let Latency = 38;
				let ResourceCycles = [6,38,38];
				let NumMicroOps = 28;
				}

				def JWriteCMPXCHGVariant : SchedWriteVariant<[
				SchedVar<MCSchedPredicate<IsAtomicCompareAndSwap8B>, [JWriteLOCK_CMPXCHG8B]>,
				SchedVar<MCSchedPredicate<IsAtomicCompareAndSwap16B>, [JWriteLOCK_CMPXCHG16B]>,
				SchedVar<MCSchedPredicate<IsAtomicCompareAndSwap_8>, [JWriteLOCK_CMPXCHG8rm]>,
				SchedVar<MCSchedPredicate<IsAtomicCompareAndSwap>, [JWriteLOCK_CMPXCHGrm]>,
				SchedVar<MCSchedPredicate<IsCompareAndSwap8B>, [JWriteCMPXCHG8B]>,
				SchedVar<MCSchedPredicate<IsCompareAndSwap16B>, [JWriteCMPXCHG16B]>,
				SchedVar<MCSchedPredicate<IsRegMemCompareAndSwap_8>, [JWriteCMPXCHG8rm]>,
				SchedVar<MCSchedPredicate<IsRegMemCompareAndSwap>, [WriteCMPXCHGRMW]>,
				SchedVar<MCSchedPredicate<IsRegRegCompareAndSwap_8>, [JWriteCMPXCHG8rr]>,
				SchedVar<NoSchedPred, [WriteCMPXCHG]>
				]>;
				def : InstRW<[JWriteCMPXCHGVariant], (instrs CMPXCHG8rr, LCMPXCHG8, CMPXCHG8rm,
				CMPXCHG16rm, CMPXCHG32rm, CMPXCHG64rm,
				LCMPXCHG16, LCMPXCHG32, LCMPXCHG64,
				CMPXCHG8B, CMPXCHG16B,
				LCMPXCHG8B, LCMPXCHG16B)>;


	////////////////////////////////////////////////////////////////////////////////			////////////////////////////////////////////////////////////////////////////////
	// Floating point. This covers both scalar and vector operations.			// Floating point. This covers both scalar and vector operations.
	////////////////////////////////////////////////////////////////////////////////			////////////////////////////////////////////////////////////////////////////////

	defm : X86WriteRes<WriteFLD0, [JFPU1, JSTC], 3, [1,1], 1>;			defm : X86WriteRes<WriteFLD0, [JFPU1, JSTC], 3, [1,1], 1>;
	defm : X86WriteRes<WriteFLD1, [JFPU1, JSTC], 3, [1,1], 1>;			defm : X86WriteRes<WriteFLD1, [JFPU1, JSTC], 3, [1,1], 1>;
	defm : X86WriteRes<WriteFLDC, [JFPU1, JSTC], 3, [1,1], 1>;			defm : X86WriteRes<WriteFLDC, [JFPU1, JSTC], 3, [1,1], 1>;
	defm : X86WriteRes<WriteFLoad, [JLAGU, JFPU01, JFPX], 5, [1, 1, 1], 1>;			defm : X86WriteRes<WriteFLoad, [JLAGU, JFPU01, JFPX], 5, [1, 1, 1], 1>;
	▲ Show 20 Lines • Show All 528 Lines • Show Last 20 Lines

test/tools/llvm-mca/X86/BtVer2/resources-cmpxchg.s

	# NOTE: Assertions have been autogenerated by utils/update_mca_test_checks.py			# NOTE: Assertions have been autogenerated by utils/update_mca_test_checks.py
	# RUN: llvm-mca -mtriple=x86_64-unknown-unknown -mcpu=btver2 -instruction-tables < %s \| FileCheck %s			# RUN: llvm-mca -mtriple=x86_64-unknown-unknown -mcpu=btver2 -instruction-tables < %s \| FileCheck %s

	cmpxchg8b (%rax)			cmpxchg8b (%rax)
	cmpxchg16b (%rax)			cmpxchg16b (%rax)
	lock cmpxchg8b (%rax)			lock cmpxchg8b (%rax)
	lock cmpxchg16b (%rax)			lock cmpxchg16b (%rax)

				lebedev.riUnsubmitted Not Done Reply Inline Actions Can these test updates be propagated to every `resources-cmpxchg.s` please? lebedev.ri: Can these test updates be propagated to every `resources-cmpxchg.s` please?
				andreadbAuthorUnsubmitted Done Reply Inline Actions Sure. I am going to do this as a NFC. Then I update this patch. andreadb: Sure. I am going to do this as a NFC. Then I update this patch.
	# CHECK: Instruction Info:			# CHECK: Instruction Info:
	# CHECK-NEXT: [1]: #uOps			# CHECK-NEXT: [1]: #uOps
	# CHECK-NEXT: [2]: Latency			# CHECK-NEXT: [2]: Latency
	# CHECK-NEXT: [3]: RThroughput			# CHECK-NEXT: [3]: RThroughput
	# CHECK-NEXT: [4]: MayLoad			# CHECK-NEXT: [4]: MayLoad
	# CHECK-NEXT: [5]: MayStore			# CHECK-NEXT: [5]: MayStore
	# CHECK-NEXT: [6]: HasSideEffects (U)			# CHECK-NEXT: [6]: HasSideEffects (U)

	# CHECK: [1] [2] [3] [4] [5] [6] Instructions:			# CHECK: [1] [2] [3] [4] [5] [6] Instructions:
	# CHECK-NEXT: 2 4 1.00 * * cmpxchg8b (%rax)			# CHECK-NEXT: 18 11 1.50 * * cmpxchg8b (%rax)
	# CHECK-NEXT: 2 4 1.00 * * cmpxchg16b (%rax)			# CHECK-NEXT: 28 32 3.00 * * cmpxchg16b (%rax)
	# CHECK-NEXT: 2 4 1.00 * * lock cmpxchg8b (%rax)			# CHECK-NEXT: 18 19 19.00 * * lock cmpxchg8b (%rax)
	# CHECK-NEXT: 2 4 1.00 * * lock cmpxchg16b (%rax)			# CHECK-NEXT: 28 38 38.00 * * lock cmpxchg16b (%rax)

	# CHECK: Resources:			# CHECK: Resources:
	# CHECK-NEXT: [0] - JALU0			# CHECK-NEXT: [0] - JALU0
	# CHECK-NEXT: [1] - JALU1			# CHECK-NEXT: [1] - JALU1
	# CHECK-NEXT: [2] - JDiv			# CHECK-NEXT: [2] - JDiv
	# CHECK-NEXT: [3] - JFPA			# CHECK-NEXT: [3] - JFPA
	# CHECK-NEXT: [4] - JFPM			# CHECK-NEXT: [4] - JFPM
	# CHECK-NEXT: [5] - JFPU0			# CHECK-NEXT: [5] - JFPU0
	# CHECK-NEXT: [6] - JFPU1			# CHECK-NEXT: [6] - JFPU1
	# CHECK-NEXT: [7] - JLAGU			# CHECK-NEXT: [7] - JLAGU
	# CHECK-NEXT: [8] - JMul			# CHECK-NEXT: [8] - JMul
	# CHECK-NEXT: [9] - JSAGU			# CHECK-NEXT: [9] - JSAGU
				RKSimonUnsubmitted Not Done Reply Inline Actions Only the 8b/16b ops should be in resources-cmpxchg.s, the rest should be in resources-x86_64.s - the non-lock ops are already there. Ideally we'd have coverage for all the other lock ops (ADD, ADC, etc.) but I'm happy to deal with that later. RKSimon: Only the 8b/16b ops should be in resources-cmpxchg.s, the rest should be in resources-x86_64.s…
	# CHECK-NEXT: [10] - JSTC			# CHECK-NEXT: [10] - JSTC
	# CHECK-NEXT: [11] - JVALU0			# CHECK-NEXT: [11] - JVALU0
	# CHECK-NEXT: [12] - JVALU1			# CHECK-NEXT: [12] - JVALU1
	# CHECK-NEXT: [13] - JVIMUL			# CHECK-NEXT: [13] - JVIMUL

	# CHECK: Resource pressure per iteration:			# CHECK: Resource pressure per iteration:
	# CHECK-NEXT: [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13]			# CHECK-NEXT: [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13]
	# CHECK-NEXT: 2.00 2.00 - - - - - 4.00 - 4.00 - - - -			# CHECK-NEXT: 9.00 9.00 - - - - - 59.00 - 59.00 - - - -

	# CHECK: Resource pressure by instruction:			# CHECK: Resource pressure by instruction:
	# CHECK-NEXT: [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] Instructions:			# CHECK-NEXT: [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] Instructions:
	# CHECK-NEXT: 0.50 0.50 - - - - - 1.00 - 1.00 - - - - cmpxchg8b (%rax)			# CHECK-NEXT: 1.50 1.50 - - - - - 1.00 - 1.00 - - - - cmpxchg8b (%rax)
	# CHECK-NEXT: 0.50 0.50 - - - - - 1.00 - 1.00 - - - - cmpxchg16b (%rax)			# CHECK-NEXT: 3.00 3.00 - - - - - 1.00 - 1.00 - - - - cmpxchg16b (%rax)
	# CHECK-NEXT: 0.50 0.50 - - - - - 1.00 - 1.00 - - - - lock cmpxchg8b (%rax)			# CHECK-NEXT: 1.50 1.50 - - - - - 19.00 - 19.00 - - - - lock cmpxchg8b (%rax)
	# CHECK-NEXT: 0.50 0.50 - - - - - 1.00 - 1.00 - - - - lock cmpxchg16b (%rax)			# CHECK-NEXT: 3.00 3.00 - - - - - 38.00 - 38.00 - - - - lock cmpxchg16b (%rax)

test/tools/llvm-mca/X86/BtVer2/resources-x86_64.s

	Show First 20 Lines • Show All 1,104 Lines • ▼ Show 20 Lines
	# CHECK-NEXT: 1 4 1.00 * cmpq $7, (%rax)			# CHECK-NEXT: 1 4 1.00 * cmpq $7, (%rax)
	# CHECK-NEXT: 1 1 0.50 cmpq %rsi, %rdi			# CHECK-NEXT: 1 1 0.50 cmpq %rsi, %rdi
	# CHECK-NEXT: 1 4 1.00 * cmpq %rsi, (%rax)			# CHECK-NEXT: 1 4 1.00 * cmpq %rsi, (%rax)
	# CHECK-NEXT: 1 4 1.00 * cmpq (%rax), %rdi			# CHECK-NEXT: 1 4 1.00 * cmpq (%rax), %rdi
	# CHECK-NEXT: 1 100 0.50 U cmpsb %es:(%rdi), (%rsi)			# CHECK-NEXT: 1 100 0.50 U cmpsb %es:(%rdi), (%rsi)
	# CHECK-NEXT: 1 100 0.50 U cmpsw %es:(%rdi), (%rsi)			# CHECK-NEXT: 1 100 0.50 U cmpsw %es:(%rdi), (%rsi)
	# CHECK-NEXT: 1 100 0.50 U cmpsl %es:(%rdi), (%rsi)			# CHECK-NEXT: 1 100 0.50 U cmpsl %es:(%rdi), (%rsi)
	# CHECK-NEXT: 1 100 0.50 U cmpsq %es:(%rdi), (%rsi)			# CHECK-NEXT: 1 100 0.50 U cmpsq %es:(%rdi), (%rsi)
	# CHECK-NEXT: 1 1 0.50 cmpxchgb %cl, %bl			# CHECK-NEXT: 3 3 1.50 cmpxchgb %cl, %bl
	# CHECK-NEXT: 2 4 1.00 * * cmpxchgb %cl, (%rbx)			# CHECK-NEXT: 5 11 1.50 * * cmpxchgb %cl, (%rbx)
	# CHECK-NEXT: 2 4 1.00 * * lock cmpxchgb %cl, (%rbx)			# CHECK-NEXT: 5 16 16.00 * * lock cmpxchgb %cl, (%rbx)
	# CHECK-NEXT: 1 1 0.50 cmpxchgw %cx, %bx			# CHECK-NEXT: 5 3 1.50 cmpxchgw %cx, %bx
	# CHECK-NEXT: 2 4 1.00 * * cmpxchgw %cx, (%rbx)			# CHECK-NEXT: 6 11 1.50 * * cmpxchgw %cx, (%rbx)
	# CHECK-NEXT: 2 4 1.00 * * lock cmpxchgw %cx, (%rbx)			# CHECK-NEXT: 6 17 17.00 * * lock cmpxchgw %cx, (%rbx)
	# CHECK-NEXT: 1 1 0.50 cmpxchgl %ecx, %ebx			# CHECK-NEXT: 5 3 1.50 cmpxchgl %ecx, %ebx
	# CHECK-NEXT: 2 4 1.00 * * cmpxchgl %ecx, (%rbx)			# CHECK-NEXT: 6 11 1.50 * * cmpxchgl %ecx, (%rbx)
	# CHECK-NEXT: 2 4 1.00 * * lock cmpxchgl %ecx, (%rbx)			# CHECK-NEXT: 6 17 17.00 * * lock cmpxchgl %ecx, (%rbx)
	# CHECK-NEXT: 1 1 0.50 cmpxchgq %rcx, %rbx			# CHECK-NEXT: 5 3 1.50 cmpxchgq %rcx, %rbx
	# CHECK-NEXT: 2 4 1.00 * * cmpxchgq %rcx, (%rbx)			# CHECK-NEXT: 6 11 1.50 * * cmpxchgq %rcx, (%rbx)
	# CHECK-NEXT: 2 4 1.00 * * lock cmpxchgq %rcx, (%rbx)			# CHECK-NEXT: 6 17 17.00 * * lock cmpxchgq %rcx, (%rbx)
	# CHECK-NEXT: 1 100 0.50 U cpuid			# CHECK-NEXT: 1 100 0.50 U cpuid
	# CHECK-NEXT: 1 1 0.50 decb %dil			# CHECK-NEXT: 1 1 0.50 decb %dil
	# CHECK-NEXT: 1 5 1.00 * * decb (%rax)			# CHECK-NEXT: 1 5 1.00 * * decb (%rax)
	# CHECK-NEXT: 1 1 0.50 decw %di			# CHECK-NEXT: 1 1 0.50 decw %di
	# CHECK-NEXT: 1 5 1.00 * * decw (%rax)			# CHECK-NEXT: 1 5 1.00 * * decw (%rax)
	# CHECK-NEXT: 1 1 0.50 decl %edi			# CHECK-NEXT: 1 1 0.50 decl %edi
	# CHECK-NEXT: 1 5 1.00 * * decl (%rax)			# CHECK-NEXT: 1 5 1.00 * * decl (%rax)
	# CHECK-NEXT: 1 1 0.50 decq %rdi			# CHECK-NEXT: 1 1 0.50 decq %rdi
	▲ Show 20 Lines • Show All 567 Lines • ▼ Show 20 Lines
	# CHECK-NEXT: [9] - JSAGU			# CHECK-NEXT: [9] - JSAGU
	# CHECK-NEXT: [10] - JSTC			# CHECK-NEXT: [10] - JSTC
	# CHECK-NEXT: [11] - JVALU0			# CHECK-NEXT: [11] - JVALU0
	# CHECK-NEXT: [12] - JVALU1			# CHECK-NEXT: [12] - JVALU1
	# CHECK-NEXT: [13] - JVIMUL			# CHECK-NEXT: [13] - JVIMUL

	# CHECK: Resource pressure per iteration:			# CHECK: Resource pressure per iteration:
	# CHECK-NEXT: [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13]			# CHECK-NEXT: [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13]
	# CHECK-NEXT: 612.00 662.00 380.00 - - - - 334.00 64.00 235.00 - - - -			# CHECK-NEXT: 624.00 674.00 380.00 - - - - 397.00 64.00 298.00 - - - -

	# CHECK: Resource pressure by instruction:			# CHECK: Resource pressure by instruction:
	# CHECK-NEXT: [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] Instructions:			# CHECK-NEXT: [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] Instructions:
	# CHECK-NEXT: 1.00 1.00 - - - - - - - - - - - - adcb $0, %al			# CHECK-NEXT: 1.00 1.00 - - - - - - - - - - - - adcb $0, %al
	# CHECK-NEXT: 1.00 1.00 - - - - - - - - - - - - adcb $0, %dil			# CHECK-NEXT: 1.00 1.00 - - - - - - - - - - - - adcb $0, %dil
	# CHECK-NEXT: 1.00 1.00 - - - - - 1.00 - 1.00 - - - - adcb $0, (%rax)			# CHECK-NEXT: 1.00 1.00 - - - - - 1.00 - 1.00 - - - - adcb $0, (%rax)
	# CHECK-NEXT: 1.00 1.00 - - - - - - - - - - - - adcb $7, %al			# CHECK-NEXT: 1.00 1.00 - - - - - - - - - - - - adcb $7, %al
	# CHECK-NEXT: 1.00 1.00 - - - - - - - - - - - - adcb $7, %dil			# CHECK-NEXT: 1.00 1.00 - - - - - - - - - - - - adcb $7, %dil
	▲ Show 20 Lines • Show All 194 Lines • ▼ Show 20 Lines
	# CHECK-NEXT: 0.50 0.50 - - - - - 1.00 - - - - - - cmpq $7, (%rax)			# CHECK-NEXT: 0.50 0.50 - - - - - 1.00 - - - - - - cmpq $7, (%rax)
	# CHECK-NEXT: 0.50 0.50 - - - - - - - - - - - - cmpq %rsi, %rdi			# CHECK-NEXT: 0.50 0.50 - - - - - - - - - - - - cmpq %rsi, %rdi
	# CHECK-NEXT: 0.50 0.50 - - - - - 1.00 - - - - - - cmpq %rsi, (%rax)			# CHECK-NEXT: 0.50 0.50 - - - - - 1.00 - - - - - - cmpq %rsi, (%rax)
	# CHECK-NEXT: 0.50 0.50 - - - - - 1.00 - - - - - - cmpq (%rax), %rdi			# CHECK-NEXT: 0.50 0.50 - - - - - 1.00 - - - - - - cmpq (%rax), %rdi
	# CHECK-NEXT: 0.50 0.50 - - - - - - - - - - - - cmpsb %es:(%rdi), (%rsi)			# CHECK-NEXT: 0.50 0.50 - - - - - - - - - - - - cmpsb %es:(%rdi), (%rsi)
	# CHECK-NEXT: 0.50 0.50 - - - - - - - - - - - - cmpsw %es:(%rdi), (%rsi)			# CHECK-NEXT: 0.50 0.50 - - - - - - - - - - - - cmpsw %es:(%rdi), (%rsi)
	# CHECK-NEXT: 0.50 0.50 - - - - - - - - - - - - cmpsl %es:(%rdi), (%rsi)			# CHECK-NEXT: 0.50 0.50 - - - - - - - - - - - - cmpsl %es:(%rdi), (%rsi)
	# CHECK-NEXT: 0.50 0.50 - - - - - - - - - - - - cmpsq %es:(%rdi), (%rsi)			# CHECK-NEXT: 0.50 0.50 - - - - - - - - - - - - cmpsq %es:(%rdi), (%rsi)
	# CHECK-NEXT: 0.50 0.50 - - - - - - - - - - - - cmpxchgb %cl, %bl			# CHECK-NEXT: 1.50 1.50 - - - - - - - - - - - - cmpxchgb %cl, %bl
	# CHECK-NEXT: 0.50 0.50 - - - - - 1.00 - 1.00 - - - - cmpxchgb %cl, (%rbx)			# CHECK-NEXT: 1.50 1.50 - - - - - 1.00 - 1.00 - - - - cmpxchgb %cl, (%rbx)
	# CHECK-NEXT: 0.50 0.50 - - - - - 1.00 - 1.00 - - - - lock cmpxchgb %cl, (%rbx)			# CHECK-NEXT: 1.50 1.50 - - - - - 16.00 - 16.00 - - - - lock cmpxchgb %cl, (%rbx)
	# CHECK-NEXT: 0.50 0.50 - - - - - - - - - - - - cmpxchgw %cx, %bx			# CHECK-NEXT: 1.50 1.50 - - - - - - - - - - - - cmpxchgw %cx, %bx
	# CHECK-NEXT: 0.50 0.50 - - - - - 1.00 - 1.00 - - - - cmpxchgw %cx, (%rbx)			# CHECK-NEXT: 1.50 1.50 - - - - - 1.00 - 1.00 - - - - cmpxchgw %cx, (%rbx)
	# CHECK-NEXT: 0.50 0.50 - - - - - 1.00 - 1.00 - - - - lock cmpxchgw %cx, (%rbx)			# CHECK-NEXT: 1.50 1.50 - - - - - 17.00 - 17.00 - - - - lock cmpxchgw %cx, (%rbx)
	# CHECK-NEXT: 0.50 0.50 - - - - - - - - - - - - cmpxchgl %ecx, %ebx			# CHECK-NEXT: 1.50 1.50 - - - - - - - - - - - - cmpxchgl %ecx, %ebx
	# CHECK-NEXT: 0.50 0.50 - - - - - 1.00 - 1.00 - - - - cmpxchgl %ecx, (%rbx)			# CHECK-NEXT: 1.50 1.50 - - - - - 1.00 - 1.00 - - - - cmpxchgl %ecx, (%rbx)
	# CHECK-NEXT: 0.50 0.50 - - - - - 1.00 - 1.00 - - - - lock cmpxchgl %ecx, (%rbx)			# CHECK-NEXT: 1.50 1.50 - - - - - 17.00 - 17.00 - - - - lock cmpxchgl %ecx, (%rbx)
	# CHECK-NEXT: 0.50 0.50 - - - - - - - - - - - - cmpxchgq %rcx, %rbx			# CHECK-NEXT: 1.50 1.50 - - - - - - - - - - - - cmpxchgq %rcx, %rbx
	# CHECK-NEXT: 0.50 0.50 - - - - - 1.00 - 1.00 - - - - cmpxchgq %rcx, (%rbx)			# CHECK-NEXT: 1.50 1.50 - - - - - 1.00 - 1.00 - - - - cmpxchgq %rcx, (%rbx)
	# CHECK-NEXT: 0.50 0.50 - - - - - 1.00 - 1.00 - - - - lock cmpxchgq %rcx, (%rbx)			# CHECK-NEXT: 1.50 1.50 - - - - - 17.00 - 17.00 - - - - lock cmpxchgq %rcx, (%rbx)
	# CHECK-NEXT: 0.50 0.50 - - - - - - - - - - - - cpuid			# CHECK-NEXT: 0.50 0.50 - - - - - - - - - - - - cpuid
	# CHECK-NEXT: 0.50 0.50 - - - - - - - - - - - - decb %dil			# CHECK-NEXT: 0.50 0.50 - - - - - - - - - - - - decb %dil
	# CHECK-NEXT: 0.50 0.50 - - - - - 1.00 - 1.00 - - - - decb (%rax)			# CHECK-NEXT: 0.50 0.50 - - - - - 1.00 - 1.00 - - - - decb (%rax)
	# CHECK-NEXT: 0.50 0.50 - - - - - - - - - - - - decw %di			# CHECK-NEXT: 0.50 0.50 - - - - - - - - - - - - decw %di
	# CHECK-NEXT: 0.50 0.50 - - - - - 1.00 - 1.00 - - - - decw (%rax)			# CHECK-NEXT: 0.50 0.50 - - - - - 1.00 - 1.00 - - - - decw (%rax)
	# CHECK-NEXT: 0.50 0.50 - - - - - - - - - - - - decl %edi			# CHECK-NEXT: 0.50 0.50 - - - - - - - - - - - - decl %edi
	# CHECK-NEXT: 0.50 0.50 - - - - - 1.00 - 1.00 - - - - decl (%rax)			# CHECK-NEXT: 0.50 0.50 - - - - - 1.00 - 1.00 - - - - decl (%rax)
	# CHECK-NEXT: 0.50 0.50 - - - - - - - - - - - - decq %rdi			# CHECK-NEXT: 0.50 0.50 - - - - - - - - - - - - decq %rdi
	▲ Show 20 Lines • Show All 556 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[X86][Btver2] Fix latency and throughput of CMPXCHG instructions.ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 215965

lib/Target/X86/MCTargetDesc/X86MCTargetDesc.h

lib/Target/X86/MCTargetDesc/X86MCTargetDesc.cpp

lib/Target/X86/X86InstrInfo.h

lib/Target/X86/X86SchedPredicates.td

lib/Target/X86/X86ScheduleBtVer2.td

test/tools/llvm-mca/X86/BtVer2/resources-cmpxchg.s

test/tools/llvm-mca/X86/BtVer2/resources-x86_64.s

[X86][Btver2] Fix latency and throughput of CMPXCHG instructions.
ClosedPublic