This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
include/llvm/
-
llvm/
-
MC/
1
MCSchedule.h
-
MCSubtargetInfo.h
-
MCA/
-
Instruction.h
-
lib/
-
CodeGen/
-
TargetSubtargetInfo.cpp
-
MC/
-
MCSchedule.cpp
-
MCA/
-
InstrBuilder.cpp
-
Target/X86/
-
X86/
-
X86InstrMMX.td
-
X86InstrSSE.td
-
X86SchedBroadwell.td
-
X86SchedHaswell.td
-
X86SchedSandyBridge.td
-
X86SchedSkylakeClient.td
-
X86SchedSkylakeServer.td
3/3
X86Schedule.td
-
X86ScheduleAtom.td
-
X86ScheduleBdVer2.td
-
X86ScheduleBtVer2.td
-
X86ScheduleSLM.td
-
X86ScheduleZnver1.td
-
test/
-
CodeGen/X86/
-
X86/
-
mmx-schedule.ll
-
sse41-schedule.ll
-
tools/llvm-mca/X86/BtVer2/
-
llvm-mca/
-
X86/
-
BtVer2/
-
int-to-fpu-forwarding-1.s
-
int-to-fpu-forwarding-3.s
-
tools/llvm-mca/Views/
-
llvm-mca/
-
Views/
-
InstructionInfoView.cpp

Differential D57056

[MC][X86] Correctly model additional operand latency caused by transfer delays from the integer to the floating point unit.
ClosedPublic

Authored by andreadb on Jan 22 2019, 7:41 AM.

Download Raw Diff

Details

Reviewers

RKSimon
spatel
craig.topper
courbet
ab
atrick
lebedev.ri
mattd

Commits

rGd768d355158d: [MC][X86] Correctly model additional operand latency caused by transfer delays…
rL351965: [MC][X86] Correctly model additional operand latency caused by transfer delays…

Summary

This patch adds a new ReadAdvance definition named ReadInt2Fpu.
ReadInt2Fpu allows x86 scheduling models to accurately describe delays caused by data transfers from the integer unit to the floating point unit.
ReadInt2Fpu currently defaults to a delay of zero cycles (i.e. no delay) for all x86 models excluding BtVer2. That means, this patch is only a functional change for the Jaguar cpu model only.

Tablegen definitions for instructions (V)PINSR* have been updated to account for the new ReadInt2Fpu. That read is mapped to the the GPR input operand.
On Jaguar, int-to-fpu transfers are modeled as a +6cy delay. Before this patch, that extra delay was added to the opcode latency. In practice, the insert opcode only executes for 1cy. Most of the actual latency is actually contributed by the so-called operand-latency. According to the AMD SOG for family 16h, (V)PINSR* latency is defined by expression f+1, where f is defined as a forwarding delay from the integer unit to the fpu.

When printing instruction latency from MCA (see InstructionInfoView.cpp) and LLC (only when flag -print-schedule is speified), we now need to account for any extra forwarding delays. We do this by checking if scheduling classes declare any negative ReadAdvance entries. Quoting a code comment in TargetSchedule.td: "A negative advance effectively increases latency, which may be used for cross-domain stalls". When computing the instruction latency for the purpose of our scheduling tests, we now add any extra delay to the formula. This avoids regressing existing codegen and mca schedule tests. It comes with the cost of an extra (but very simple) hook in MCSchedModel.

As a side note: this patch would have been a bit simpler if we didn't have to support flag -print-schedule. Now that we have llvm-mca, we can just move all our latency/throughput test coverage to llvm-mca. If we do that, then we can get rid of flag -print-schedule, and all the latency/throughput reporting framework. It would also help to solve a long-standing layering violation (originally reported as PR37160).

Let me know if okay to commit.

Thanks,
Andrea

Diff Detail

Event Timeline

andreadb created this revision.Jan 22 2019, 7:41 AM

Herald added a reviewer: lebedev.ri. · View Herald TranscriptJan 22 2019, 7:41 AM

Herald added subscribers: lebedev.ri, gbedwell. · View Herald Transcript

andreadb added a reviewer: mattd.Jan 22 2019, 9:51 AM

@craig.topper Should any Intel model account for this in a similar way? Agner is a little vague on this.

lib/Target/X86/X86Schedule.td
20	Add a description comment

LGTM. I'd wait a bit to see if others have anything else to add.

include/llvm/MC/MCSchedule.h
375	You can probably constify Entries.
lib/Target/X86/X86Schedule.td
19–25	I think it would be helpful to have a comment here, describing ReadInt2Fpu. Something similar to what you described above in the Details.

andreadb marked an inline comment as done.Jan 22 2019, 10:54 AM

andreadb added inline comments.

lib/Target/X86/X86Schedule.td
20	I will add a comment to this line.

Thanks for the feedback.

Patch has been rebased. I have added a small description of ReadInt2Fpu in the code.

andreadb marked 2 inline comments as done.Jan 23 2019, 4:52 AM

LGTM - cheers

This revision is now accepted and ready to land.Jan 23 2019, 7:18 AM

Closed by commit rL351965: [MC][X86] Correctly model additional operand latency caused by transfer delays… (authored by adibiagio). · Explain WhyJan 23 2019, 8:35 AM

This revision was automatically updated to reflect the committed changes.

In D57056#1366590, @RKSimon wrote:

@craig.topper Should any Intel model account for this in a similar way? Agner is a little vague on this.

I tried to find out more about this internally. There are still bypass delays between integer and fp in some cases, and they change in every generation. And the definition of what's a float and what's an integer have changed over time. For example logic ops and shuffles aren't really considered int or fp in the most recent CPUs.

andreadb mentioned this in D57148: [X86][Btver2] Improved latency/throughput model for scalar int-to-float conversions..Jan 24 2019, 5:07 AM

Revision Contents

Path

Size

include/

llvm/

MC/

MCSchedule.h

6 lines

MCSubtargetInfo.h

10 lines

MCA/

Instruction.h

4 lines

lib/

CodeGen/

TargetSubtargetInfo.cpp

18 lines

MC/

MCSchedule.cpp

16 lines

MCA/

InstrBuilder.cpp

1 line

Target/

X86/

2 lines

8 lines

2 lines

2 lines

X86SchedSandyBridge.td

2 lines

X86SchedSkylakeClient.td

2 lines

X86SchedSkylakeServer.td

2 lines

6 lines

2 lines

2 lines

7 lines

2 lines

2 lines

test/

CodeGen/

X86/

mmx-schedule.ll

2 lines

sse41-schedule.ll

4 lines

tools/

llvm-mca/

X86/

BtVer2/

int-to-fpu-forwarding-1.s

24 lines

int-to-fpu-forwarding-3.s

34 lines

tools/

llvm-mca/

Views/

InstructionInfoView.cpp

3 lines

Diff 183088

include/llvm/MC/MCSchedule.h

//===-- llvm/MC/MCSchedule.h - Scheduling ------------------------ C++ --===//		//===-- llvm/MC/MCSchedule.h - Scheduling ------------------------ C++ --===//
//		//
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.		// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
// See https://llvm.org/LICENSE.txt for license information.		// See https://llvm.org/LICENSE.txt for license information.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception		// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
//		//
// This file defines the classes used to describe a subtarget's machine model		// This file defines the classes used to describe a subtarget's machine model
// for scheduling and other instruction cost heuristics.		// for scheduling and other instruction cost heuristics.
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#ifndef LLVM_MC_MCSCHEDULE_H		#ifndef LLVM_MC_MCSCHEDULE_H
#define LLVM_MC_MCSCHEDULE_H		#define LLVM_MC_MCSCHEDULE_H

		#include "llvm/ADT/ArrayRef.h"
#include "llvm/ADT/Optional.h"		#include "llvm/ADT/Optional.h"
#include "llvm/Config/llvm-config.h"		#include "llvm/Config/llvm-config.h"
#include "llvm/Support/DataTypes.h"		#include "llvm/Support/DataTypes.h"
#include <cassert>		#include <cassert>

namespace llvm {		namespace llvm {

struct InstrItinerary;		struct InstrItinerary;
▲ Show 20 Lines • Show All 339 Lines • ▼ Show 20 Lines	struct MCSchedModel {

static double		static double
getReciprocalThroughput(unsigned SchedClass, const InstrItineraryData &IID);		getReciprocalThroughput(unsigned SchedClass, const InstrItineraryData &IID);

double		double
getReciprocalThroughput(const MCSubtargetInfo &STI, const MCInstrInfo &MCII,		getReciprocalThroughput(const MCSubtargetInfo &STI, const MCInstrInfo &MCII,
const MCInst &Inst) const;		const MCInst &Inst) const;

		/// Returns the maximum forwarding delay for register reads dependent on
		/// writes of scheduling class WriteResourceIdx.
		static unsigned getForwardingDelayCycles(ArrayRef<MCReadAdvanceEntry> Entries,
		mattdUnsubmitted Not Done Reply Inline Actions You can probably constify Entries. mattd: You can probably constify Entries.
		unsigned WriteResourceIdx = 0);

/// Returns the default initialized model.		/// Returns the default initialized model.
static const MCSchedModel &GetDefaultSchedModel() { return Default; }		static const MCSchedModel &GetDefaultSchedModel() { return Default; }
static const MCSchedModel Default;		static const MCSchedModel Default;
};		};

} // namespace llvm		} // namespace llvm

#endif		#endif

include/llvm/MC/MCSubtargetInfo.h

Show First 20 Lines • Show All 146 Lines • ▼ Show 20 Lines	for (const MCReadAdvanceEntry *I = &ReadAdvanceTable[SC->ReadAdvanceIdx],
// Find the first WriteResIdx match, which has the highest cycle count.		// Find the first WriteResIdx match, which has the highest cycle count.
if (!I->WriteResourceID \|\| I->WriteResourceID == WriteResID) {		if (!I->WriteResourceID \|\| I->WriteResourceID == WriteResID) {
return I->Cycles;		return I->Cycles;
}		}
}		}
return 0;		return 0;
}		}

		/// Return the set of ReadAdvance entries declared by the scheduling class
		/// descriptor in input.
		ArrayRef<MCReadAdvanceEntry>
		getReadAdvanceEntries(const MCSchedClassDesc &SC) const {
		if (!SC.NumReadAdvanceEntries)
		return ArrayRef<MCReadAdvanceEntry>();
		return ArrayRef<MCReadAdvanceEntry>(&ReadAdvanceTable[SC.ReadAdvanceIdx],
		SC.NumReadAdvanceEntries);
		}

/// Get scheduling itinerary of a CPU.		/// Get scheduling itinerary of a CPU.
InstrItineraryData getInstrItineraryForCPU(StringRef CPU) const;		InstrItineraryData getInstrItineraryForCPU(StringRef CPU) const;

/// Initialize an InstrItineraryData instance.		/// Initialize an InstrItineraryData instance.
void initInstrItins(InstrItineraryData &InstrItins) const;		void initInstrItins(InstrItineraryData &InstrItins) const;

/// Resolve a variant scheduling class for the given MCInst and CPU.		/// Resolve a variant scheduling class for the given MCInst and CPU.
virtual unsigned		virtual unsigned
Show All 20 Lines

include/llvm/MCA/Instruction.h

Show First 20 Lines • Show All 326 Lines • ▼ Show 20 Lines	struct InstrDesc {
SmallVector<std::pair<uint64_t, ResourceUsage>, 4> Resources;		SmallVector<std::pair<uint64_t, ResourceUsage>, 4> Resources;

// A list of buffered resources consumed by this instruction.		// A list of buffered resources consumed by this instruction.
SmallVector<uint64_t, 4> Buffers;		SmallVector<uint64_t, 4> Buffers;

unsigned MaxLatency;		unsigned MaxLatency;
// Number of MicroOps for this instruction.		// Number of MicroOps for this instruction.
unsigned NumMicroOps;		unsigned NumMicroOps;
		// SchedClassID used to construct this InstrDesc.
		// This information is currently used by views to do fast queries on the
		// subtarget when computing the reciprocal throughput.
		unsigned SchedClassID;

bool MayLoad;		bool MayLoad;
bool MayStore;		bool MayStore;
bool HasSideEffects;		bool HasSideEffects;
bool BeginGroup;		bool BeginGroup;
bool EndGroup;		bool EndGroup;

// True if all buffered resources are in-order, and there is at least one		// True if all buffered resources are in-order, and there is at least one
▲ Show 20 Lines • Show All 208 Lines • Show Last 20 Lines

lib/CodeGen/TargetSubtargetInfo.cpp

	Show First 20 Lines • Show All 82 Lines • ▼ Show 20 Lines
	std::string TargetSubtargetInfo::getSchedInfoStr(const MachineInstr &MI) const {			std::string TargetSubtargetInfo::getSchedInfoStr(const MachineInstr &MI) const {
	if (MI.isPseudo() \|\| MI.isTerminator())			if (MI.isPseudo() \|\| MI.isTerminator())
	return std::string();			return std::string();
	// We don't cache TSchedModel because it depends on TargetInstrInfo			// We don't cache TSchedModel because it depends on TargetInstrInfo
	// that could be changed during the compilation			// that could be changed during the compilation
	TargetSchedModel TSchedModel;			TargetSchedModel TSchedModel;
	TSchedModel.init(this);			TSchedModel.init(this);
	unsigned Latency = TSchedModel.computeInstrLatency(&MI);			unsigned Latency = TSchedModel.computeInstrLatency(&MI);

				// Add extra latency due to forwarding delays.
				const MCSchedClassDesc &SCDesc = *TSchedModel.resolveSchedClass(&MI);
				Latency +=
				MCSchedModel::getForwardingDelayCycles(getReadAdvanceEntries(SCDesc));

	double RThroughput = TSchedModel.computeReciprocalThroughput(&MI);			double RThroughput = TSchedModel.computeReciprocalThroughput(&MI);
	return createSchedInfoStr(Latency, RThroughput);			return createSchedInfoStr(Latency, RThroughput);
	}			}

	/// Returns string representation of scheduler comment			/// Returns string representation of scheduler comment
	std::string TargetSubtargetInfo::getSchedInfoStr(MCInst const &MCI) const {			std::string TargetSubtargetInfo::getSchedInfoStr(MCInst const &MCI) const {
	// We don't cache TSchedModel because it depends on TargetInstrInfo			// We don't cache TSchedModel because it depends on TargetInstrInfo
	// that could be changed during the compilation			// that could be changed during the compilation
	TargetSchedModel TSchedModel;			TargetSchedModel TSchedModel;
	TSchedModel.init(this);			TSchedModel.init(this);
	unsigned Latency;			unsigned Latency;
	if (TSchedModel.hasInstrSchedModel())			if (TSchedModel.hasInstrSchedModel()) {
	Latency = TSchedModel.computeInstrLatency(MCI);			Latency = TSchedModel.computeInstrLatency(MCI);
	else if (TSchedModel.hasInstrItineraries()) {			// Add extra latency due to forwarding delays.
				const MCSchedModel &SM = *TSchedModel.getMCSchedModel();
				unsigned SClassID = getInstrInfo()->get(MCI.getOpcode()).getSchedClass();
				while (SM.getSchedClassDesc(SClassID)->isVariant())
				SClassID = resolveVariantSchedClass(SClassID, &MCI, SM.ProcID);
				const MCSchedClassDesc &SCDesc = *SM.getSchedClassDesc(SClassID);
				Latency +=
				MCSchedModel::getForwardingDelayCycles(getReadAdvanceEntries(SCDesc));
				} else if (TSchedModel.hasInstrItineraries()) {
	auto *ItinData = TSchedModel.getInstrItineraries();			auto *ItinData = TSchedModel.getInstrItineraries();
	Latency = ItinData->getStageLatency(			Latency = ItinData->getStageLatency(
	getInstrInfo()->get(MCI.getOpcode()).getSchedClass());			getInstrInfo()->get(MCI.getOpcode()).getSchedClass());
	} else			} else
	return std::string();			return std::string();
	double RThroughput = TSchedModel.computeReciprocalThroughput(MCI);			double RThroughput = TSchedModel.computeReciprocalThroughput(MCI);
	return createSchedInfoStr(Latency, RThroughput);			return createSchedInfoStr(Latency, RThroughput);
	}			}

	void TargetSubtargetInfo::mirFileLoaded(MachineFunction &MF) const {			void TargetSubtargetInfo::mirFileLoaded(MachineFunction &MF) const {
	}			}

lib/MC/MCSchedule.cpp

Show First 20 Lines • Show All 143 Lines • ▼ Show 20 Lines	MCSchedModel::getReciprocalThroughput(unsigned SchedClass,
}		}
if (Throughput.hasValue())		if (Throughput.hasValue())
return 1.0 / Throughput.getValue();		return 1.0 / Throughput.getValue();

// If there are no execution resources specified for this class, then assume		// If there are no execution resources specified for this class, then assume
// that it can execute at the maximum default issue width.		// that it can execute at the maximum default issue width.
return 1.0 / DefaultIssueWidth;		return 1.0 / DefaultIssueWidth;
}		}

		unsigned
		MCSchedModel::getForwardingDelayCycles(ArrayRef<MCReadAdvanceEntry> Entries,
		unsigned WriteResourceID) {
		if (Entries.empty())
		return 0;

		int DelayCycles = 0;
		for (const MCReadAdvanceEntry &E : Entries) {
		if (E.WriteResourceID != WriteResourceID)
		continue;
		DelayCycles = std::min(DelayCycles, E.Cycles);
		}

		return std::abs(DelayCycles);
		}

lib/MCA/InstrBuilder.cpp

Show First 20 Lines • Show All 526 Lines • ▼ Show 20 Lines	InstrBuilder::createInstrDescImpl(const MCInst &MCI) {
}		}

LLVM_DEBUG(dbgs() << "\n\t\tOpcode Name= " << MCII.getName(Opcode) << '\n');		LLVM_DEBUG(dbgs() << "\n\t\tOpcode Name= " << MCII.getName(Opcode) << '\n');
LLVM_DEBUG(dbgs() << "\t\tSchedClassID=" << SchedClassID << '\n');		LLVM_DEBUG(dbgs() << "\t\tSchedClassID=" << SchedClassID << '\n');

// Create a new empty descriptor.		// Create a new empty descriptor.
std::unique_ptr<InstrDesc> ID = llvm::make_unique<InstrDesc>();		std::unique_ptr<InstrDesc> ID = llvm::make_unique<InstrDesc>();
ID->NumMicroOps = SCDesc.NumMicroOps;		ID->NumMicroOps = SCDesc.NumMicroOps;
		ID->SchedClassID = SchedClassID;

if (MCDesc.isCall() && FirstCallInst) {		if (MCDesc.isCall() && FirstCallInst) {
// We don't correctly model calls.		// We don't correctly model calls.
WithColor::warning() << "found a call in the input assembly sequence.\n";		WithColor::warning() << "found a call in the input assembly sequence.\n";
WithColor::note() << "call instructions are not correctly modeled. "		WithColor::note() << "call instructions are not correctly modeled. "
<< "Assume a latency of 100cy.\n";		<< "Assume a latency of 100cy.\n";
FirstCallInst = false;		FirstCallInst = false;
}		}
▲ Show 20 Lines • Show All 155 Lines • Show Last 20 Lines

lib/Target/X86/X86InstrMMX.td

	Show First 20 Lines • Show All 537 Lines • ▼ Show 20 Lines
	let Constraints = "$src1 = $dst" in {			let Constraints = "$src1 = $dst" in {
	let Predicates = [HasMMX, HasSSE1] in {			let Predicates = [HasMMX, HasSSE1] in {
	def MMX_PINSRWrr : MMXIi8<0xC4, MRMSrcReg,			def MMX_PINSRWrr : MMXIi8<0xC4, MRMSrcReg,
	(outs VR64:$dst),			(outs VR64:$dst),
	(ins VR64:$src1, GR32orGR64:$src2, i32u8imm:$src3),			(ins VR64:$src1, GR32orGR64:$src2, i32u8imm:$src3),
	"pinsrw\t{$src3, $src2, $dst\|$dst, $src2, $src3}",			"pinsrw\t{$src3, $src2, $dst\|$dst, $src2, $src3}",
	[(set VR64:$dst, (int_x86_mmx_pinsr_w VR64:$src1,			[(set VR64:$dst, (int_x86_mmx_pinsr_w VR64:$src1,
	GR32orGR64:$src2, imm:$src3))]>,			GR32orGR64:$src2, imm:$src3))]>,
	Sched<[WriteVecInsert]>;			Sched<[WriteVecInsert, ReadDefault, ReadInt2Fpu]>;

	def MMX_PINSRWrm : MMXIi8<0xC4, MRMSrcMem,			def MMX_PINSRWrm : MMXIi8<0xC4, MRMSrcMem,
	(outs VR64:$dst),			(outs VR64:$dst),
	(ins VR64:$src1, i16mem:$src2, i32u8imm:$src3),			(ins VR64:$src1, i16mem:$src2, i32u8imm:$src3),
	"pinsrw\t{$src3, $src2, $dst\|$dst, $src2, $src3}",			"pinsrw\t{$src3, $src2, $dst\|$dst, $src2, $src3}",
	[(set VR64:$dst, (int_x86_mmx_pinsr_w VR64:$src1,			[(set VR64:$dst, (int_x86_mmx_pinsr_w VR64:$src1,
	(i32 (anyext (loadi16 addr:$src2))),			(i32 (anyext (loadi16 addr:$src2))),
	imm:$src3))]>,			imm:$src3))]>,
	▲ Show 20 Lines • Show All 57 Lines • Show Last 20 Lines

lib/Target/X86/X86InstrSSE.td

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 4,116 Lines • ▼ Show 20 Lines	multiclass sse2_pinsrw<bit Is2Addr = 1> {
def rr : Ii8<0xC4, MRMSrcReg,		def rr : Ii8<0xC4, MRMSrcReg,
(outs VR128:$dst), (ins VR128:$src1,		(outs VR128:$dst), (ins VR128:$src1,
GR32orGR64:$src2, u8imm:$src3),		GR32orGR64:$src2, u8imm:$src3),
!if(Is2Addr,		!if(Is2Addr,
"pinsrw\t{$src3, $src2, $dst\|$dst, $src2, $src3}",		"pinsrw\t{$src3, $src2, $dst\|$dst, $src2, $src3}",
"vpinsrw\t{$src3, $src2, $src1, $dst\|$dst, $src1, $src2, $src3}"),		"vpinsrw\t{$src3, $src2, $src1, $dst\|$dst, $src1, $src2, $src3}"),
[(set VR128:$dst,		[(set VR128:$dst,
(X86pinsrw VR128:$src1, GR32orGR64:$src2, imm:$src3))]>,		(X86pinsrw VR128:$src1, GR32orGR64:$src2, imm:$src3))]>,
Sched<[WriteVecInsert]>;		Sched<[WriteVecInsert, ReadDefault, ReadInt2Fpu]>;
def rm : Ii8<0xC4, MRMSrcMem,		def rm : Ii8<0xC4, MRMSrcMem,
(outs VR128:$dst), (ins VR128:$src1,		(outs VR128:$dst), (ins VR128:$src1,
i16mem:$src2, u8imm:$src3),		i16mem:$src2, u8imm:$src3),
!if(Is2Addr,		!if(Is2Addr,
"pinsrw\t{$src3, $src2, $dst\|$dst, $src2, $src3}",		"pinsrw\t{$src3, $src2, $dst\|$dst, $src2, $src3}",
"vpinsrw\t{$src3, $src2, $src1, $dst\|$dst, $src1, $src2, $src3}"),		"vpinsrw\t{$src3, $src2, $src1, $dst\|$dst, $src1, $src2, $src3}"),
[(set VR128:$dst,		[(set VR128:$dst,
(X86pinsrw VR128:$src1, (extloadi16 addr:$src2),		(X86pinsrw VR128:$src1, (extloadi16 addr:$src2),
▲ Show 20 Lines • Show All 1,438 Lines • ▼ Show 20 Lines	multiclass SS41I_insert8<bits<8> opc, string asm, bit Is2Addr = 1> {
def rr : SS4AIi8<opc, MRMSrcReg, (outs VR128:$dst),		def rr : SS4AIi8<opc, MRMSrcReg, (outs VR128:$dst),
(ins VR128:$src1, GR32orGR64:$src2, u8imm:$src3),		(ins VR128:$src1, GR32orGR64:$src2, u8imm:$src3),
!if(Is2Addr,		!if(Is2Addr,
!strconcat(asm, "\t{$src3, $src2, $dst\|$dst, $src2, $src3}"),		!strconcat(asm, "\t{$src3, $src2, $dst\|$dst, $src2, $src3}"),
!strconcat(asm,		!strconcat(asm,
"\t{$src3, $src2, $src1, $dst\|$dst, $src1, $src2, $src3}")),		"\t{$src3, $src2, $src1, $dst\|$dst, $src1, $src2, $src3}")),
[(set VR128:$dst,		[(set VR128:$dst,
(X86pinsrb VR128:$src1, GR32orGR64:$src2, imm:$src3))]>,		(X86pinsrb VR128:$src1, GR32orGR64:$src2, imm:$src3))]>,
Sched<[WriteVecInsert]>;		Sched<[WriteVecInsert, ReadDefault, ReadInt2Fpu]>;
def rm : SS4AIi8<opc, MRMSrcMem, (outs VR128:$dst),		def rm : SS4AIi8<opc, MRMSrcMem, (outs VR128:$dst),
(ins VR128:$src1, i8mem:$src2, u8imm:$src3),		(ins VR128:$src1, i8mem:$src2, u8imm:$src3),
!if(Is2Addr,		!if(Is2Addr,
!strconcat(asm, "\t{$src3, $src2, $dst\|$dst, $src2, $src3}"),		!strconcat(asm, "\t{$src3, $src2, $dst\|$dst, $src2, $src3}"),
!strconcat(asm,		!strconcat(asm,
"\t{$src3, $src2, $src1, $dst\|$dst, $src1, $src2, $src3}")),		"\t{$src3, $src2, $src1, $dst\|$dst, $src1, $src2, $src3}")),
[(set VR128:$dst,		[(set VR128:$dst,
(X86pinsrb VR128:$src1, (extloadi8 addr:$src2), imm:$src3))]>,		(X86pinsrb VR128:$src1, (extloadi8 addr:$src2), imm:$src3))]>,
Show All 9 Lines	multiclass SS41I_insert32<bits<8> opc, string asm, bit Is2Addr = 1> {
def rr : SS4AIi8<opc, MRMSrcReg, (outs VR128:$dst),		def rr : SS4AIi8<opc, MRMSrcReg, (outs VR128:$dst),
(ins VR128:$src1, GR32:$src2, u8imm:$src3),		(ins VR128:$src1, GR32:$src2, u8imm:$src3),
!if(Is2Addr,		!if(Is2Addr,
!strconcat(asm, "\t{$src3, $src2, $dst\|$dst, $src2, $src3}"),		!strconcat(asm, "\t{$src3, $src2, $dst\|$dst, $src2, $src3}"),
!strconcat(asm,		!strconcat(asm,
"\t{$src3, $src2, $src1, $dst\|$dst, $src1, $src2, $src3}")),		"\t{$src3, $src2, $src1, $dst\|$dst, $src1, $src2, $src3}")),
[(set VR128:$dst,		[(set VR128:$dst,
(v4i32 (insertelt VR128:$src1, GR32:$src2, imm:$src3)))]>,		(v4i32 (insertelt VR128:$src1, GR32:$src2, imm:$src3)))]>,
Sched<[WriteVecInsert]>;		Sched<[WriteVecInsert, ReadDefault, ReadInt2Fpu]>;
def rm : SS4AIi8<opc, MRMSrcMem, (outs VR128:$dst),		def rm : SS4AIi8<opc, MRMSrcMem, (outs VR128:$dst),
(ins VR128:$src1, i32mem:$src2, u8imm:$src3),		(ins VR128:$src1, i32mem:$src2, u8imm:$src3),
!if(Is2Addr,		!if(Is2Addr,
!strconcat(asm, "\t{$src3, $src2, $dst\|$dst, $src2, $src3}"),		!strconcat(asm, "\t{$src3, $src2, $dst\|$dst, $src2, $src3}"),
!strconcat(asm,		!strconcat(asm,
"\t{$src3, $src2, $src1, $dst\|$dst, $src1, $src2, $src3}")),		"\t{$src3, $src2, $src1, $dst\|$dst, $src1, $src2, $src3}")),
[(set VR128:$dst,		[(set VR128:$dst,
(v4i32 (insertelt VR128:$src1, (loadi32 addr:$src2), imm:$src3)))]>,		(v4i32 (insertelt VR128:$src1, (loadi32 addr:$src2), imm:$src3)))]>,
Show All 9 Lines	multiclass SS41I_insert64<bits<8> opc, string asm, bit Is2Addr = 1> {
def rr : SS4AIi8<opc, MRMSrcReg, (outs VR128:$dst),		def rr : SS4AIi8<opc, MRMSrcReg, (outs VR128:$dst),
(ins VR128:$src1, GR64:$src2, u8imm:$src3),		(ins VR128:$src1, GR64:$src2, u8imm:$src3),
!if(Is2Addr,		!if(Is2Addr,
!strconcat(asm, "\t{$src3, $src2, $dst\|$dst, $src2, $src3}"),		!strconcat(asm, "\t{$src3, $src2, $dst\|$dst, $src2, $src3}"),
!strconcat(asm,		!strconcat(asm,
"\t{$src3, $src2, $src1, $dst\|$dst, $src1, $src2, $src3}")),		"\t{$src3, $src2, $src1, $dst\|$dst, $src1, $src2, $src3}")),
[(set VR128:$dst,		[(set VR128:$dst,
(v2i64 (insertelt VR128:$src1, GR64:$src2, imm:$src3)))]>,		(v2i64 (insertelt VR128:$src1, GR64:$src2, imm:$src3)))]>,
Sched<[WriteVecInsert]>;		Sched<[WriteVecInsert, ReadDefault, ReadInt2Fpu]>;
def rm : SS4AIi8<opc, MRMSrcMem, (outs VR128:$dst),		def rm : SS4AIi8<opc, MRMSrcMem, (outs VR128:$dst),
(ins VR128:$src1, i64mem:$src2, u8imm:$src3),		(ins VR128:$src1, i64mem:$src2, u8imm:$src3),
!if(Is2Addr,		!if(Is2Addr,
!strconcat(asm, "\t{$src3, $src2, $dst\|$dst, $src2, $src3}"),		!strconcat(asm, "\t{$src3, $src2, $dst\|$dst, $src2, $src3}"),
!strconcat(asm,		!strconcat(asm,
"\t{$src3, $src2, $src1, $dst\|$dst, $src1, $src2, $src3}")),		"\t{$src3, $src2, $src1, $dst\|$dst, $src1, $src2, $src3}")),
[(set VR128:$dst,		[(set VR128:$dst,
(v2i64 (insertelt VR128:$src1, (loadi64 addr:$src2), imm:$src3)))]>,		(v2i64 (insertelt VR128:$src1, (loadi64 addr:$src2), imm:$src3)))]>,
▲ Show 20 Lines • Show All 2,837 Lines • Show Last 20 Lines

lib/Target/X86/X86SchedBroadwell.td

	Show First 20 Lines • Show All 75 Lines • ▼ Show 20 Lines
	def : ReadAdvance<ReadAfterLd, 5>;			def : ReadAdvance<ReadAfterLd, 5>;

	// Vector loads are 5/5/6 cycles, so ReadAfterVec*Ld registers needn't be available			// Vector loads are 5/5/6 cycles, so ReadAfterVec*Ld registers needn't be available
	// until 5/5/6 cycles after the memory operand.			// until 5/5/6 cycles after the memory operand.
	def : ReadAdvance<ReadAfterVecLd, 5>;			def : ReadAdvance<ReadAfterVecLd, 5>;
	def : ReadAdvance<ReadAfterVecXLd, 5>;			def : ReadAdvance<ReadAfterVecXLd, 5>;
	def : ReadAdvance<ReadAfterVecYLd, 6>;			def : ReadAdvance<ReadAfterVecYLd, 6>;

				def : ReadAdvance<ReadInt2Fpu, 0>;

	// Many SchedWrites are defined in pairs with and without a folded load.			// Many SchedWrites are defined in pairs with and without a folded load.
	// Instructions with folded loads are usually micro-fused, so they only appear			// Instructions with folded loads are usually micro-fused, so they only appear
	// as two micro-ops when queued in the reservation station.			// as two micro-ops when queued in the reservation station.
	// This multiclass defines the resource usage for variants with and without			// This multiclass defines the resource usage for variants with and without
	// folded loads.			// folded loads.
	multiclass BWWriteResPair<X86FoldableSchedWrite SchedRW,			multiclass BWWriteResPair<X86FoldableSchedWrite SchedRW,
	list<ProcResourceKind> ExePorts,			list<ProcResourceKind> ExePorts,
	int Lat, list<int> Res = [1], int UOps = 1,			int Lat, list<int> Res = [1], int UOps = 1,
	▲ Show 20 Lines • Show All 1,503 Lines • Show Last 20 Lines

lib/Target/X86/X86SchedHaswell.td

	Show First 20 Lines • Show All 80 Lines • ▼ Show 20 Lines
	def : ReadAdvance<ReadAfterLd, 5>;			def : ReadAdvance<ReadAfterLd, 5>;

	// Vector loads are 5/6/7 cycles, so ReadAfterVec*Ld registers needn't be available			// Vector loads are 5/6/7 cycles, so ReadAfterVec*Ld registers needn't be available
	// until 5/6/7 cycles after the memory operand.			// until 5/6/7 cycles after the memory operand.
	def : ReadAdvance<ReadAfterVecLd, 5>;			def : ReadAdvance<ReadAfterVecLd, 5>;
	def : ReadAdvance<ReadAfterVecXLd, 6>;			def : ReadAdvance<ReadAfterVecXLd, 6>;
	def : ReadAdvance<ReadAfterVecYLd, 7>;			def : ReadAdvance<ReadAfterVecYLd, 7>;

				def : ReadAdvance<ReadInt2Fpu, 0>;

	// Many SchedWrites are defined in pairs with and without a folded load.			// Many SchedWrites are defined in pairs with and without a folded load.
	// Instructions with folded loads are usually micro-fused, so they only appear			// Instructions with folded loads are usually micro-fused, so they only appear
	// as two micro-ops when queued in the reservation station.			// as two micro-ops when queued in the reservation station.
	// This multiclass defines the resource usage for variants with and without			// This multiclass defines the resource usage for variants with and without
	// folded loads.			// folded loads.
	multiclass HWWriteResPair<X86FoldableSchedWrite SchedRW,			multiclass HWWriteResPair<X86FoldableSchedWrite SchedRW,
	list<ProcResourceKind> ExePorts,			list<ProcResourceKind> ExePorts,
	int Lat, list<int> Res = [1], int UOps = 1,			int Lat, list<int> Res = [1], int UOps = 1,
	▲ Show 20 Lines • Show All 1,753 Lines • Show Last 20 Lines

lib/Target/X86/X86SchedSandyBridge.td

	Show First 20 Lines • Show All 70 Lines • ▼ Show 20 Lines
	def : ReadAdvance<ReadAfterLd, 5>;			def : ReadAdvance<ReadAfterLd, 5>;

	// Vector loads are 5/6/7 cycles, so ReadAfterVec*Ld registers needn't be available			// Vector loads are 5/6/7 cycles, so ReadAfterVec*Ld registers needn't be available
	// until 5/6/7 cycles after the memory operand.			// until 5/6/7 cycles after the memory operand.
	def : ReadAdvance<ReadAfterVecLd, 5>;			def : ReadAdvance<ReadAfterVecLd, 5>;
	def : ReadAdvance<ReadAfterVecXLd, 6>;			def : ReadAdvance<ReadAfterVecXLd, 6>;
	def : ReadAdvance<ReadAfterVecYLd, 7>;			def : ReadAdvance<ReadAfterVecYLd, 7>;

				def : ReadAdvance<ReadInt2Fpu, 0>;

	// Many SchedWrites are defined in pairs with and without a folded load.			// Many SchedWrites are defined in pairs with and without a folded load.
	// Instructions with folded loads are usually micro-fused, so they only appear			// Instructions with folded loads are usually micro-fused, so they only appear
	// as two micro-ops when queued in the reservation station.			// as two micro-ops when queued in the reservation station.
	// This multiclass defines the resource usage for variants with and without			// This multiclass defines the resource usage for variants with and without
	// folded loads.			// folded loads.
	multiclass SBWriteResPair<X86FoldableSchedWrite SchedRW,			multiclass SBWriteResPair<X86FoldableSchedWrite SchedRW,
	list<ProcResourceKind> ExePorts,			list<ProcResourceKind> ExePorts,
	int Lat, list<int> Res = [1], int UOps = 1,			int Lat, list<int> Res = [1], int UOps = 1,
	▲ Show 20 Lines • Show All 1,088 Lines • Show Last 20 Lines

lib/Target/X86/X86SchedSkylakeClient.td

	Show First 20 Lines • Show All 74 Lines • ▼ Show 20 Lines
	def : ReadAdvance<ReadAfterLd, 5>;			def : ReadAdvance<ReadAfterLd, 5>;

	// Vector loads are 5/6/7 cycles, so ReadAfterVec*Ld registers needn't be available			// Vector loads are 5/6/7 cycles, so ReadAfterVec*Ld registers needn't be available
	// until 5/6/7 cycles after the memory operand.			// until 5/6/7 cycles after the memory operand.
	def : ReadAdvance<ReadAfterVecLd, 5>;			def : ReadAdvance<ReadAfterVecLd, 5>;
	def : ReadAdvance<ReadAfterVecXLd, 6>;			def : ReadAdvance<ReadAfterVecXLd, 6>;
	def : ReadAdvance<ReadAfterVecYLd, 7>;			def : ReadAdvance<ReadAfterVecYLd, 7>;

				def : ReadAdvance<ReadInt2Fpu, 0>;

	// Many SchedWrites are defined in pairs with and without a folded load.			// Many SchedWrites are defined in pairs with and without a folded load.
	// Instructions with folded loads are usually micro-fused, so they only appear			// Instructions with folded loads are usually micro-fused, so they only appear
	// as two micro-ops when queued in the reservation station.			// as two micro-ops when queued in the reservation station.
	// This multiclass defines the resource usage for variants with and without			// This multiclass defines the resource usage for variants with and without
	// folded loads.			// folded loads.
	multiclass SKLWriteResPair<X86FoldableSchedWrite SchedRW,			multiclass SKLWriteResPair<X86FoldableSchedWrite SchedRW,
	list<ProcResourceKind> ExePorts,			list<ProcResourceKind> ExePorts,
	int Lat, list<int> Res = [1], int UOps = 1,			int Lat, list<int> Res = [1], int UOps = 1,
	▲ Show 20 Lines • Show All 1,659 Lines • Show Last 20 Lines

lib/Target/X86/X86SchedSkylakeServer.td

	Show First 20 Lines • Show All 74 Lines • ▼ Show 20 Lines
	def : ReadAdvance<ReadAfterLd, 5>;			def : ReadAdvance<ReadAfterLd, 5>;

	// Vector loads are 5/6/7 cycles, so ReadAfterVec*Ld registers needn't be available			// Vector loads are 5/6/7 cycles, so ReadAfterVec*Ld registers needn't be available
	// until 5/6/7 cycles after the memory operand.			// until 5/6/7 cycles after the memory operand.
	def : ReadAdvance<ReadAfterVecLd, 5>;			def : ReadAdvance<ReadAfterVecLd, 5>;
	def : ReadAdvance<ReadAfterVecXLd, 6>;			def : ReadAdvance<ReadAfterVecXLd, 6>;
	def : ReadAdvance<ReadAfterVecYLd, 7>;			def : ReadAdvance<ReadAfterVecYLd, 7>;

				def : ReadAdvance<ReadInt2Fpu, 0>;

	// Many SchedWrites are defined in pairs with and without a folded load.			// Many SchedWrites are defined in pairs with and without a folded load.
	// Instructions with folded loads are usually micro-fused, so they only appear			// Instructions with folded loads are usually micro-fused, so they only appear
	// as two micro-ops when queued in the reservation station.			// as two micro-ops when queued in the reservation station.
	// This multiclass defines the resource usage for variants with and without			// This multiclass defines the resource usage for variants with and without
	// folded loads.			// folded loads.
	multiclass SKXWriteResPair<X86FoldableSchedWrite SchedRW,			multiclass SKXWriteResPair<X86FoldableSchedWrite SchedRW,
	list<ProcResourceKind> ExePorts,			list<ProcResourceKind> ExePorts,
	int Lat, list<int> Res = [1], int UOps = 1,			int Lat, list<int> Res = [1], int UOps = 1,
	▲ Show 20 Lines • Show All 2,375 Lines • Show Last 20 Lines

lib/Target/X86/X86Schedule.td

	Show All 10 Lines

	// Instructions with folded loads need to read the memory operand immediately,			// Instructions with folded loads need to read the memory operand immediately,
	// but other register operands don't have to be read until the load is ready.			// but other register operands don't have to be read until the load is ready.
	// These operands are marked with ReadAfterLd.			// These operands are marked with ReadAfterLd.
	def ReadAfterLd : SchedRead;			def ReadAfterLd : SchedRead;
	def ReadAfterVecLd : SchedRead;			def ReadAfterVecLd : SchedRead;
	def ReadAfterVecXLd : SchedRead;			def ReadAfterVecXLd : SchedRead;
	def ReadAfterVecYLd : SchedRead;			def ReadAfterVecYLd : SchedRead;

				// Instructions that move data between general purpose registers and vector
				RKSimonUnsubmitted Done Reply Inline Actions Add a description comment RKSimon: Add a description comment
				andreadbAuthorUnsubmitted Done Reply Inline Actions I will add a comment to this line. andreadb: I will add a comment to this line.
				// registers may be subject to extra latency due to data bypass delays.
				// This SchedRead describes a bypass delay caused by data being moved from the
				// integer unit to the floating point unit.
				def ReadInt2Fpu : SchedRead;

				mattdUnsubmitted Done Reply Inline Actions I think it would be helpful to have a comment here, describing ReadInt2Fpu. Something similar to what you described above in the Details. mattd: I think it would be helpful to have a comment here, describing ReadInt2Fpu. Something similar…
	// Instructions with both a load and a store folded are modeled as a folded			// Instructions with both a load and a store folded are modeled as a folded
	// load + WriteRMW.			// load + WriteRMW.
	def WriteRMW : SchedWrite;			def WriteRMW : SchedWrite;

	// Helper to set SchedWrite ExePorts/Latency/ResourceCycles/NumMicroOps.			// Helper to set SchedWrite ExePorts/Latency/ResourceCycles/NumMicroOps.
	multiclass X86WriteRes<SchedWrite SchedRW,			multiclass X86WriteRes<SchedWrite SchedRW,
	list<ProcResourceKind> ExePorts,			list<ProcResourceKind> ExePorts,
	int Lat, list<int> Res, int UOps> {			int Lat, list<int> Res, int UOps> {
	▲ Show 20 Lines • Show All 668 Lines • Show Last 20 Lines

lib/Target/X86/X86ScheduleAtom.td

	Show All 40 Lines

	// Loads are 3 cycles, so ReadAfterLd registers needn't be available until 3			// Loads are 3 cycles, so ReadAfterLd registers needn't be available until 3
	// cycles after the memory operand.			// cycles after the memory operand.
	def : ReadAdvance<ReadAfterLd, 3>;			def : ReadAdvance<ReadAfterLd, 3>;
	def : ReadAdvance<ReadAfterVecLd, 3>;			def : ReadAdvance<ReadAfterVecLd, 3>;
	def : ReadAdvance<ReadAfterVecXLd, 3>;			def : ReadAdvance<ReadAfterVecXLd, 3>;
	def : ReadAdvance<ReadAfterVecYLd, 3>;			def : ReadAdvance<ReadAfterVecYLd, 3>;

				def : ReadAdvance<ReadInt2Fpu, 0>;

	// Many SchedWrites are defined in pairs with and without a folded load.			// Many SchedWrites are defined in pairs with and without a folded load.
	// Instructions with folded loads are usually micro-fused, so they only appear			// Instructions with folded loads are usually micro-fused, so they only appear
	// as two micro-ops when dispatched by the schedulers.			// as two micro-ops when dispatched by the schedulers.
	// This multiclass defines the resource usage for variants with and without			// This multiclass defines the resource usage for variants with and without
	// folded loads.			// folded loads.
	multiclass AtomWriteResPair<X86FoldableSchedWrite SchedRW,			multiclass AtomWriteResPair<X86FoldableSchedWrite SchedRW,
	list<ProcResourceKind> RRPorts,			list<ProcResourceKind> RRPorts,
	list<ProcResourceKind> RMPorts,			list<ProcResourceKind> RMPorts,
	▲ Show 20 Lines • Show All 847 Lines • Show Last 20 Lines

lib/Target/X86/X86ScheduleBdVer2.td

	Show First 20 Lines • Show All 244 Lines • ▼ Show 20 Lines
	def : ReadAdvance<ReadAfterLd, 4>;			def : ReadAdvance<ReadAfterLd, 4>;

	// Vector loads are 5 cycles, so ReadAfterVec*Ld registers needn't be available			// Vector loads are 5 cycles, so ReadAfterVec*Ld registers needn't be available
	// until 5 cycles after the memory operand.			// until 5 cycles after the memory operand.
	def : ReadAdvance<ReadAfterVecLd, 5>;			def : ReadAdvance<ReadAfterVecLd, 5>;
	def : ReadAdvance<ReadAfterVecXLd, 5>;			def : ReadAdvance<ReadAfterVecXLd, 5>;
	def : ReadAdvance<ReadAfterVecYLd, 5>;			def : ReadAdvance<ReadAfterVecYLd, 5>;

				def : ReadAdvance<ReadInt2Fpu, 0>;

	// A folded store needs a cycle on the PdStore for the store data.			// A folded store needs a cycle on the PdStore for the store data.
	def : WriteRes<WriteRMW, [PdStore]>;			def : WriteRes<WriteRMW, [PdStore]>;

	////////////////////////////////////////////////////////////////////////////////			////////////////////////////////////////////////////////////////////////////////
	// Loads, stores, and moves, not folded with other operations.			// Loads, stores, and moves, not folded with other operations.
	////////////////////////////////////////////////////////////////////////////////			////////////////////////////////////////////////////////////////////////////////

	def : WriteRes<WriteLoad, [PdLoad]> { let Latency = 5; }			def : WriteRes<WriteLoad, [PdLoad]> { let Latency = 5; }
	▲ Show 20 Lines • Show All 1,021 Lines • Show Last 20 Lines

lib/Target/X86/X86ScheduleBtVer2.td

	Show First 20 Lines • Show All 102 Lines • ▼ Show 20 Lines
	def : ReadAdvance<ReadAfterLd, 3>;			def : ReadAdvance<ReadAfterLd, 3>;

	// Vector loads are 5 cycles, so ReadAfterVec*Ld registers needn't be available until 5			// Vector loads are 5 cycles, so ReadAfterVec*Ld registers needn't be available until 5
	// cycles after the memory operand.			// cycles after the memory operand.
	def : ReadAdvance<ReadAfterVecLd, 5>;			def : ReadAdvance<ReadAfterVecLd, 5>;
	def : ReadAdvance<ReadAfterVecXLd, 5>;			def : ReadAdvance<ReadAfterVecXLd, 5>;
	def : ReadAdvance<ReadAfterVecYLd, 5>;			def : ReadAdvance<ReadAfterVecYLd, 5>;

				/// "Additional 6 cycle transfer operation which moves a floating point
				/// operation input value from the integer unit to the floating point unit.
				/// Reference: AMDfam16h SOG (Appendix A "Instruction Latencies", Section A.2).
				def : ReadAdvance<ReadInt2Fpu, -6>;

	// Many SchedWrites are defined in pairs with and without a folded load.			// Many SchedWrites are defined in pairs with and without a folded load.
	// Instructions with folded loads are usually micro-fused, so they only appear			// Instructions with folded loads are usually micro-fused, so they only appear
	// as two micro-ops when dispatched by the schedulers.			// as two micro-ops when dispatched by the schedulers.
	// This multiclass defines the resource usage for variants with and without			// This multiclass defines the resource usage for variants with and without
	// folded loads.			// folded loads.
	multiclass JWriteResIntPair<X86FoldableSchedWrite SchedRW,			multiclass JWriteResIntPair<X86FoldableSchedWrite SchedRW,
	list<ProcResourceKind> ExePorts,			list<ProcResourceKind> ExePorts,
	int Lat, list<int> Res = [], int UOps = 1,			int Lat, list<int> Res = [], int UOps = 1,
	▲ Show 20 Lines • Show All 416 Lines • ▼ Show 20 Lines
	defm : X86WriteResPairUnsupported<WriteVecTestZ>;			defm : X86WriteResPairUnsupported<WriteVecTestZ>;
	defm : X86WriteResPairUnsupported<WriteShuffle256>;			defm : X86WriteResPairUnsupported<WriteShuffle256>;
	defm : X86WriteResPairUnsupported<WriteVarShuffle256>;			defm : X86WriteResPairUnsupported<WriteVarShuffle256>;

	////////////////////////////////////////////////////////////////////////////////			////////////////////////////////////////////////////////////////////////////////
	// Vector insert/extract operations.			// Vector insert/extract operations.
	////////////////////////////////////////////////////////////////////////////////			////////////////////////////////////////////////////////////////////////////////

	defm : X86WriteRes<WriteVecInsert, [JFPU01, JVALU], 7, [1,1], 2>;			defm : X86WriteRes<WriteVecInsert, [JFPU01, JVALU], 1, [1,1], 2>;
	defm : X86WriteRes<WriteVecInsertLd, [JFPU01, JVALU, JLAGU], 4, [1,1,1], 1>;			defm : X86WriteRes<WriteVecInsertLd, [JFPU01, JVALU, JLAGU], 4, [1,1,1], 1>;
	defm : X86WriteRes<WriteVecExtract, [JFPU0, JFPA, JALU0], 3, [1,1,1], 1>;			defm : X86WriteRes<WriteVecExtract, [JFPU0, JFPA, JALU0], 3, [1,1,1], 1>;
	defm : X86WriteRes<WriteVecExtractSt, [JFPU1, JSTC, JSAGU], 3, [1,1,1], 1>;			defm : X86WriteRes<WriteVecExtractSt, [JFPU1, JSTC, JSAGU], 3, [1,1,1], 1>;

	////////////////////////////////////////////////////////////////////////////////			////////////////////////////////////////////////////////////////////////////////
	// SSE42 String instructions.			// SSE42 String instructions.
	////////////////////////////////////////////////////////////////////////////////			////////////////////////////////////////////////////////////////////////////////

	▲ Show 20 Lines • Show All 287 Lines • Show Last 20 Lines

lib/Target/X86/X86ScheduleSLM.td

	Show First 20 Lines • Show All 46 Lines • ▼ Show 20 Lines

	// Loads are 3 cycles, so ReadAfterLd registers needn't be available until 3			// Loads are 3 cycles, so ReadAfterLd registers needn't be available until 3
	// cycles after the memory operand.			// cycles after the memory operand.
	def : ReadAdvance<ReadAfterLd, 3>;			def : ReadAdvance<ReadAfterLd, 3>;
	def : ReadAdvance<ReadAfterVecLd, 3>;			def : ReadAdvance<ReadAfterVecLd, 3>;
	def : ReadAdvance<ReadAfterVecXLd, 3>;			def : ReadAdvance<ReadAfterVecXLd, 3>;
	def : ReadAdvance<ReadAfterVecYLd, 3>;			def : ReadAdvance<ReadAfterVecYLd, 3>;

				def : ReadAdvance<ReadInt2Fpu, 0>;

	// Many SchedWrites are defined in pairs with and without a folded load.			// Many SchedWrites are defined in pairs with and without a folded load.
	// Instructions with folded loads are usually micro-fused, so they only appear			// Instructions with folded loads are usually micro-fused, so they only appear
	// as two micro-ops when queued in the reservation station.			// as two micro-ops when queued in the reservation station.
	// This multiclass defines the resource usage for variants with and without			// This multiclass defines the resource usage for variants with and without
	// folded loads.			// folded loads.
	multiclass SLMWriteResPair<X86FoldableSchedWrite SchedRW,			multiclass SLMWriteResPair<X86FoldableSchedWrite SchedRW,
	list<ProcResourceKind> ExePorts,			list<ProcResourceKind> ExePorts,
	int Lat, list<int> Res = [1], int UOps = 1,			int Lat, list<int> Res = [1], int UOps = 1,
	▲ Show 20 Lines • Show All 447 Lines • Show Last 20 Lines

lib/Target/X86/X86ScheduleZnver1.td

	Show First 20 Lines • Show All 88 Lines • ▼ Show 20 Lines
	// 4 Cycles integer load-to use Latency is captured			// 4 Cycles integer load-to use Latency is captured
	def : ReadAdvance<ReadAfterLd, 4>;			def : ReadAdvance<ReadAfterLd, 4>;

	// 8 Cycles vector load-to use Latency is captured			// 8 Cycles vector load-to use Latency is captured
	def : ReadAdvance<ReadAfterVecLd, 8>;			def : ReadAdvance<ReadAfterVecLd, 8>;
	def : ReadAdvance<ReadAfterVecXLd, 8>;			def : ReadAdvance<ReadAfterVecXLd, 8>;
	def : ReadAdvance<ReadAfterVecYLd, 8>;			def : ReadAdvance<ReadAfterVecYLd, 8>;

				def : ReadAdvance<ReadInt2Fpu, 0>;

	// The Integer PRF for Zen is 168 entries, and it holds the architectural and			// The Integer PRF for Zen is 168 entries, and it holds the architectural and
	// speculative version of the 64-bit integer registers.			// speculative version of the 64-bit integer registers.
	// Reference: "Software Optimization Guide for AMD Family 17h Processors"			// Reference: "Software Optimization Guide for AMD Family 17h Processors"
	def ZnIntegerPRF : RegisterFile<168, [GR64, CCR]>;			def ZnIntegerPRF : RegisterFile<168, [GR64, CCR]>;

	// 36 Entry (9x4 entries) floating-point Scheduler			// 36 Entry (9x4 entries) floating-point Scheduler
	def ZnFPU : ProcResGroup<[ZnFPU0, ZnFPU1, ZnFPU2, ZnFPU3]> {			def ZnFPU : ProcResGroup<[ZnFPU0, ZnFPU1, ZnFPU2, ZnFPU3]> {
	let BufferSize=36;			let BufferSize=36;
	▲ Show 20 Lines • Show All 1,449 Lines • Show Last 20 Lines

test/CodeGen/X86/mmx-schedule.ll

This file is larger than 256 KB, so syntax highlighting is disabled by default.

	Show First 20 Lines • Show All 3,881 Lines • ▼ Show 20 Lines
	; BDVER2-NEXT: movswl (%rsi), %eax # sched: [5:0.50]			; BDVER2-NEXT: movswl (%rsi), %eax # sched: [5:0.50]
	; BDVER2-NEXT: pinsrw $0, %edi, %mm0 # sched: [2:0.50]			; BDVER2-NEXT: pinsrw $0, %edi, %mm0 # sched: [2:0.50]
	; BDVER2-NEXT: pinsrw $1, %eax, %mm0 # sched: [2:0.50]			; BDVER2-NEXT: pinsrw $1, %eax, %mm0 # sched: [2:0.50]
	; BDVER2-NEXT: movq %mm0, %rax # sched: [10:1.00]			; BDVER2-NEXT: movq %mm0, %rax # sched: [10:1.00]
	; BDVER2-NEXT: retq # sched: [5:1.00]			; BDVER2-NEXT: retq # sched: [5:1.00]
	;			;
	; BTVER2-LABEL: test_pinsrw:			; BTVER2-LABEL: test_pinsrw:
	; BTVER2: # %bb.0:			; BTVER2: # %bb.0:
	; BTVER2-NEXT: pinsrw $0, %edi, %mm0 # sched: [7:0.50]
	; BTVER2-NEXT: movswl (%rsi), %eax # sched: [4:1.00]			; BTVER2-NEXT: movswl (%rsi), %eax # sched: [4:1.00]
				; BTVER2-NEXT: pinsrw $0, %edi, %mm0 # sched: [7:0.50]
	; BTVER2-NEXT: pinsrw $1, %eax, %mm0 # sched: [7:0.50]			; BTVER2-NEXT: pinsrw $1, %eax, %mm0 # sched: [7:0.50]
	; BTVER2-NEXT: movq %mm0, %rax # sched: [4:1.00]			; BTVER2-NEXT: movq %mm0, %rax # sched: [4:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]			; BTVER2-NEXT: retq # sched: [4:1.00]
	;			;
	; ZNVER1-LABEL: test_pinsrw:			; ZNVER1-LABEL: test_pinsrw:
	; ZNVER1: # %bb.0:			; ZNVER1: # %bb.0:
	; ZNVER1-NEXT: movswl (%rsi), %eax # sched: [8:0.50]			; ZNVER1-NEXT: movswl (%rsi), %eax # sched: [8:0.50]
	; ZNVER1-NEXT: pinsrw $0, %edi, %mm0 # sched: [1:0.25]			; ZNVER1-NEXT: pinsrw $0, %edi, %mm0 # sched: [1:0.25]
	▲ Show 20 Lines • Show All 3,660 Lines • Show Last 20 Lines

test/CodeGen/X86/sse41-schedule.ll

This file is larger than 256 KB, so syntax highlighting is disabled by default.

	Show First 20 Lines • Show All 2,673 Lines • ▼ Show 20 Lines
	; BDVER2: # %bb.0:			; BDVER2: # %bb.0:
	; BDVER2-NEXT: vpinsrq $1, (%rsi), %xmm1, %xmm1 # sched: [6:0.50]			; BDVER2-NEXT: vpinsrq $1, (%rsi), %xmm1, %xmm1 # sched: [6:0.50]
	; BDVER2-NEXT: vpinsrq $1, %rdi, %xmm0, %xmm0 # sched: [2:0.50]			; BDVER2-NEXT: vpinsrq $1, %rdi, %xmm0, %xmm0 # sched: [2:0.50]
	; BDVER2-NEXT: vpaddq %xmm1, %xmm0, %xmm0 # sched: [2:0.50]			; BDVER2-NEXT: vpaddq %xmm1, %xmm0, %xmm0 # sched: [2:0.50]
	; BDVER2-NEXT: retq # sched: [5:1.00]			; BDVER2-NEXT: retq # sched: [5:1.00]
	;			;
	; BTVER2-SSE-LABEL: test_pinsrq:			; BTVER2-SSE-LABEL: test_pinsrq:
	; BTVER2-SSE: # %bb.0:			; BTVER2-SSE: # %bb.0:
	; BTVER2-SSE-NEXT: pinsrq $1, %rdi, %xmm0 # sched: [7:0.50]
	; BTVER2-SSE-NEXT: pinsrq $1, (%rsi), %xmm1 # sched: [4:1.00]			; BTVER2-SSE-NEXT: pinsrq $1, (%rsi), %xmm1 # sched: [4:1.00]
				; BTVER2-SSE-NEXT: pinsrq $1, %rdi, %xmm0 # sched: [7:0.50]
	; BTVER2-SSE-NEXT: paddq %xmm1, %xmm0 # sched: [1:0.50]			; BTVER2-SSE-NEXT: paddq %xmm1, %xmm0 # sched: [1:0.50]
	; BTVER2-SSE-NEXT: retq # sched: [4:1.00]			; BTVER2-SSE-NEXT: retq # sched: [4:1.00]
	;			;
	; BTVER2-LABEL: test_pinsrq:			; BTVER2-LABEL: test_pinsrq:
	; BTVER2: # %bb.0:			; BTVER2: # %bb.0:
	; BTVER2-NEXT: vpinsrq $1, %rdi, %xmm0, %xmm0 # sched: [7:0.50]
	; BTVER2-NEXT: vpinsrq $1, (%rsi), %xmm1, %xmm1 # sched: [4:1.00]			; BTVER2-NEXT: vpinsrq $1, (%rsi), %xmm1, %xmm1 # sched: [4:1.00]
				; BTVER2-NEXT: vpinsrq $1, %rdi, %xmm0, %xmm0 # sched: [7:0.50]
	; BTVER2-NEXT: vpaddq %xmm1, %xmm0, %xmm0 # sched: [1:0.50]			; BTVER2-NEXT: vpaddq %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: retq # sched: [4:1.00]			; BTVER2-NEXT: retq # sched: [4:1.00]
	;			;
	; ZNVER1-SSE-LABEL: test_pinsrq:			; ZNVER1-SSE-LABEL: test_pinsrq:
	; ZNVER1-SSE: # %bb.0:			; ZNVER1-SSE: # %bb.0:
	; ZNVER1-SSE-NEXT: pinsrq $1, (%rsi), %xmm1 # sched: [8:0.50]			; ZNVER1-SSE-NEXT: pinsrq $1, (%rsi), %xmm1 # sched: [8:0.50]
	; ZNVER1-SSE-NEXT: pinsrq $1, %rdi, %xmm0 # sched: [1:0.25]			; ZNVER1-SSE-NEXT: pinsrq $1, %rdi, %xmm0 # sched: [1:0.25]
	; ZNVER1-SSE-NEXT: paddq %xmm1, %xmm0 # sched: [1:0.25]			; ZNVER1-SSE-NEXT: paddq %xmm1, %xmm0 # sched: [1:0.25]
	▲ Show 20 Lines • Show All 3,550 Lines • Show Last 20 Lines

test/tools/llvm-mca/X86/BtVer2/int-to-fpu-forwarding-1.s

	Show All 21 Lines
	vpinsrq $0, %rax, %xmm0, %xmm0			vpinsrq $0, %rax, %xmm0, %xmm0
	vpinsrq $1, %rax, %xmm0, %xmm0			vpinsrq $1, %rax, %xmm0, %xmm0
	# LLVM-MCA-END			# LLVM-MCA-END

	# CHECK: [0] Code Region			# CHECK: [0] Code Region

	# CHECK: Iterations: 500			# CHECK: Iterations: 500
	# CHECK-NEXT: Instructions: 1000			# CHECK-NEXT: Instructions: 1000
	# CHECK-NEXT: Total Cycles: 7003			# CHECK-NEXT: Total Cycles: 1003
	# CHECK-NEXT: Total uOps: 2000			# CHECK-NEXT: Total uOps: 2000

	# CHECK: Dispatch Width: 2			# CHECK: Dispatch Width: 2
	# CHECK-NEXT: uOps Per Cycle: 0.29			# CHECK-NEXT: uOps Per Cycle: 1.99
	# CHECK-NEXT: IPC: 0.14			# CHECK-NEXT: IPC: 1.00
	# CHECK-NEXT: Block RThroughput: 2.0			# CHECK-NEXT: Block RThroughput: 2.0

	# CHECK: Instruction Info:			# CHECK: Instruction Info:
	# CHECK-NEXT: [1]: #uOps			# CHECK-NEXT: [1]: #uOps
	# CHECK-NEXT: [2]: Latency			# CHECK-NEXT: [2]: Latency
	# CHECK-NEXT: [3]: RThroughput			# CHECK-NEXT: [3]: RThroughput
	# CHECK-NEXT: [4]: MayLoad			# CHECK-NEXT: [4]: MayLoad
	# CHECK-NEXT: [5]: MayStore			# CHECK-NEXT: [5]: MayStore
	Show All 27 Lines
	# CHECK-NEXT: [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] Instructions:			# CHECK-NEXT: [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] Instructions:
	# CHECK-NEXT: - - - - - - 1.00 - - - - - 1.00 - vpinsrb $0, %eax, %xmm0, %xmm0			# CHECK-NEXT: - - - - - - 1.00 - - - - - 1.00 - vpinsrb $0, %eax, %xmm0, %xmm0
	# CHECK-NEXT: - - - - - 1.00 - - - - - 1.00 - - vpinsrb $1, %eax, %xmm0, %xmm0			# CHECK-NEXT: - - - - - 1.00 - - - - - 1.00 - - vpinsrb $1, %eax, %xmm0, %xmm0

	# CHECK: [1] Code Region			# CHECK: [1] Code Region

	# CHECK: Iterations: 500			# CHECK: Iterations: 500
	# CHECK-NEXT: Instructions: 1000			# CHECK-NEXT: Instructions: 1000
	# CHECK-NEXT: Total Cycles: 7003			# CHECK-NEXT: Total Cycles: 1003
	# CHECK-NEXT: Total uOps: 2000			# CHECK-NEXT: Total uOps: 2000

	# CHECK: Dispatch Width: 2			# CHECK: Dispatch Width: 2
	# CHECK-NEXT: uOps Per Cycle: 0.29			# CHECK-NEXT: uOps Per Cycle: 1.99
	# CHECK-NEXT: IPC: 0.14			# CHECK-NEXT: IPC: 1.00
	# CHECK-NEXT: Block RThroughput: 2.0			# CHECK-NEXT: Block RThroughput: 2.0

	# CHECK: Instruction Info:			# CHECK: Instruction Info:
	# CHECK-NEXT: [1]: #uOps			# CHECK-NEXT: [1]: #uOps
	# CHECK-NEXT: [2]: Latency			# CHECK-NEXT: [2]: Latency
	# CHECK-NEXT: [3]: RThroughput			# CHECK-NEXT: [3]: RThroughput
	# CHECK-NEXT: [4]: MayLoad			# CHECK-NEXT: [4]: MayLoad
	# CHECK-NEXT: [5]: MayStore			# CHECK-NEXT: [5]: MayStore
	Show All 27 Lines
	# CHECK-NEXT: [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] Instructions:			# CHECK-NEXT: [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] Instructions:
	# CHECK-NEXT: - - - - - - 1.00 - - - - - 1.00 - vpinsrw $0, %eax, %xmm0, %xmm0			# CHECK-NEXT: - - - - - - 1.00 - - - - - 1.00 - vpinsrw $0, %eax, %xmm0, %xmm0
	# CHECK-NEXT: - - - - - 1.00 - - - - - 1.00 - - vpinsrw $1, %eax, %xmm0, %xmm0			# CHECK-NEXT: - - - - - 1.00 - - - - - 1.00 - - vpinsrw $1, %eax, %xmm0, %xmm0

	# CHECK: [2] Code Region			# CHECK: [2] Code Region

	# CHECK: Iterations: 500			# CHECK: Iterations: 500
	# CHECK-NEXT: Instructions: 1000			# CHECK-NEXT: Instructions: 1000
	# CHECK-NEXT: Total Cycles: 7003			# CHECK-NEXT: Total Cycles: 1003
	# CHECK-NEXT: Total uOps: 2000			# CHECK-NEXT: Total uOps: 2000

	# CHECK: Dispatch Width: 2			# CHECK: Dispatch Width: 2
	# CHECK-NEXT: uOps Per Cycle: 0.29			# CHECK-NEXT: uOps Per Cycle: 1.99
	# CHECK-NEXT: IPC: 0.14			# CHECK-NEXT: IPC: 1.00
	# CHECK-NEXT: Block RThroughput: 2.0			# CHECK-NEXT: Block RThroughput: 2.0

	# CHECK: Instruction Info:			# CHECK: Instruction Info:
	# CHECK-NEXT: [1]: #uOps			# CHECK-NEXT: [1]: #uOps
	# CHECK-NEXT: [2]: Latency			# CHECK-NEXT: [2]: Latency
	# CHECK-NEXT: [3]: RThroughput			# CHECK-NEXT: [3]: RThroughput
	# CHECK-NEXT: [4]: MayLoad			# CHECK-NEXT: [4]: MayLoad
	# CHECK-NEXT: [5]: MayStore			# CHECK-NEXT: [5]: MayStore
	Show All 27 Lines
	# CHECK-NEXT: [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] Instructions:			# CHECK-NEXT: [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] Instructions:
	# CHECK-NEXT: - - - - - - 1.00 - - - - - 1.00 - vpinsrd $0, %eax, %xmm0, %xmm0			# CHECK-NEXT: - - - - - - 1.00 - - - - - 1.00 - vpinsrd $0, %eax, %xmm0, %xmm0
	# CHECK-NEXT: - - - - - 1.00 - - - - - 1.00 - - vpinsrd $1, %eax, %xmm0, %xmm0			# CHECK-NEXT: - - - - - 1.00 - - - - - 1.00 - - vpinsrd $1, %eax, %xmm0, %xmm0

	# CHECK: [3] Code Region			# CHECK: [3] Code Region

	# CHECK: Iterations: 500			# CHECK: Iterations: 500
	# CHECK-NEXT: Instructions: 1000			# CHECK-NEXT: Instructions: 1000
	# CHECK-NEXT: Total Cycles: 7003			# CHECK-NEXT: Total Cycles: 1003
	# CHECK-NEXT: Total uOps: 2000			# CHECK-NEXT: Total uOps: 2000

	# CHECK: Dispatch Width: 2			# CHECK: Dispatch Width: 2
	# CHECK-NEXT: uOps Per Cycle: 0.29			# CHECK-NEXT: uOps Per Cycle: 1.99
	# CHECK-NEXT: IPC: 0.14			# CHECK-NEXT: IPC: 1.00
	# CHECK-NEXT: Block RThroughput: 2.0			# CHECK-NEXT: Block RThroughput: 2.0

	# CHECK: Instruction Info:			# CHECK: Instruction Info:
	# CHECK-NEXT: [1]: #uOps			# CHECK-NEXT: [1]: #uOps
	# CHECK-NEXT: [2]: Latency			# CHECK-NEXT: [2]: Latency
	# CHECK-NEXT: [3]: RThroughput			# CHECK-NEXT: [3]: RThroughput
	# CHECK-NEXT: [4]: MayLoad			# CHECK-NEXT: [4]: MayLoad
	# CHECK-NEXT: [5]: MayStore			# CHECK-NEXT: [5]: MayStore
	Show All 30 Lines

test/tools/llvm-mca/X86/BtVer2/int-to-fpu-forwarding-3.s

	# NOTE: Assertions have been autogenerated by utils/update_mca_test_checks.py			# NOTE: Assertions have been autogenerated by utils/update_mca_test_checks.py
	# RUN: llvm-mca -mtriple=x86_64-unknown-unknown -mcpu=btver2 -iterations=500 -timeline -timeline-max-iterations=3 < %s \| FileCheck %s			# RUN: llvm-mca -mtriple=x86_64-unknown-unknown -mcpu=btver2 -iterations=500 -timeline -timeline-max-iterations=3 < %s \| FileCheck %s

	# Throughput for the code snippet below should tend to 1.00 IPC.			# Throughput for the code snippet below should tend to 1.00 IPC.

	add %eax, %eax			add %eax, %eax
	vpinsrb $0, %eax, %xmm0, %xmm0			vpinsrb $0, %eax, %xmm0, %xmm0
	vpinsrb $1, %eax, %xmm0, %xmm0			vpinsrb $1, %eax, %xmm0, %xmm0

	# CHECK: Iterations: 500			# CHECK: Iterations: 500
	# CHECK-NEXT: Instructions: 1500			# CHECK-NEXT: Instructions: 1500
	# CHECK-NEXT: Total Cycles: 7004			# CHECK-NEXT: Total Cycles: 1509
	# CHECK-NEXT: Total uOps: 2500			# CHECK-NEXT: Total uOps: 2500

	# CHECK: Dispatch Width: 2			# CHECK: Dispatch Width: 2
	# CHECK-NEXT: uOps Per Cycle: 0.36			# CHECK-NEXT: uOps Per Cycle: 1.66
	# CHECK-NEXT: IPC: 0.21			# CHECK-NEXT: IPC: 0.99
	# CHECK-NEXT: Block RThroughput: 2.5			# CHECK-NEXT: Block RThroughput: 2.5

	# CHECK: Instruction Info:			# CHECK: Instruction Info:
	# CHECK-NEXT: [1]: #uOps			# CHECK-NEXT: [1]: #uOps
	# CHECK-NEXT: [2]: Latency			# CHECK-NEXT: [2]: Latency
	# CHECK-NEXT: [3]: RThroughput			# CHECK-NEXT: [3]: RThroughput
	# CHECK-NEXT: [4]: MayLoad			# CHECK-NEXT: [4]: MayLoad
	# CHECK-NEXT: [5]: MayStore			# CHECK-NEXT: [5]: MayStore
	Show All 26 Lines

	# CHECK: Resource pressure by instruction:			# CHECK: Resource pressure by instruction:
	# CHECK-NEXT: [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] Instructions:			# CHECK-NEXT: [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] Instructions:
	# CHECK-NEXT: 0.50 0.50 - - - - - - - - - - - - addl %eax, %eax			# CHECK-NEXT: 0.50 0.50 - - - - - - - - - - - - addl %eax, %eax
	# CHECK-NEXT: - - - - - - 1.00 - - - - - 1.00 - vpinsrb $0, %eax, %xmm0, %xmm0			# CHECK-NEXT: - - - - - - 1.00 - - - - - 1.00 - vpinsrb $0, %eax, %xmm0, %xmm0
	# CHECK-NEXT: - - - - - 1.00 - - - - - 1.00 - - vpinsrb $1, %eax, %xmm0, %xmm0			# CHECK-NEXT: - - - - - 1.00 - - - - - 1.00 - - vpinsrb $1, %eax, %xmm0, %xmm0

	# CHECK: Timeline view:			# CHECK: Timeline view:
	# CHECK-NEXT: 0123456789 0123456789			# CHECK-NEXT: 01234567
	# CHECK-NEXT: Index 0123456789 0123456789 012345			# CHECK-NEXT: Index 0123456789

	# CHECK: [0,0] DeER . . . . . . . . . addl %eax, %eax			# CHECK: [0,0] DeER . . . . addl %eax, %eax
	# CHECK-NEXT: [0,1] .DeeeeeeeER . . . . . . . vpinsrb $0, %eax, %xmm0, %xmm0			# CHECK-NEXT: [0,1] .D======eER . . vpinsrb $0, %eax, %xmm0, %xmm0
	# CHECK-NEXT: [0,2] . D======eeeeeeeER . . . . . . vpinsrb $1, %eax, %xmm0, %xmm0			# CHECK-NEXT: [0,2] . D======eER . . vpinsrb $1, %eax, %xmm0, %xmm0
	# CHECK-NEXT: [1,0] . DeE-----------R . . . . . . addl %eax, %eax			# CHECK-NEXT: [1,0] . DeE-----R . . addl %eax, %eax
	# CHECK-NEXT: [1,1] . D===========eeeeeeeER. . . . . vpinsrb $0, %eax, %xmm0, %xmm0			# CHECK-NEXT: [1,1] . D======eER . . vpinsrb $0, %eax, %xmm0, %xmm0
	# CHECK-NEXT: [1,2] . D=================eeeeeeeER . . . vpinsrb $1, %eax, %xmm0, %xmm0			# CHECK-NEXT: [1,2] . D======eER. . vpinsrb $1, %eax, %xmm0, %xmm0
	# CHECK-NEXT: [2,0] . .DeE----------------------R . . . addl %eax, %eax			# CHECK-NEXT: [2,0] . .DeE-----R. . addl %eax, %eax
	# CHECK-NEXT: [2,1] . . D======================eeeeeeeER . . vpinsrb $0, %eax, %xmm0, %xmm0			# CHECK-NEXT: [2,1] . . D======eER. vpinsrb $0, %eax, %xmm0, %xmm0
	# CHECK-NEXT: [2,2] . . D============================eeeeeeeER vpinsrb $1, %eax, %xmm0, %xmm0			# CHECK-NEXT: [2,2] . . D======eER vpinsrb $1, %eax, %xmm0, %xmm0

	# CHECK: Average Wait times (based on the timeline view):			# CHECK: Average Wait times (based on the timeline view):
	# CHECK-NEXT: [0]: Executions			# CHECK-NEXT: [0]: Executions
	# CHECK-NEXT: [1]: Average time spent waiting in a scheduler's queue			# CHECK-NEXT: [1]: Average time spent waiting in a scheduler's queue
	# CHECK-NEXT: [2]: Average time spent waiting in a scheduler's queue while ready			# CHECK-NEXT: [2]: Average time spent waiting in a scheduler's queue while ready
	# CHECK-NEXT: [3]: Average time elapsed from WB until retire stage			# CHECK-NEXT: [3]: Average time elapsed from WB until retire stage

	# CHECK: [0] [1] [2] [3]			# CHECK: [0] [1] [2] [3]
	# CHECK-NEXT: 0. 3 1.0 1.0 11.0 addl %eax, %eax			# CHECK-NEXT: 0. 3 1.0 1.0 3.3 addl %eax, %eax
	# CHECK-NEXT: 1. 3 12.0 0.0 0.0 vpinsrb $0, %eax, %xmm0, %xmm0			# CHECK-NEXT: 1. 3 7.0 0.0 0.0 vpinsrb $0, %eax, %xmm0, %xmm0
	# CHECK-NEXT: 2. 3 18.0 0.0 0.0 vpinsrb $1, %eax, %xmm0, %xmm0			# CHECK-NEXT: 2. 3 7.0 0.0 0.0 vpinsrb $1, %eax, %xmm0, %xmm0

tools/llvm-mca/Views/InstructionInfoView.cpp

Show All 37 Lines	for (const MCInst &Inst : Source) {

// Try to solve variant scheduling classes.		// Try to solve variant scheduling classes.
while (SchedClassID && SM.getSchedClassDesc(SchedClassID)->isVariant())		while (SchedClassID && SM.getSchedClassDesc(SchedClassID)->isVariant())
SchedClassID = STI.resolveVariantSchedClass(SchedClassID, &Inst, CPUID);		SchedClassID = STI.resolveVariantSchedClass(SchedClassID, &Inst, CPUID);

const MCSchedClassDesc &SCDesc = *SM.getSchedClassDesc(SchedClassID);		const MCSchedClassDesc &SCDesc = *SM.getSchedClassDesc(SchedClassID);
unsigned NumMicroOpcodes = SCDesc.NumMicroOps;		unsigned NumMicroOpcodes = SCDesc.NumMicroOps;
unsigned Latency = MCSchedModel::computeInstrLatency(STI, SCDesc);		unsigned Latency = MCSchedModel::computeInstrLatency(STI, SCDesc);
		// Add extra latency due to delays in the forwarding data paths.
		Latency += MCSchedModel::getForwardingDelayCycles(
		STI.getReadAdvanceEntries(SCDesc));
Optional<double> RThroughput =		Optional<double> RThroughput =
MCSchedModel::getReciprocalThroughput(STI, SCDesc);		MCSchedModel::getReciprocalThroughput(STI, SCDesc);

TempStream << ' ' << NumMicroOpcodes << " ";		TempStream << ' ' << NumMicroOpcodes << " ";
if (NumMicroOpcodes < 10)		if (NumMicroOpcodes < 10)
TempStream << " ";		TempStream << " ";
else if (NumMicroOpcodes < 100)		else if (NumMicroOpcodes < 100)
TempStream << ' ';		TempStream << ' ';
Show All 35 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[MC][X86] Correctly model additional operand latency caused by transfer delays from the integer to the floating point unit.ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 183088

include/llvm/MC/MCSchedule.h

include/llvm/MC/MCSubtargetInfo.h

include/llvm/MCA/Instruction.h

lib/CodeGen/TargetSubtargetInfo.cpp

lib/MC/MCSchedule.cpp

lib/MCA/InstrBuilder.cpp

lib/Target/X86/X86InstrMMX.td

lib/Target/X86/X86InstrSSE.td

lib/Target/X86/X86SchedBroadwell.td

lib/Target/X86/X86SchedHaswell.td

lib/Target/X86/X86SchedSandyBridge.td

lib/Target/X86/X86SchedSkylakeClient.td

lib/Target/X86/X86SchedSkylakeServer.td

lib/Target/X86/X86Schedule.td

lib/Target/X86/X86ScheduleAtom.td

lib/Target/X86/X86ScheduleBdVer2.td

lib/Target/X86/X86ScheduleBtVer2.td

lib/Target/X86/X86ScheduleSLM.td

lib/Target/X86/X86ScheduleZnver1.td

test/CodeGen/X86/mmx-schedule.ll

test/CodeGen/X86/sse41-schedule.ll

test/tools/llvm-mca/X86/BtVer2/int-to-fpu-forwarding-1.s

test/tools/llvm-mca/X86/BtVer2/int-to-fpu-forwarding-3.s

tools/llvm-mca/Views/InstructionInfoView.cpp

[MC][X86] Correctly model additional operand latency caused by transfer delays from the integer to the floating point unit.
ClosedPublic