This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
include/llvm/MCA/Stages/
-
llvm/
-
MCA/
-
Stages/
-
InOrderIssueStage.h
-
lib/MCA/
-
MCA/
-
Context.cpp
-
Stages/
2
InOrderIssueStage.cpp
-
test/tools/llvm-mca/AArch64/Cortex/
-
tools/
-
llvm-mca/
-
AArch64/
-
Cortex/
-
A55-load-store-alias.s
1/7
A55-load-store-noalias.s

Differential D103955

[MCA] Use LSU for the in-order pipeline
ClosedPublic

Authored by asavonic on Jun 9 2021, 4:52 AM.

Download Raw Diff

Details

Reviewers

andreadb
dmgreen
foad

Commits

rGbcc83a2e8321: [MCA] Use LSU for the in-order pipeline

Summary

Load/Store unit is used to enforce order of loads and stores if they
alias (controlled by --noalias=false option).

This model is not very accurate though - Cortex-A55 hardware still
shows quite different results in comparison with MCA.

See PR50483 - [MCA] In-order pipeline doesn't track memory load/store dependencies.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

asavonic created this revision.Jun 9 2021, 4:52 AM

Herald added subscribers: gbedwell, hiraditya. · View Herald TranscriptJun 9 2021, 4:52 AM

asavonic requested review of this revision.Jun 9 2021, 4:52 AM

Herald added a project: Restricted Project. · View Herald TranscriptJun 9 2021, 4:52 AM

Herald added a subscriber: llvm-commits. · View Herald Transcript

Harbormaster completed remote builds in B108389: Diff 350855.Jun 9 2021, 4:52 AM

asavonic edited the summary of this revision. (Show Details)Jun 9 2021, 4:53 AM

Thanks Andrew.

llvm/lib/MCA/Stages/InOrderIssueStage.cpp
77–78	This check is harmless, but completely redundant for an in-order processor. It should never fail in practice because load/store queues are not really modelled by in-order processors. In an in-order processors, the dispatch event coincides with the issue event, so there is no need for queueing loads/stores. So, we should always ignore the presence of queues and I think you can safely get rid of that check.
130–134	I don't think that this should be an else-if. It is better to always test this condition at the end of the if-then-else chain if `StallCycles` is still zero. I suspect this is the reason why you get some out-of-order execution in the test that you have added.
llvm/test/tools/llvm-mca/AArch64/Cortex/A55-load-store-noalias.s
98–99	This doesn't look correct. Any idea why these instructions are executed out of order? Edit: I think it might be due to the LSU check you have added (see my other comment).

asavonic added inline comments.Jun 9 2021, 6:59 AM

llvm/test/tools/llvm-mca/AArch64/Cortex/A55-load-store-noalias.s
98–99	This is because the store instruction has no writes (Instruction::getDefs returns an empty list), so findLastWriteBackCycle returns 0.

This model is not very accurate though - Cortex-A55 hardware still
shows quite different results in comparison with MCA.

Accurate static simulation of memory operations is very difficult to achieve in practice.
Predicting whether a load would effectively alias another store, or even predicting whether a memory access would
hit a specific cache level is hard to do ahead of time. Sometimes, it is just not possible due to the lack of
information (which can only be obtained at runtime). So, the inability to accurately predict the latency and aliasing of
memory accesses will always be a big source of inaccuracy in general.

In the case of llvm-mca, there are several limiting factors. Most of those fall under the following two categories:

The llvm scheduling model simply doesn't provide enough information for llvm-mca to simulate the memory subsystem.
MCInst is a (too) flat/simple representation, and it doesn't provide enough information about memory operations.

About 1.
There is no knowledge about which caches are available in hardware (i.e. memory cache hierarchy, store buffers, TLB caches, etc.).
Since there is no cache (at least, from the llvm-mca point of view), there is only one possible "latency" value for every write.
For loads, most models tend to encode an optimistic "load-to-use latency" in the write latency itself.
There is no way to use different latency values if the value is believed to miss the L1. Most of the times, the
"optimistic load-to-use latency" assumes a HIT in the L1.

We could introduce special annotations (like metadata, or llvm-mca comments) to describe the
"probability of hitting a different cache level". We could then use that knowledge in conjunction with a more accurate tablegen
description of the memory hierarchy.
This is just an idea: it might improve the simulation, at the cost of adding more complex abstractions. There may be already a PR for this.

More in general: lvm-mca doesn't know about memory types. It assumes that all memory is cacheable. The LSU rules work quite well for WB
(and even write through) memory. Non cacheable memory would be subject to different latencies, and stores might be
subject to so-called "write combining". For simplicity, llvm-mca assumes that all stores are cacheable, so there is no
attempt at modelling the WC logic in HW.

For in-order processors, not being able to model store buffers may still be fine.
After all (at least in theory) there is no reason why stores should be delayed. I expect stores to be immediately committed.
It also means that we don't need to worry about modelling things like STLF (store-to-load forwarding).
The lack of STLF prediction is one of the bigger sources of inaccuracy when simulating memory intensive kernels
on OoO processors.

About 2.
One big difference between MCInst and MachineInstr, is that MCInst doesn't carry any information about memory accesses.
MCInst was designed as a simpler intermediate for integrated assemblers and disassemblers. It was not meant to be used
to implement complex data-flow analysis. So its structure is pretty flat by design.

For MachineInstr, we have that MachineMemOperand instances can be used to infer aliasing properties on loads/stores etc.
We don't have those operands for MCInst, so - even wanting - we cannot implement a greedy symbolic alias analysis to infer
which loads may-alias which stores.

Depending on the value of flag --noalias, we either always assume "may-alias" or "no-alias".
The default (i.e. --noalias=true) is what is optimistically used by llvm-mca. It may be also the main reason why you see a lot
of errors in your measurements. Although, keep in mind that this just one of the (many) sources of
inaccuracy (as already described before in my point 1.).
Let say that --noalias is a good "default" for things like memcpy-like patterns.

Hi Andrew,
are there any updates on this code review?

Thanks,
-Andrea

In D103955#2849804, @andreadb wrote:

Hi Andrew,
are there any updates on this code review?

Sorry for the delay. I plan to update the patch later this week.

In D103955#2849836, @asavonic wrote:

In D103955#2849804, @andreadb wrote:

Hi Andrew,
are there any updates on this code review?

Sorry for the delay. I plan to update the patch later this week.

No problem, take your time :-)

dmgreen added a subscriber: NickGuy.Jul 1 2021, 3:06 AM

dmgreen added inline comments.

llvm/test/tools/llvm-mca/AArch64/Cortex/A55-load-store-noalias.s
96–97	I think I would expect most CPU's to work like this, whether the addresses alias or not :)

andreadb added inline comments.Jul 1 2021, 3:11 AM

llvm/test/tools/llvm-mca/AArch64/Cortex/A55-load-store-noalias.s
96–97	You mean the store sequence. Of course. My concern was related to instructions that appear to commit out of order like the load and the nop after it. We have flag RetireOOO for cases where we want to allow it.

andreadb added inline comments.Jul 1 2021, 3:21 AM

llvm/test/tools/llvm-mca/AArch64/Cortex/A55-load-store-noalias.s
96–97	If instead you are concerned about whether this patch might end up delaying the second store, then don't worry. That's not how flag -noalias should work: it only affects interactions between loads and stores. It is about whether a younger load is allowed to pass an older store. It should not affect pairs of adjacent stores.

dmgreen added inline comments.Jul 7 2021, 2:51 AM

llvm/test/tools/llvm-mca/AArch64/Cortex/A55-load-store-noalias.s
96–97	Sorry, I was hoping to look into the schedule over the weekend to see what is going on, but didn't get the chance to look into the correct bit yet. I believe there are 2 different optimizations that can happen here: Do two stores to the same address have some penalty. Do loads from the same address as a load have a penalty. The first sounds to me like it should almost always be no, and the second requires store->load forwarding which I believe is very common in most cpus of sufficient complexity. It comes down to what does the latency of a store mean. I was under the impression that it didn't mean anything in normal llvm scheduling, but it appears that it does have some effect on the latency of an store to the end of the block (I think). In llvm-mca it means the latency of the write into L1 cache? The Cortex-A55 optimization guide specifies the latency of stores as 1, and that would probably be a better value to use in the A55 schedule model. I've put together a patch to do that in D105541.

andreadb added inline comments.Jul 7 2021, 4:25 AM

llvm/test/tools/llvm-mca/AArch64/Cortex/A55-load-store-noalias.s
96–97	Just to be clear: the noalias flag does NOT affect store pairs. Regarding your point 2. I guess you wanted to say: "Do loads from the same address as a STORE have a penalty.". STLF assumes the presence of a store buffer, and that memory stores are not immediately propagated to the underlying caches. I don't know how common this is for in-order processors. However, I take from your comment that modern in-order may do a lot of out-of-order commit for store operations too. That being said, llvm-mca doesn't know if the simulated target implements a store buffer, nor it knows how to predict if a younger load would alias an older stores. Without that knowledge, it is not possible to correctly predict which are valid STLF candidates. STLF also assumes the presence of a store buffer, and that values are not immediately committed in cache (which I honestly don't know how common it is for in-order processors). When "noalias=true", we assume that there is no aliasing at all for loads and stores. There is no need to model STLF for this case, because - under that assumption - younger loads will never alias older stores. When "noalias=false", we conservatively assume that younger loads may alias older stores. However, we don't know if they would partially overlap, or if operations are for misaligned addresses. So we cannot always optimistically assume that STLF will eventually occur. STLF is subject to a number of constraints in hardware, and different subtargets might impose different restrictions. In future, we could introduce code annotations/metadata to pass "hints" to llvm-mca. Something like: "assume no-alias"/ "assume perfect-alias" / "assume aligned"; etc. We could then extend the scheduling model in order to provide extra information about store buffers. That would allow us to model STLF. For now, `noalias=false` is just a "worst-case scenario" where aliasing always occurs between loads and stores, and STLF is not simulated (so, it implicitly fails for "reasons" that we don't provide). About the store latency: In the presence of a store buffer, I'd expect the latency of a store to be 1. It is literally the cost of placing the value in the store buffer (which I expect to be 1 for most targets). Strictly speaking, llvm-mca doesn't specially handle latency of loads and stores. llvm-mca literally ONLY uses whatever latency value is declared by each write. In all upstream scheduling models, the latency of loads is often defined according to the "load-to-use latency" defined by the vendor. But that's it. There is no special handling in llvm-mca. In future (at least for addressing modes that allow folded loads), it would be nice to distinguish the load contribution (i.e. load-to-use latency) from the total latency.

Rebased the patch and fixed CR comments.

Note that stores retire out-of-order on Cortex-A55 after D105541.

Harbormaster completed remote builds in B116235: Diff 361740.Jul 26 2021, 12:52 PM

In D103955#2905183, @asavonic wrote:

Rebased the patch and fixed CR comments.

Note that stores retire out-of-order on Cortex-A55 after D105541.

LGTM

Thanks Andrew!

andreadb accepted this revision.Jul 27 2021, 5:30 AM

This revision is now accepted and ready to land.Jul 27 2021, 5:30 AM

This revision was landed with ongoing or failed builds.Jul 29 2021, 4:42 AM

Closed by commit rGbcc83a2e8321: [MCA] Use LSU for the in-order pipeline (authored by asavonic). · Explain Why

This revision was automatically updated to reflect the committed changes.

asavonic added a commit: rGbcc83a2e8321: [MCA] Use LSU for the in-order pipeline.

Revision Contents

Path

Size

llvm/

include/

llvm/

MCA/

Stages/

InOrderIssueStage.h

5 lines

lib/

MCA/

Context.cpp

5 lines

Stages/

InOrderIssueStage.cpp

27 lines

test/

tools/

llvm-mca/

AArch64/

Cortex/

A55-load-store-alias.s

83 lines

A55-load-store-noalias.s

100 lines

Diff 362721

llvm/include/llvm/MCA/Stages/InOrderIssueStage.h

Show All 15 Lines

#include "llvm/MCA/CustomBehaviour.h"		#include "llvm/MCA/CustomBehaviour.h"
#include "llvm/MCA/HardwareUnits/ResourceManager.h"		#include "llvm/MCA/HardwareUnits/ResourceManager.h"
#include "llvm/MCA/SourceMgr.h"		#include "llvm/MCA/SourceMgr.h"
#include "llvm/MCA/Stages/Stage.h"		#include "llvm/MCA/Stages/Stage.h"

namespace llvm {		namespace llvm {
namespace mca {		namespace mca {
		class LSUnit;
class RegisterFile;		class RegisterFile;

struct StallInfo {		struct StallInfo {
enum class StallKind {		enum class StallKind {
DEFAULT,		DEFAULT,
REGISTER_DEPS,		REGISTER_DEPS,
DISPATCH,		DISPATCH,
DELAY,		DELAY,
		LOAD_STORE,
CUSTOM_STALL		CUSTOM_STALL
};		};

InstRef IR;		InstRef IR;
unsigned CyclesLeft;		unsigned CyclesLeft;
StallKind Kind;		StallKind Kind;

StallInfo() : IR(), CyclesLeft(), Kind(StallKind::DEFAULT) {}		StallInfo() : IR(), CyclesLeft(), Kind(StallKind::DEFAULT) {}
Show All 9 Lines	struct StallInfo {
void cycleEnd();		void cycleEnd();
};		};

class InOrderIssueStage final : public Stage {		class InOrderIssueStage final : public Stage {
const MCSubtargetInfo &STI;		const MCSubtargetInfo &STI;
RegisterFile &PRF;		RegisterFile &PRF;
ResourceManager RM;		ResourceManager RM;
CustomBehaviour &CB;		CustomBehaviour &CB;
		LSUnit &LSU;

/// Instructions that were issued, but not executed yet.		/// Instructions that were issued, but not executed yet.
SmallVector<InstRef, 4> IssuedInst;		SmallVector<InstRef, 4> IssuedInst;

/// Number of instructions issued in the current cycle.		/// Number of instructions issued in the current cycle.
unsigned NumIssued;		unsigned NumIssued;

StallInfo SI;		StallInfo SI;
Show All 40 Lines	class InOrderIssueStage final : public Stage {
void notifyInstructionRetired(const InstRef &IR,		void notifyInstructionRetired(const InstRef &IR,
ArrayRef<unsigned> FreedRegs);		ArrayRef<unsigned> FreedRegs);

/// Retire instruction once it is executed.		/// Retire instruction once it is executed.
void retireInstruction(InstRef &IR);		void retireInstruction(InstRef &IR);

public:		public:
InOrderIssueStage(const MCSubtargetInfo &STI, RegisterFile &PRF,		InOrderIssueStage(const MCSubtargetInfo &STI, RegisterFile &PRF,
CustomBehaviour &CB);		CustomBehaviour &CB, LSUnit &LSU);

unsigned getIssueWidth() const;		unsigned getIssueWidth() const;
bool isAvailable(const InstRef &) const override;		bool isAvailable(const InstRef &) const override;
bool hasWorkToComplete() const override;		bool hasWorkToComplete() const override;
Error execute(InstRef &IR) override;		Error execute(InstRef &IR) override;
Error cycleStart() override;		Error cycleStart() override;
Error cycleEnd() override;		Error cycleEnd() override;
};		};

} // namespace mca		} // namespace mca
} // namespace llvm		} // namespace llvm

#endif // LLVM_MCA_STAGES_INORDERISSUESTAGE_H		#endif // LLVM_MCA_STAGES_INORDERISSUESTAGE_H

llvm/lib/MCA/Context.cpp

Show First 20 Lines • Show All 68 Lines • ▼ Show 20 Lines	Context::createDefaultPipeline(const PipelineOptions &Opts, SourceMgr &SrcMgr,
return StagePipeline;		return StagePipeline;
}		}

std::unique_ptr<Pipeline>		std::unique_ptr<Pipeline>
Context::createInOrderPipeline(const PipelineOptions &Opts, SourceMgr &SrcMgr,		Context::createInOrderPipeline(const PipelineOptions &Opts, SourceMgr &SrcMgr,
CustomBehaviour &CB) {		CustomBehaviour &CB) {
const MCSchedModel &SM = STI.getSchedModel();		const MCSchedModel &SM = STI.getSchedModel();
auto PRF = std::make_unique<RegisterFile>(SM, MRI, Opts.RegisterFileSize);		auto PRF = std::make_unique<RegisterFile>(SM, MRI, Opts.RegisterFileSize);
		auto LSU = std::make_unique<LSUnit>(SM, Opts.LoadQueueSize,
		Opts.StoreQueueSize, Opts.AssumeNoAlias);

// Create the pipeline stages.		// Create the pipeline stages.
auto Entry = std::make_unique<EntryStage>(SrcMgr);		auto Entry = std::make_unique<EntryStage>(SrcMgr);
auto InOrderIssue = std::make_unique<InOrderIssueStage>(STI, *PRF, CB);		auto InOrderIssue = std::make_unique<InOrderIssueStage>(STI, PRF, CB, LSU);
auto StagePipeline = std::make_unique<Pipeline>();		auto StagePipeline = std::make_unique<Pipeline>();

// Pass the ownership of all the hardware units to this Context.		// Pass the ownership of all the hardware units to this Context.
addHardwareUnit(std::move(PRF));		addHardwareUnit(std::move(PRF));
		addHardwareUnit(std::move(LSU));

// Build the pipeline.		// Build the pipeline.
StagePipeline->appendStage(std::move(Entry));		StagePipeline->appendStage(std::move(Entry));
StagePipeline->appendStage(std::move(InOrderIssue));		StagePipeline->appendStage(std::move(InOrderIssue));
return StagePipeline;		return StagePipeline;
}		}

} // namespace mca		} // namespace mca
} // namespace llvm		} // namespace llvm

llvm/lib/MCA/Stages/InOrderIssueStage.cpp

//===---------------------- InOrderIssueStage.cpp ---------------- C++ --===//		//===---------------------- InOrderIssueStage.cpp ---------------- C++ --===//
//		//
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.		// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
// See https://llvm.org/LICENSE.txt for license information.		// See https://llvm.org/LICENSE.txt for license information.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception		// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
/// \file		/// \file
///		///
/// InOrderIssueStage implements an in-order execution pipeline.		/// InOrderIssueStage implements an in-order execution pipeline.
///		///
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#include "llvm/MCA/Stages/InOrderIssueStage.h"		#include "llvm/MCA/Stages/InOrderIssueStage.h"
		#include "llvm/MCA/HardwareUnits/LSUnit.h"
#include "llvm/MCA/HardwareUnits/RegisterFile.h"		#include "llvm/MCA/HardwareUnits/RegisterFile.h"
#include "llvm/MCA/HardwareUnits/RetireControlUnit.h"		#include "llvm/MCA/HardwareUnits/RetireControlUnit.h"
#include "llvm/MCA/Instruction.h"		#include "llvm/MCA/Instruction.h"

#define DEBUG_TYPE "llvm-mca"		#define DEBUG_TYPE "llvm-mca"
namespace llvm {		namespace llvm {
namespace mca {		namespace mca {

Show All 15 Lines	void StallInfo::cycleEnd() {

if (!CyclesLeft)		if (!CyclesLeft)
return;		return;

--CyclesLeft;		--CyclesLeft;
}		}

InOrderIssueStage::InOrderIssueStage(const MCSubtargetInfo &STI,		InOrderIssueStage::InOrderIssueStage(const MCSubtargetInfo &STI,
RegisterFile &PRF, CustomBehaviour &CB)		RegisterFile &PRF, CustomBehaviour &CB,
: STI(STI), PRF(PRF), RM(STI.getSchedModel()), CB(CB), NumIssued(), SI(),		LSUnit &LSU)
CarryOver(), Bandwidth(), LastWriteBackCycle() {}		: STI(STI), PRF(PRF), RM(STI.getSchedModel()), CB(CB), LSU(LSU),
		NumIssued(), SI(), CarryOver(), Bandwidth(), LastWriteBackCycle() {}

unsigned InOrderIssueStage::getIssueWidth() const {		unsigned InOrderIssueStage::getIssueWidth() const {
return STI.getSchedModel().IssueWidth;		return STI.getSchedModel().IssueWidth;
}		}

bool InOrderIssueStage::hasWorkToComplete() const {		bool InOrderIssueStage::hasWorkToComplete() const {
return !IssuedInst.empty() \|\| SI.isValid() \|\| CarriedOver;		return !IssuedInst.empty() \|\| SI.isValid() \|\| CarriedOver;
}		}
Show All 10 Lines	bool InOrderIssueStage::isAvailable(const InstRef &IR) const {
if (Bandwidth < NumMicroOps && !ShouldCarryOver)		if (Bandwidth < NumMicroOps && !ShouldCarryOver)
return false;		return false;

// Instruction with BeginGroup must be the first instruction to be issued in a		// Instruction with BeginGroup must be the first instruction to be issued in a
// cycle.		// cycle.
if (Desc.BeginGroup && NumIssued != 0)		if (Desc.BeginGroup && NumIssued != 0)
return false;		return false;

return true;		return true;
}		}
		andreadbUnsubmitted Not Done Reply Inline Actions This check is harmless, but completely redundant for an in-order processor. It should never fail in practice because load/store queues are not really modelled by in-order processors. In an in-order processors, the dispatch event coincides with the issue event, so there is no need for queueing loads/stores. So, we should always ignore the presence of queues and I think you can safely get rid of that check. andreadb: This check is harmless, but completely redundant for an in-order processor. It should never…

static bool hasResourceHazard(const ResourceManager &RM, const InstRef &IR) {		static bool hasResourceHazard(const ResourceManager &RM, const InstRef &IR) {
if (RM.checkAvailability(IR.getInstruction()->getDesc())) {		if (RM.checkAvailability(IR.getInstruction()->getDesc())) {
LLVM_DEBUG(dbgs() << "[E] Stall #" << IR << '\n');		LLVM_DEBUG(dbgs() << "[E] Stall #" << IR << '\n');
return true;		return true;
}		}

return false;		return false;
Show All 35 Lines	if (unsigned Cycles = checkRegisterHazard(PRF, STI, IR)) {
return false;		return false;
}		}

if (hasResourceHazard(RM, IR)) {		if (hasResourceHazard(RM, IR)) {
SI.update(IR, /* delay */ 1, StallInfo::StallKind::DISPATCH);		SI.update(IR, /* delay */ 1, StallInfo::StallKind::DISPATCH);
return false;		return false;
}		}

		if (IR.getInstruction()->isMemOp() && !LSU.isReady(IR)) {
		// This load (store) aliases with a preceding store (load). Delay
		// it until the depenency is cleared.
		SI.update(IR, /* delay */ 1, StallInfo::StallKind::LOAD_STORE);
		return false;
		andreadbUnsubmitted Not Done Reply Inline Actions I don't think that this should be an else-if. It is better to always test this condition at the end of the if-then-else chain if `StallCycles` is still zero. I suspect this is the reason why you get some out-of-order execution in the test that you have added. andreadb: I don't think that this should be an else-if. It is better to always test this condition at the…
		}

if (unsigned CustomStallCycles = CB.checkCustomHazard(IssuedInst, IR)) {		if (unsigned CustomStallCycles = CB.checkCustomHazard(IssuedInst, IR)) {
SI.update(IR, CustomStallCycles, StallInfo::StallKind::CUSTOM_STALL);		SI.update(IR, CustomStallCycles, StallInfo::StallKind::CUSTOM_STALL);
return false;		return false;
}		}

if (LastWriteBackCycle) {		if (LastWriteBackCycle) {
if (!IR.getInstruction()->getDesc().RetireOOO) {		if (!IR.getInstruction()->getDesc().RetireOOO) {
unsigned NextWriteBackCycle = findFirstWriteBackCycle(IR);		unsigned NextWriteBackCycle = findFirstWriteBackCycle(IR);
▲ Show 20 Lines • Show All 47 Lines • ▼ Show 20 Lines

void InOrderIssueStage::notifyInstructionRetired(const InstRef &IR,		void InOrderIssueStage::notifyInstructionRetired(const InstRef &IR,
ArrayRef<unsigned> FreedRegs) {		ArrayRef<unsigned> FreedRegs) {
notifyEvent<HWInstructionEvent>(HWInstructionRetiredEvent(IR, FreedRegs));		notifyEvent<HWInstructionEvent>(HWInstructionRetiredEvent(IR, FreedRegs));
LLVM_DEBUG(dbgs() << "[E] Retired #" << IR << " \n");		LLVM_DEBUG(dbgs() << "[E] Retired #" << IR << " \n");
}		}

llvm::Error InOrderIssueStage::execute(InstRef &IR) {		llvm::Error InOrderIssueStage::execute(InstRef &IR) {
		Instruction &IS = *IR.getInstruction();
		if (IS.isMemOp())
		IS.setLSUTokenID(LSU.dispatch(IR));

if (llvm::Error E = tryIssue(IR))		if (llvm::Error E = tryIssue(IR))
return E;		return E;

if (SI.isValid())		if (SI.isValid())
notifyStallEvent();		notifyStallEvent();

return llvm::ErrorSuccess();		return llvm::ErrorSuccess();
}		}
Show All 18 Lines	llvm::Error InOrderIssueStage::tryIssue(InstRef &IR) {

unsigned NumMicroOps = IS.getNumMicroOps();		unsigned NumMicroOps = IS.getNumMicroOps();
notifyInstructionDispatched(IR, NumMicroOps, UsedRegs);		notifyInstructionDispatched(IR, NumMicroOps, UsedRegs);

SmallVector<ResourceUse, 4> UsedResources;		SmallVector<ResourceUse, 4> UsedResources;
RM.issueInstruction(Desc, UsedResources);		RM.issueInstruction(Desc, UsedResources);
IS.execute(SourceIndex);		IS.execute(SourceIndex);

		if (IS.isMemOp())
		LSU.onInstructionIssued(IR);

// Replace resource masks with valid resource processor IDs.		// Replace resource masks with valid resource processor IDs.
for (ResourceUse &Use : UsedResources) {		for (ResourceUse &Use : UsedResources) {
uint64_t Mask = Use.first.first;		uint64_t Mask = Use.first.first;
Use.first.first = RM.resolveResourceMask(Mask);		Use.first.first = RM.resolveResourceMask(Mask);
}		}
notifyInstructionIssued(IR, UsedResources);		notifyInstructionIssued(IR, UsedResources);

bool ShouldCarryOver = NumMicroOps > Bandwidth;		bool ShouldCarryOver = NumMicroOps > Bandwidth;
▲ Show 20 Lines • Show All 41 Lines • ▼ Show 20 Lines	for (auto I = IssuedInst.begin(), E = IssuedInst.end();
if (!IS.isExecuted()) {		if (!IS.isExecuted()) {
LLVM_DEBUG(dbgs() << "[N] Instruction #" << IR		LLVM_DEBUG(dbgs() << "[N] Instruction #" << IR
<< " is still executing\n");		<< " is still executing\n");
++I;		++I;
continue;		continue;
}		}

PRF.onInstructionExecuted(&IS);		PRF.onInstructionExecuted(&IS);
		LSU.onInstructionExecuted(IR);
notifyInstructionExecuted(IR);		notifyInstructionExecuted(IR);
++NumExecuted;		++NumExecuted;

retireInstruction(*I);		retireInstruction(*I);

std::iter_swap(I, E - NumExecuted);		std::iter_swap(I, E - NumExecuted);
}		}

Show All 29 Lines
void InOrderIssueStage::retireInstruction(InstRef &IR) {		void InOrderIssueStage::retireInstruction(InstRef &IR) {
Instruction &IS = *IR.getInstruction();		Instruction &IS = *IR.getInstruction();
IS.retire();		IS.retire();

llvm::SmallVector<unsigned, 4> FreedRegs(PRF.getNumRegisterFiles());		llvm::SmallVector<unsigned, 4> FreedRegs(PRF.getNumRegisterFiles());
for (const WriteState &WS : IS.getDefs())		for (const WriteState &WS : IS.getDefs())
PRF.removeRegisterWrite(WS, FreedRegs);		PRF.removeRegisterWrite(WS, FreedRegs);

		if (IS.isMemOp())
		LSU.onInstructionRetired(IR);

notifyInstructionRetired(IR, FreedRegs);		notifyInstructionRetired(IR, FreedRegs);
}		}

void InOrderIssueStage::notifyStallEvent() {		void InOrderIssueStage::notifyStallEvent() {
assert(SI.getCyclesLeft() && "A zero cycles stall?");		assert(SI.getCyclesLeft() && "A zero cycles stall?");
assert(SI.isValid() && "Invalid stall information found!");		assert(SI.isValid() && "Invalid stall information found!");

const InstRef &IR = SI.getInstruction();		const InstRef &IR = SI.getInstruction();
Show All 23 Lines	void InOrderIssueStage::notifyStallEvent() {
}		}
}		}

llvm::Error InOrderIssueStage::cycleStart() {		llvm::Error InOrderIssueStage::cycleStart() {
NumIssued = 0;		NumIssued = 0;
Bandwidth = getIssueWidth();		Bandwidth = getIssueWidth();

PRF.cycleStart();		PRF.cycleStart();
		LSU.cycleEvent();

// Release consumed resources.		// Release consumed resources.
SmallVector<ResourceRef, 4> Freed;		SmallVector<ResourceRef, 4> Freed;
RM.cycleEvent(Freed);		RM.cycleEvent(Freed);

updateIssuedInst();		updateIssuedInst();

// Continue to issue the instruction carried over from the previous cycle		// Continue to issue the instruction carried over from the previous cycle
Show All 40 Lines

llvm/test/tools/llvm-mca/AArch64/Cortex/A55-load-store-alias.s

	# NOTE: Assertions have been autogenerated by utils/update_mca_test_checks.py			# NOTE: Assertions have been autogenerated by utils/update_mca_test_checks.py
	# RUN: llvm-mca -mtriple=aarch64 -mcpu=cortex-a55 -timeline --iterations=5 -noalias=false < %s \| FileCheck %s			# RUN: llvm-mca -mtriple=aarch64 -mcpu=cortex-a55 -timeline --iterations=3 -noalias=false < %s \| FileCheck %s

	# PR50483: Execution of loads and stores should not overlap if flag -noalias is set to false.			# PR50483: Execution of loads and stores should not overlap if flag -noalias is set to false.

	str x1, [x4]			str x1, [x10]
	ldr x2, [x4]			str x1, [x10]
				ldr x2, [x10]
	# CHECK: Iterations: 5			nop
	# CHECK-NEXT: Instructions: 10			ldr x2, [x10]
	# CHECK-NEXT: Total Cycles: 8			ldr x3, [x10]
	# CHECK-NEXT: Total uOps: 10
				# CHECK: Iterations: 3
				# CHECK-NEXT: Instructions: 18
				# CHECK-NEXT: Total Cycles: 31
				# CHECK-NEXT: Total uOps: 18

	# CHECK: Dispatch Width: 2			# CHECK: Dispatch Width: 2
	# CHECK-NEXT: uOps Per Cycle: 1.25			# CHECK-NEXT: uOps Per Cycle: 0.58
	# CHECK-NEXT: IPC: 1.25			# CHECK-NEXT: IPC: 0.58
	# CHECK-NEXT: Block RThroughput: 1.0			# CHECK-NEXT: Block RThroughput: 3.0

	# CHECK: Instruction Info:			# CHECK: Instruction Info:
	# CHECK-NEXT: [1]: #uOps			# CHECK-NEXT: [1]: #uOps
	# CHECK-NEXT: [2]: Latency			# CHECK-NEXT: [2]: Latency
	# CHECK-NEXT: [3]: RThroughput			# CHECK-NEXT: [3]: RThroughput
	# CHECK-NEXT: [4]: MayLoad			# CHECK-NEXT: [4]: MayLoad
	# CHECK-NEXT: [5]: MayStore			# CHECK-NEXT: [5]: MayStore
	# CHECK-NEXT: [6]: HasSideEffects (U)			# CHECK-NEXT: [6]: HasSideEffects (U)

	# CHECK: [1] [2] [3] [4] [5] [6] Instructions:			# CHECK: [1] [2] [3] [4] [5] [6] Instructions:
	# CHECK-NEXT: 1 1 1.00 * str x1, [x4]			# CHECK-NEXT: 1 1 1.00 * str x1, [x10]
	# CHECK-NEXT: 1 3 1.00 * ldr x2, [x4]			# CHECK-NEXT: 1 1 1.00 * str x1, [x10]
				# CHECK-NEXT: 1 3 1.00 * ldr x2, [x10]
				# CHECK-NEXT: 1 1 1.00 * * U nop
				# CHECK-NEXT: 1 3 1.00 * ldr x2, [x10]
				# CHECK-NEXT: 1 3 1.00 * ldr x3, [x10]

	# CHECK: Resources:			# CHECK: Resources:
	# CHECK-NEXT: [0.0] - CortexA55UnitALU			# CHECK-NEXT: [0.0] - CortexA55UnitALU
	# CHECK-NEXT: [0.1] - CortexA55UnitALU			# CHECK-NEXT: [0.1] - CortexA55UnitALU
	# CHECK-NEXT: [1] - CortexA55UnitB			# CHECK-NEXT: [1] - CortexA55UnitB
	# CHECK-NEXT: [2] - CortexA55UnitDiv			# CHECK-NEXT: [2] - CortexA55UnitDiv
	# CHECK-NEXT: [3.0] - CortexA55UnitFPALU			# CHECK-NEXT: [3.0] - CortexA55UnitFPALU
	# CHECK-NEXT: [3.1] - CortexA55UnitFPALU			# CHECK-NEXT: [3.1] - CortexA55UnitFPALU
	# CHECK-NEXT: [4] - CortexA55UnitFPDIV			# CHECK-NEXT: [4] - CortexA55UnitFPDIV
	# CHECK-NEXT: [5.0] - CortexA55UnitFPMAC			# CHECK-NEXT: [5.0] - CortexA55UnitFPMAC
	# CHECK-NEXT: [5.1] - CortexA55UnitFPMAC			# CHECK-NEXT: [5.1] - CortexA55UnitFPMAC
	# CHECK-NEXT: [6] - CortexA55UnitLd			# CHECK-NEXT: [6] - CortexA55UnitLd
	# CHECK-NEXT: [7] - CortexA55UnitMAC			# CHECK-NEXT: [7] - CortexA55UnitMAC
	# CHECK-NEXT: [8] - CortexA55UnitSt			# CHECK-NEXT: [8] - CortexA55UnitSt

	# CHECK: Resource pressure per iteration:			# CHECK: Resource pressure per iteration:
	# CHECK-NEXT: [0.0] [0.1] [1] [2] [3.0] [3.1] [4] [5.0] [5.1] [6] [7] [8]			# CHECK-NEXT: [0.0] [0.1] [1] [2] [3.0] [3.1] [4] [5.0] [5.1] [6] [7] [8]
	# CHECK-NEXT: - - - - - - - - - 1.00 - 1.00			# CHECK-NEXT: - - 1.00 - - - - - - 3.00 - 2.00

	# CHECK: Resource pressure by instruction:			# CHECK: Resource pressure by instruction:
	# CHECK-NEXT: [0.0] [0.1] [1] [2] [3.0] [3.1] [4] [5.0] [5.1] [6] [7] [8] Instructions:			# CHECK-NEXT: [0.0] [0.1] [1] [2] [3.0] [3.1] [4] [5.0] [5.1] [6] [7] [8] Instructions:
	# CHECK-NEXT: - - - - - - - - - - - 1.00 str x1, [x4]			# CHECK-NEXT: - - - - - - - - - - - 1.00 str x1, [x10]
	# CHECK-NEXT: - - - - - - - - - 1.00 - - ldr x2, [x4]			# CHECK-NEXT: - - - - - - - - - - - 1.00 str x1, [x10]
				# CHECK-NEXT: - - - - - - - - - 1.00 - - ldr x2, [x10]
				# CHECK-NEXT: - - 1.00 - - - - - - - - - nop
				# CHECK-NEXT: - - - - - - - - - 1.00 - - ldr x2, [x10]
				# CHECK-NEXT: - - - - - - - - - 1.00 - - ldr x3, [x10]

	# CHECK: Timeline view:			# CHECK: Timeline view:
	# CHECK-NEXT: Index 01234567			# CHECK-NEXT: 0123456789 0
				# CHECK-NEXT: Index 0123456789 0123456789

	# CHECK: [0,0] DE . . str x1, [x4]			# CHECK: [0,0] DE . . . . . . str x1, [x10]
	# CHECK-NEXT: [0,1] DeeE . . ldr x2, [x4]			# CHECK-NEXT: [0,1] .DE . . . . . . str x1, [x10]
	# CHECK-NEXT: [1,0] .DE . . str x1, [x4]			# CHECK-NEXT: [0,2] . DeeE . . . . . ldr x2, [x10]
	# CHECK-NEXT: [1,1] .DeeE. . ldr x2, [x4]			# CHECK-NEXT: [0,3] . DE . . . . . nop
	# CHECK-NEXT: [2,0] . DE . . str x1, [x4]			# CHECK-NEXT: [0,4] . .DeeE. . . . . ldr x2, [x10]
	# CHECK-NEXT: [2,1] . DeeE . ldr x2, [x4]			# CHECK-NEXT: [0,5] . . DeeE . . . . ldr x3, [x10]
	# CHECK-NEXT: [3,0] . DE. . str x1, [x4]			# CHECK-NEXT: [1,0] . . DE . . . . str x1, [x10]
	# CHECK-NEXT: [3,1] . DeeE. ldr x2, [x4]			# CHECK-NEXT: [1,1] . . .DE . . . . str x1, [x10]
	# CHECK-NEXT: [4,0] . DE . str x1, [x4]			# CHECK-NEXT: [1,2] . . . DeeE . . . ldr x2, [x10]
	# CHECK-NEXT: [4,1] . DeeE ldr x2, [x4]			# CHECK-NEXT: [1,3] . . . DE . . . nop
				# CHECK-NEXT: [1,4] . . . .DeeE. . . ldr x2, [x10]
				# CHECK-NEXT: [1,5] . . . . DeeE . . ldr x3, [x10]
				# CHECK-NEXT: [2,0] . . . . DE . . str x1, [x10]
				# CHECK-NEXT: [2,1] . . . . .DE . . str x1, [x10]
				# CHECK-NEXT: [2,2] . . . . . DeeE . ldr x2, [x10]
				# CHECK-NEXT: [2,3] . . . . . DE . nop
				# CHECK-NEXT: [2,4] . . . . . .DeeE. ldr x2, [x10]
				# CHECK-NEXT: [2,5] . . . . . . DeeE ldr x3, [x10]

	# CHECK: Average Wait times (based on the timeline view):			# CHECK: Average Wait times (based on the timeline view):
	# CHECK-NEXT: [0]: Executions			# CHECK-NEXT: [0]: Executions
	# CHECK-NEXT: [1]: Average time spent waiting in a scheduler's queue			# CHECK-NEXT: [1]: Average time spent waiting in a scheduler's queue
	# CHECK-NEXT: [2]: Average time spent waiting in a scheduler's queue while ready			# CHECK-NEXT: [2]: Average time spent waiting in a scheduler's queue while ready
	# CHECK-NEXT: [3]: Average time elapsed from WB until retire stage			# CHECK-NEXT: [3]: Average time elapsed from WB until retire stage

	# CHECK: [0] [1] [2] [3]			# CHECK: [0] [1] [2] [3]
	# CHECK-NEXT: 0. 5 0.0 0.0 0.0 str x1, [x4]			# CHECK-NEXT: 0. 3 0.0 0.0 0.0 str x1, [x10]
	# CHECK-NEXT: 1. 5 0.0 0.0 0.0 ldr x2, [x4]			# CHECK-NEXT: 1. 3 0.0 0.0 0.0 str x1, [x10]
	# CHECK-NEXT: 5 0.0 0.0 0.0 <total>			# CHECK-NEXT: 2. 3 0.0 0.0 0.0 ldr x2, [x10]
				# CHECK-NEXT: 3. 3 0.0 0.0 0.0 nop
				# CHECK-NEXT: 4. 3 0.0 0.0 0.0 ldr x2, [x10]
				# CHECK-NEXT: 5. 3 0.0 0.0 0.0 ldr x3, [x10]
				# CHECK-NEXT: 3 0.0 0.0 0.0 <total>

llvm/test/tools/llvm-mca/AArch64/Cortex/A55-load-store-noalias.s

This file was added.

				# NOTE: Assertions have been autogenerated by utils/update_mca_test_checks.py
				# RUN: llvm-mca -mtriple=aarch64 -mcpu=cortex-a55 -timeline --iterations=3 --noalias=true < %s \| FileCheck %s

				str x1, [x10]
				str x1, [x10]
				ldr x2, [x10]
				nop
				ldr x2, [x10]
				ldr x3, [x10]

				# CHECK: Iterations: 3
				# CHECK-NEXT: Instructions: 18
				# CHECK-NEXT: Total Cycles: 19
				# CHECK-NEXT: Total uOps: 18

				# CHECK: Dispatch Width: 2
				# CHECK-NEXT: uOps Per Cycle: 0.95
				# CHECK-NEXT: IPC: 0.95
				# CHECK-NEXT: Block RThroughput: 3.0

				# CHECK: Instruction Info:
				# CHECK-NEXT: [1]: #uOps
				# CHECK-NEXT: [2]: Latency
				# CHECK-NEXT: [3]: RThroughput
				# CHECK-NEXT: [4]: MayLoad
				# CHECK-NEXT: [5]: MayStore
				# CHECK-NEXT: [6]: HasSideEffects (U)

				# CHECK: [1] [2] [3] [4] [5] [6] Instructions:
				# CHECK-NEXT: 1 1 1.00 * str x1, [x10]
				# CHECK-NEXT: 1 1 1.00 * str x1, [x10]
				# CHECK-NEXT: 1 3 1.00 * ldr x2, [x10]
				# CHECK-NEXT: 1 1 1.00 * * U nop
				# CHECK-NEXT: 1 3 1.00 * ldr x2, [x10]
				# CHECK-NEXT: 1 3 1.00 * ldr x3, [x10]

				# CHECK: Resources:
				# CHECK-NEXT: [0.0] - CortexA55UnitALU
				# CHECK-NEXT: [0.1] - CortexA55UnitALU
				# CHECK-NEXT: [1] - CortexA55UnitB
				# CHECK-NEXT: [2] - CortexA55UnitDiv
				# CHECK-NEXT: [3.0] - CortexA55UnitFPALU
				# CHECK-NEXT: [3.1] - CortexA55UnitFPALU
				# CHECK-NEXT: [4] - CortexA55UnitFPDIV
				# CHECK-NEXT: [5.0] - CortexA55UnitFPMAC
				# CHECK-NEXT: [5.1] - CortexA55UnitFPMAC
				# CHECK-NEXT: [6] - CortexA55UnitLd
				# CHECK-NEXT: [7] - CortexA55UnitMAC
				# CHECK-NEXT: [8] - CortexA55UnitSt

				# CHECK: Resource pressure per iteration:
				# CHECK-NEXT: [0.0] [0.1] [1] [2] [3.0] [3.1] [4] [5.0] [5.1] [6] [7] [8]
				# CHECK-NEXT: - - 1.00 - - - - - - 3.00 - 2.00

				# CHECK: Resource pressure by instruction:
				# CHECK-NEXT: [0.0] [0.1] [1] [2] [3.0] [3.1] [4] [5.0] [5.1] [6] [7] [8] Instructions:
				# CHECK-NEXT: - - - - - - - - - - - 1.00 str x1, [x10]
				# CHECK-NEXT: - - - - - - - - - - - 1.00 str x1, [x10]
				# CHECK-NEXT: - - - - - - - - - 1.00 - - ldr x2, [x10]
				# CHECK-NEXT: - - 1.00 - - - - - - - - - nop
				# CHECK-NEXT: - - - - - - - - - 1.00 - - ldr x2, [x10]
				# CHECK-NEXT: - - - - - - - - - 1.00 - - ldr x3, [x10]

				# CHECK: Timeline view:
				# CHECK-NEXT: 012345678
				# CHECK-NEXT: Index 0123456789

				# CHECK: [0,0] DE . . . . str x1, [x10]
				# CHECK-NEXT: [0,1] .DE . . . . str x1, [x10]
				# CHECK-NEXT: [0,2] .DeeE. . . . ldr x2, [x10]
				# CHECK-NEXT: [0,3] . DE. . . . nop
				# CHECK-NEXT: [0,4] . DeeE . . . ldr x2, [x10]
				# CHECK-NEXT: [0,5] . DeeE . . . ldr x3, [x10]
				# CHECK-NEXT: [1,0] . DE . . . str x1, [x10]
				# CHECK-NEXT: [1,1] . .DE . . . str x1, [x10]
				# CHECK-NEXT: [1,2] . .DeeE. . . ldr x2, [x10]
				# CHECK-NEXT: [1,3] . . DE. . . nop
				# CHECK-NEXT: [1,4] . . DeeE . . ldr x2, [x10]
				# CHECK-NEXT: [1,5] . . DeeE . . ldr x3, [x10]
				# CHECK-NEXT: [2,0] . . DE . . str x1, [x10]
				# CHECK-NEXT: [2,1] . . .DE . . str x1, [x10]
				# CHECK-NEXT: [2,2] . . .DeeE. . ldr x2, [x10]
				# CHECK-NEXT: [2,3] . . . DE. . nop
				# CHECK-NEXT: [2,4] . . . DeeE. ldr x2, [x10]
				# CHECK-NEXT: [2,5] . . . DeeE ldr x3, [x10]

				# CHECK: Average Wait times (based on the timeline view):
				# CHECK-NEXT: [0]: Executions
				# CHECK-NEXT: [1]: Average time spent waiting in a scheduler's queue
				# CHECK-NEXT: [2]: Average time spent waiting in a scheduler's queue while ready
				# CHECK-NEXT: [3]: Average time elapsed from WB until retire stage

				# CHECK: [0] [1] [2] [3]
				# CHECK-NEXT: 0. 3 0.0 0.0 0.0 str x1, [x10]
				# CHECK-NEXT: 1. 3 0.0 0.0 0.0 str x1, [x10]
				# CHECK-NEXT: 2. 3 0.0 0.0 0.0 ldr x2, [x10]
				# CHECK-NEXT: 3. 3 0.0 0.0 0.0 nop
				dmgreenUnsubmitted Not Done Reply Inline Actions I think I would expect most CPU's to work like this, whether the addresses alias or not :) dmgreen: I think I would expect most CPU's to work like this, whether the addresses alias or not :)
				andreadbUnsubmitted Not Done Reply Inline Actions You mean the store sequence. Of course. My concern was related to instructions that appear to commit out of order like the load and the nop after it. We have flag RetireOOO for cases where we want to allow it. andreadb: You mean the store sequence. Of course. My concern was related to instructions that appear to…
				andreadbUnsubmitted Not Done Reply Inline Actions If instead you are concerned about whether this patch might end up delaying the second store, then don't worry. That's not how flag -noalias should work: it only affects interactions between loads and stores. It is about whether a younger load is allowed to pass an older store. It should not affect pairs of adjacent stores. andreadb: If instead you are concerned about whether this patch might end up delaying the second store…
				dmgreenUnsubmitted Not Done Reply Inline Actions Sorry, I was hoping to look into the schedule over the weekend to see what is going on, but didn't get the chance to look into the correct bit yet. I believe there are 2 different optimizations that can happen here: Do two stores to the same address have some penalty. Do loads from the same address as a load have a penalty. The first sounds to me like it should almost always be no, and the second requires store->load forwarding which I believe is very common in most cpus of sufficient complexity. It comes down to what does the latency of a store mean. I was under the impression that it didn't mean anything in normal llvm scheduling, but it appears that it does have some effect on the latency of an store to the end of the block (I think). In llvm-mca it means the latency of the write into L1 cache? The Cortex-A55 optimization guide specifies the latency of stores as 1, and that would probably be a better value to use in the A55 schedule model. I've put together a patch to do that in D105541. dmgreen: Sorry, I was hoping to look into the schedule over the weekend to see what is going on, but…
				andreadbUnsubmitted Not Done Reply Inline Actions Just to be clear: the noalias flag does NOT affect store pairs. Regarding your point 2. I guess you wanted to say: "Do loads from the same address as a STORE have a penalty.". STLF assumes the presence of a store buffer, and that memory stores are not immediately propagated to the underlying caches. I don't know how common this is for in-order processors. However, I take from your comment that modern in-order may do a lot of out-of-order commit for store operations too. That being said, llvm-mca doesn't know if the simulated target implements a store buffer, nor it knows how to predict if a younger load would alias an older stores. Without that knowledge, it is not possible to correctly predict which are valid STLF candidates. STLF also assumes the presence of a store buffer, and that values are not immediately committed in cache (which I honestly don't know how common it is for in-order processors). When "noalias=true", we assume that there is no aliasing at all for loads and stores. There is no need to model STLF for this case, because - under that assumption - younger loads will never alias older stores. When "noalias=false", we conservatively assume that younger loads may alias older stores. However, we don't know if they would partially overlap, or if operations are for misaligned addresses. So we cannot always optimistically assume that STLF will eventually occur. STLF is subject to a number of constraints in hardware, and different subtargets might impose different restrictions. In future, we could introduce code annotations/metadata to pass "hints" to llvm-mca. Something like: "assume no-alias"/ "assume perfect-alias" / "assume aligned"; etc. We could then extend the scheduling model in order to provide extra information about store buffers. That would allow us to model STLF. For now, `noalias=false` is just a "worst-case scenario" where aliasing always occurs between loads and stores, and STLF is not simulated (so, it implicitly fails for "reasons" that we don't provide). About the store latency: In the presence of a store buffer, I'd expect the latency of a store to be 1. It is literally the cost of placing the value in the store buffer (which I expect to be 1 for most targets). Strictly speaking, llvm-mca doesn't specially handle latency of loads and stores. llvm-mca literally ONLY uses whatever latency value is declared by each write. In all upstream scheduling models, the latency of loads is often defined according to the "load-to-use latency" defined by the vendor. But that's it. There is no special handling in llvm-mca. In future (at least for addressing modes that allow folded loads), it would be nice to distinguish the load contribution (i.e. load-to-use latency) from the total latency. andreadb: Just to be clear: the noalias flag does NOT affect store pairs. Regarding your point 2. I…
				# CHECK-NEXT: 4. 3 0.0 0.0 0.0 ldr x2, [x10]
				# CHECK-NEXT: 5. 3 0.0 0.0 0.0 ldr x3, [x10]
				andreadbUnsubmitted Not Done Reply Inline Actions This doesn't look correct. Any idea why these instructions are executed out of order? Edit: I think it might be due to the LSU check you have added (see my other comment). andreadb: This doesn't look correct. Any idea why these instructions are executed out of order? Edit: I…
				asavonicAuthorUnsubmitted Done Reply Inline Actions This is because the store instruction has no writes (Instruction::getDefs returns an empty list), so findLastWriteBackCycle returns 0. asavonic: This is because the store instruction has no writes (Instruction::getDefs returns an empty…
				# CHECK-NEXT: 3 0.0 0.0 0.0 <total>