This is an archive of the discontinued LLVM Phabricator instance.

[MachineScheduler] Add support for store clustering
ClosedPublic

Authored by junbuml on Mar 22 2016, 2:11 PM.

Download Raw Diff

Details

Reviewers

t.p.northover
• tstellarAMD
jmolloy
atrick
arsenm
mcrosier

Summary

Perform store clustering just like load clustering. This change add
StoreClusterMutation in machine-scheduler. To control StoreClusterMutation,
added enableClusterStores() in TargetInstrInfo.h. This is enabled only on AArch64
for now.

This change also add support for unscaled stores which were not handled in getMemOpBaseRegImmOfs().

Diff Detail

Event Timeline

junbuml updated this revision to Diff 51337.Mar 22 2016, 2:11 PM

junbuml retitled this revision from to [AArch64]Add support for store clustering in machine-scheduler.

junbuml updated this object.

junbuml added reviewers: mcrosier, t.p.northover, jmolloy.

junbuml added a subscriber: llvm-commits.

Herald added subscribers: mcrosier, MatzeB, rengolin, aemerson. · View Herald TranscriptMar 22 2016, 2:11 PM

Looks like an obvious change to me. It is so obvious that I wonder why it wasn't added in the first place, maybe there is a reason to it?

include/llvm/Target/TargetInstrInfo.h
998–1006	Would it make sense to simply rename "xxxClusterLoads" to "xxxClusterMemOps" instead of adding an extra interface? The target can still control whether loads or stores are clustered in getMemOpBaseRegImmOfs(). The same is true for a lot of the following code which would need the differentiation between loads and stores anymore.

MatzeB added a reviewer: atrick.Mar 23 2016, 10:53 PM

junbuml retitled this revision from [AArch64]Add support for store clustering in machine-scheduler to [MachineScheduler] Add support for store clustering.Mar 24 2016, 10:15 AM

junbuml added a reviewer: arsenm.

junbuml added inline comments.

include/llvm/Target/TargetInstrInfo.h
998–1006	I agree of using MemOps, instead of using load/store mixed. Removing enableClusterStores() means that we enable this StoreMutation by default in other targets if getMemOpBaseRegImmOfs already support stores. We should extend reviewers and tests in other targets. As of now, I can see enableClusterLoads() is also enabled in : AMDGPUInstrInfo.cpp.

I compared llvm stats with/without this change for Spec2006. Overall, this patch increases the number of ld/st pairs by about 3% for Spec2006. I didn't see any serious regression in other statistics (see below).

Please see the full summary of llvm stats diff collected for spec2006 without and with this patch. For example, the third stat indicates 1495 (or ~3%) more ld/st pairs are generated with this patch :

    2 (0.06%) aarch64-copyelim - Number of copies removed.                        
   38 (0.81%) aarch64-ldst-opt - Number of load/store from unscaled generated     
 1495 (3.00%) aarch64-ldst-opt - Number of load/store pair instructions generated 
   28 (1.14%) aarch64-ldst-opt - Number of post-index updates folded              
-1661 (-0.06%) asm-printer - Number of machine instrs printed                      
-6056 (-0.01%) assembler - Number of emitted object file bytes                     
   67 (0.01%) assembler - Number of evaluated fixups                              
  -42 (-0.13%) branchfolding - Number of block tails merged                        
    3 (0.00%) branchfolding - Number of branches optimized                        
  -32 (-0.055%) branchfolding - Number of dead blocks removed                       
   -7 (-0.69%) branchfolding - Number of times common instructions are hoisted     
    2 (0.02%) codegen-cp - Number of dead copies deleted                          
    3 (1.69%) machine-licm - Number of machine instructions hoisted out of loops post regalloc
   67 (0.01%) mccodeemitter - Number of MC fixups created.                        
-1661 (-0.06%) mccodeemitter - Number of MC instructions emitted.                  
   93 (0.00%) mcexpr - Number of MCExpr evaluations                               
  112 (0.00%) pei - Number of bytes used for stack in all functions               
 -491 (-0.54%) regalloc - Number of copies inserted for splitting                  
   15 (0.19%) regalloc - Number of hoisted spills                                 
 -103 (-0.03%) regalloc - Number of identity moves eliminated after rewriting      
    5 (0.02%) regalloc - Number of interferences evicted                          
   -2 (-0.23%) regalloc - Number of live ranges fractured by DCE                   
 -127 (-0.07%) regalloc - Number of new live ranges queued                         
   -7 (-0.58%) regalloc - Number of omitted spills of reloads                      
 -128 (-0.01%) regalloc - Number of registers assigned                             
  -21 (-0.05%) regalloc - Number of registers unassigned                           
   46 (0.12%) regalloc - Number of reloads inserted                               
   -3 (-0.41%) regalloc - Number of reloads removed                                
   -2 (-0.01%) regalloc - Number of rematerialized defs for spilling               
   -6 (-0.02%) regalloc - Number of spilled live ranges                            
   -7 (-1.23%) regalloc - Number of spilled snippets                               
  -31 (-0.13%) regalloc - Number of spills inserted                                
   -3 (-0.07%) regalloc - Number of spills removed                                 
   -5 (-0.04%) regalloc - Number of split global live ranges                       
   -5 (-0.03%) regalloc - Number of splits finished                                
   -3 (-0.02%) regalloc - Number of splits that were simple                        
  264 (0.61%) slotindexes - Number of local renumberings                          
   -2 (-0.07%) stackslotcoloring - Number of stack slots eliminated due to coloring
  -13 (-0.00%) tailduplication - Additional instructions due to tail duplication   
   -7 (-0.29%) tailduplication - Number of dead blocks removed                     
    3 (0.05%) tailduplication - Number of tail duplicated blocks                  
   -5 (-0.13%) tailduplication - Number of tails duplicated

flyingforyou added a subscriber: flyingforyou.Mar 25 2016, 2:48 AM

I also think this is reasonable change.

lib/CodeGen/MachineScheduler.cpp
1376	IsLoad?

Addressed comments from comment Mattias and Junmo.

Herald added a reviewer: • tstellarAMD. · View Herald TranscriptMar 25 2016, 11:00 AM

Herald added a subscriber: arsenm. · View Herald Transcript

junbuml updated this object.Mar 25 2016, 11:02 AM

junbuml edited edge metadata.

LGTM under the assumption this does not change codegen for the AMDGPU targets.

This revision is now accepted and ready to land.Mar 31 2016, 6:29 AM

Just minor renaming and clean up.

As enableClusterMemOps() returns true in AMDGPUInstrInfo.cpp, I still want to wait until this patch is reviewed by reviewers from AMDGPU.

Kindly ping.
We need to confirm if this change doesn't have any negative impact on AMDGPU. Or, I can enable this only on AArch64 by adding enableClusterStores() in TargetInstrInfo.h.

Enabled this only on AArch64 for now by adding enableClusterStores().

Tom,
I believe you can easily enable this for AMDGPU later with proper tests as a separate patch.

junbuml updated this object.Apr 11 2016, 10:55 AM

Ping..
Please let me know if this is still LGTM.

This is fine with me.

If there was some reason I didn't do this it was specific to the microarchitecture and downstream optimizations at that time. As long as we get good results from this change I have no objection.

In D18376#399964, @atrick wrote:

This is fine with me.

If there was some reason I didn't do this it was specific to the microarchitecture and downstream optimizations at that time. As long as we get good results from this change I have no objection.

Thanks, Andy.

I think the last bit that is needed is for @arsenm or @tstellarAMD to approve the patch as this may impact the AMD target.

Please note that in my last change, I turned it on only in AArch64 by separating enableClusterStores() and enableClusterLoads(). enableClusterStores() is off in AMD by default.

In D18376#400164, @junbuml wrote:

Please note that in my last change, I turned it on only in AArch64 by separating enableClusterStores() and enableClusterLoads(). enableClusterStores() is off in AMD by default.

In that case I think this is fine to commit. Thanks, Jun.

In D18376#399964, @atrick wrote:
This is fine with me.

If there was some reason I didn't do this it was specific to the microarchitecture and downstream optimizations at that time. As >long as we get good results from this change I have no objection.

I didn't see clear performance gains in score with this change in my spec2006 test. However, with this change I observed overall better llvm stats; please see my previous comments showing differences of llvm stats for spec2006.

I'm going to commit this if there is any objection by tomorrow.

Landed in r266437.
Thanks for the reviews !

hfinkel mentioned this in D37698: Allow target to decide when to cluster loads/stores in misched.Sep 13 2017, 2:05 AM

Revision Contents

Path

Size

include/

llvm/

Target/

TargetInstrInfo.h

8 lines

lib/

CodeGen/

MachineScheduler.cpp

95 lines

Target/

AArch64/

AArch64InstrInfo.h

4 lines

AArch64InstrInfo.cpp

16 lines

AMDGPU/

SIInstrInfo.h

6 lines

SIInstrInfo.cpp

6 lines

test/

CodeGen/

AArch64/

aarch64-stp-cluster.ll

149 lines

arm64-ldp-cluster.ll

12 lines

arm64-stp.ll

4 lines

global-merge-group-by-use.ll

4 lines

Diff 53278

include/llvm/Target/TargetInstrInfo.h

Show First 20 Lines • Show All 989 Lines • ▼ Show 20 Lines	public:
/// Get the base register and byte offset of an instruction that reads/writes		/// Get the base register and byte offset of an instruction that reads/writes
/// memory.		/// memory.
virtual bool getMemOpBaseRegImmOfs(MachineInstr *MemOp, unsigned &BaseReg,		virtual bool getMemOpBaseRegImmOfs(MachineInstr *MemOp, unsigned &BaseReg,
int64_t &Offset,		int64_t &Offset,
const TargetRegisterInfo *TRI) const {		const TargetRegisterInfo *TRI) const {
return false;		return false;
}		}

virtual bool enableClusterLoads() const { return false; }		virtual bool enableClusterLoads() const { return false; }

virtual bool shouldClusterLoads(MachineInstr *FirstLdSt,		virtual bool enableClusterStores() const { return false; }

		virtual bool shouldClusterMemOps(MachineInstr *FirstLdSt,
MachineInstr *SecondLdSt,		MachineInstr *SecondLdSt,
unsigned NumLoads) const {		unsigned NumLoads) const {
return false;		return false;
}		}
		MatzeBUnsubmitted Not Done Reply Inline Actions Would it make sense to simply rename "xxxClusterLoads" to "xxxClusterMemOps" instead of adding an extra interface? The target can still control whether loads or stores are clustered in getMemOpBaseRegImmOfs(). The same is true for a lot of the following code which would need the differentiation between loads and stores anymore. MatzeB: Would it make sense to simply rename "xxxClusterLoads" to "xxxClusterMemOps" instead of adding…
		junbumlAuthorUnsubmitted Not Done Reply Inline Actions I agree of using MemOps, instead of using load/store mixed. Removing enableClusterStores() means that we enable this StoreMutation by default in other targets if getMemOpBaseRegImmOfs already support stores. We should extend reviewers and tests in other targets. As of now, I can see enableClusterLoads() is also enabled in : AMDGPUInstrInfo.cpp. junbuml: I agree of using MemOps, instead of using load/store mixed. Removing enableClusterStores()…

/// Can this target fuse the given instructions if they are scheduled		/// Can this target fuse the given instructions if they are scheduled
/// adjacent.		/// adjacent.
virtual bool shouldScheduleAdjacent(MachineInstr* First,		virtual bool shouldScheduleAdjacent(MachineInstr* First,
MachineInstr *Second) const {		MachineInstr *Second) const {
return false;		return false;
}		}

▲ Show 20 Lines • Show All 441 Lines • Show Last 20 Lines

lib/CodeGen/MachineScheduler.cpp

Show First 20 Lines • Show All 65 Lines • ▼ Show 20 Lines
#endif // NDEBUG		#endif // NDEBUG

static cl::opt<bool> EnableRegPressure("misched-regpressure", cl::Hidden,		static cl::opt<bool> EnableRegPressure("misched-regpressure", cl::Hidden,
cl::desc("Enable register pressure scheduling."), cl::init(true));		cl::desc("Enable register pressure scheduling."), cl::init(true));

static cl::opt<bool> EnableCyclicPath("misched-cyclicpath", cl::Hidden,		static cl::opt<bool> EnableCyclicPath("misched-cyclicpath", cl::Hidden,
cl::desc("Enable cyclic critical path analysis."), cl::init(true));		cl::desc("Enable cyclic critical path analysis."), cl::init(true));

static cl::opt<bool> EnableLoadCluster("misched-cluster", cl::Hidden,		static cl::opt<bool> EnableMemOpCluster("misched-cluster", cl::Hidden,
cl::desc("Enable load clustering."), cl::init(true));		cl::desc("Enable memop clustering."),
		cl::init(true));

// Experimental heuristics		// Experimental heuristics
static cl::opt<bool> EnableMacroFusion("misched-fusion", cl::Hidden,		static cl::opt<bool> EnableMacroFusion("misched-fusion", cl::Hidden,
cl::desc("Enable scheduling for macro fusion."), cl::init(true));		cl::desc("Enable scheduling for macro fusion."), cl::init(true));

static cl::opt<bool> VerifyScheduling("verify-misched", cl::Hidden,		static cl::opt<bool> VerifyScheduling("verify-misched", cl::Hidden,
cl::desc("Verify machine instrs before and after machine scheduling"));		cl::desc("Verify machine instrs before and after machine scheduling"));

▲ Show 20 Lines • Show All 1,262 Lines • ▼ Show 20 Lines	if (ShouldTrackPressure) {

updateScheduledPressure(SU, BotRPTracker.getPressure().MaxSetPressure);		updateScheduledPressure(SU, BotRPTracker.getPressure().MaxSetPressure);
updatePressureDiffs(LiveUses);		updatePressureDiffs(LiveUses);
}		}
}		}
}		}

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
// LoadClusterMutation - DAG post-processing to cluster loads.		// BaseMemOpClusterMutation - DAG post-processing to cluster loads or stores.
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

namespace {		namespace {
/// \brief Post-process the DAG to create cluster edges between neighboring		/// \brief Post-process the DAG to create cluster edges between neighboring
/// loads.		/// loads or between neighboring stores.
class LoadClusterMutation : public ScheduleDAGMutation {		class BaseMemOpClusterMutation : public ScheduleDAGMutation {
struct LoadInfo {		struct MemOpInfo {
SUnit *SU;		SUnit *SU;
unsigned BaseReg;		unsigned BaseReg;
int64_t Offset;		int64_t Offset;
LoadInfo(SUnit *su, unsigned reg, int64_t ofs)		MemOpInfo(SUnit *su, unsigned reg, int64_t ofs)
: SU(su), BaseReg(reg), Offset(ofs) {}		: SU(su), BaseReg(reg), Offset(ofs) {}

bool operator<(const LoadInfo &RHS) const {		bool operator<(const MemOpInfo&RHS) const {
return std::tie(BaseReg, Offset) < std::tie(RHS.BaseReg, RHS.Offset);		return std::tie(BaseReg, Offset) < std::tie(RHS.BaseReg, RHS.Offset);
}		}
};		};

const TargetInstrInfo *TII;		const TargetInstrInfo *TII;
const TargetRegisterInfo *TRI;		const TargetRegisterInfo *TRI;
		bool IsLoad;
		flyingforyouUnsubmitted Not Done Reply Inline Actions IsLoad? flyingforyou: IsLoad?

public:		public:
LoadClusterMutation(const TargetInstrInfo *tii,		BaseMemOpClusterMutation(const TargetInstrInfo *tii,
const TargetRegisterInfo *tri)		const TargetRegisterInfo *tri, bool IsLoad)
: TII(tii), TRI(tri) {}		: TII(tii), TRI(tri), IsLoad(IsLoad) {}

void apply(ScheduleDAGInstrs *DAGInstrs) override;		void apply(ScheduleDAGInstrs *DAGInstrs) override;

protected:		protected:
void clusterNeighboringLoads(ArrayRef<SUnit> Loads, ScheduleDAGMI DAG);		void clusterNeighboringMemOps(ArrayRef<SUnit > MemOps, ScheduleDAGMI DAG);
		};

		class StoreClusterMutation : public BaseMemOpClusterMutation {
		public:
		StoreClusterMutation(const TargetInstrInfo *tii,
		const TargetRegisterInfo *tri)
		: BaseMemOpClusterMutation(tii, tri, false) {}
		};

		class LoadClusterMutation : public BaseMemOpClusterMutation {
		public:
		LoadClusterMutation(const TargetInstrInfo tii, const TargetRegisterInfo tri)
		: BaseMemOpClusterMutation(tii, tri, true) {}
};		};
} // anonymous		} // anonymous

void LoadClusterMutation::clusterNeighboringLoads(ArrayRef<SUnit*> Loads,		void BaseMemOpClusterMutation::clusterNeighboringMemOps(
ScheduleDAGMI *DAG) {		ArrayRef<SUnit > MemOps, ScheduleDAGMI DAG) {
SmallVector<LoadClusterMutation::LoadInfo,32> LoadRecords;		SmallVector<MemOpInfo, 32> MemOpRecords;
for (unsigned Idx = 0, End = Loads.size(); Idx != End; ++Idx) {		for (unsigned Idx = 0, End = MemOps.size(); Idx != End; ++Idx) {
SUnit *SU = Loads[Idx];		SUnit *SU = MemOps[Idx];
unsigned BaseReg;		unsigned BaseReg;
int64_t Offset;		int64_t Offset;
if (TII->getMemOpBaseRegImmOfs(SU->getInstr(), BaseReg, Offset, TRI))		if (TII->getMemOpBaseRegImmOfs(SU->getInstr(), BaseReg, Offset, TRI))
LoadRecords.push_back(LoadInfo(SU, BaseReg, Offset));		MemOpRecords.push_back(MemOpInfo(SU, BaseReg, Offset));
}		}
if (LoadRecords.size() < 2)		if (MemOpRecords.size() < 2)
return;		return;
std::sort(LoadRecords.begin(), LoadRecords.end());
		std::sort(MemOpRecords.begin(), MemOpRecords.end());
unsigned ClusterLength = 1;		unsigned ClusterLength = 1;
for (unsigned Idx = 0, End = LoadRecords.size(); Idx < (End - 1); ++Idx) {		for (unsigned Idx = 0, End = MemOpRecords.size(); Idx < (End - 1); ++Idx) {
if (LoadRecords[Idx].BaseReg != LoadRecords[Idx+1].BaseReg) {		if (MemOpRecords[Idx].BaseReg != MemOpRecords[Idx+1].BaseReg) {
ClusterLength = 1;		ClusterLength = 1;
continue;		continue;
}		}

SUnit *SUa = LoadRecords[Idx].SU;		SUnit *SUa = MemOpRecords[Idx].SU;
SUnit *SUb = LoadRecords[Idx+1].SU;		SUnit *SUb = MemOpRecords[Idx+1].SU;
if (TII->shouldClusterLoads(SUa->getInstr(), SUb->getInstr(), ClusterLength)		if (TII->shouldClusterMemOps(SUa->getInstr(), SUb->getInstr(), ClusterLength)
&& DAG->addEdge(SUb, SDep(SUa, SDep::Cluster))) {		&& DAG->addEdge(SUb, SDep(SUa, SDep::Cluster))) {
		DEBUG(dbgs() << "Cluster ld/st SU(" << SUa->NodeNum << ") - SU("
DEBUG(dbgs() << "Cluster loads SU(" << SUa->NodeNum << ") - SU("
<< SUb->NodeNum << ")\n");		<< SUb->NodeNum << ")\n");
// Copy successor edges from SUa to SUb. Interleaving computation		// Copy successor edges from SUa to SUb. Interleaving computation
// dependent on SUa can prevent load combining due to register reuse.		// dependent on SUa can prevent load combining due to register reuse.
// Predecessor edges do not need to be copied from SUb to SUa since nearby		// Predecessor edges do not need to be copied from SUb to SUa since nearby
// loads should have effectively the same inputs.		// loads should have effectively the same inputs.
for (SUnit::const_succ_iterator		for (SUnit::const_succ_iterator
SI = SUa->Succs.begin(), SE = SUa->Succs.end(); SI != SE; ++SI) {		SI = SUa->Succs.begin(), SE = SUa->Succs.end(); SI != SE; ++SI) {
if (SI->getSUnit() == SUb)		if (SI->getSUnit() == SUb)
continue;		continue;
DEBUG(dbgs() << " Copy Succ SU(" << SI->getSUnit()->NodeNum << ")\n");		DEBUG(dbgs() << " Copy Succ SU(" << SI->getSUnit()->NodeNum << ")\n");
DAG->addEdge(SI->getSUnit(), SDep(SUb, SDep::Artificial));		DAG->addEdge(SI->getSUnit(), SDep(SUb, SDep::Artificial));
}		}
++ClusterLength;		++ClusterLength;
}		}
else		else
ClusterLength = 1;		ClusterLength = 1;
}		}
}		}

/// \brief Callback from DAG postProcessing to create cluster edges for loads.		/// \brief Callback from DAG postProcessing to create cluster edges for loads.
void LoadClusterMutation::apply(ScheduleDAGInstrs *DAGInstrs) {		void BaseMemOpClusterMutation::apply(ScheduleDAGInstrs *DAGInstrs) {

ScheduleDAGMI DAG = static_cast<ScheduleDAGMI>(DAGInstrs);		ScheduleDAGMI DAG = static_cast<ScheduleDAGMI>(DAGInstrs);

// Map DAG NodeNum to store chain ID.		// Map DAG NodeNum to store chain ID.
DenseMap<unsigned, unsigned> StoreChainIDs;		DenseMap<unsigned, unsigned> StoreChainIDs;
// Map each store chain to a set of dependent loads.		// Map each store chain to a set of dependent MemOps.
SmallVector<SmallVector<SUnit*,4>, 32> StoreChainDependents;		SmallVector<SmallVector<SUnit*,4>, 32> StoreChainDependents;
for (unsigned Idx = 0, End = DAG->SUnits.size(); Idx != End; ++Idx) {		for (unsigned Idx = 0, End = DAG->SUnits.size(); Idx != End; ++Idx) {
SUnit *SU = &DAG->SUnits[Idx];		SUnit *SU = &DAG->SUnits[Idx];
if (!SU->getInstr()->mayLoad())		if ((IsLoad && !SU->getInstr()->mayLoad()) \|\|
		(!IsLoad && !SU->getInstr()->mayStore()))
continue;		continue;

unsigned ChainPredID = DAG->SUnits.size();		unsigned ChainPredID = DAG->SUnits.size();
for (SUnit::const_pred_iterator		for (SUnit::const_pred_iterator
PI = SU->Preds.begin(), PE = SU->Preds.end(); PI != PE; ++PI) {		PI = SU->Preds.begin(), PE = SU->Preds.end(); PI != PE; ++PI) {
if (PI->isCtrl()) {		if (PI->isCtrl()) {
ChainPredID = PI->getSUnit()->NodeNum;		ChainPredID = PI->getSUnit()->NodeNum;
break;		break;
}		}
}		}
// Check if this chain-like pred has been seen		// Check if this chain-like pred has been seen
// before. ChainPredID==MaxNodeID for loads at the top of the schedule.		// before. ChainPredID==MaxNodeID at the top of the schedule.
unsigned NumChains = StoreChainDependents.size();		unsigned NumChains = StoreChainDependents.size();
std::pair<DenseMap<unsigned, unsigned>::iterator, bool> Result =		std::pair<DenseMap<unsigned, unsigned>::iterator, bool> Result =
StoreChainIDs.insert(std::make_pair(ChainPredID, NumChains));		StoreChainIDs.insert(std::make_pair(ChainPredID, NumChains));
if (Result.second)		if (Result.second)
StoreChainDependents.resize(NumChains + 1);		StoreChainDependents.resize(NumChains + 1);
StoreChainDependents[Result.first->second].push_back(SU);		StoreChainDependents[Result.first->second].push_back(SU);
}		}

// Iterate over the store chains.		// Iterate over the store chains.
for (unsigned Idx = 0, End = StoreChainDependents.size(); Idx != End; ++Idx)		for (unsigned Idx = 0, End = StoreChainDependents.size(); Idx != End; ++Idx)
clusterNeighboringLoads(StoreChainDependents[Idx], DAG);		clusterNeighboringMemOps(StoreChainDependents[Idx], DAG);
}		}

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
// MacroFusion - DAG post-processing to encourage fusion of macro ops.		// MacroFusion - DAG post-processing to encourage fusion of macro ops.
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

namespace {		namespace {
/// \brief Post-process the DAG to create cluster edges between instructions		/// \brief Post-process the DAG to create cluster edges between instructions
▲ Show 20 Lines • Show All 1,578 Lines • ▼ Show 20 Lines
static ScheduleDAGInstrs createGenericSchedLive(MachineSchedContext C) {		static ScheduleDAGInstrs createGenericSchedLive(MachineSchedContext C) {
ScheduleDAGMILive *DAG = new ScheduleDAGMILive(C, make_unique<GenericScheduler>(C));		ScheduleDAGMILive *DAG = new ScheduleDAGMILive(C, make_unique<GenericScheduler>(C));
// Register DAG post-processors.		// Register DAG post-processors.
//		//
// FIXME: extend the mutation API to allow earlier mutations to instantiate		// FIXME: extend the mutation API to allow earlier mutations to instantiate
// data and pass it to later mutations. Have a single mutation that gathers		// data and pass it to later mutations. Have a single mutation that gathers
// the interesting nodes in one pass.		// the interesting nodes in one pass.
DAG->addMutation(make_unique<CopyConstrain>(DAG->TII, DAG->TRI));		DAG->addMutation(make_unique<CopyConstrain>(DAG->TII, DAG->TRI));
if (EnableLoadCluster && DAG->TII->enableClusterLoads())		if (EnableMemOpCluster) {
		if (DAG->TII->enableClusterLoads())
DAG->addMutation(make_unique<LoadClusterMutation>(DAG->TII, DAG->TRI));		DAG->addMutation(make_unique<LoadClusterMutation>(DAG->TII, DAG->TRI));
		if (DAG->TII->enableClusterStores())
		DAG->addMutation(make_unique<StoreClusterMutation>(DAG->TII, DAG->TRI));
		}
if (EnableMacroFusion)		if (EnableMacroFusion)
DAG->addMutation(make_unique<MacroFusion>(DAG->TII, DAG->TRI));		DAG->addMutation(make_unique<MacroFusion>(DAG->TII, DAG->TRI));
return DAG;		return DAG;
}		}

static MachineSchedRegistry		static MachineSchedRegistry
GenericSchedRegistry("converge", "Standard converging scheduler.",		GenericSchedRegistry("converge", "Standard converging scheduler.",
createGenericSchedLive);		createGenericSchedLive);
▲ Show 20 Lines • Show All 425 Lines • Show Last 20 Lines

lib/Target/AArch64/AArch64InstrInfo.h

Show First 20 Lines • Show All 103 Lines • ▼ Show 20 Lines	bool getMemOpBaseRegImmOfs(MachineInstr *LdSt, unsigned &BaseReg,
const TargetRegisterInfo *TRI) const override;		const TargetRegisterInfo *TRI) const override;

bool getMemOpBaseRegImmOfsWidth(MachineInstr *LdSt, unsigned &BaseReg,		bool getMemOpBaseRegImmOfsWidth(MachineInstr *LdSt, unsigned &BaseReg,
int64_t &Offset, unsigned &Width,		int64_t &Offset, unsigned &Width,
const TargetRegisterInfo *TRI) const;		const TargetRegisterInfo *TRI) const;

bool enableClusterLoads() const override { return true; }		bool enableClusterLoads() const override { return true; }

bool shouldClusterLoads(MachineInstr FirstLdSt, MachineInstr SecondLdSt,		bool enableClusterStores() const override { return true; }

		bool shouldClusterMemOps(MachineInstr FirstLdSt, MachineInstr SecondLdSt,
unsigned NumLoads) const override;		unsigned NumLoads) const override;

bool shouldScheduleAdjacent(MachineInstr *First,		bool shouldScheduleAdjacent(MachineInstr *First,
MachineInstr *Second) const override;		MachineInstr *Second) const override;

MachineInstr *emitFrameIndexDebugValue(MachineFunction &MF, int FrameIx,		MachineInstr *emitFrameIndexDebugValue(MachineFunction &MF, int FrameIx,
uint64_t Offset, const MDNode *Var,		uint64_t Offset, const MDNode *Var,
const MDNode *Expr, DebugLoc DL) const;		const MDNode *Expr, DebugLoc DL) const;
▲ Show 20 Lines • Show All 157 Lines • Show Last 20 Lines

lib/Target/AArch64/AArch64InstrInfo.cpp

Show First 20 Lines • Show All 1,425 Lines • ▼ Show 20 Lines	bool AArch64InstrInfo::getMemOpBaseRegImmOfs(
case AArch64::STRWui:		case AArch64::STRWui:
case AArch64::LDRSui:		case AArch64::LDRSui:
case AArch64::LDRDui:		case AArch64::LDRDui:
case AArch64::LDRQui:		case AArch64::LDRQui:
case AArch64::LDRXui:		case AArch64::LDRXui:
case AArch64::LDRWui:		case AArch64::LDRWui:
case AArch64::LDRSWui:		case AArch64::LDRSWui:
// Unscaled instructions.		// Unscaled instructions.
		case AArch64::STURSi:
		case AArch64::STURDi:
		case AArch64::STURQi:
		case AArch64::STURXi:
		case AArch64::STURWi:
case AArch64::LDURSi:		case AArch64::LDURSi:
case AArch64::LDURDi:		case AArch64::LDURDi:
case AArch64::LDURQi:		case AArch64::LDURQi:
case AArch64::LDURWi:		case AArch64::LDURWi:
case AArch64::LDURXi:		case AArch64::LDURXi:
case AArch64::LDURSWi:		case AArch64::LDURSWi:
unsigned Width;		unsigned Width;
return getMemOpBaseRegImmOfsWidth(LdSt, BaseReg, Offset, Width, TRI);		return getMemOpBaseRegImmOfsWidth(LdSt, BaseReg, Offset, Width, TRI);
▲ Show 20 Lines • Show All 93 Lines • ▼ Show 20 Lines
// Scale the unscaled offsets. Returns false if the unscaled offset can't be		// Scale the unscaled offsets. Returns false if the unscaled offset can't be
// scaled.		// scaled.
static bool scaleOffset(unsigned Opc, int64_t &Offset) {		static bool scaleOffset(unsigned Opc, int64_t &Offset) {
unsigned OffsetStride = 1;		unsigned OffsetStride = 1;
switch (Opc) {		switch (Opc) {
default:		default:
return false;		return false;
case AArch64::LDURQi:		case AArch64::LDURQi:
		case AArch64::STURQi:
OffsetStride = 16;		OffsetStride = 16;
break;		break;
case AArch64::LDURXi:		case AArch64::LDURXi:
case AArch64::LDURDi:		case AArch64::LDURDi:
		case AArch64::STURXi:
		case AArch64::STURDi:
OffsetStride = 8;		OffsetStride = 8;
break;		break;
case AArch64::LDURWi:		case AArch64::LDURWi:
case AArch64::LDURSi:		case AArch64::LDURSi:
case AArch64::LDURSWi:		case AArch64::LDURSWi:
		case AArch64::STURWi:
		case AArch64::STURSi:
OffsetStride = 4;		OffsetStride = 4;
break;		break;
}		}
// If the byte-offset isn't a multiple of the stride, we can't scale this		// If the byte-offset isn't a multiple of the stride, we can't scale this
// offset.		// offset.
if (Offset % OffsetStride != 0)		if (Offset % OffsetStride != 0)
return false;		return false;

Show All 19 Lines	static bool canPairLdStOpc(unsigned FirstOpc, unsigned SecondOpc) {
}		}
// These instructions can't be paired based on their opcodes.		// These instructions can't be paired based on their opcodes.
return false;		return false;
}		}

/// Detect opportunities for ldp/stp formation.		/// Detect opportunities for ldp/stp formation.
///		///
/// Only called for LdSt for which getMemOpBaseRegImmOfs returns true.		/// Only called for LdSt for which getMemOpBaseRegImmOfs returns true.
bool AArch64InstrInfo::shouldClusterLoads(MachineInstr *FirstLdSt,		bool AArch64InstrInfo::shouldClusterMemOps(MachineInstr *FirstLdSt,
MachineInstr *SecondLdSt,		MachineInstr *SecondLdSt,
unsigned NumLoads) const {		unsigned NumLoads) const {
// Only cluster up to a single pair.		// Only cluster up to a single pair.
if (NumLoads > 1)		if (NumLoads > 1)
return false;		return false;

// Can we pair these instructions based on their opcodes?		// Can we pair these instructions based on their opcodes?
unsigned FirstOpc = FirstLdSt->getOpcode();		unsigned FirstOpc = FirstLdSt->getOpcode();
unsigned SecondOpc = SecondLdSt->getOpcode();		unsigned SecondOpc = SecondLdSt->getOpcode();
if (!canPairLdStOpc(FirstOpc, SecondOpc))		if (!canPairLdStOpc(FirstOpc, SecondOpc))
▲ Show 20 Lines • Show All 1,696 Lines • Show Last 20 Lines

lib/Target/AMDGPU/SIInstrInfo.h

Show First 20 Lines • Show All 88 Lines • ▼ Show 20 Lines	public:
bool areLoadsFromSameBasePtr(SDNode Load1, SDNode Load2,		bool areLoadsFromSameBasePtr(SDNode Load1, SDNode Load2,
int64_t &Offset1,		int64_t &Offset1,
int64_t &Offset2) const override;		int64_t &Offset2) const override;

bool getMemOpBaseRegImmOfs(MachineInstr *LdSt, unsigned &BaseReg,		bool getMemOpBaseRegImmOfs(MachineInstr *LdSt, unsigned &BaseReg,
int64_t &Offset,		int64_t &Offset,
const TargetRegisterInfo *TRI) const final;		const TargetRegisterInfo *TRI) const final;

bool shouldClusterLoads(MachineInstr *FirstLdSt,		bool shouldClusterMemOps(MachineInstr *FirstLdSt,
MachineInstr *SecondLdSt,		MachineInstr *SecondLdSt,
unsigned NumLoads) const final;		unsigned NumLoads) const final;

void copyPhysReg(MachineBasicBlock &MBB,		void copyPhysReg(MachineBasicBlock &MBB,
MachineBasicBlock::iterator MI, DebugLoc DL,		MachineBasicBlock::iterator MI, DebugLoc DL,
unsigned DestReg, unsigned SrcReg,		unsigned DestReg, unsigned SrcReg,
bool KillSrc) const override;		bool KillSrc) const override;

unsigned calculateLDSSpillAddress(MachineBasicBlock &MBB,		unsigned calculateLDSSpillAddress(MachineBasicBlock &MBB,
MachineBasicBlock::iterator MI,		MachineBasicBlock::iterator MI,
▲ Show 20 Lines • Show All 418 Lines • Show Last 20 Lines

lib/Target/AMDGPU/SIInstrInfo.cpp

Show First 20 Lines • Show All 286 Lines • ▼ Show 20 Lines	if (isSMRD(*LdSt)) {
BaseReg = SBaseReg->getReg();		BaseReg = SBaseReg->getReg();
Offset = OffsetImm->getImm();		Offset = OffsetImm->getImm();
return true;		return true;
}		}

return false;		return false;
}		}

bool SIInstrInfo::shouldClusterLoads(MachineInstr *FirstLdSt,		bool SIInstrInfo::shouldClusterMemOps(MachineInstr *FirstLdSt,
MachineInstr *SecondLdSt,		MachineInstr *SecondLdSt,
unsigned NumLoads) const {		unsigned NumLoads) const {
const MachineOperand *FirstDst = nullptr;		const MachineOperand *FirstDst = nullptr;
const MachineOperand *SecondDst = nullptr;		const MachineOperand *SecondDst = nullptr;

if (isDS(FirstLdSt) && isDS(SecondLdSt)) {		if (isDS(FirstLdSt) && isDS(SecondLdSt)) {
FirstDst = getNamedOperand(*FirstLdSt, AMDGPU::OpName::vdst);		FirstDst = getNamedOperand(*FirstLdSt, AMDGPU::OpName::vdst);
SecondDst = getNamedOperand(*SecondLdSt, AMDGPU::OpName::vdst);		SecondDst = getNamedOperand(*SecondLdSt, AMDGPU::OpName::vdst);
}		}

▲ Show 20 Lines • Show All 2,652 Lines • Show Last 20 Lines

test/CodeGen/AArch64/aarch64-stp-cluster.ll

This file was added.

				; REQUIRES: asserts
				; RUN: llc < %s -mtriple=arm64-linux-gnu -mcpu=cortex-a57 -verify-misched -debug-only=misched -aarch64-stp-suppress=false -o - 2>&1 > /dev/null \| FileCheck %s

				; CHECK: ******** MI Scheduling ********
				; CHECK-LABEL: stp_i64_scale:BB#0
				; CHECK:Cluster ld/st SU(4) - SU(3)
				; CHECK:Cluster ld/st SU(2) - SU(5)
				; CHECK:SU(4): STRXui %vreg1, %vreg0, 1
				; CHECK:SU(3): STRXui %vreg1, %vreg0, 2
				; CHECK:SU(2): STRXui %vreg1, %vreg0, 3
				; CHECK:SU(5): STRXui %vreg1, %vreg0, 4
				define i64 @stp_i64_scale(i64* nocapture %P, i64 %v) {
				entry:
				%arrayidx = getelementptr inbounds i64, i64* %P, i64 3
				store i64 %v, i64* %arrayidx
				%arrayidx1 = getelementptr inbounds i64, i64* %P, i64 2
				store i64 %v, i64* %arrayidx1
				%arrayidx2 = getelementptr inbounds i64, i64* %P, i64 1
				store i64 %v, i64* %arrayidx2
				%arrayidx3 = getelementptr inbounds i64, i64* %P, i64 4
				store i64 %v, i64* %arrayidx3
				ret i64 %v
				}

				; CHECK: ******** MI Scheduling ********
				; CHECK-LABEL: stp_i32_scale:BB#0
				; CHECK:Cluster ld/st SU(4) - SU(3)
				; CHECK:Cluster ld/st SU(2) - SU(5)
				; CHECK:SU(4): STRWui %vreg1, %vreg0, 1
				; CHECK:SU(3): STRWui %vreg1, %vreg0, 2
				; CHECK:SU(2): STRWui %vreg1, %vreg0, 3
				; CHECK:SU(5): STRWui %vreg1, %vreg0, 4
				define i32 @stp_i32_scale(i32* nocapture %P, i32 %v) {
				entry:
				%arrayidx = getelementptr inbounds i32, i32* %P, i32 3
				store i32 %v, i32* %arrayidx
				%arrayidx1 = getelementptr inbounds i32, i32* %P, i32 2
				store i32 %v, i32* %arrayidx1
				%arrayidx2 = getelementptr inbounds i32, i32* %P, i32 1
				store i32 %v, i32* %arrayidx2
				%arrayidx3 = getelementptr inbounds i32, i32* %P, i32 4
				store i32 %v, i32* %arrayidx3
				ret i32 %v
				}

				; CHECK:******** MI Scheduling ********
				; CHECK-LABEL:stp_i64_unscale:BB#0 entry
				; CHECK:Cluster ld/st SU(5) - SU(2)
				; CHECK:Cluster ld/st SU(4) - SU(3)
				; CHECK:SU(5): STURXi %vreg1, %vreg0, -32
				; CHECK:SU(2): STURXi %vreg1, %vreg0, -24
				; CHECK:SU(4): STURXi %vreg1, %vreg0, -16
				; CHECK:SU(3): STURXi %vreg1, %vreg0, -8
				define void @stp_i64_unscale(i64* nocapture %P, i64 %v) #0 {
				entry:
				%arrayidx = getelementptr inbounds i64, i64* %P, i64 -3
				store i64 %v, i64* %arrayidx
				%arrayidx1 = getelementptr inbounds i64, i64* %P, i64 -1
				store i64 %v, i64* %arrayidx1
				%arrayidx2 = getelementptr inbounds i64, i64* %P, i64 -2
				store i64 %v, i64* %arrayidx2
				%arrayidx3 = getelementptr inbounds i64, i64* %P, i64 -4
				store i64 %v, i64* %arrayidx3
				ret void
				}

				; CHECK:******** MI Scheduling ********
				; CHECK-LABEL:stp_i32_unscale:BB#0 entry
				; CHECK:Cluster ld/st SU(5) - SU(2)
				; CHECK:Cluster ld/st SU(4) - SU(3)
				; CHECK:SU(5): STURWi %vreg1, %vreg0, -16
				; CHECK:SU(2): STURWi %vreg1, %vreg0, -12
				; CHECK:SU(4): STURWi %vreg1, %vreg0, -8
				; CHECK:SU(3): STURWi %vreg1, %vreg0, -4
				define void @stp_i32_unscale(i32* nocapture %P, i32 %v) #0 {
				entry:
				%arrayidx = getelementptr inbounds i32, i32* %P, i32 -3
				store i32 %v, i32* %arrayidx
				%arrayidx1 = getelementptr inbounds i32, i32* %P, i32 -1
				store i32 %v, i32* %arrayidx1
				%arrayidx2 = getelementptr inbounds i32, i32* %P, i32 -2
				store i32 %v, i32* %arrayidx2
				%arrayidx3 = getelementptr inbounds i32, i32* %P, i32 -4
				store i32 %v, i32* %arrayidx3
				ret void
				}

				; CHECK:******** MI Scheduling ********
				; CHECK-LABEL:stp_double:BB#0
				; CHECK:Cluster ld/st SU(3) - SU(4)
				; CHECK:Cluster ld/st SU(2) - SU(5)
				; CHECK:SU(3): STRDui %vreg1, %vreg0, 1
				; CHECK:SU(4): STRDui %vreg1, %vreg0, 2
				; CHECK:SU(2): STRDui %vreg1, %vreg0, 3
				; CHECK:SU(5): STRDui %vreg1, %vreg0, 4
				define void @stp_double(double* nocapture %P, double %v) {
				entry:
				%arrayidx = getelementptr inbounds double, double* %P, i64 3
				store double %v, double* %arrayidx
				%arrayidx1 = getelementptr inbounds double, double* %P, i64 1
				store double %v, double* %arrayidx1
				%arrayidx2 = getelementptr inbounds double, double* %P, i64 2
				store double %v, double* %arrayidx2
				%arrayidx3 = getelementptr inbounds double, double* %P, i64 4
				store double %v, double* %arrayidx3
				ret void
				}

				; CHECK:******** MI Scheduling ********
				; CHECK-LABEL:stp_float:BB#0
				; CHECK:Cluster ld/st SU(3) - SU(4)
				; CHECK:Cluster ld/st SU(2) - SU(5)
				; CHECK:SU(3): STRSui %vreg1, %vreg0, 1
				; CHECK:SU(4): STRSui %vreg1, %vreg0, 2
				; CHECK:SU(2): STRSui %vreg1, %vreg0, 3
				; CHECK:SU(5): STRSui %vreg1, %vreg0, 4
				define void @stp_float(float* nocapture %P, float %v) {
				entry:
				%arrayidx = getelementptr inbounds float, float* %P, i64 3
				store float %v, float* %arrayidx
				%arrayidx1 = getelementptr inbounds float, float* %P, i64 1
				store float %v, float* %arrayidx1
				%arrayidx2 = getelementptr inbounds float, float* %P, i64 2
				store float %v, float* %arrayidx2
				%arrayidx3 = getelementptr inbounds float, float* %P, i64 4
				store float %v, float* %arrayidx3
				ret void
				}

				; CHECK: ******** MI Scheduling ********
				; CHECK-LABEL: stp_volatile:BB#0
				; CHECK-NOT: Cluster ld/st
				; CHECK:SU(2): STRXui %vreg1, %vreg0, 3; mem:Volatile
				; CHECK:SU(3): STRXui %vreg1, %vreg0, 2; mem:Volatile
				; CHECK:SU(4): STRXui %vreg1, %vreg0, 1; mem:Volatile
				; CHECK:SU(5): STRXui %vreg1, %vreg0, 4; mem:Volatile
				define i64 @stp_volatile(i64* nocapture %P, i64 %v) {
				entry:
				%arrayidx = getelementptr inbounds i64, i64* %P, i64 3
				store volatile i64 %v, i64* %arrayidx
				%arrayidx1 = getelementptr inbounds i64, i64* %P, i64 2
				store volatile i64 %v, i64* %arrayidx1
				%arrayidx2 = getelementptr inbounds i64, i64* %P, i64 1
				store volatile i64 %v, i64* %arrayidx2
				%arrayidx3 = getelementptr inbounds i64, i64* %P, i64 4
				store volatile i64 %v, i64* %arrayidx3
				ret i64 %v
				}

test/CodeGen/AArch64/arm64-ldp-cluster.ll

	; REQUIRES: asserts			; REQUIRES: asserts
	; RUN: llc < %s -mtriple=arm64-linux-gnu -mcpu=cortex-a57 -verify-misched -debug-only=misched -o - 2>&1 > /dev/null \| FileCheck %s			; RUN: llc < %s -mtriple=arm64-linux-gnu -mcpu=cortex-a57 -verify-misched -debug-only=misched -o - 2>&1 > /dev/null \| FileCheck %s

	; Test ldr clustering.			; Test ldr clustering.
	; CHECK: ******** MI Scheduling ********			; CHECK: ******** MI Scheduling ********
	; CHECK-LABEL: ldr_int:BB#0			; CHECK-LABEL: ldr_int:BB#0
	; CHECK: Cluster loads SU(1) - SU(2)			; CHECK: Cluster ld/st SU(1) - SU(2)
	; CHECK: SU(1): %vreg{{[0-9]+}}<def> = LDRWui			; CHECK: SU(1): %vreg{{[0-9]+}}<def> = LDRWui
	; CHECK: SU(2): %vreg{{[0-9]+}}<def> = LDRWui			; CHECK: SU(2): %vreg{{[0-9]+}}<def> = LDRWui
	define i32 @ldr_int(i32* %a) nounwind {			define i32 @ldr_int(i32* %a) nounwind {
	%p1 = getelementptr inbounds i32, i32* %a, i32 1			%p1 = getelementptr inbounds i32, i32* %a, i32 1
	%tmp1 = load i32, i32* %p1, align 2			%tmp1 = load i32, i32* %p1, align 2
	%p2 = getelementptr inbounds i32, i32* %a, i32 2			%p2 = getelementptr inbounds i32, i32* %a, i32 2
	%tmp2 = load i32, i32* %p2, align 2			%tmp2 = load i32, i32* %p2, align 2
	%tmp3 = add i32 %tmp1, %tmp2			%tmp3 = add i32 %tmp1, %tmp2
	ret i32 %tmp3			ret i32 %tmp3
	}			}

	; Test ldpsw clustering			; Test ldpsw clustering
	; CHECK: ******** MI Scheduling ********			; CHECK: ******** MI Scheduling ********
	; CHECK-LABEL: ldp_sext_int:BB#0			; CHECK-LABEL: ldp_sext_int:BB#0
	; CHECK: Cluster loads SU(1) - SU(2)			; CHECK: Cluster ld/st SU(1) - SU(2)
	; CHECK: SU(1): %vreg{{[0-9]+}}<def> = LDRSWui			; CHECK: SU(1): %vreg{{[0-9]+}}<def> = LDRSWui
	; CHECK: SU(2): %vreg{{[0-9]+}}<def> = LDRSWui			; CHECK: SU(2): %vreg{{[0-9]+}}<def> = LDRSWui
	define i64 @ldp_sext_int(i32* %p) nounwind {			define i64 @ldp_sext_int(i32* %p) nounwind {
	%tmp = load i32, i32* %p, align 4			%tmp = load i32, i32* %p, align 4
	%add.ptr = getelementptr inbounds i32, i32* %p, i64 1			%add.ptr = getelementptr inbounds i32, i32* %p, i64 1
	%tmp1 = load i32, i32* %add.ptr, align 4			%tmp1 = load i32, i32* %add.ptr, align 4
	%sexttmp = sext i32 %tmp to i64			%sexttmp = sext i32 %tmp to i64
	%sexttmp1 = sext i32 %tmp1 to i64			%sexttmp1 = sext i32 %tmp1 to i64
	%add = add nsw i64 %sexttmp1, %sexttmp			%add = add nsw i64 %sexttmp1, %sexttmp
	ret i64 %add			ret i64 %add
	}			}

	; Test ldur clustering.			; Test ldur clustering.
	; CHECK: ******** MI Scheduling ********			; CHECK: ******** MI Scheduling ********
	; CHECK-LABEL: ldur_int:BB#0			; CHECK-LABEL: ldur_int:BB#0
	; CHECK: Cluster loads SU(2) - SU(1)			; CHECK: Cluster ld/st SU(2) - SU(1)
	; CHECK: SU(1): %vreg{{[0-9]+}}<def> = LDURWi			; CHECK: SU(1): %vreg{{[0-9]+}}<def> = LDURWi
	; CHECK: SU(2): %vreg{{[0-9]+}}<def> = LDURWi			; CHECK: SU(2): %vreg{{[0-9]+}}<def> = LDURWi
	define i32 @ldur_int(i32* %a) nounwind {			define i32 @ldur_int(i32* %a) nounwind {
	%p1 = getelementptr inbounds i32, i32* %a, i32 -1			%p1 = getelementptr inbounds i32, i32* %a, i32 -1
	%tmp1 = load i32, i32* %p1, align 2			%tmp1 = load i32, i32* %p1, align 2
	%p2 = getelementptr inbounds i32, i32* %a, i32 -2			%p2 = getelementptr inbounds i32, i32* %a, i32 -2
	%tmp2 = load i32, i32* %p2, align 2			%tmp2 = load i32, i32* %p2, align 2
	%tmp3 = add i32 %tmp1, %tmp2			%tmp3 = add i32 %tmp1, %tmp2
	ret i32 %tmp3			ret i32 %tmp3
	}			}

	; Test sext + zext clustering.			; Test sext + zext clustering.
	; CHECK: ******** MI Scheduling ********			; CHECK: ******** MI Scheduling ********
	; CHECK-LABEL: ldp_half_sext_zext_int:BB#0			; CHECK-LABEL: ldp_half_sext_zext_int:BB#0
	; CHECK: Cluster loads SU(3) - SU(4)			; CHECK: Cluster ld/st SU(3) - SU(4)
	; CHECK: SU(3): %vreg{{[0-9]+}}<def> = LDRSWui			; CHECK: SU(3): %vreg{{[0-9]+}}<def> = LDRSWui
	; CHECK: SU(4): %vreg{{[0-9]+}}:sub_32<def,read-undef> = LDRWui			; CHECK: SU(4): %vreg{{[0-9]+}}:sub_32<def,read-undef> = LDRWui
	define i64 @ldp_half_sext_zext_int(i64* %q, i32* %p) nounwind {			define i64 @ldp_half_sext_zext_int(i64* %q, i32* %p) nounwind {
	%tmp0 = load i64, i64* %q, align 4			%tmp0 = load i64, i64* %q, align 4
	%tmp = load i32, i32* %p, align 4			%tmp = load i32, i32* %p, align 4
	%add.ptr = getelementptr inbounds i32, i32* %p, i64 1			%add.ptr = getelementptr inbounds i32, i32* %p, i64 1
	%tmp1 = load i32, i32* %add.ptr, align 4			%tmp1 = load i32, i32* %add.ptr, align 4
	%sexttmp = sext i32 %tmp to i64			%sexttmp = sext i32 %tmp to i64
	%sexttmp1 = zext i32 %tmp1 to i64			%sexttmp1 = zext i32 %tmp1 to i64
	%add = add nsw i64 %sexttmp1, %sexttmp			%add = add nsw i64 %sexttmp1, %sexttmp
	%add1 = add nsw i64 %add, %tmp0			%add1 = add nsw i64 %add, %tmp0
	ret i64 %add1			ret i64 %add1
	}			}

	; Test zext + sext clustering.			; Test zext + sext clustering.
	; CHECK: ******** MI Scheduling ********			; CHECK: ******** MI Scheduling ********
	; CHECK-LABEL: ldp_half_zext_sext_int:BB#0			; CHECK-LABEL: ldp_half_zext_sext_int:BB#0
	; CHECK: Cluster loads SU(3) - SU(4)			; CHECK: Cluster ld/st SU(3) - SU(4)
	; CHECK: SU(3): %vreg{{[0-9]+}}:sub_32<def,read-undef> = LDRWui			; CHECK: SU(3): %vreg{{[0-9]+}}:sub_32<def,read-undef> = LDRWui
	; CHECK: SU(4): %vreg{{[0-9]+}}<def> = LDRSWui			; CHECK: SU(4): %vreg{{[0-9]+}}<def> = LDRSWui
	define i64 @ldp_half_zext_sext_int(i64* %q, i32* %p) nounwind {			define i64 @ldp_half_zext_sext_int(i64* %q, i32* %p) nounwind {
	%tmp0 = load i64, i64* %q, align 4			%tmp0 = load i64, i64* %q, align 4
	%tmp = load i32, i32* %p, align 4			%tmp = load i32, i32* %p, align 4
	%add.ptr = getelementptr inbounds i32, i32* %p, i64 1			%add.ptr = getelementptr inbounds i32, i32* %p, i64 1
	%tmp1 = load i32, i32* %add.ptr, align 4			%tmp1 = load i32, i32* %add.ptr, align 4
	%sexttmp = zext i32 %tmp to i64			%sexttmp = zext i32 %tmp to i64
	%sexttmp1 = sext i32 %tmp1 to i64			%sexttmp1 = sext i32 %tmp1 to i64
	%add = add nsw i64 %sexttmp1, %sexttmp			%add = add nsw i64 %sexttmp1, %sexttmp
	%add1 = add nsw i64 %add, %tmp0			%add1 = add nsw i64 %add, %tmp0
	ret i64 %add1			ret i64 %add1
	}			}

	; Verify we don't cluster volatile loads.			; Verify we don't cluster volatile loads.
	; CHECK: ******** MI Scheduling ********			; CHECK: ******** MI Scheduling ********
	; CHECK-LABEL: ldr_int_volatile:BB#0			; CHECK-LABEL: ldr_int_volatile:BB#0
	; CHECK-NOT: Cluster loads			; CHECK-NOT: Cluster ld/st
	; CHECK: SU(1): %vreg{{[0-9]+}}<def> = LDRWui			; CHECK: SU(1): %vreg{{[0-9]+}}<def> = LDRWui
	; CHECK: SU(2): %vreg{{[0-9]+}}<def> = LDRWui			; CHECK: SU(2): %vreg{{[0-9]+}}<def> = LDRWui
	define i32 @ldr_int_volatile(i32* %a) nounwind {			define i32 @ldr_int_volatile(i32* %a) nounwind {
	%p1 = getelementptr inbounds i32, i32* %a, i32 1			%p1 = getelementptr inbounds i32, i32* %a, i32 1
	%tmp1 = load volatile i32, i32* %p1, align 2			%tmp1 = load volatile i32, i32* %p1, align 2
	%p2 = getelementptr inbounds i32, i32* %a, i32 2			%p2 = getelementptr inbounds i32, i32* %a, i32 2
	%tmp2 = load volatile i32, i32* %p2, align 2			%tmp2 = load volatile i32, i32* %p2, align 2
	%tmp3 = add i32 %tmp1, %tmp2			%tmp3 = add i32 %tmp1, %tmp2
	ret i32 %tmp3			ret i32 %tmp3
	}			}

test/CodeGen/AArch64/arm64-stp.ll

Show First 20 Lines • Show All 94 Lines • ▼ Show 20 Lines	; CHECK: ret
%p20 = insertelement <4 x i32> %p19, i32 %v, i32 3		%p20 = insertelement <4 x i32> %p19, i32 %v, i32 3
%p21 = bitcast i32* %p to <4 x i32>*		%p21 = bitcast i32* %p to <4 x i32>*
store <4 x i32> %p20, <4 x i32>* %p21, align 4		store <4 x i32> %p20, <4 x i32>* %p21, align 4
ret void		ret void
}		}

; Read of %b to compute %tmp2 shouldn't prevent formation of stp		; Read of %b to compute %tmp2 shouldn't prevent formation of stp
; CHECK-LABEL: stp_int_rar_hazard		; CHECK-LABEL: stp_int_rar_hazard
; CHECK: stp w0, w1, [x2]
; CHECK: ldr [[REG:w[0-9]+]], [x2, #8]		; CHECK: ldr [[REG:w[0-9]+]], [x2, #8]
; CHECK: add w0, [[REG]], w1		; CHECK: add w8, [[REG]], w1
		; CHECK: stp w0, w1, [x2]
; CHECK: ret		; CHECK: ret
define i32 @stp_int_rar_hazard(i32 %a, i32 %b, i32* nocapture %p) nounwind {		define i32 @stp_int_rar_hazard(i32 %a, i32 %b, i32* nocapture %p) nounwind {
store i32 %a, i32* %p, align 4		store i32 %a, i32* %p, align 4
%ld.ptr = getelementptr inbounds i32, i32* %p, i64 2		%ld.ptr = getelementptr inbounds i32, i32* %p, i64 2
%tmp = load i32, i32* %ld.ptr, align 4		%tmp = load i32, i32* %ld.ptr, align 4
%tmp2 = add i32 %tmp, %b		%tmp2 = add i32 %tmp, %b
%add.ptr = getelementptr inbounds i32, i32* %p, i64 1		%add.ptr = getelementptr inbounds i32, i32* %p, i64 1
store i32 %b, i32* %add.ptr, align 4		store i32 %b, i32* %add.ptr, align 4
Show All 18 Lines

test/CodeGen/AArch64/global-merge-group-by-use.ll

	Show First 20 Lines • Show All 58 Lines • ▼ Show 20 Lines

	@m4 = internal global i32 0, align 4			@m4 = internal global i32 0, align 4
	@n4 = internal global i32 0, align 4			@n4 = internal global i32 0, align 4

	; CHECK-LABEL: f4:			; CHECK-LABEL: f4:
	define void @f4(i32 %a1, i32 %a2, i32 %a3) #0 {			define void @f4(i32 %a1, i32 %a2, i32 %a3) #0 {
	; CHECK-NEXT: adrp x8, [[SET3]]@PAGE			; CHECK-NEXT: adrp x8, [[SET3]]@PAGE
	; CHECK-NEXT: add x8, x8, [[SET3]]@PAGEOFF			; CHECK-NEXT: add x8, x8, [[SET3]]@PAGEOFF
	; CHECK-NEXT: stp w0, w1, [x8, #4]			; CHECK-NEXT: stp w2, w0, [x8]
	; CHECK-NEXT: str w2, [x8]			; CHECK-NEXT: str w1, [x8, #8]
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	store i32 %a1, i32* @m4, align 4			store i32 %a1, i32* @m4, align 4
	store i32 %a2, i32* @n4, align 4			store i32 %a2, i32* @n4, align 4
	store i32 %a3, i32* @n3, align 4			store i32 %a3, i32* @n3, align 4
	ret void			ret void
	}			}

	; Finally, check that we don't do anything with one-element global sets.			; Finally, check that we don't do anything with one-element global sets.
	Show All 18 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[MachineScheduler] Add support for store clusteringClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 53278

include/llvm/Target/TargetInstrInfo.h

lib/CodeGen/MachineScheduler.cpp

lib/Target/AArch64/AArch64InstrInfo.h

lib/Target/AArch64/AArch64InstrInfo.cpp

lib/Target/AMDGPU/SIInstrInfo.h

lib/Target/AMDGPU/SIInstrInfo.cpp

test/CodeGen/AArch64/aarch64-stp-cluster.ll

test/CodeGen/AArch64/arm64-ldp-cluster.ll

test/CodeGen/AArch64/arm64-stp.ll

test/CodeGen/AArch64/global-merge-group-by-use.ll

[MachineScheduler] Add support for store clustering
ClosedPublic