This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/AMDGPU/
-
Target/
-
AMDGPU/
5/5
GCNSchedStrategy.h
46/46
GCNSchedStrategy.cpp
-
test/CodeGen/AMDGPU/
-
CodeGen/
-
AMDGPU/
-
load-global-i16.ll
-
promote-constOffset-to-imm.ll

Differential D139710

[AMDGPU] MachineScheduler: schedule execution metric added for the UnclusteredHighRPStage
ClosedPublic

Authored by alex-t on Dec 9 2022, 5:58 AM.

Download Raw Diff

Details

Reviewers

vpykhtin
kerbowa
rampitec
foad

Commits

rG6daa983c9dde: [AMDGPU] MachineScheduler: schedule execution metric added for the…

Summary

Since the divergence-driven ISel was fully enabled we have more VGPRs available.

MachineScheduler trying to take advantage of that bumps up the occupancy sacrificing
the hiding of memory access latency.  This really spoils the initially good schedule.
A new metric that reflects the latency hiding quality of the schedule has been created
to make it to balance between occupancy and latency. The metric is based on the latency
model which computes the bubble to working cycles ratio. Then we use this ratio to decide
if the higher occupancy schedule is profitable as follows:

    Profit = NewOccupancy/OldOccupancy * OldMetric/NewMetric

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

alex-t created this revision.Dec 9 2022, 5:58 AM

Herald added a project: Restricted Project. · View Herald TranscriptDec 9 2022, 5:58 AM

Herald added subscribers: kosarev, hiraditya, tpr and 5 others. · View Herald Transcript

alex-t requested review of this revision.Dec 9 2022, 5:58 AM

Herald added a project: Restricted Project. · View Herald TranscriptDec 9 2022, 5:58 AM

Herald added a subscriber: wdng. · View Herald Transcript

odd changes removed

yet one more minor code cleanup

Harbormaster completed remote builds in B202213: Diff 481618.Dec 9 2022, 7:35 AM

Do you have any performance measurements?
Also guard it with an option to turn it off.

llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
862	You probably do not need to compute it always, just in UnclusteredHighRPStage?
1079	No else after return.
1088	Avoid using float. Use scaled integers.

vpykhtin added inline comments.Dec 10 2022, 1:56 AM

llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
949	Model -> ReadyCycles?
964	max
969	Model[&MI] = ReadyCycle;

vpykhtin added inline comments.Dec 10 2022, 2:04 AM

llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
956	if (!SU) continue; what is it BTW? Debug instruction?

changes as requested

llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
862	I only need this at the stage preceding the UnclusteredHighRPStage because it is the "MetricBefore" computation. The UnclusteredHighRPStage only runs for the regions which conform to the condition: bool UnclusteredHighRPStage::initGCNRegion() { // Only reschedule regions with the minimum occupancy or regions that may have // spilling (excess register pressure). if ((!DAG.RegionsWithMinOcc[RegionIdx] \|\| DAG.MinOccupancy <= InitialOccupancy) && !DAG.RegionsWithExcessRP[RegionIdx]) return false; return GCNSchedStage::initGCNRegion(); } What I should have done here, is to avoid this running for the GCNSchedStageID::ClusteredLowOccupancyReschedule stage.
949	What do you mean? Don't understand the question
956	Not only debug. Copy f.ex.
1088	What is the evil in float? I would agree if we're targeting the embedded platform with no or very expensive floating point support. Could you explain where is the overhead for x86like (for example)?

vpykhtin added inline comments.Dec 12 2022, 7:43 AM

llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
949	Sorry, I mean rename Model to ReadyCycles

Model map renamed to ReadyCycles

alex-t marked an inline comment as done.Dec 12 2022, 8:10 AM

Harbormaster completed remote builds in B202586: Diff 482135.Dec 12 2022, 10:17 AM

rampitec added inline comments.Dec 12 2022, 10:58 AM

llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
862	I only need this at the stage preceding the UnclusteredHighRPStage because it is the "MetricBefore" computation. The UnclusteredHighRPStage only runs for the regions which conform to the condition: bool UnclusteredHighRPStage::initGCNRegion() { // Only reschedule regions with the minimum occupancy or regions that may have // spilling (excess register pressure). if ((!DAG.RegionsWithMinOcc[RegionIdx] \|\| DAG.MinOccupancy <= InitialOccupancy) && !DAG.RegionsWithExcessRP[RegionIdx]) return false; return GCNSchedStage::initGCNRegion(); } What I should have done here, is to avoid this running for the GCNSchedStageID::ClusteredLowOccupancyReschedule stage. I believe it is better to place it into the `UnclusteredHighRPStage::initGCNSchedStage()`. This will allow to rearrange stages and also skip it if `UnclusteredHighRPStage` itself is skipped.
1088	What is the evil in float? I would agree if we're targeting the embedded platform with no or very expensive floating point support. Could you explain where is the overhead for x86like (for example)? We are not only running on x86. For instance our lit tests are being run on a variety of platforms and llvm does support it. Besides you do not want to get different results depending on the host platform or even host compiler used due to different rounding. In addition it is always good to avoid it so that llvm itself does not need to link in soft float on other platforms simply because their compiler also supports codegen for amdgpu.

floating points changed to scaled integers

alex-t marked 2 inline comments as done.Dec 12 2022, 1:20 PM

alex-t added inline comments.Dec 12 2022, 2:11 PM

llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
862	I compute, estimate, and store metrics per region and for those with minimal occupancy only. initGCNSchedStage() yet has no regions. The initGCNRegion is also not a proper place as we need a schedule for the given (i.e. previous) stage to be done before we can compute its metrics.

rampitec added inline comments.Dec 12 2022, 2:36 PM

llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
862	Regions are created by the first stage. By the time `UnclusteredHighRPStage::initGCNSchedStage()` it should be already initialized. The schedule at the initialization is what a previous stage has left.

alex-t added inline comments.Dec 12 2022, 3:04 PM

llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
862	We don't need metrics for the whole function - just for several regions. To compute and store the metrics for the regions with min occupancy (for which the UnclusteredHighRPStage will actually be done) we would have to have a loop over all regions in the initGCNSchedStage() I put it into finalizeGCNRegion because I can easily check if the metrics are needed for the concrete region and compute them if necessary.

Harbormaster completed remote builds in B202671: Diff 482244.Dec 12 2022, 3:33 PM

kerbowa added inline comments.Dec 12 2022, 10:37 PM

llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
862	It might be better in UnclusteredHighRPStage::initGCNRegion(), but I guess the problem is that you need the DAG to be built already? Maybe we even need a virtual getSchedulerMetrics(), since I can imagine this being used for more then the UnclusteredHighRPStage eventually, and now it only does something in the initOccStage. If it stays in the initOccStage, can you add an assert that the next stage is UnclusteredHighRP?
938	Is this debug only?
951	Can't this just iterate over SUnits?
963	What about unbuffered resources like MFMA? I guess maybe it should be considered in future patches.
964	Should this be TargetSchedModel::computeInstrLatency? We don't use itineraries so I think it is just falling back here.
1089	What happens if the original schedule has no stalls, i.e. a metric of 0? Does that mean that no amount of occupancy increase can result in a profitable tradeoff? This seems heavily biased towards increases in ILP. A single-stall cycle can be weighted as more important than an increase in occupancy from 2 to 3. I don't think this was the intention, was it? I think we need some sort of Occupancy scaling factor in the Profit calculation to have some way of tuning the importance of occupancy.
llvm/lib/Target/AMDGPU/GCNSchedStrategy.h
129	Could this be moved into the class below so it is not in the llvm namespace?
134	NIT: LLVM formating.

alex-t marked an inline comment as done.Dec 13 2022, 12:13 PM

alex-t added inline comments.

llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
862	Your guess is quite correct. My first intention was exactly UnclusteredHighRPStage::initGCNRegion(). Then I found that unfortunately, I have no DAG yet :)
1089	I agree with you. And a scaling factor was planned. The reason it doesn't exist yet is that we still don't have access to HW other than Navi and we can't run enough tests to determine a reasonable scaling factor. The idea behind this initial heuristic was: “This fixes the regression. Since we don't have an HW to test with, let's fix it and let QA see what happens." But your comment is a good catch. Indeed, I should have added a compiler option for the scaling factor with a default value of 1.

metric calculation calls localized to the UnclusteredHighRPStage::shouldRevertSchedule. Now the getScheduleMetric is only called if it is really necessary.
-amdgpu-schedule-metric-bias=<unsigned value> compiler option was added to ease the further testing and tuning. It defaults to 10 which means the schedule w/o bubbles gets 10 points reward.
Several other changes according to the reviewer request.

alex-t marked 7 inline comments as done.Dec 15 2022, 5:06 AM

alex-t added inline comments.

llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
862	What I really need is to get a metric for the schedule before UnclusteredHighRPStage and after it. The only place I need it is UnclusteredHighRPStage::shouldRevertScheduling where I already have SUnits which reflects the order before and BB which reflects the current order. So, making the getScheduleMetrics accept a vector of the SUnit looks like a perfect solution. I don't have to care about any stages except the UnclusteredHighRPStage.
951	It now does as I pass the SUs vector to the getScheduleMetrics

alex-t marked an inline comment as done.Dec 15 2022, 5:15 AM

alex-t added inline comments.

llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
963	This is the very first and drafty implementation which aims to find what it yields. All the specific things like HW hazards and unbuffered resources may be considered further in case we really observe that we need a more precise model.

Harbormaster completed remote builds in B203315: Diff 483134.Dec 15 2022, 7:02 AM

I still want to have an option to disable this heuristic.
Also are there any performance measurements done?

rampitec added inline comments.Dec 15 2022, 10:55 AM

llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
1105	If NewMetric is 0 you will divide by 0.

In D139710#3998487, @rampitec wrote:

I still want to have an option to disable this heuristic.
Also are there any performance measurements done?

We still have no HW available to run the benchmarks.
Cause we've been waiting too long for that, the idea behind this patch is:
let's commit it and then the QA will have to run all the benchmarks for me :)

llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp

1105

It is never 0. All metrics with a bubble amount of less than 1% are 1.

unsigned getMetric() const {
    unsigned Metric = (BubbleCycles * ScaleFactor) / ScheduleLength;
    // Metric is zero if the amount of bubbles is less than 1% which is too
    // small. So, return 1.
    return Metric ? Metric : 1;
  }

@rampitec

amdgpu-schedule-metric-bias=100

makes the scheduler always prefer the occupancy over the latency.
but the schedule metrics are still computed.
Did you mean the option to completely switch the getScheduleMetrics OFF?

In D139710#3999192, @alex-t wrote:
@rampitec
amdgpu-schedule-metric-bias=100
makes the scheduler always prefer the occupancy over the latency.
but the schedule metrics are still computed.
Did you mean the option to completely switch the getScheduleMetrics OFF?

OK, thanks. Document it in the help text of the option.

amdgpu-schedule-metric-bias description updated

Harbormaster completed remote builds in B203455: Diff 483319.Dec 15 2022, 5:13 PM

vpykhtin added inline comments.Dec 16 2022, 2:24 AM

llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
1087	Unneeded copy. Add second version of getScheduleMetrics that can perform in MachineInstr*, you can translate an instruction to SUnit inside.
1106	This Profit formula is very hard to read, could you split it? Instead of the last division by SF, may be compare Profit < (SF * SF)?

vpykhtin added inline comments.Dec 16 2022, 2:51 AM

llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp

953

auto &SU

1087

Sorry, I mean not second version but template, something like:

// shim accessors
SUnit &getSUnit(&DAG, MachineInstr *MI) { return DAG.getSUnit(MI); }
SUnit &getSUnit(&DAG, SUnit &SU) { return SU; }

template <typename Range>
ScheduleMetrics GCNSchedStage::getScheduleMetrics(Range &&S) {
...
  for (auto &X : S) {
     SUnit &SU = getSUnit(DAG, X);

refactorerd getScheduleMetrics to avoid copying

alex-t marked 3 inline comments as done.Dec 19 2022, 10:01 AM

Harbormaster completed remote builds in B203943: Diff 483993.Dec 19 2022, 11:27 AM

schedule printing function moved out from the GCNSchedStage class.
objects passed by reference are marked as const

Harbormaster completed remote builds in B204111: Diff 484202.Dec 20 2022, 5:24 AM

rampitec added inline comments.Jan 4 2023, 12:19 PM

llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
30	Do you still need it?
848	Keep this. It is unrelated to the patch.
955	Could you please use a more specific name for the struct?
972	`auto &I`.
991	`auto &SU`.
1006	Any chance for CurrCycle to be zero?
1040	Any chance for CurrCycle to be zero?
llvm/lib/Target/AMDGPU/GCNSchedStrategy.h
19	You do not seem to need it here.
290	SUnit is not a small class, how about `SUnit*`?

minor changes according to the reviewer request

llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
1006	Since we have CurrCycle = ++ReadyCycle; the only chance is an empty input. Since the input is the DAG.SUnits we'd never get there as it is empty.
1040	No chance. For same reason as at line 1006
llvm/lib/Target/AMDGPU/GCNSchedStrategy.h
290	It leverages from passing the llvm::ScheduleDAG::SUnits by reference. The llvm::ScheduleDAG class member SUnits already has a type std::vector<llvm::SUnit> To pass the vector of addresses I would have to have a loop that fills it in.

LGTM with a nit.

llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
963	Indent is off.

This revision is now accepted and ready to land.Jan 5 2023, 11:45 AM

alex-t marked an inline comment as done.Jan 5 2023, 12:04 PM

alex-t added inline comments.

llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
963	The clang-format does this! :) What indent would you like to see?

rampitec added inline comments.Jan 5 2023, 12:05 PM

llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
963	Really?! OK then.

alex-t marked 2 inline comments as done.Jan 5 2023, 12:08 PM

This revision was landed with ongoing or failed builds.Jan 5 2023, 12:11 PM

Closed by commit rG6daa983c9dde: [AMDGPU] MachineScheduler: schedule execution metric added for the… (authored by Alexander Timofeev <alexander.timofeev@amd.com>). · Explain Why

This revision was automatically updated to reflect the committed changes.

Alexander Timofeev <alexander.timofeev@amd.com> added a commit: rG6daa983c9dde: [AMDGPU] MachineScheduler: schedule execution metric added for the….

Harbormaster completed remote builds in B205960: Diff 486640.Jan 5 2023, 2:30 PM

rampitec mentioned this in D143934: [AMDGPU] Do not apply schedule metric for regions with spilling.Feb 13 2023, 11:45 AM

rampitec mentioned this in rG12b4f9e2af95: [AMDGPU] Do not apply schedule metric for regions with spilling.Feb 14 2023, 12:17 PM

alex-t mentioned this in D158368: [AMDGPU][MISCHED] GCNBalancedSchedStrategy..Aug 20 2023, 8:45 AM

Revision Contents

Path

Size

llvm/

lib/

Target/

AMDGPU/

GCNSchedStrategy.h

34 lines

GCNSchedStrategy.cpp

146 lines

test/

CodeGen/

AMDGPU/

load-global-i16.ll

4 lines

promote-constOffset-to-imm.ll

199 lines

Diff 486646

llvm/lib/Target/AMDGPU/GCNSchedStrategy.h

Show All 10 Lines
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#ifndef LLVM_LIB_TARGET_AMDGPU_GCNSCHEDSTRATEGY_H		#ifndef LLVM_LIB_TARGET_AMDGPU_GCNSCHEDSTRATEGY_H
#define LLVM_LIB_TARGET_AMDGPU_GCNSCHEDSTRATEGY_H		#define LLVM_LIB_TARGET_AMDGPU_GCNSCHEDSTRATEGY_H

#include "GCNRegPressure.h"		#include "GCNRegPressure.h"
#include "llvm/ADT/MapVector.h"		#include "llvm/ADT/MapVector.h"
#include "llvm/CodeGen/MachineScheduler.h"		#include "llvm/CodeGen/MachineScheduler.h"

		rampitecUnsubmitted Done Reply Inline Actions You do not seem to need it here. rampitec: You do not seem to need it here.
namespace llvm {		namespace llvm {

class SIMachineFunctionInfo;		class SIMachineFunctionInfo;
class SIRegisterInfo;		class SIRegisterInfo;
class GCNSubtarget;		class GCNSubtarget;
class GCNSchedStage;		class GCNSchedStage;

enum class GCNSchedStageID : unsigned {		enum class GCNSchedStageID : unsigned {
▲ Show 20 Lines • Show All 93 Lines • ▼ Show 20 Lines
protected:		protected:
bool tryCandidate(SchedCandidate &Cand, SchedCandidate &TryCand,		bool tryCandidate(SchedCandidate &Cand, SchedCandidate &TryCand,
SchedBoundary *Zone) const override;		SchedBoundary *Zone) const override;

public:		public:
GCNMaxILPSchedStrategy(const MachineSchedContext *C);		GCNMaxILPSchedStrategy(const MachineSchedContext *C);
};		};

		class ScheduleMetrics {
		kerbowaUnsubmitted Done Reply Inline Actions Could this be moved into the class below so it is not in the llvm namespace? kerbowa: Could this be moved into the class below so it is not in the llvm namespace?
		unsigned ScheduleLength;
		unsigned BubbleCycles;

		public:
		ScheduleMetrics() {}
		kerbowaUnsubmitted Done Reply Inline Actions NIT: LLVM formating. kerbowa: NIT: LLVM formating.
		ScheduleMetrics(unsigned L, unsigned BC)
		: ScheduleLength(L), BubbleCycles(BC) {}
		unsigned getLength() const { return ScheduleLength; }
		unsigned getBubbles() const { return BubbleCycles; }
		unsigned getMetric() const {
		unsigned Metric = (BubbleCycles * ScaleFactor) / ScheduleLength;
		// Metric is zero if the amount of bubbles is less than 1% which is too
		// small. So, return 1.
		return Metric ? Metric : 1;
		}
		static const unsigned ScaleFactor;
		};

		inline raw_ostream &operator<<(raw_ostream &OS, const ScheduleMetrics &Sm) {
		dbgs() << "\n Schedule Metric (scaled by "
		<< ScheduleMetrics::ScaleFactor
		<< " ) is: " << Sm.getMetric() << " [ " << Sm.getBubbles() << "/"
		<< Sm.getLength() << " ]\n";
		return OS;
		}

class GCNScheduleDAGMILive final : public ScheduleDAGMILive {		class GCNScheduleDAGMILive final : public ScheduleDAGMILive {
friend class GCNSchedStage;		friend class GCNSchedStage;
friend class OccInitialScheduleStage;		friend class OccInitialScheduleStage;
friend class UnclusteredHighRPStage;		friend class UnclusteredHighRPStage;
friend class ClusteredLowOccStage;		friend class ClusteredLowOccStage;
friend class PreRARematStage;		friend class PreRARematStage;
friend class ILPInitialScheduleStage;		friend class ILPInitialScheduleStage;

▲ Show 20 Lines • Show All 117 Lines • ▼ Show 20 Lines	public:
void setupNewBlock();		void setupNewBlock();

// Finalize state after scheudling a region.		// Finalize state after scheudling a region.
void finalizeGCNRegion();		void finalizeGCNRegion();

// Check result of scheduling.		// Check result of scheduling.
void checkScheduling();		void checkScheduling();

		// computes the given schedule virtual execution time in clocks
		ScheduleMetrics getScheduleMetrics(const std::vector<SUnit> &InputSchedule);
		rampitecUnsubmitted Done Reply Inline Actions SUnit is not a small class, how about `SUnit`? rampitec:* SUnit is not a small class, how about `SUnit*`?
		alex-tAuthorUnsubmitted Done Reply Inline Actions It leverages from passing the llvm::ScheduleDAG::SUnits by reference. The llvm::ScheduleDAG class member SUnits already has a type std::vector<llvm::SUnit> To pass the vector of addresses I would have to have a loop that fills it in. alex-t: It leverages from passing the llvm::ScheduleDAG::SUnits by reference. The llvm::ScheduleDAG…
		ScheduleMetrics getScheduleMetrics(const GCNScheduleDAGMILive &DAG);
		unsigned computeSUnitReadyCycle(const SUnit &SU, unsigned CurrCycle,
		DenseMap<unsigned, unsigned> &ReadyCycles,
		const TargetSchedModel &SM);

// Returns true if scheduling should be reverted.		// Returns true if scheduling should be reverted.
virtual bool shouldRevertScheduling(unsigned WavesAfter);		virtual bool shouldRevertScheduling(unsigned WavesAfter);

// Returns true if the new schedule may result in more spilling.		// Returns true if the new schedule may result in more spilling.
bool mayCauseSpilling(unsigned WavesAfter);		bool mayCauseSpilling(unsigned WavesAfter);

// Attempt to revert scheduling for this region.		// Attempt to revert scheduling for this region.
void revertScheduling();		void revertScheduling();
▲ Show 20 Lines • Show All 110 Lines • Show Last 20 Lines

llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp

Show All 21 Lines
/// other regions.		/// other regions.
///		///
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#include "GCNSchedStrategy.h"		#include "GCNSchedStrategy.h"
#include "AMDGPUIGroupLP.h"		#include "AMDGPUIGroupLP.h"
#include "SIMachineFunctionInfo.h"		#include "SIMachineFunctionInfo.h"
#include "llvm/CodeGen/RegisterClassInfo.h"		#include "llvm/CodeGen/RegisterClassInfo.h"

		rampitecUnsubmitted Done Reply Inline Actions Do you still need it? rampitec: Do you still need it?
#define DEBUG_TYPE "machine-scheduler"		#define DEBUG_TYPE "machine-scheduler"

using namespace llvm;		using namespace llvm;

static cl::opt<bool>		static cl::opt<bool>
DisableUnclusterHighRP("amdgpu-disable-unclustred-high-rp-reschedule",		DisableUnclusterHighRP("amdgpu-disable-unclustred-high-rp-reschedule",
cl::Hidden,		cl::Hidden,
cl::desc("Disable unclustred high register pressure "		cl::desc("Disable unclustred high register pressure "
"reduction scheduling stage."),		"reduction scheduling stage."),
cl::init(false));		cl::init(false));
		static cl::opt<unsigned> ScheduleMetricBias(
		"amdgpu-schedule-metric-bias", cl::Hidden,
		cl::desc(
		"Sets the bias which adds weight to occupancy vs latency. Set it to "
		"100 to chase the occupancy only."),
		cl::init(10));

		const unsigned ScheduleMetrics::ScaleFactor = 100;

GCNSchedStrategy::GCNSchedStrategy(const MachineSchedContext *C)		GCNSchedStrategy::GCNSchedStrategy(const MachineSchedContext *C)
: GenericScheduler(C), TargetOccupancy(0), MF(nullptr),		: GenericScheduler(C), TargetOccupancy(0), MF(nullptr),
HasHighPressure(false) {}		HasHighPressure(false) {}

void GCNSchedStrategy::initialize(ScheduleDAGMI *DAG) {		void GCNSchedStrategy::initialize(ScheduleDAGMI *DAG) {
GenericScheduler::initialize(DAG);		GenericScheduler::initialize(DAG);

▲ Show 20 Lines • Show All 791 Lines • ▼ Show 20 Lines	if (StageID == GCNSchedStageID::OccInitialSchedule)
DAG.computeBlockPressure(RegionIdx, CurrentMBB);		DAG.computeBlockPressure(RegionIdx, CurrentMBB);
}		}

void GCNSchedStage::finalizeGCNRegion() {		void GCNSchedStage::finalizeGCNRegion() {
DAG.Regions[RegionIdx] = std::pair(DAG.RegionBegin, DAG.RegionEnd);		DAG.Regions[RegionIdx] = std::pair(DAG.RegionBegin, DAG.RegionEnd);
DAG.RescheduleRegions[RegionIdx] = false;		DAG.RescheduleRegions[RegionIdx] = false;
if (S.HasHighPressure)		if (S.HasHighPressure)
DAG.RegionsWithHighRP[RegionIdx] = true;		DAG.RegionsWithHighRP[RegionIdx] = true;

rampitecUnsubmitted Done Reply Inline Actions Keep this. It is unrelated to the patch. rampitec: Keep this. It is unrelated to the patch.
// Revert scheduling if we have dropped occupancy or there is some other		// Revert scheduling if we have dropped occupancy or there is some other
// reason that the original schedule is better.		// reason that the original schedule is better.
checkScheduling();		checkScheduling();

if (DAG.RegionsWithIGLPInstrs[RegionIdx] &&		if (DAG.RegionsWithIGLPInstrs[RegionIdx] &&
StageID != GCNSchedStageID::UnclusteredHighRPReschedule)		StageID != GCNSchedStageID::UnclusteredHighRPReschedule)
		rampitecUnsubmitted Done Reply Inline Actions You probably do not need to compute it always, just in UnclusteredHighRPStage? rampitec: You probably do not need to compute it always, just in UnclusteredHighRPStage?
		alex-tAuthorUnsubmitted Done Reply Inline Actions I only need this at the stage preceding the UnclusteredHighRPStage because it is the "MetricBefore" computation. The UnclusteredHighRPStage only runs for the regions which conform to the condition: bool UnclusteredHighRPStage::initGCNRegion() { // Only reschedule regions with the minimum occupancy or regions that may have // spilling (excess register pressure). if ((!DAG.RegionsWithMinOcc[RegionIdx] \|\| DAG.MinOccupancy <= InitialOccupancy) && !DAG.RegionsWithExcessRP[RegionIdx]) return false; return GCNSchedStage::initGCNRegion(); } What I should have done here, is to avoid this running for the GCNSchedStageID::ClusteredLowOccupancyReschedule stage. alex-t: I only need this at the stage preceding the UnclusteredHighRPStage because it is the…
		rampitecUnsubmitted Done Reply Inline Actions I only need this at the stage preceding the UnclusteredHighRPStage because it is the "MetricBefore" computation. The UnclusteredHighRPStage only runs for the regions which conform to the condition: bool UnclusteredHighRPStage::initGCNRegion() { // Only reschedule regions with the minimum occupancy or regions that may have // spilling (excess register pressure). if ((!DAG.RegionsWithMinOcc[RegionIdx] \|\| DAG.MinOccupancy <= InitialOccupancy) && !DAG.RegionsWithExcessRP[RegionIdx]) return false; return GCNSchedStage::initGCNRegion(); } What I should have done here, is to avoid this running for the GCNSchedStageID::ClusteredLowOccupancyReschedule stage. I believe it is better to place it into the `UnclusteredHighRPStage::initGCNSchedStage()`. This will allow to rearrange stages and also skip it if `UnclusteredHighRPStage` itself is skipped. rampitec: > I only need this at the stage preceding the UnclusteredHighRPStage because it is the…
		alex-tAuthorUnsubmitted Done Reply Inline Actions I compute, estimate, and store metrics per region and for those with minimal occupancy only. initGCNSchedStage() yet has no regions. The initGCNRegion is also not a proper place as we need a schedule for the given (i.e. previous) stage to be done before we can compute its metrics. alex-t: I compute, estimate, and store metrics per region and for those with minimal occupancy only.
		rampitecUnsubmitted Done Reply Inline Actions Regions are created by the first stage. By the time `UnclusteredHighRPStage::initGCNSchedStage()` it should be already initialized. The schedule at the initialization is what a previous stage has left. rampitec: Regions are created by the first stage. By the time `UnclusteredHighRPStage::initGCNSchedStage…
		alex-tAuthorUnsubmitted Done Reply Inline Actions We don't need metrics for the whole function - just for several regions. To compute and store the metrics for the regions with min occupancy (for which the UnclusteredHighRPStage will actually be done) we would have to have a loop over all regions in the initGCNSchedStage() I put it into finalizeGCNRegion because I can easily check if the metrics are needed for the concrete region and compute them if necessary. alex-t: We don't need metrics for the whole function - just for several regions. To compute and store…
		kerbowaUnsubmitted Done Reply Inline Actions It might be better in UnclusteredHighRPStage::initGCNRegion(), but I guess the problem is that you need the DAG to be built already? Maybe we even need a virtual getSchedulerMetrics(), since I can imagine this being used for more then the UnclusteredHighRPStage eventually, and now it only does something in the initOccStage. If it stays in the initOccStage, can you add an assert that the next stage is UnclusteredHighRP? kerbowa: It might be better in UnclusteredHighRPStage::initGCNRegion(), but I guess the problem is that…
		alex-tAuthorUnsubmitted Done Reply Inline Actions Your guess is quite correct. My first intention was exactly UnclusteredHighRPStage::initGCNRegion(). Then I found that unfortunately, I have no DAG yet :) alex-t: Your guess is quite correct. My first intention was exactly UnclusteredHighRPStage…
		alex-tAuthorUnsubmitted Done Reply Inline Actions What I really need is to get a metric for the schedule before UnclusteredHighRPStage and after it. The only place I need it is UnclusteredHighRPStage::shouldRevertScheduling where I already have SUnits which reflects the order before and BB which reflects the current order. So, making the getScheduleMetrics accept a vector of the SUnit looks like a perfect solution. I don't have to care about any stages except the UnclusteredHighRPStage. alex-t: What I really need is to get a metric for the schedule before UnclusteredHighRPStage and after…
SavedMutations.swap(DAG.Mutations);		SavedMutations.swap(DAG.Mutations);

DAG.exitRegion();		DAG.exitRegion();
RegionIdx++;		RegionIdx++;
}		}

void GCNSchedStage::checkScheduling() {		void GCNSchedStage::checkScheduling() {
// Check the results of scheduling.		// Check the results of scheduling.
PressureAfter = DAG.getRealRegPressure(RegionIdx);		PressureAfter = DAG.getRealRegPressure(RegionIdx);
LLVM_DEBUG(dbgs() << "Pressure after scheduling: " << print(PressureAfter));		LLVM_DEBUG(dbgs() << "Pressure after scheduling: " << print(PressureAfter));
		LLVM_DEBUG(dbgs() << "Region: " << RegionIdx << ".\n");

if (PressureAfter.getSGPRNum() <= S.SGPRCriticalLimit &&		if (PressureAfter.getSGPRNum() <= S.SGPRCriticalLimit &&
PressureAfter.getVGPRNum(ST.hasGFX90AInsts()) <= S.VGPRCriticalLimit) {		PressureAfter.getVGPRNum(ST.hasGFX90AInsts()) <= S.VGPRCriticalLimit) {
DAG.Pressure[RegionIdx] = PressureAfter;		DAG.Pressure[RegionIdx] = PressureAfter;
DAG.RegionsWithMinOcc[RegionIdx] =		DAG.RegionsWithMinOcc[RegionIdx] =
PressureAfter.getOccupancy(ST) == DAG.MinOccupancy;		PressureAfter.getOccupancy(ST) == DAG.MinOccupancy;

// Early out if we have achieve the occupancy target.		// Early out if we have achieve the occupancy target.
▲ Show 20 Lines • Show All 47 Lines • ▼ Show 20 Lines	if (shouldRevertScheduling(WavesAfter)) {
revertScheduling();		revertScheduling();
} else {		} else {
DAG.Pressure[RegionIdx] = PressureAfter;		DAG.Pressure[RegionIdx] = PressureAfter;
DAG.RegionsWithMinOcc[RegionIdx] =		DAG.RegionsWithMinOcc[RegionIdx] =
PressureAfter.getOccupancy(ST) == DAG.MinOccupancy;		PressureAfter.getOccupancy(ST) == DAG.MinOccupancy;
}		}
}		}

		unsigned
		GCNSchedStage::computeSUnitReadyCycle(const SUnit &SU, unsigned CurrCycle,
		kerbowaUnsubmitted Done Reply Inline Actions Is this debug only? kerbowa: Is this debug only?
		DenseMap<unsigned, unsigned> &ReadyCycles,
		const TargetSchedModel &SM) {
		unsigned ReadyCycle = CurrCycle;
		for (auto &D : SU.Preds) {
		if (D.isAssignedRegDep()) {
		MachineInstr *DefMI = D.getSUnit()->getInstr();
		unsigned Latency = SM.computeInstrLatency(DefMI);
		unsigned DefReady = ReadyCycles[DAG.getSUnit(DefMI)->NodeNum];
		ReadyCycle = std::max(ReadyCycle, DefReady + Latency);
		}
		}
		vpykhtinUnsubmitted Done Reply Inline Actions Model -> ReadyCycles? vpykhtin: Model -> ReadyCycles?
		alex-tAuthorUnsubmitted Done Reply Inline Actions What do you mean? Don't understand the question alex-t: What do you mean? Don't understand the question
		vpykhtinUnsubmitted Done Reply Inline Actions Sorry, I mean rename Model to ReadyCycles vpykhtin: Sorry, I mean rename Model to ReadyCycles
		ReadyCycles[SU.NodeNum] = ReadyCycle;
		return ReadyCycle;
		kerbowaUnsubmitted Done Reply Inline Actions Can't this just iterate over SUnits? kerbowa: Can't this just iterate over SUnits?
		alex-tAuthorUnsubmitted Done Reply Inline Actions It now does as I pass the SUs vector to the getScheduleMetrics alex-t: It now does as I pass the SUs vector to the getScheduleMetrics
		}

		vpykhtinUnsubmitted Done Reply Inline Actions auto &SU vpykhtin: auto &SU
		#ifndef NDEBUG
		struct EarlierIssuingCycle {
		rampitecUnsubmitted Done Reply Inline Actions Could you please use a more specific name for the struct? rampitec: Could you please use a more specific name for the struct?
		bool operator()(std::pair<MachineInstr *, unsigned> A,
		vpykhtinUnsubmitted Done Reply Inline Actions if (!SU) continue; what is it BTW? Debug instruction? vpykhtin: if (!SU) continue; what is it BTW? Debug instruction?
		alex-tAuthorUnsubmitted Done Reply Inline Actions Not only debug. Copy f.ex. alex-t: Not only debug. Copy f.ex.
		std::pair<MachineInstr *, unsigned> B) const {
		return A.second < B.second;
		}
		};

		static void printScheduleModel(std::set<std::pair<MachineInstr *, unsigned>,
		EarlierIssuingCycle> &ReadyCycles) {
		kerbowaUnsubmitted Done Reply Inline Actions What about unbuffered resources like MFMA? I guess maybe it should be considered in future patches. kerbowa: What about unbuffered resources like MFMA? I guess maybe it should be considered in future…
		alex-tAuthorUnsubmitted Done Reply Inline Actions This is the very first and drafty implementation which aims to find what it yields. All the specific things like HW hazards and unbuffered resources may be considered further in case we really observe that we need a more precise model. alex-t: This is the very first and drafty implementation which aims to find what it yields. All the…
		rampitecUnsubmitted Done Reply Inline Actions Indent is off. rampitec: Indent is off.
		alex-tAuthorUnsubmitted Done Reply Inline Actions The clang-format does this! :) What indent would you like to see? alex-t: The clang-format does this! :) What indent would you like to see?
		rampitecUnsubmitted Done Reply Inline Actions Really?! OK then. rampitec: Really?! OK then.
		if (ReadyCycles.empty())
		vpykhtinUnsubmitted Done Reply Inline Actions max vpykhtin: max
		kerbowaUnsubmitted Done Reply Inline Actions Should this be TargetSchedModel::computeInstrLatency? We don't use itineraries so I think it is just falling back here. kerbowa: Should this be TargetSchedModel::computeInstrLatency? We don't use itineraries so I think it is…
		return;
		unsigned BBNum = ReadyCycles.begin()->first->getParent()->getNumber();
		dbgs() << "\n################## Schedule time ReadyCycles for MBB : " << BBNum
		<< " ##################\n# Cycle #\t\t\tInstruction "
		" "
		vpykhtinUnsubmitted Done Reply Inline Actions Model[&MI] = ReadyCycle; vpykhtin: Model[&MI] = ReadyCycle;
		" \n";
		unsigned IPrev = 1;
		for (auto &I : ReadyCycles) {
		rampitecUnsubmitted Done Reply Inline Actions `auto &I`. rampitec: `auto &I`.
		if (I.second > IPrev + 1)
		dbgs() << "****************************** BUBBLE OF " << I.second - IPrev
		<< " CYCLES DETECTED ******************************\n\n";
		dbgs() << "[ " << I.second << " ] : " << *I.first << "\n";
		IPrev = I.second;
		}
		}
		#endif

		ScheduleMetrics
		GCNSchedStage::getScheduleMetrics(const std::vector<SUnit> &InputSchedule) {
		#ifndef NDEBUG
		std::set<std::pair<MachineInstr *, unsigned>, EarlierIssuingCycle>
		ReadyCyclesSorted;
		#endif
		const TargetSchedModel &SM = ST.getInstrInfo()->getSchedModel();
		unsigned SumBubbles = 0;
		DenseMap<unsigned, unsigned> ReadyCycles;
		unsigned CurrCycle = 0;
		rampitecUnsubmitted Done Reply Inline Actions `auto &SU`. rampitec: `auto &SU`.
		for (auto &SU : InputSchedule) {
		unsigned ReadyCycle =
		computeSUnitReadyCycle(SU, CurrCycle, ReadyCycles, SM);
		SumBubbles += ReadyCycle - CurrCycle;
		#ifndef NDEBUG
		ReadyCyclesSorted.insert(std::make_pair(SU.getInstr(), ReadyCycle));
		#endif
		CurrCycle = ++ReadyCycle;
		}
		#ifndef NDEBUG
		LLVM_DEBUG(
		printScheduleModel(ReadyCyclesSorted);
		dbgs() << "\n\t"
		<< "Metric: "
		<< (SumBubbles
		rampitecUnsubmitted Done Reply Inline Actions Any chance for CurrCycle to be zero? rampitec: Any chance for CurrCycle to be zero?
		alex-tAuthorUnsubmitted Done Reply Inline Actions Since we have CurrCycle = ++ReadyCycle; the only chance is an empty input. Since the input is the DAG.SUnits we'd never get there as it is empty. alex-t: Since we have ``` CurrCycle = ++ReadyCycle; ``` the only chance is an empty input. Since the…
		? (SumBubbles * ScheduleMetrics::ScaleFactor) / CurrCycle
		: 1)
		<< "\n\n");
		#endif

		return ScheduleMetrics(CurrCycle, SumBubbles);
		}

		ScheduleMetrics
		GCNSchedStage::getScheduleMetrics(const GCNScheduleDAGMILive &DAG) {
		#ifndef NDEBUG
		std::set<std::pair<MachineInstr *, unsigned>, EarlierIssuingCycle>
		ReadyCyclesSorted;
		#endif
		const TargetSchedModel &SM = ST.getInstrInfo()->getSchedModel();
		unsigned SumBubbles = 0;
		DenseMap<unsigned, unsigned> ReadyCycles;
		unsigned CurrCycle = 0;
		for (auto &MI : DAG) {
		SUnit *SU = DAG.getSUnit(&MI);
		if (!SU)
		continue;
		unsigned ReadyCycle =
		computeSUnitReadyCycle(*SU, CurrCycle, ReadyCycles, SM);
		SumBubbles += ReadyCycle - CurrCycle;
		#ifndef NDEBUG
		ReadyCyclesSorted.insert(std::make_pair(SU->getInstr(), ReadyCycle));
		#endif
		CurrCycle = ++ReadyCycle;
		}
		#ifndef NDEBUG
		LLVM_DEBUG(
		printScheduleModel(ReadyCyclesSorted);
		dbgs() << "\n\t"
		rampitecUnsubmitted Done Reply Inline Actions Any chance for CurrCycle to be zero? rampitec: Any chance for CurrCycle to be zero?
		alex-tAuthorUnsubmitted Done Reply Inline Actions No chance. For same reason as at line 1006 alex-t: No chance. For same reason as at line 1006
		<< "Metric: "
		<< (SumBubbles
		? (SumBubbles * ScheduleMetrics::ScaleFactor) / CurrCycle
		: 1)
		<< "\n\n");
		#endif

		return ScheduleMetrics(CurrCycle, SumBubbles);
		}

bool GCNSchedStage::shouldRevertScheduling(unsigned WavesAfter) {		bool GCNSchedStage::shouldRevertScheduling(unsigned WavesAfter) {
if (WavesAfter < DAG.MinOccupancy)		if (WavesAfter < DAG.MinOccupancy)
return true;		return true;

return false;		return false;
}		}

bool OccInitialScheduleStage::shouldRevertScheduling(unsigned WavesAfter) {		bool OccInitialScheduleStage::shouldRevertScheduling(unsigned WavesAfter) {
Show All 12 Lines
bool UnclusteredHighRPStage::shouldRevertScheduling(unsigned WavesAfter) {		bool UnclusteredHighRPStage::shouldRevertScheduling(unsigned WavesAfter) {
// If RP is not reduced in the unclustred reschedule stage, revert to the		// If RP is not reduced in the unclustred reschedule stage, revert to the
// old schedule.		// old schedule.
if ((WavesAfter <= PressureBefore.getOccupancy(ST) &&		if ((WavesAfter <= PressureBefore.getOccupancy(ST) &&
mayCauseSpilling(WavesAfter)) \|\|		mayCauseSpilling(WavesAfter)) \|\|
GCNSchedStage::shouldRevertScheduling(WavesAfter)) {		GCNSchedStage::shouldRevertScheduling(WavesAfter)) {
LLVM_DEBUG(dbgs() << "Unclustered reschedule did not help.\n");		LLVM_DEBUG(dbgs() << "Unclustered reschedule did not help.\n");
return true;		return true;
}		}
		rampitecUnsubmitted Done Reply Inline Actions No else after return. rampitec: No else after return.

return false;		LLVM_DEBUG(
		dbgs()
		<< "\n\t * In shouldRevertScheduling *\n"
		<< " ********* BEFORE UnclusteredHighRPStage *********\n");
		ScheduleMetrics MBefore =
		getScheduleMetrics(DAG.SUnits);
		LLVM_DEBUG(
		vpykhtinUnsubmitted Done Reply Inline Actions Unneeded copy. Add second version of getScheduleMetrics that can perform in MachineInstr, you can translate an instruction to SUnit inside. vpykhtin:* Unneeded copy. Add second version of getScheduleMetrics that can perform in MachineInstr*, you…
		vpykhtinUnsubmitted Done Reply Inline Actions Sorry, I mean not second version but template, something like: // shim accessors SUnit &getSUnit(&DAG, MachineInstr MI) { return DAG.getSUnit(MI); } SUnit &getSUnit(&DAG, SUnit &SU) { return SU; } template <typename Range> ScheduleMetrics GCNSchedStage::getScheduleMetrics(Range &&S) { ... for (auto &X : S) { SUnit &SU = getSUnit(DAG, X); vpykhtin:* Sorry, I mean not second version but template, something like: ``` // shim accessors SUnit…
		dbgs()
		rampitecUnsubmitted Done Reply Inline Actions Avoid using float. Use scaled integers. rampitec: Avoid using float. Use scaled integers.
		alex-tAuthorUnsubmitted Done Reply Inline Actions What is the evil in float? I would agree if we're targeting the embedded platform with no or very expensive floating point support. Could you explain where is the overhead for x86like (for example)? alex-t: What is the evil in float? I would agree if we're targeting the embedded platform with no or…
		rampitecUnsubmitted Done Reply Inline Actions What is the evil in float? I would agree if we're targeting the embedded platform with no or very expensive floating point support. Could you explain where is the overhead for x86like (for example)? We are not only running on x86. For instance our lit tests are being run on a variety of platforms and llvm does support it. Besides you do not want to get different results depending on the host platform or even host compiler used due to different rounding. In addition it is always good to avoid it so that llvm itself does not need to link in soft float on other platforms simply because their compiler also supports codegen for amdgpu. rampitec: > What is the evil in float? I would agree if we're targeting the embedded platform with no or…
		<< "\n ********* AFTER UnclusteredHighRPStage *********\n");
		kerbowaUnsubmitted Done Reply Inline Actions What happens if the original schedule has no stalls, i.e. a metric of 0? Does that mean that no amount of occupancy increase can result in a profitable tradeoff? This seems heavily biased towards increases in ILP. A single-stall cycle can be weighted as more important than an increase in occupancy from 2 to 3. I don't think this was the intention, was it? I think we need some sort of Occupancy scaling factor in the Profit calculation to have some way of tuning the importance of occupancy. kerbowa: What happens if the original schedule has no stalls, i.e. a metric of 0? Does that mean that no…
		alex-tAuthorUnsubmitted Done Reply Inline Actions I agree with you. And a scaling factor was planned. The reason it doesn't exist yet is that we still don't have access to HW other than Navi and we can't run enough tests to determine a reasonable scaling factor. The idea behind this initial heuristic was: “This fixes the regression. Since we don't have an HW to test with, let's fix it and let QA see what happens." But your comment is a good catch. Indeed, I should have added a compiler option for the scaling factor with a default value of 1. alex-t: I agree with you. And a scaling factor was planned. The reason it doesn't exist yet is that we…
		ScheduleMetrics MAfter = getScheduleMetrics(DAG);
		unsigned OldMetric = MBefore.getMetric();
		unsigned NewMetric = MAfter.getMetric();
		unsigned WavesBefore =
		std::min(S.getTargetOccupancy(), PressureBefore.getOccupancy(ST));
		unsigned Profit =
		((WavesAfter * ScheduleMetrics::ScaleFactor) / WavesBefore *
		((OldMetric + ScheduleMetricBias) * ScheduleMetrics::ScaleFactor) /
		NewMetric) /
		ScheduleMetrics::ScaleFactor;
		LLVM_DEBUG(dbgs() << "\tMetric before " << MBefore << "\tMetric after "
		<< MAfter << "Profit: " << Profit << "\n");
		return Profit < ScheduleMetrics::ScaleFactor;
}		}

bool ClusteredLowOccStage::shouldRevertScheduling(unsigned WavesAfter) {		bool ClusteredLowOccStage::shouldRevertScheduling(unsigned WavesAfter) {
		rampitecUnsubmitted Done Reply Inline Actions If NewMetric is 0 you will divide by 0. rampitec: If NewMetric is 0 you will divide by 0.
		alex-tAuthorUnsubmitted Done Reply Inline Actions It is never 0. All metrics with a bubble amount of less than 1% are 1. unsigned getMetric() const { unsigned Metric = (BubbleCycles * ScaleFactor) / ScheduleLength; // Metric is zero if the amount of bubbles is less than 1% which is too // small. So, return 1. return Metric ? Metric : 1; } alex-t: It is never 0. All metrics with a bubble amount of less than 1% are 1. ``` unsigned getMetric…
if (PressureAfter == PressureBefore)		if (PressureAfter == PressureBefore)
		vpykhtinUnsubmitted Done Reply Inline Actions This Profit formula is very hard to read, could you split it? Instead of the last division by SF, may be compare Profit < (SF * SF)? vpykhtin: This Profit formula is very hard to read, could you split it? Instead of the last division by…
return false;		return false;

if (GCNSchedStage::shouldRevertScheduling(WavesAfter))		if (GCNSchedStage::shouldRevertScheduling(WavesAfter))
return true;		return true;

if (mayCauseSpilling(WavesAfter))		if (mayCauseSpilling(WavesAfter))
return true;		return true;

▲ Show 20 Lines • Show All 408 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/load-global-i16.ll

This file is larger than 256 KB, so syntax highlighting is disabled by default.

	Show First 20 Lines • Show All 6,616 Lines • ▼ Show 20 Lines
	; GCN-NOHSA-VI-NEXT: s_waitcnt lgkmcnt(0)			; GCN-NOHSA-VI-NEXT: s_waitcnt lgkmcnt(0)
	; GCN-NOHSA-VI-NEXT: s_mov_b32 s8, s6			; GCN-NOHSA-VI-NEXT: s_mov_b32 s8, s6
	; GCN-NOHSA-VI-NEXT: s_mov_b32 s9, s7			; GCN-NOHSA-VI-NEXT: s_mov_b32 s9, s7
	; GCN-NOHSA-VI-NEXT: buffer_load_dwordx4 v[0:3], off, s[8:11], 0			; GCN-NOHSA-VI-NEXT: buffer_load_dwordx4 v[0:3], off, s[8:11], 0
	; GCN-NOHSA-VI-NEXT: buffer_load_dwordx4 v[4:7], off, s[8:11], 0 offset:16			; GCN-NOHSA-VI-NEXT: buffer_load_dwordx4 v[4:7], off, s[8:11], 0 offset:16
	; GCN-NOHSA-VI-NEXT: s_mov_b32 s0, s4			; GCN-NOHSA-VI-NEXT: s_mov_b32 s0, s4
	; GCN-NOHSA-VI-NEXT: s_mov_b32 s1, s5			; GCN-NOHSA-VI-NEXT: s_mov_b32 s1, s5
	; GCN-NOHSA-VI-NEXT: v_mov_b32_e32 v30, 0			; GCN-NOHSA-VI-NEXT: v_mov_b32_e32 v30, 0
	; GCN-NOHSA-VI-NEXT: v_mov_b32_e32 v23, 0
	; GCN-NOHSA-VI-NEXT: v_mov_b32_e32 v19, 0
	; GCN-NOHSA-VI-NEXT: v_mov_b32_e32 v11, 0			; GCN-NOHSA-VI-NEXT: v_mov_b32_e32 v11, 0
	; GCN-NOHSA-VI-NEXT: v_mov_b32_e32 v15, 0			; GCN-NOHSA-VI-NEXT: v_mov_b32_e32 v15, 0
				; GCN-NOHSA-VI-NEXT: v_mov_b32_e32 v19, 0
				; GCN-NOHSA-VI-NEXT: v_mov_b32_e32 v23, 0
	; GCN-NOHSA-VI-NEXT: s_waitcnt vmcnt(1)			; GCN-NOHSA-VI-NEXT: s_waitcnt vmcnt(1)
	; GCN-NOHSA-VI-NEXT: v_and_b32_e32 v16, 0xffff, v3			; GCN-NOHSA-VI-NEXT: v_and_b32_e32 v16, 0xffff, v3
	; GCN-NOHSA-VI-NEXT: s_waitcnt vmcnt(0)			; GCN-NOHSA-VI-NEXT: s_waitcnt vmcnt(0)
	; GCN-NOHSA-VI-NEXT: v_and_b32_e32 v27, 0xffff, v4			; GCN-NOHSA-VI-NEXT: v_and_b32_e32 v27, 0xffff, v4
	; GCN-NOHSA-VI-NEXT: v_lshrrev_b32_e32 v29, 16, v4			; GCN-NOHSA-VI-NEXT: v_lshrrev_b32_e32 v29, 16, v4
	; GCN-NOHSA-VI-NEXT: v_mov_b32_e32 v4, 0			; GCN-NOHSA-VI-NEXT: v_mov_b32_e32 v4, 0
	; GCN-NOHSA-VI-NEXT: v_mov_b32_e32 v28, v4			; GCN-NOHSA-VI-NEXT: v_mov_b32_e32 v28, v4
	; GCN-NOHSA-VI-NEXT: v_lshrrev_b32_e32 v18, 16, v3			; GCN-NOHSA-VI-NEXT: v_lshrrev_b32_e32 v18, 16, v3
	▲ Show 20 Lines • Show All 2,085 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/promote-constOffset-to-imm.ll

	Show First 20 Lines • Show All 443 Lines • ▼ Show 20 Lines
	; GFX8-NEXT: v_mov_b32_e32 v3, s35			; GFX8-NEXT: v_mov_b32_e32 v3, s35
	; GFX8-NEXT: v_add_u32_e32 v1, vcc, s34, v1			; GFX8-NEXT: v_add_u32_e32 v1, vcc, s34, v1
	; GFX8-NEXT: v_addc_u32_e32 v2, vcc, v2, v3, vcc			; GFX8-NEXT: v_addc_u32_e32 v2, vcc, v2, v3, vcc
	; GFX8-NEXT: s_movk_i32 s0, 0x5000			; GFX8-NEXT: s_movk_i32 s0, 0x5000
	; GFX8-NEXT: v_add_u32_e32 v1, vcc, s0, v1			; GFX8-NEXT: v_add_u32_e32 v1, vcc, s0, v1
	; GFX8-NEXT: v_mov_b32_e32 v5, 0			; GFX8-NEXT: v_mov_b32_e32 v5, 0
	; GFX8-NEXT: v_addc_u32_e32 v2, vcc, 0, v2, vcc			; GFX8-NEXT: v_addc_u32_e32 v2, vcc, 0, v2, vcc
	; GFX8-NEXT: v_mov_b32_e32 v6, 0			; GFX8-NEXT: v_mov_b32_e32 v6, 0
	; GFX8-NEXT: s_movk_i32 s4, 0x7f			; GFX8-NEXT: s_movk_i32 s0, 0x7f
	; GFX8-NEXT: .LBB1_1: ; %for.cond.preheader			; GFX8-NEXT: .LBB1_1: ; %for.cond.preheader
	; GFX8-NEXT: ; =>This Loop Header: Depth=1			; GFX8-NEXT: ; =>This Loop Header: Depth=1
	; GFX8-NEXT: ; Child Loop BB1_2 Depth 2			; GFX8-NEXT: ; Child Loop BB1_2 Depth 2
	; GFX8-NEXT: v_mov_b32_e32 v4, v2			; GFX8-NEXT: v_mov_b32_e32 v4, v2
	; GFX8-NEXT: v_mov_b32_e32 v3, v1			; GFX8-NEXT: v_mov_b32_e32 v3, v1
	; GFX8-NEXT: s_mov_b32 s5, 0			; GFX8-NEXT: s_mov_b32 s1, 0
	; GFX8-NEXT: .LBB1_2: ; %for.body			; GFX8-NEXT: .LBB1_2: ; %for.body
	; GFX8-NEXT: ; Parent Loop BB1_1 Depth=1			; GFX8-NEXT: ; Parent Loop BB1_1 Depth=1
	; GFX8-NEXT: ; => This Inner Loop Header: Depth=2			; GFX8-NEXT: ; => This Inner Loop Header: Depth=2
	; GFX8-NEXT: v_add_u32_e32 v7, vcc, 0xffffb000, v3			; GFX8-NEXT: v_add_u32_e32 v7, vcc, 0xffffb000, v3
	; GFX8-NEXT: s_mov_b64 s[0:1], vcc			; GFX8-NEXT: v_addc_u32_e32 v8, vcc, -1, v4, vcc
	; GFX8-NEXT: v_add_u32_e32 v9, vcc, 0xffffb800, v3
	; GFX8-NEXT: s_mov_b64 s[2:3], vcc
	; GFX8-NEXT: v_addc_u32_e64 v8, vcc, -1, v4, s[0:1]
	; GFX8-NEXT: flat_load_dwordx2 v[7:8], v[7:8]			; GFX8-NEXT: flat_load_dwordx2 v[7:8], v[7:8]
	; GFX8-NEXT: v_add_u32_e32 v11, vcc, 0xffffc000, v3			; GFX8-NEXT: v_add_u32_e32 v9, vcc, 0xffffb800, v3
	; GFX8-NEXT: s_mov_b64 s[0:1], vcc			; GFX8-NEXT: v_addc_u32_e32 v10, vcc, -1, v4, vcc
	; GFX8-NEXT: v_addc_u32_e64 v10, vcc, -1, v4, s[2:3]
	; GFX8-NEXT: flat_load_dwordx2 v[9:10], v[9:10]			; GFX8-NEXT: flat_load_dwordx2 v[9:10], v[9:10]
				; GFX8-NEXT: v_add_u32_e32 v11, vcc, 0xffffc000, v3
				; GFX8-NEXT: v_addc_u32_e32 v12, vcc, -1, v4, vcc
				; GFX8-NEXT: flat_load_dwordx2 v[11:12], v[11:12]
	; GFX8-NEXT: v_add_u32_e32 v13, vcc, 0xffffc800, v3			; GFX8-NEXT: v_add_u32_e32 v13, vcc, 0xffffc800, v3
	; GFX8-NEXT: s_mov_b64 s[2:3], vcc			; GFX8-NEXT: v_addc_u32_e32 v14, vcc, -1, v4, vcc
	; GFX8-NEXT: v_addc_u32_e64 v12, vcc, -1, v4, s[0:1]			; GFX8-NEXT: flat_load_dwordx2 v[13:14], v[13:14]
	; GFX8-NEXT: s_addk_i32 s5, 0x2000			; GFX8-NEXT: v_add_u32_e32 v15, vcc, 0xffffd000, v3
	; GFX8-NEXT: s_cmp_gt_u32 s5, 0x3fffff			; GFX8-NEXT: v_addc_u32_e32 v16, vcc, -1, v4, vcc
	; GFX8-NEXT: s_waitcnt vmcnt(1)			; GFX8-NEXT: v_add_u32_e32 v17, vcc, 0xffffd800, v3
	; GFX8-NEXT: v_add_u32_e32 v15, vcc, v7, v5			; GFX8-NEXT: v_addc_u32_e32 v18, vcc, -1, v4, vcc
	; GFX8-NEXT: v_addc_u32_e32 v6, vcc, v8, v6, vcc			; GFX8-NEXT: v_add_u32_e32 v19, vcc, 0xffffe000, v3
	; GFX8-NEXT: flat_load_dwordx2 v[7:8], v[11:12]			; GFX8-NEXT: v_addc_u32_e32 v20, vcc, -1, v4, vcc
	; GFX8-NEXT: v_add_u32_e32 v5, vcc, 0xffffd000, v3			; GFX8-NEXT: flat_load_dwordx2 v[15:16], v[15:16]
	; GFX8-NEXT: s_mov_b64 s[0:1], vcc			; GFX8-NEXT: flat_load_dwordx2 v[17:18], v[17:18]
	; GFX8-NEXT: v_addc_u32_e64 v14, vcc, -1, v4, s[2:3]			; GFX8-NEXT: s_addk_i32 s1, 0x2000
	; GFX8-NEXT: flat_load_dwordx2 v[11:12], v[13:14]			; GFX8-NEXT: s_cmp_gt_u32 s1, 0x3fffff
	; GFX8-NEXT: s_waitcnt vmcnt(2)			; GFX8-NEXT: s_waitcnt vmcnt(5)
	; GFX8-NEXT: v_add_u32_e32 v15, vcc, v9, v15			; GFX8-NEXT: v_add_u32_e32 v21, vcc, v7, v5
	; GFX8-NEXT: v_addc_u32_e32 v10, vcc, v10, v6, vcc			; GFX8-NEXT: v_addc_u32_e32 v22, vcc, v8, v6, vcc
	; GFX8-NEXT: v_add_u32_e32 v9, vcc, 0xffffd800, v3			; GFX8-NEXT: v_add_u32_e32 v5, vcc, 0xffffe800, v3
	; GFX8-NEXT: s_mov_b64 s[2:3], vcc			; GFX8-NEXT: v_addc_u32_e32 v6, vcc, -1, v4, vcc
	; GFX8-NEXT: v_addc_u32_e64 v6, vcc, -1, v4, s[0:1]			; GFX8-NEXT: v_add_u32_e32 v7, vcc, 0xfffff000, v3
				; GFX8-NEXT: flat_load_dwordx2 v[19:20], v[19:20]
	; GFX8-NEXT: flat_load_dwordx2 v[5:6], v[5:6]			; GFX8-NEXT: flat_load_dwordx2 v[5:6], v[5:6]
	; GFX8-NEXT: s_waitcnt vmcnt(2)			; GFX8-NEXT: v_addc_u32_e32 v8, vcc, -1, v4, vcc
	; GFX8-NEXT: v_add_u32_e32 v13, vcc, v7, v15			; GFX8-NEXT: s_waitcnt vmcnt(6)
	; GFX8-NEXT: v_addc_u32_e32 v8, vcc, v8, v10, vcc			; GFX8-NEXT: v_add_u32_e32 v21, vcc, v9, v21
	; GFX8-NEXT: v_add_u32_e32 v7, vcc, 0xffffe000, v3			; GFX8-NEXT: v_addc_u32_e32 v22, vcc, v10, v22, vcc
	; GFX8-NEXT: s_mov_b64 s[0:1], vcc
	; GFX8-NEXT: v_addc_u32_e64 v10, vcc, -1, v4, s[2:3]
	; GFX8-NEXT: flat_load_dwordx2 v[9:10], v[9:10]
	; GFX8-NEXT: s_waitcnt vmcnt(2)
	; GFX8-NEXT: v_add_u32_e32 v13, vcc, v11, v13
	; GFX8-NEXT: v_addc_u32_e32 v12, vcc, v12, v8, vcc
	; GFX8-NEXT: v_add_u32_e32 v11, vcc, 0xffffe800, v3
	; GFX8-NEXT: s_mov_b64 s[2:3], vcc
	; GFX8-NEXT: v_addc_u32_e64 v8, vcc, -1, v4, s[0:1]
	; GFX8-NEXT: flat_load_dwordx2 v[7:8], v[7:8]
	; GFX8-NEXT: s_waitcnt vmcnt(2)
	; GFX8-NEXT: v_add_u32_e32 v13, vcc, v5, v13
	; GFX8-NEXT: v_addc_u32_e32 v6, vcc, v6, v12, vcc
	; GFX8-NEXT: v_add_u32_e32 v5, vcc, 0xfffff000, v3
	; GFX8-NEXT: s_mov_b64 s[0:1], vcc
	; GFX8-NEXT: v_addc_u32_e64 v12, vcc, -1, v4, s[2:3]
	; GFX8-NEXT: flat_load_dwordx2 v[11:12], v[11:12]
	; GFX8-NEXT: s_waitcnt vmcnt(2)
	; GFX8-NEXT: v_add_u32_e32 v13, vcc, v9, v13
	; GFX8-NEXT: v_addc_u32_e32 v14, vcc, v10, v6, vcc
	; GFX8-NEXT: v_addc_u32_e64 v6, s[0:1], -1, v4, s[0:1]
	; GFX8-NEXT: v_add_u32_e32 v9, vcc, 0xfffff800, v3			; GFX8-NEXT: v_add_u32_e32 v9, vcc, 0xfffff800, v3
	; GFX8-NEXT: flat_load_dwordx2 v[5:6], v[5:6]			; GFX8-NEXT: flat_load_dwordx2 v[7:8], v[7:8]
	; GFX8-NEXT: v_addc_u32_e32 v10, vcc, -1, v4, vcc			; GFX8-NEXT: v_addc_u32_e32 v10, vcc, -1, v4, vcc
	; GFX8-NEXT: flat_load_dwordx2 v[9:10], v[9:10]			; GFX8-NEXT: flat_load_dwordx2 v[9:10], v[9:10]
	; GFX8-NEXT: s_waitcnt vmcnt(3)			; GFX8-NEXT: s_waitcnt vmcnt(7)
	; GFX8-NEXT: v_add_u32_e32 v13, vcc, v7, v13			; GFX8-NEXT: v_add_u32_e32 v21, vcc, v11, v21
	; GFX8-NEXT: v_addc_u32_e32 v14, vcc, v8, v14, vcc			; GFX8-NEXT: v_addc_u32_e32 v22, vcc, v12, v22, vcc
	; GFX8-NEXT: flat_load_dwordx2 v[7:8], v[3:4]			; GFX8-NEXT: flat_load_dwordx2 v[11:12], v[3:4]
	; GFX8-NEXT: v_add_u32_e32 v3, vcc, 0x10000, v3			; GFX8-NEXT: v_add_u32_e32 v3, vcc, 0x10000, v3
	; GFX8-NEXT: v_addc_u32_e32 v4, vcc, 0, v4, vcc			; GFX8-NEXT: v_addc_u32_e32 v4, vcc, 0, v4, vcc
				; GFX8-NEXT: s_waitcnt vmcnt(7)
				; GFX8-NEXT: v_add_u32_e32 v13, vcc, v13, v21
				; GFX8-NEXT: v_addc_u32_e32 v14, vcc, v14, v22, vcc
				; GFX8-NEXT: s_waitcnt vmcnt(6)
				; GFX8-NEXT: v_add_u32_e32 v13, vcc, v15, v13
				; GFX8-NEXT: v_addc_u32_e32 v14, vcc, v16, v14, vcc
				; GFX8-NEXT: s_waitcnt vmcnt(5)
				; GFX8-NEXT: v_add_u32_e32 v13, vcc, v17, v13
				; GFX8-NEXT: v_addc_u32_e32 v14, vcc, v18, v14, vcc
				; GFX8-NEXT: s_waitcnt vmcnt(4)
				; GFX8-NEXT: v_add_u32_e32 v13, vcc, v19, v13
				; GFX8-NEXT: v_addc_u32_e32 v14, vcc, v20, v14, vcc
	; GFX8-NEXT: s_waitcnt vmcnt(3)			; GFX8-NEXT: s_waitcnt vmcnt(3)
	; GFX8-NEXT: v_add_u32_e32 v11, vcc, v11, v13			; GFX8-NEXT: v_add_u32_e32 v5, vcc, v5, v13
	; GFX8-NEXT: v_addc_u32_e32 v12, vcc, v12, v14, vcc			; GFX8-NEXT: v_addc_u32_e32 v6, vcc, v6, v14, vcc
	; GFX8-NEXT: s_waitcnt vmcnt(2)			; GFX8-NEXT: s_waitcnt vmcnt(2)
	; GFX8-NEXT: v_add_u32_e32 v5, vcc, v5, v11			; GFX8-NEXT: v_add_u32_e32 v5, vcc, v7, v5
	; GFX8-NEXT: v_addc_u32_e32 v6, vcc, v6, v12, vcc			; GFX8-NEXT: v_addc_u32_e32 v6, vcc, v8, v6, vcc
	; GFX8-NEXT: s_waitcnt vmcnt(1)			; GFX8-NEXT: s_waitcnt vmcnt(1)
	; GFX8-NEXT: v_add_u32_e32 v5, vcc, v9, v5			; GFX8-NEXT: v_add_u32_e32 v5, vcc, v9, v5
	; GFX8-NEXT: v_addc_u32_e32 v6, vcc, v10, v6, vcc			; GFX8-NEXT: v_addc_u32_e32 v6, vcc, v10, v6, vcc
	; GFX8-NEXT: s_waitcnt vmcnt(0)			; GFX8-NEXT: s_waitcnt vmcnt(0)
	; GFX8-NEXT: v_add_u32_e32 v5, vcc, v7, v5			; GFX8-NEXT: v_add_u32_e32 v5, vcc, v11, v5
	; GFX8-NEXT: v_addc_u32_e32 v6, vcc, v8, v6, vcc			; GFX8-NEXT: v_addc_u32_e32 v6, vcc, v12, v6, vcc
	; GFX8-NEXT: s_cbranch_scc0 .LBB1_2			; GFX8-NEXT: s_cbranch_scc0 .LBB1_2
	; GFX8-NEXT: ; %bb.3: ; %while.cond.loopexit			; GFX8-NEXT: ; %bb.3: ; %while.cond.loopexit
	; GFX8-NEXT: ; in Loop: Header=BB1_1 Depth=1			; GFX8-NEXT: ; in Loop: Header=BB1_1 Depth=1
	; GFX8-NEXT: s_add_i32 s0, s4, -1			; GFX8-NEXT: s_add_i32 s1, s0, -1
	; GFX8-NEXT: s_cmp_eq_u32 s4, 0			; GFX8-NEXT: s_cmp_eq_u32 s0, 0
	; GFX8-NEXT: s_cbranch_scc1 .LBB1_5			; GFX8-NEXT: s_cbranch_scc1 .LBB1_5
	; GFX8-NEXT: ; %bb.4: ; in Loop: Header=BB1_1 Depth=1			; GFX8-NEXT: ; %bb.4: ; in Loop: Header=BB1_1 Depth=1
	; GFX8-NEXT: s_mov_b32 s4, s0			; GFX8-NEXT: s_mov_b32 s0, s1
	; GFX8-NEXT: s_branch .LBB1_1			; GFX8-NEXT: s_branch .LBB1_1
	; GFX8-NEXT: .LBB1_5: ; %while.end			; GFX8-NEXT: .LBB1_5: ; %while.end
	; GFX8-NEXT: v_mov_b32_e32 v1, s35			; GFX8-NEXT: v_mov_b32_e32 v1, s35
	; GFX8-NEXT: v_add_u32_e32 v0, vcc, s34, v0			; GFX8-NEXT: v_add_u32_e32 v0, vcc, s34, v0
	; GFX8-NEXT: v_addc_u32_e32 v1, vcc, 0, v1, vcc			; GFX8-NEXT: v_addc_u32_e32 v1, vcc, 0, v1, vcc
	; GFX8-NEXT: flat_store_dwordx2 v[0:1], v[5:6]			; GFX8-NEXT: flat_store_dwordx2 v[0:1], v[5:6]
	; GFX8-NEXT: s_endpgm			; GFX8-NEXT: s_endpgm
	;			;
	Show All 40 Lines
	; GFX900-NEXT: ; Child Loop BB1_2 Depth 2			; GFX900-NEXT: ; Child Loop BB1_2 Depth 2
	; GFX900-NEXT: v_mov_b32_e32 v4, v2			; GFX900-NEXT: v_mov_b32_e32 v4, v2
	; GFX900-NEXT: v_mov_b32_e32 v3, v1			; GFX900-NEXT: v_mov_b32_e32 v3, v1
	; GFX900-NEXT: s_mov_b32 s6, 0			; GFX900-NEXT: s_mov_b32 s6, 0
	; GFX900-NEXT: .LBB1_2: ; %for.body			; GFX900-NEXT: .LBB1_2: ; %for.body
	; GFX900-NEXT: ; Parent Loop BB1_1 Depth=1			; GFX900-NEXT: ; Parent Loop BB1_1 Depth=1
	; GFX900-NEXT: ; => This Inner Loop Header: Depth=2			; GFX900-NEXT: ; => This Inner Loop Header: Depth=2
	; GFX900-NEXT: v_add_co_u32_e32 v7, vcc, 0xffffb000, v3			; GFX900-NEXT: v_add_co_u32_e32 v7, vcc, 0xffffb000, v3
	; GFX900-NEXT: s_mov_b64 s[0:1], vcc			; GFX900-NEXT: v_addc_co_u32_e32 v8, vcc, -1, v4, vcc
	; GFX900-NEXT: v_addc_co_u32_e64 v8, s[0:1], -1, v4, s[0:1]
	; GFX900-NEXT: global_load_dwordx2 v[9:10], v[3:4], off offset:-4096			; GFX900-NEXT: global_load_dwordx2 v[9:10], v[3:4], off offset:-4096
	; GFX900-NEXT: global_load_dwordx2 v[11:12], v[3:4], off offset:-2048			; GFX900-NEXT: global_load_dwordx2 v[11:12], v[3:4], off offset:-2048
	; GFX900-NEXT: v_add_co_u32_e32 v13, vcc, 0xffffc000, v3			; GFX900-NEXT: v_add_co_u32_e32 v13, vcc, 0xffffc000, v3
	; GFX900-NEXT: global_load_dwordx2 v[7:8], v[7:8], off			; GFX900-NEXT: global_load_dwordx2 v[7:8], v[7:8], off
	; GFX900-NEXT: v_addc_co_u32_e32 v14, vcc, -1, v4, vcc			; GFX900-NEXT: v_addc_co_u32_e32 v14, vcc, -1, v4, vcc
				; GFX900-NEXT: global_load_dwordx2 v[17:18], v[13:14], off offset:-2048
				; GFX900-NEXT: global_load_dwordx2 v[19:20], v[13:14], off
				; GFX900-NEXT: v_add_co_u32_e32 v15, vcc, s2, v3
				; GFX900-NEXT: v_addc_co_u32_e32 v16, vcc, -1, v4, vcc
				; GFX900-NEXT: v_add_co_u32_e32 v13, vcc, s3, v3
				; GFX900-NEXT: global_load_dwordx2 v[15:16], v[15:16], off offset:-2048
				; GFX900-NEXT: v_addc_co_u32_e32 v14, vcc, -1, v4, vcc
	; GFX900-NEXT: s_addk_i32 s6, 0x2000			; GFX900-NEXT: s_addk_i32 s6, 0x2000
	; GFX900-NEXT: s_cmp_gt_u32 s6, 0x3fffff			; GFX900-NEXT: s_cmp_gt_u32 s6, 0x3fffff
	; GFX900-NEXT: s_waitcnt vmcnt(0)			; GFX900-NEXT: s_waitcnt vmcnt(3)
	; GFX900-NEXT: v_add_co_u32_e32 v7, vcc, v7, v5			; GFX900-NEXT: v_add_co_u32_e32 v21, vcc, v7, v5
	; GFX900-NEXT: v_addc_co_u32_e32 v8, vcc, v8, v6, vcc			; GFX900-NEXT: v_addc_co_u32_e32 v6, vcc, v8, v6, vcc
	; GFX900-NEXT: global_load_dwordx2 v[5:6], v[13:14], off offset:-2048			; GFX900-NEXT: global_load_dwordx2 v[7:8], v[13:14], off offset:-4096
	; GFX900-NEXT: s_waitcnt vmcnt(0)			; GFX900-NEXT: s_waitcnt vmcnt(3)
	; GFX900-NEXT: v_add_co_u32_e32 v15, vcc, v5, v7			; GFX900-NEXT: v_add_co_u32_e64 v23, s[0:1], v17, v21
	; GFX900-NEXT: v_addc_co_u32_e32 v8, vcc, v6, v8, vcc			; GFX900-NEXT: v_addc_co_u32_e64 v24, s[0:1], v18, v6, s[0:1]
	; GFX900-NEXT: global_load_dwordx2 v[6:7], v[13:14], off			; GFX900-NEXT: global_load_dwordx2 v[17:18], v[13:14], off offset:-2048
	; GFX900-NEXT: v_add_co_u32_e32 v5, vcc, s2, v3			; GFX900-NEXT: global_load_dwordx2 v[21:22], v[13:14], off
	; GFX900-NEXT: s_mov_b64 s[0:1], vcc
	; GFX900-NEXT: s_waitcnt vmcnt(0)
	; GFX900-NEXT: v_add_co_u32_e32 v13, vcc, v6, v15
	; GFX900-NEXT: v_addc_co_u32_e64 v6, s[0:1], -1, v4, s[0:1]
	; GFX900-NEXT: global_load_dwordx2 v[5:6], v[5:6], off offset:-2048
	; GFX900-NEXT: v_addc_co_u32_e32 v14, vcc, v7, v8, vcc
	; GFX900-NEXT: v_add_co_u32_e32 v7, vcc, s3, v3
	; GFX900-NEXT: v_addc_co_u32_e32 v8, vcc, -1, v4, vcc
	; GFX900-NEXT: s_waitcnt vmcnt(0)
	; GFX900-NEXT: v_add_co_u32_e32 v13, vcc, v5, v13
	; GFX900-NEXT: v_addc_co_u32_e32 v14, vcc, v6, v14, vcc
	; GFX900-NEXT: global_load_dwordx2 v[5:6], v[7:8], off offset:-4096
	; GFX900-NEXT: s_waitcnt vmcnt(0)
	; GFX900-NEXT: v_add_co_u32_e32 v13, vcc, v5, v13
	; GFX900-NEXT: v_addc_co_u32_e32 v14, vcc, v6, v14, vcc
	; GFX900-NEXT: global_load_dwordx2 v[5:6], v[7:8], off offset:-2048
	; GFX900-NEXT: s_waitcnt vmcnt(0)
	; GFX900-NEXT: v_add_co_u32_e32 v13, vcc, v5, v13
	; GFX900-NEXT: global_load_dwordx2 v[7:8], v[7:8], off
	; GFX900-NEXT: v_addc_co_u32_e32 v14, vcc, v6, v14, vcc
	; GFX900-NEXT: v_add_co_u32_e32 v5, vcc, s5, v3			; GFX900-NEXT: v_add_co_u32_e32 v5, vcc, s5, v3
	; GFX900-NEXT: v_addc_co_u32_e32 v6, vcc, -1, v4, vcc			; GFX900-NEXT: v_addc_co_u32_e32 v6, vcc, -1, v4, vcc
	; GFX900-NEXT: global_load_dwordx2 v[5:6], v[5:6], off offset:-2048			; GFX900-NEXT: global_load_dwordx2 v[5:6], v[5:6], off offset:-2048
	; GFX900-NEXT: s_waitcnt vmcnt(1)			; GFX900-NEXT: s_waitcnt vmcnt(5)
	; GFX900-NEXT: v_add_co_u32_e32 v13, vcc, v7, v13			; GFX900-NEXT: v_add_co_u32_e32 v19, vcc, v19, v23
	; GFX900-NEXT: v_addc_co_u32_e32 v14, vcc, v8, v14, vcc			; GFX900-NEXT: global_load_dwordx2 v[13:14], v[3:4], off
	; GFX900-NEXT: global_load_dwordx2 v[7:8], v[3:4], off			; GFX900-NEXT: v_addc_co_u32_e32 v20, vcc, v20, v24, vcc
	; GFX900-NEXT: v_add_co_u32_e32 v3, vcc, 0x10000, v3			; GFX900-NEXT: v_add_co_u32_e32 v3, vcc, 0x10000, v3
	; GFX900-NEXT: v_addc_co_u32_e32 v4, vcc, 0, v4, vcc			; GFX900-NEXT: v_addc_co_u32_e32 v4, vcc, 0, v4, vcc
				; GFX900-NEXT: s_waitcnt vmcnt(5)
				; GFX900-NEXT: v_add_co_u32_e32 v15, vcc, v15, v19
				; GFX900-NEXT: v_addc_co_u32_e32 v16, vcc, v16, v20, vcc
				; GFX900-NEXT: s_waitcnt vmcnt(4)
				; GFX900-NEXT: v_add_co_u32_e32 v7, vcc, v7, v15
				; GFX900-NEXT: v_addc_co_u32_e32 v8, vcc, v8, v16, vcc
				; GFX900-NEXT: s_waitcnt vmcnt(3)
				; GFX900-NEXT: v_add_co_u32_e32 v7, vcc, v17, v7
				; GFX900-NEXT: v_addc_co_u32_e32 v8, vcc, v18, v8, vcc
				; GFX900-NEXT: s_waitcnt vmcnt(2)
				; GFX900-NEXT: v_add_co_u32_e32 v7, vcc, v21, v7
				; GFX900-NEXT: v_addc_co_u32_e32 v8, vcc, v22, v8, vcc
	; GFX900-NEXT: s_waitcnt vmcnt(1)			; GFX900-NEXT: s_waitcnt vmcnt(1)
	; GFX900-NEXT: v_add_co_u32_e32 v5, vcc, v5, v13			; GFX900-NEXT: v_add_co_u32_e32 v5, vcc, v5, v7
	; GFX900-NEXT: v_addc_co_u32_e32 v6, vcc, v6, v14, vcc			; GFX900-NEXT: v_addc_co_u32_e32 v6, vcc, v6, v8, vcc
	; GFX900-NEXT: v_add_co_u32_e32 v5, vcc, v9, v5			; GFX900-NEXT: v_add_co_u32_e32 v5, vcc, v9, v5
	; GFX900-NEXT: v_addc_co_u32_e32 v6, vcc, v10, v6, vcc			; GFX900-NEXT: v_addc_co_u32_e32 v6, vcc, v10, v6, vcc
	; GFX900-NEXT: v_add_co_u32_e32 v5, vcc, v11, v5			; GFX900-NEXT: v_add_co_u32_e32 v5, vcc, v11, v5
	; GFX900-NEXT: v_addc_co_u32_e32 v6, vcc, v12, v6, vcc			; GFX900-NEXT: v_addc_co_u32_e32 v6, vcc, v12, v6, vcc
	; GFX900-NEXT: s_waitcnt vmcnt(0)			; GFX900-NEXT: s_waitcnt vmcnt(0)
	; GFX900-NEXT: v_add_co_u32_e32 v5, vcc, v7, v5			; GFX900-NEXT: v_add_co_u32_e32 v5, vcc, v13, v5
	; GFX900-NEXT: v_addc_co_u32_e32 v6, vcc, v8, v6, vcc			; GFX900-NEXT: v_addc_co_u32_e32 v6, vcc, v14, v6, vcc
	; GFX900-NEXT: s_cbranch_scc0 .LBB1_2			; GFX900-NEXT: s_cbranch_scc0 .LBB1_2
	; GFX900-NEXT: ; %bb.3: ; %while.cond.loopexit			; GFX900-NEXT: ; %bb.3: ; %while.cond.loopexit
	; GFX900-NEXT: ; in Loop: Header=BB1_1 Depth=1			; GFX900-NEXT: ; in Loop: Header=BB1_1 Depth=1
	; GFX900-NEXT: s_add_i32 s0, s4, -1			; GFX900-NEXT: s_add_i32 s0, s4, -1
	; GFX900-NEXT: s_cmp_eq_u32 s4, 0			; GFX900-NEXT: s_cmp_eq_u32 s4, 0
	; GFX900-NEXT: s_cbranch_scc1 .LBB1_5			; GFX900-NEXT: s_cbranch_scc1 .LBB1_5
	; GFX900-NEXT: ; %bb.4: ; in Loop: Header=BB1_1 Depth=1			; GFX900-NEXT: ; %bb.4: ; in Loop: Header=BB1_1 Depth=1
	; GFX900-NEXT: s_mov_b32 s4, s0			; GFX900-NEXT: s_mov_b32 s4, s0
	▲ Show 20 Lines • Show All 2,289 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] MachineScheduler: schedule execution metric added for the UnclusteredHighRPStageClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 486646

llvm/lib/Target/AMDGPU/GCNSchedStrategy.h

llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp

llvm/test/CodeGen/AMDGPU/load-global-i16.ll

llvm/test/CodeGen/AMDGPU/promote-constOffset-to-imm.ll

[AMDGPU] MachineScheduler: schedule execution metric added for the UnclusteredHighRPStage
ClosedPublic