This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/lib/CodeGen/SelectionDAG/
-
lib/
-
CodeGen/
-
SelectionDAG/
-
ScheduleDAGRRList.cpp

Differential D83335

[ScheduleDAGRRList] Use std::*_heap() to keep candidate queue a heap.
Changes PlannedPublic

Authored by fhahn on Jul 7 2020, 12:18 PM.

Download Raw Diff

Details

Reviewers

efriedma
niravd
paquette

Summary

Currently popFromQueueImpl iterates over all candidates to find the best
one. While the candidate queue is small, this is not a problem. But it
becomes a problem once the queue gets larger. For example, the snippet
below takes 330s to compile with llc -O0, but completes in 3s with this
patch.

define void @test(i4000000* %ptr) {
entry:

store i4000000 0, i4000000* %ptr, align 4
ret void

}

On backends that use the MachineScheduler, there should be no changes in
the generated code (e.g. for X86 there are no binary changes with this
patch when building MultiSource, SPEC2000, SPEC2006 with -O3 -lto).

On backends that are not using the MachineScheduler, there is a slight
change in behavior: previously, the first candidate in the list would be
picked if there are multiple candidates with the same score.

For small worklists, maintaining the heap can be more expensive than it
is actually worth it, so the new approach is only used for candidate
lists with more than 100 candidates. See
http://llvm-compile-time-tracker.com/compare.php?from=058af835063ff9afc39fc53279fa660e075564ed&to=390d055bee63860be70caf57515b3f29f7728d91&stat=instructions
where the first commit is with a limit and the second one always uses
a heap. The first commit is slightly faster.

Overall on CTMark, the change is mostly neutral
(http://llvm-compile-time-tracker.com/compare.php?from=f7522a5823d66303edfd8d872232dd6b07190f42&to=058af835063ff9afc39fc53279fa660e075564ed&stat=instructions)
but it is beneficial for very large inputs.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

fhahn created this revision.Jul 7 2020, 12:18 PM

Herald added a project: Restricted Project. · View Herald TranscriptJul 7 2020, 12:18 PM

Herald added subscribers: dexonsmith, hiraditya, MatzeB. · View Herald Transcript

Harbormaster failed remote builds in B63276: Diff 276178!Jul 7 2020, 12:19 PM

I'm concerned that the behavior of queues with multiple candidates with the same score might not be consistent across compilers. (This is similar to using llvm::sort when you really need std::stable_sort.)

In D83335#2137150, @efriedma wrote:

I'm concerned that the behavior of queues with multiple candidates with the same score might not be consistent across compilers. (This is similar to using llvm::sort when you really need std::stable_sort.)

It looks like the comperators try hard to break ties between candidates with the same score (via an increasing NodeQueueId) but I think I noticed cases where there we still visit candidates in a slightly different order. I'll take a closer look

dexonsmith removed a subscriber: dexonsmith.Jul 7 2020, 9:03 PM

I updated the patch to limit the use of the heap to the src order comperator (which is used in combination with the MachineScheduler) and added extra verification to ensure the heap remains properly ordered and we pick the same candidate as with the existing heuristic.

With those verifications enabled, I managed to do a bootstrap build on X86 and built SPEC2000, SPEC2006 and MultiSource on X86 and AArch64 without a crash.

After taking a closer look at the source order operator, it looks like the guarantee a total ordering (they fall back to the NodeQueueId if all other criteria are equal and this ID is unique), so I don't think we would run into problems with multiple SUs having the same score.

There is a different potential issue though. For some comperators, the scoring of a candidate can change if a different node is scheduled. This does not seem like an issue on practice for the source order comperator, but it is for some of the other comperators, which is why this patch now limits the change to the source order comperator. After looking at the code for the source order comperator, it looks like the score could change after units are scheduled as well in some edge cases. This is not a big problem, as it would happen in a deterministic way and should only have very minor impact on the generated code, as the machinescheduler has the main responsibility for scheduling.

What do you think?

We might even go further and limit the source order comperator to just the IR ordering and the queue IDs, because the real scheduling should happen in the machine scheduler.

Harbormaster failed remote builds in B63428: Diff 276450!Jul 8 2020, 8:49 AM

After looking at the code for the source order comperator, it looks like the score could change after units are scheduled as well in some edge cases.

So AssertisHeap might fail? I'm not really comfortable with that...

We might even go further and limit the source order comperator to just the IR ordering and the queue IDs, because the real scheduling should happen in the machine scheduler.

Make this a separate patch, in case it has some unexpected side-effect, but sure, that makes sense.

Also, maybe we could change the way we compute scheduling priority based on the size of the queue. So keep the current scheduling for common cases, but switch to a simpler heuristic if the queue gets too large.

In D83335#2139796, @efriedma wrote:

After looking at the code for the source order comperator, it looks like the score could change after units are scheduled as well in some edge cases.

So AssertisHeap might fail? I'm not really comfortable with that...

I am not sure if we want to leave them in either. The main reason to include them in the patch was to show how I tried to verify things behave sanely for a wide range of inputs. As mentioned earlier, not picking the best candidate here should not a big deal and it should happen very rarely (did not happen during bootstrap on X86 and various SPEC & MultiSource benchmarks). The selection should be deterministic across different compilers/C++ STLs because the comparator enforces a total order. Does that make sense?

We might even go further and limit the source order comperator to just the IR ordering and the queue IDs, because the real scheduling should happen in the machine scheduler.

Make this a separate patch, in case it has some unexpected side-effect, but sure, that makes sense.

yes that definitely needs to be separate. I'll need to do a more careful evaluation there, as changing the heuristic unfortunately impacts a bunch of test cases in small ways.

In D83335#2139810, @efriedma wrote:

Also, maybe we could change the way we compute scheduling priority based on the size of the queue. So keep the current scheduling for common cases, but switch to a simpler heuristic if the queue gets too large.

Are you referring to using the heap only once the queue grows larger than a threshold or deciding what scheduling heuristics to enable based on the size? I'll add back the original threshold back to the patch. I removed it to ensure the heap & assertions are applied as broadly as possible for verification.

Are you referring to using the heap only once the queue grows larger than a threshold or deciding what scheduling heuristics to enable based on the size?

The scheduling heuristics.

The selection should be deterministic across different compilers/C++ STLs because the comparator enforces a total order.

It's undefined behavior to call std::push_heap/std::pop_heap on an array that isn't a heap. If the total order changes, that can break the heap property. Not sure what the practical consequence would be on common STL implementations, but that seems scary enough that we want to ensure that can't happen.

In D83335#2140138, @efriedma wrote:

Are you referring to using the heap only once the queue grows larger than a threshold or deciding what scheduling heuristics to enable based on the size?

The scheduling heuristics.

The selection should be deterministic across different compilers/C++ STLs because the comparator enforces a total order.

It's undefined behavior to call std::push_heap/std::pop_heap on an array that isn't a heap. If the total order changes, that can break the heap property. Not sure what the practical consequence would be on common STL implementations, but that seems scary enough that we want to ensure that can't happen.

Yeah we should avoid that. I'll take another look at the source order comparator, but I don't think we can rule out changing costs as of right now. Potentially changing the comparator for backends using the MachineScheduler is a bit bigger task. In the meantime I think I'll put up a patch that limits the number of candidates to scan linearly, to avoid a nasty quadratic compile-time case.

fhahn mentioned this in D84328: [ScheduleDAGRRList] Limit number of candidates to explore..Jul 22 2020, 7:59 AM

I'll put this on hold for now, until I have more time to investigate the heuristics. I've put up D84328 which adds a cut-off for the number of candidates to check in the meantime, which also avoids huge increases in compile-time caused by the linear scan here.

fhahn mentioned this in rG2f8e6b5f3c86: [ScheduleDAGRRList] Limit number of candidates to explore..Jul 23 2020, 3:44 AM

Revision Contents

Path

Size

llvm/

lib/

CodeGen/

SelectionDAG/

ScheduleDAGRRList.cpp

68 lines

Diff 276450

llvm/lib/CodeGen/SelectionDAG/ScheduleDAGRRList.cpp

	Show First 20 Lines • Show All 48 Lines • ▼ Show 20 Lines
	#include "llvm/Support/raw_ostream.h"			#include "llvm/Support/raw_ostream.h"
	#include <algorithm>			#include <algorithm>
	#include <cassert>			#include <cassert>
	#include <cstdint>			#include <cstdint>
	#include <cstdlib>			#include <cstdlib>
	#include <iterator>			#include <iterator>
	#include <limits>			#include <limits>
	#include <memory>			#include <memory>
				#include <queue>
	#include <utility>			#include <utility>
	#include <vector>			#include <vector>

	using namespace llvm;			using namespace llvm;

	#define DEBUG_TYPE "pre-RA-sched"			#define DEBUG_TYPE "pre-RA-sched"

	STATISTIC(NumBacktracks, "Number of times scheduler backtracked");			STATISTIC(NumBacktracks, "Number of times scheduler backtracked");
	▲ Show 20 Lines • Show All 1,747 Lines • ▼ Show 20 Lines

	protected:			protected:
	bool canClobber(const SUnit SU, const SUnit Op);			bool canClobber(const SUnit SU, const SUnit Op);
	void AddPseudoTwoAddrDeps();			void AddPseudoTwoAddrDeps();
	void PrescheduleNodesWithMultipleUses();			void PrescheduleNodesWithMultipleUses();
	void CalculateSethiUllmanNumbers();			void CalculateSethiUllmanNumbers();
	};			};

	template<class SF>			template <class SF>
	static SUnit popFromQueueImpl(std::vector<SUnit > &Q, SF &Picker) {			static SUnit popFromQueueImpl(std::vector<SUnit > &Q, SF &Picker,
				bool &IsHeap) {
	std::vector<SUnit *>::iterator Best = Q.begin();			std::vector<SUnit *>::iterator Best = Q.begin();
	for (auto I = std::next(Q.begin()), E = Q.end(); I != E; ++I)			for (auto I = std::next(Q.begin()), E = Q.end(); I != E; ++I)
	if (Picker(Best, I))			if (Picker(Best, I))
	Best = I;			Best = I;
	SUnit V = Best;			SUnit V = Best;
	if (Best != std::prev(Q.end()))			if (Best != std::prev(Q.end()))
	std::swap(*Best, Q.back());			std::swap(*Best, Q.back());
	Q.pop_back();			Q.pop_back();
	return V;			return V;
	}			}

	template<class SF>			template <class SF>
	SUnit popFromQueue(std::vector<SUnit > &Q, SF &Picker, ScheduleDAG *DAG) {			static void assertIsHeap(std::vector<SUnit *> &Q, SF &Picker) {
				#ifdef EXPENSIVE_CHECKS
				// Verify that Q is a valid heap.
				std::vector<SUnit *> Copy(Q.begin(), Q.end());
				std::make_heap(Copy.begin(), Copy.end(), Picker);
				for (unsigned i = 0; i < Copy.size(); i++)
				assert(Copy[i] == Q[i]);
				#endif
				}

				// Specialization for src_ls_rr_sort that uses a heap to maintain the queue for
				// faster accesses. This is limited to src_ls_rr_sort because other comperators
				// may change the score after a node is removed from the queue.
				static SUnit popFromQueueImpl(std::vector<SUnit > &Q, src_ls_rr_sort &Picker,
				bool &IsHeap) {
				if (!IsHeap) {
				std::make_heap(Q.begin(), Q.end(), Picker);
				IsHeap = true;
				}

				assertIsHeap(Q, Picker);

				std::pop_heap(Q.begin(), Q.end(), Picker);
				SUnit *V = Q.back();

				#ifdef EXPENSIVE_CHECKS
				SUnit Best = Q.begin();
				for (auto I = std::next(Q.begin()), E = Q.end(); I != E; ++I)
				if (Picker(Best, *I))
				Best = *I;
				assert(V == Best &&
				"candidates found on heap and through linear scan differ!");
				#endif

				Q.pop_back();
				return V;
				}

				template <class SF>
				SUnit popFromQueue(std::vector<SUnit > &Q, SF &Picker, ScheduleDAG *DAG,
				bool &IsHeap) {
	#ifndef NDEBUG			#ifndef NDEBUG
	if (DAG->StressSched) {			if (DAG->StressSched) {
	reverse_sort<SF> RPicker(Picker);			reverse_sort<SF> RPicker(Picker);
	return popFromQueueImpl(Q, RPicker);			return popFromQueueImpl(Q, RPicker, IsHeap);
	}			}
	#endif			#endif
	(void)DAG;			(void)DAG;
	return popFromQueueImpl(Q, Picker);			return popFromQueueImpl(Q, Picker, IsHeap);
	}			}

	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	// RegReductionPriorityQueue Definition			// RegReductionPriorityQueue Definition
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	//			//
	// This is a SchedulingPriorityQueue that schedules using Sethi Ullman numbers			// This is a SchedulingPriorityQueue that schedules using Sethi Ullman numbers
	// to reduce register pressure.			// to reduce register pressure.
	//			//
	template<class SF>			template<class SF>
	class RegReductionPriorityQueue : public RegReductionPQBase {			class RegReductionPriorityQueue : public RegReductionPQBase {
	SF Picker;			SF Picker;

	unsigned CurQueueId = 0;			unsigned CurQueueId = 0;
	std::vector<SUnit *> Queue;			std::vector<SUnit *> Queue;
				bool IsHeap = false;

	public:			public:
	RegReductionPriorityQueue(MachineFunction &mf,			RegReductionPriorityQueue(MachineFunction &mf,
	bool tracksrp,			bool tracksrp,
	bool srcorder,			bool srcorder,
	const TargetInstrInfo *tii,			const TargetInstrInfo *tii,
	const TargetRegisterInfo *tri,			const TargetRegisterInfo *tri,
	const TargetLowering *tli)			const TargetLowering *tli)
	: RegReductionPQBase(mf, SF::HasReadyFilter, tracksrp, srcorder,			: RegReductionPQBase(mf, SF::HasReadyFilter, tracksrp, srcorder,
	tii, tri, tli),			tii, tri, tli),
	Picker(this) {}			Picker(this) {}

	bool isBottomUp() const override { return SF::IsBottomUp; }			bool isBottomUp() const override { return SF::IsBottomUp; }

	bool isReady(SUnit *U) const override {			bool isReady(SUnit *U) const override {
	return Picker.HasReadyFilter && Picker.isReady(U, getCurCycle());			return Picker.HasReadyFilter && Picker.isReady(U, getCurCycle());
	}			}

	SUnit *pop() override {			SUnit *pop() override {
	if (Queue.empty()) return nullptr;			if (Queue.empty()) return nullptr;

	SUnit *V = popFromQueue(Queue, Picker, scheduleDAG);			SUnit *V = popFromQueue(Queue, Picker, scheduleDAG, IsHeap);
	V->NodeQueueId = 0;			V->NodeQueueId = 0;
	return V;			return V;
	}			}

	void push(SUnit *U) override {			void push(SUnit *U) override {
	assert(!U->NodeQueueId && "Node in the queue already");			assert(!U->NodeQueueId && "Node in the queue already");
	U->NodeQueueId = ++CurQueueId;			U->NodeQueueId = ++CurQueueId;
				if (std::is_same<SF, src_ls_rr_sort>::value && IsHeap)
				assertIsHeap(Queue, Picker);

	Queue.push_back(U);			Queue.push_back(U);
				if (std::is_same<SF, src_ls_rr_sort>::value && IsHeap) {
				std::push_heap(Queue.begin(), Queue.end(), Picker);
				assertIsHeap(Queue, Picker);
				}
	}			}

	void remove(SUnit *SU) override {			void remove(SUnit *SU) override {
	assert(!Queue.empty() && "Queue is empty!");			assert(!Queue.empty() && "Queue is empty!");
	assert(SU->NodeQueueId != 0 && "Not in queue!");			assert(SU->NodeQueueId != 0 && "Not in queue!");
	std::vector<SUnit *>::iterator I = llvm::find(Queue, SU);			std::vector<SUnit *>::iterator I = llvm::find(Queue, SU);
	if (I != std::prev(Queue.end()))			if (I != std::prev(Queue.end()))
	std::swap(*I, Queue.back());			std::swap(*I, Queue.back());
	Queue.pop_back();			Queue.pop_back();
	SU->NodeQueueId = 0;			SU->NodeQueueId = 0;
				IsHeap = false;
	}			}

	bool empty() const override { return Queue.empty(); }			bool empty() const override { return Queue.empty(); }

	#if !defined(NDEBUG) \|\| defined(LLVM_ENABLE_DUMP)			#if !defined(NDEBUG) \|\| defined(LLVM_ENABLE_DUMP)
	LLVM_DUMP_METHOD void dump(ScheduleDAG *DAG) const override {			LLVM_DUMP_METHOD void dump(ScheduleDAG *DAG) const override {
	// Emulate pop() without clobbering NodeQueueIds.			// Emulate pop() without clobbering NodeQueueIds.
	std::vector<SUnit *> DumpQueue = Queue;			std::vector<SUnit *> DumpQueue = Queue;
	SF DumpPicker = Picker;			SF DumpPicker = Picker;
	while (!DumpQueue.empty()) {			while (!DumpQueue.empty()) {
	SUnit *SU = popFromQueue(DumpQueue, DumpPicker, scheduleDAG);			bool IsHeap = IsHeap;
				SUnit *SU = popFromQueue(DumpQueue, DumpPicker, scheduleDAG, IsHeap);
	dbgs() << "Height " << SU->getHeight() << ": ";			dbgs() << "Height " << SU->getHeight() << ": ";
	DAG->dumpNode(*SU);			DAG->dumpNode(*SU);
	}			}
	}			}
	#endif			#endif
	};			};

	using BURegReductionPriorityQueue = RegReductionPriorityQueue<bu_ls_rr_sort>;			using BURegReductionPriorityQueue = RegReductionPriorityQueue<bu_ls_rr_sort>;
	▲ Show 20 Lines • Show All 1,274 Lines • Show Last 20 Lines