This is an archive of the discontinued LLVM Phabricator instance.

[ScheduleDAGRRList] Use std::*_heap() to keep candidate queue a heap.
Changes PlannedPublic

Authored by fhahn on Jul 7 2020, 12:18 PM.

Details

Summary

Currently popFromQueueImpl iterates over all candidates to find the best
one. While the candidate queue is small, this is not a problem. But it
becomes a problem once the queue gets larger. For example, the snippet
below takes 330s to compile with llc -O0, but completes in 3s with this
patch.

define void @test(i4000000* %ptr) {
entry:

store i4000000 0, i4000000* %ptr, align 4
ret void

}

On backends that use the MachineScheduler, there should be no changes in
the generated code (e.g. for X86 there are no binary changes with this
patch when building MultiSource, SPEC2000, SPEC2006 with -O3 -lto).

On backends that are not using the MachineScheduler, there is a slight
change in behavior: previously, the first candidate in the list would be
picked if there are multiple candidates with the same score.

For small worklists, maintaining the heap can be more expensive than it
is actually worth it, so the new approach is only used for candidate
lists with more than 100 candidates. See
http://llvm-compile-time-tracker.com/compare.php?from=058af835063ff9afc39fc53279fa660e075564ed&to=390d055bee63860be70caf57515b3f29f7728d91&stat=instructions
where the first commit is with a limit and the second one always uses
a heap. The first commit is slightly faster.

Overall on CTMark, the change is mostly neutral
(http://llvm-compile-time-tracker.com/compare.php?from=f7522a5823d66303edfd8d872232dd6b07190f42&to=058af835063ff9afc39fc53279fa660e075564ed&stat=instructions)
but it is beneficial for very large inputs.

Diff Detail

Event Timeline

fhahn created this revision.Jul 7 2020, 12:18 PM
Herald added a project: Restricted Project. · View Herald TranscriptJul 7 2020, 12:18 PM

I'm concerned that the behavior of queues with multiple candidates with the same score might not be consistent across compilers. (This is similar to using llvm::sort when you really need std::stable_sort.)

fhahn planned changes to this revision.Jul 7 2020, 1:24 PM

I'm concerned that the behavior of queues with multiple candidates with the same score might not be consistent across compilers. (This is similar to using llvm::sort when you really need std::stable_sort.)

It looks like the comperators try hard to break ties between candidates with the same score (via an increasing NodeQueueId) but I think I noticed cases where there we still visit candidates in a slightly different order. I'll take a closer look

fhahn updated this revision to Diff 276450.Jul 8 2020, 8:48 AM

I updated the patch to limit the use of the heap to the src order comperator (which is used in combination with the MachineScheduler) and added extra verification to ensure the heap remains properly ordered and we pick the same candidate as with the existing heuristic.

With those verifications enabled, I managed to do a bootstrap build on X86 and built SPEC2000, SPEC2006 and MultiSource on X86 and AArch64 without a crash.

After taking a closer look at the source order operator, it looks like the guarantee a total ordering (they fall back to the NodeQueueId if all other criteria are equal and this ID is unique), so I don't think we would run into problems with multiple SUs having the same score.

There is a different potential issue though. For some comperators, the scoring of a candidate can change if a different node is scheduled. This does not seem like an issue on practice for the source order comperator, but it is for some of the other comperators, which is why this patch now limits the change to the source order comperator. After looking at the code for the source order comperator, it looks like the score could change after units are scheduled as well in some edge cases. This is not a big problem, as it would happen in a deterministic way and should only have very minor impact on the generated code, as the machinescheduler has the main responsibility for scheduling.

What do you think?

We might even go further and limit the source order comperator to just the IR ordering and the queue IDs, because the real scheduling should happen in the machine scheduler.

After looking at the code for the source order comperator, it looks like the score could change after units are scheduled as well in some edge cases.

So AssertisHeap might fail? I'm not really comfortable with that...

We might even go further and limit the source order comperator to just the IR ordering and the queue IDs, because the real scheduling should happen in the machine scheduler.

Make this a separate patch, in case it has some unexpected side-effect, but sure, that makes sense.

Also, maybe we could change the way we compute scheduling priority based on the size of the queue. So keep the current scheduling for common cases, but switch to a simpler heuristic if the queue gets too large.

fhahn added a comment.Jul 8 2020, 1:56 PM

After looking at the code for the source order comperator, it looks like the score could change after units are scheduled as well in some edge cases.

So AssertisHeap might fail? I'm not really comfortable with that...

I am not sure if we want to leave them in either. The main reason to include them in the patch was to show how I tried to verify things behave sanely for a wide range of inputs. As mentioned earlier, not picking the best candidate here should not a big deal and it should happen very rarely (did not happen during bootstrap on X86 and various SPEC & MultiSource benchmarks). The selection should be deterministic across different compilers/C++ STLs because the comparator enforces a total order. Does that make sense?

We might even go further and limit the source order comperator to just the IR ordering and the queue IDs, because the real scheduling should happen in the machine scheduler.

Make this a separate patch, in case it has some unexpected side-effect, but sure, that makes sense.

yes that definitely needs to be separate. I'll need to do a more careful evaluation there, as changing the heuristic unfortunately impacts a bunch of test cases in small ways.

Also, maybe we could change the way we compute scheduling priority based on the size of the queue. So keep the current scheduling for common cases, but switch to a simpler heuristic if the queue gets too large.

Are you referring to using the heap only once the queue grows larger than a threshold or deciding what scheduling heuristics to enable based on the size? I'll add back the original threshold back to the patch. I removed it to ensure the heap & assertions are applied as broadly as possible for verification.

Are you referring to using the heap only once the queue grows larger than a threshold or deciding what scheduling heuristics to enable based on the size?

The scheduling heuristics.

The selection should be deterministic across different compilers/C++ STLs because the comparator enforces a total order.

It's undefined behavior to call std::push_heap/std::pop_heap on an array that isn't a heap. If the total order changes, that can break the heap property. Not sure what the practical consequence would be on common STL implementations, but that seems scary enough that we want to ensure that can't happen.

fhahn added a comment.Jul 10 2020, 2:37 PM

Are you referring to using the heap only once the queue grows larger than a threshold or deciding what scheduling heuristics to enable based on the size?

The scheduling heuristics.

The selection should be deterministic across different compilers/C++ STLs because the comparator enforces a total order.

It's undefined behavior to call std::push_heap/std::pop_heap on an array that isn't a heap. If the total order changes, that can break the heap property. Not sure what the practical consequence would be on common STL implementations, but that seems scary enough that we want to ensure that can't happen.

Yeah we should avoid that. I'll take another look at the source order comparator, but I don't think we can rule out changing costs as of right now. Potentially changing the comparator for backends using the MachineScheduler is a bit bigger task. In the meantime I think I'll put up a patch that limits the number of candidates to scan linearly, to avoid a nasty quadratic compile-time case.

fhahn planned changes to this revision.Jul 22 2020, 8:00 AM

I'll put this on hold for now, until I have more time to investigate the heuristics. I've put up D84328 which adds a cut-off for the number of candidates to check in the meantime, which also avoids huge increases in compile-time caused by the linear scan here.