This is an archive of the discontinued LLVM Phabricator instance.

[SLP]Do not schedule instructions with constants/argument/phi operands and external users.
ClosedPublic

Authored by ABataev on Mar 7 2022, 7:42 AM.

Details

Summary

No need to schedule entry nodes where all instructions are not memory
read/write instructions and their operands are either constants, or
arguments, or phis, or instructions from others blocks, or their users
are phis or from the other blocks.
The resulting vector instructions can be placed at
the beginning of the basic block without scheduling (if operands does
not need to be scheduled) or at the end of the block (if users are
outside of the block).
It may save some compile time and scheduling resources.

Diff Detail

Event Timeline

ABataev created this revision.Mar 7 2022, 7:42 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 7 2022, 7:42 AM
ABataev requested review of this revision.Mar 7 2022, 7:42 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 7 2022, 7:42 AM

I think this needs some text in the code to describe the high-level design changes in the scheduler introduced by this patch.
There are a couple of things to explain like:

  • When an instruction is not scheduled (which is also explained in the comments of isUsedOutsideBlock() and areAllOperandsNonInsts()).
  • Changes in the scheduler's design and data structures: If I understand correctly the instructions that are skipped don't get a ScheduleData object assigned to them.

Regarding the last point, couldn't we assign a ScheduleData object even to the instructions that are not scheduled and still save compilation time? I would assume that most of the compile-time overhead comes from calculateDependencies(), so as long as we don't calculate dependencies beyond the instructions that are marked to be skipped, then we should still save compilation time. I think this would make this change a bit less intrusive as it would not change the design of the scheduler much. What do you think?

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
803

We may need to limit the number of users to a small integer because in some pathological cases we may have thousands of them.

2366

Why are we removing Lane?

2412–2413

This is a bit hard to read. I think it could be simplified by iterating across lanes in the for loop for (unsigned Lane = 0, Lanes = VL.size(); Lane != Lanes; ++Lane) and then setting BundleMember inside the for loop.

2415
  1. I think we need a function like mustBeScheduled(Value *V) that calls areAllOperandsNonInsts() and isUsedOutsideBlock().
  1. Also these checks show up in more than one place. Do you think we could check once and and cache the outcome of the check in a map? Perhaps add a NeedsScheduling field in ScheduleData ?
2603

Do we need the checks NextInBundle != nullptr || FirstInBundle != this ? Shouldn't the check TE != nullptr be sufficient?

3933

perhaps rename it to NeedsScheduling?

4137

I don't understand why needToScheduleSingleInstruction(VL) needs to be here. If I understand correctly this assertion checks that if we have failed to schedule VL, then cacnelScheduling has worked correctly and VL0 (and probably the rest of the values?) are not marked as part of a bundle. So needToScheduleSingleInstruction(VL) executes if VL0 is part of a bundle.

6490–6491

Perhaps move these checks in a method`TreeEntry::doesNotNeedScheduling()` ?

I think this needs some text in the code to describe the high-level design changes in the scheduler introduced by this patch.
There are a couple of things to explain like:

  • When an instruction is not scheduled (which is also explained in the comments of isUsedOutsideBlock() and areAllOperandsNonInsts()).
  • Changes in the scheduler's design and data structures: If I understand correctly the instructions that are skipped don't get a ScheduleData object assigned to them.

Where do you want to see this info?

Regarding the last point, couldn't we assign a ScheduleData object even to the instructions that are not scheduled and still save compilation time? I would assume that most of the compile-time overhead comes from calculateDependencies(), so as long as we don't calculate dependencies beyond the instructions that are marked to be skipped, then we should still save compilation time. I think this would make this change a bit less intrusive as it would not change the design of the scheduler much. What do you think?

I would like to save some memory too, not only compile time. Why do we need to allocate ScheduleData for the instructions that should not be scheduled?

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
803

Good point, will do this.

2366

It is no needed anymore (actually used only in one place). Plus, it gets invalidated after graph reordering and it should be recalculated. Instead, better to find the corresponding instruction directly rather than keep this Lane and then recalculate it after graph reordering. Plus, saves some memory.

2415

Rather doubt we need a cache. The checks are pretty simple and should not take much time to perform.

2603

Yes, still need these checks since this member function is used before we actually assigning TE.

4137

isPartOfBundle is not powerfull enough to check if we have just a single instruction that requires scheduling. We don't have assigned TE yet and need an extra check here for a single schedulable instruction.

6490–6491

doesNotNeedScheduling works with a list of values, not with TreeEntry. And in some cases we need to call it for just a list. This check is needed only here, other cases do not require anything like this.

ABataev updated this revision to Diff 414661.Mar 11 2022, 7:57 AM

Address comments

Where do you want to see this info?

Perhaps right above struct BlockScheduling ?

I would like to save some memory too, not only compile time. Why do we need to allocate ScheduleData for the instructions that should not be scheduled?

Makes sense. Please mention this in the design description text too.

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
800

Nit: Please add a one line comment saying that this limit is to save compilation time when instrs have too many uses.

4137

I am probably missing something about the design here.

If the check is false for !BS.getScheduleData(VL0) || !BS.getScheduleData(VL0)->isPartOfBundle() then it means that VL0 is part of a real bundle with more than one instruction in the bundle. In other words, how can we end up in such a situation where VL0 is in a multi-instruction bundle and still have a single instruction in VL that needs scheduling?

Let me explain. If we had only a single instruction that needs scheduling, then this could be either:
(i) VL0, in which case it should be a single-instruction bundle so isPartOfBundle() would be false, or
(ii) some other instruction in VL, in which case VL0 should not require scheduling so getScheduleData(VL0) should be false because we no longer assign ScheduleData objects to instructions that don't need scheduling.

ABataev added inline comments.Mar 11 2022, 12:53 PM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4137

We allocate ScheduleData not only for schedulable instruction, but also for all instruction in between, but since they don't have next elements, they are not considered to be part of bundle. Function isPartOfBundle() is not enough here, because if we have only single schedulable instruction, isPartOfBundle() will still return false here, because this instruction doers not have next element in bundle and TE is not set yet.
Here we have a corner case, where only single instruction from the VL must be scheduled and isPartOfBundle() still returns false.

ABataev updated this revision to Diff 414734.Mar 11 2022, 1:22 PM

Address comments

vporpo added inline comments.Mar 11 2022, 3:23 PM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2725–2731

Some of this text repeats multiple times in the patch, it is probably better to avoid too much repetition.

What would also be nice to have here is some brief explanation about the design, like when exactly a ScheduleData entry is assigned to an Instruction.

4137

We allocate ScheduleData not only for schedulable instruction, but also for all instruction in between,

I see, initScheduleData() won't allocate ScheduleData unless it is schedulable, but extendSchedulingRegion() will do so without checking. Why is this done this way, and not for example always skip ScheduleData?

Here we have a corner case, where only single instruction from the VL must be scheduled and isPartOfBundle() still returns false.

I don't follow. If isPartOfBundle() returns false then !BS.getScheduleData(VL0) || !BS.getScheduleData(VL0)->isPartOfBundle() is true, and we don't need needToScheduleSingelInstruction(VL). The assertion would pass anyway.

7991–7994

move this under if (doesNotNeedToSchedule(I)) ?

ABataev marked an inline comment as done.Mar 14 2022, 7:00 AM
ABataev added inline comments.
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4137

We allocate ScheduleData not only for schedulable instruction, but also for all instruction in between,

I see, initScheduleData() won't allocate ScheduleData unless it is schedulable, but extendSchedulingRegion() will do so without checking. Why is this done this way, and not for example always skip ScheduleData?

Actually, the check in extendSchedulingRegion() is not required, there is an assert instead. Plus, extendSchedulingRegion() calls initScheduleData(), which then checks if actually need to schedule the given instruction.

Here we have a corner case, where only single instruction from the VL must be scheduled and isPartOfBundle() still returns false.

I don't follow. If isPartOfBundle() returns false then !BS.getScheduleData(VL0) || !BS.getScheduleData(VL0)->isPartOfBundle() is true, and we don't need needToScheduleSingelInstruction(VL). The assertion would pass anyway.

Ah, sorry, did not check logic correctly. Actually, it is safe to remove this check, was part of the development process, forgot to remove.

ABataev updated this revision to Diff 415089.Mar 14 2022, 7:33 AM

Address comments + rebase

vporpo accepted this revision.Mar 14 2022, 12:00 PM
vporpo added inline comments.
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
817

nit: perhaps rename it to doesNotNeedToBeScheduled? If it was a member function of the scheduler I think it would sound fine being called like scheduler.doesNotNeedToSchedule(V).

This revision is now accepted and ready to land.Mar 14 2022, 12:00 PM
This revision was landed with ongoing or failed builds.Mar 16 2022, 6:07 AM
This revision was automatically updated to reflect the committed changes.

Hi Alexey
with this patch, i noticed an assert building one of our runtime files , the test case .c produced is around 24000 lines
would you like it as is? or reduced ?

Instruction does not dominate all uses!

%39 = call <8 x i16> @llvm.fshl.v8i16(<8 x i16> %18, <8 x i16> %38, <8 x i16> <i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8>)
%23 = shufflevector <8 x i16> %22, <8 x i16> %39, <8 x i32> <i32 0, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14>

Instruction does not dominate all uses!

%67 = call <8 x i16> @llvm.fshl.v8i16(<8 x i16> %44, <8 x i16> %66, <8 x i16> <i16 15, i16 15, i16 15, i16 15, i16 15, i16 15, i16 15, i16 15>)
%51 = shufflevector <8 x i16> %50, <8 x i16> %67, <8 x i32> <i32 0, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14>

in function eshift
fatal error: error in backend: Broken function found, compilation aborted!

Hi Alexey
with this patch, i noticed an assert building one of our runtime files , the test case .c produced is around 24000 lines
would you like it as is? or reduced ?

Instruction does not dominate all uses!

%39 = call <8 x i16> @llvm.fshl.v8i16(<8 x i16> %18, <8 x i16> %38, <8 x i16> <i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8>)
%23 = shufflevector <8 x i16> %22, <8 x i16> %39, <8 x i32> <i32 0, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14>

Instruction does not dominate all uses!

%67 = call <8 x i16> @llvm.fshl.v8i16(<8 x i16> %44, <8 x i16> %66, <8 x i16> <i16 15, i16 15, i16 15, i16 15, i16 15, i16 15, i16 15, i16 15>)
%51 = shufflevector <8 x i16> %50, <8 x i16> %67, <8 x i32> <i32 0, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14>

in function eshift
fatal error: error in backend: Broken function found, compilation aborted!

Hi Ron, it would be good if you can provide reduced case. I'll revert the patch meanwhile.

haowei added a subscriber: haowei.Mar 16 2022, 4:09 PM

Hi Alexey
with this patch, i noticed an assert building one of our runtime files , the test case .c produced is around 24000 lines
would you like it as is? or reduced ?

Instruction does not dominate all uses!

%39 = call <8 x i16> @llvm.fshl.v8i16(<8 x i16> %18, <8 x i16> %38, <8 x i16> <i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8>)
%23 = shufflevector <8 x i16> %22, <8 x i16> %39, <8 x i32> <i32 0, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14>

Instruction does not dominate all uses!

%67 = call <8 x i16> @llvm.fshl.v8i16(<8 x i16> %44, <8 x i16> %66, <8 x i16> <i16 15, i16 15, i16 15, i16 15, i16 15, i16 15, i16 15, i16 15>)
%51 = shufflevector <8 x i16> %50, <8 x i16> %67, <8 x i32> <i32 0, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14>

in function eshift
fatal error: error in backend: Broken function found, compilation aborted!

Hi Ron, it would be good if you can provide reduced case. I'll revert the patch meanwhile.

Hi we are seeing clang front end errors:

Instruction does not dominate all uses!
  %shuffle419 = shufflevector <2 x i64> %359, <2 x i64> poison, <2 x i32> <i32 1, i32 0>
  %385 = xor <2 x i64> %shuffle419, %360, !dbg !1753
Instruction does not dominate all uses!
  %shuffle418 = shufflevector <2 x i64> %357, <2 x i64> poison, <2 x i32> <i32 1, i32 0>
  %386 = xor <2 x i64> %shuffle418, %358, !dbg !1753
Instruction does not dominate all uses!
  %shuffle417 = shufflevector <2 x i64> %355, <2 x i64> poison, <2 x i32> <i32 1, i32 0>
  %387 = xor <2 x i64> %shuffle417, %356, !dbg !1753
Instruction does not dominate all uses!
  %shuffle = shufflevector <2 x i64> %353, <2 x i64> poison, <2 x i32> <i32 1, i32 0>
  %388 = xor <2 x i64> %shuffle, %354, !dbg !1753
Instruction does not dominate all uses!
  %463 = call <2 x i64> @llvm.fshl.v2i64(<2 x i64> %416, <2 x i64> %462, <2 x i64> <i64 63, i64 63>), !dbg !1845
  %460 = shufflevector <2 x i64> %432, <2 x i64> %463, <2 x i32> <i32 0, i32 3>
in function HRSS_generate_key

When building boringssl and we found out this patch through git bisect. We have a reproducer available. I will file a github issue and post the reproducer there.