No need to schedule entry nodes where all instructions are not memory
read/write instructions and their operands are either constants, or
arguments, or phis, or instructions from others blocks, or their users
are phis or from the other blocks.
The resulting vector instructions can be placed at
the beginning of the basic block without scheduling (if operands does
not need to be scheduled) or at the end of the block (if users are
outside of the block).
It may save some compile time and scheduling resources.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
I think this needs some text in the code to describe the high-level design changes in the scheduler introduced by this patch.
There are a couple of things to explain like:
- When an instruction is not scheduled (which is also explained in the comments of isUsedOutsideBlock() and areAllOperandsNonInsts()).
- Changes in the scheduler's design and data structures: If I understand correctly the instructions that are skipped don't get a ScheduleData object assigned to them.
Regarding the last point, couldn't we assign a ScheduleData object even to the instructions that are not scheduled and still save compilation time? I would assume that most of the compile-time overhead comes from calculateDependencies(), so as long as we don't calculate dependencies beyond the instructions that are marked to be skipped, then we should still save compilation time. I think this would make this change a bit less intrusive as it would not change the design of the scheduler much. What do you think?
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
803 | We may need to limit the number of users to a small integer because in some pathological cases we may have thousands of them. | |
2366 | Why are we removing Lane? | |
2412–2413 | This is a bit hard to read. I think it could be simplified by iterating across lanes in the for loop for (unsigned Lane = 0, Lanes = VL.size(); Lane != Lanes; ++Lane) and then setting BundleMember inside the for loop. | |
2415 |
| |
2603 | Do we need the checks NextInBundle != nullptr || FirstInBundle != this ? Shouldn't the check TE != nullptr be sufficient? | |
3933 | perhaps rename it to NeedsScheduling? | |
4137 | I don't understand why needToScheduleSingleInstruction(VL) needs to be here. If I understand correctly this assertion checks that if we have failed to schedule VL, then cacnelScheduling has worked correctly and VL0 (and probably the rest of the values?) are not marked as part of a bundle. So needToScheduleSingleInstruction(VL) executes if VL0 is part of a bundle. | |
6490–6491 | Perhaps move these checks in a method`TreeEntry::doesNotNeedScheduling()` ? |
Where do you want to see this info?
Regarding the last point, couldn't we assign a ScheduleData object even to the instructions that are not scheduled and still save compilation time? I would assume that most of the compile-time overhead comes from calculateDependencies(), so as long as we don't calculate dependencies beyond the instructions that are marked to be skipped, then we should still save compilation time. I think this would make this change a bit less intrusive as it would not change the design of the scheduler much. What do you think?
I would like to save some memory too, not only compile time. Why do we need to allocate ScheduleData for the instructions that should not be scheduled?
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
803 | Good point, will do this. | |
2366 | It is no needed anymore (actually used only in one place). Plus, it gets invalidated after graph reordering and it should be recalculated. Instead, better to find the corresponding instruction directly rather than keep this Lane and then recalculate it after graph reordering. Plus, saves some memory. | |
2415 | Rather doubt we need a cache. The checks are pretty simple and should not take much time to perform. | |
2603 | Yes, still need these checks since this member function is used before we actually assigning TE. | |
4137 | isPartOfBundle is not powerfull enough to check if we have just a single instruction that requires scheduling. We don't have assigned TE yet and need an extra check here for a single schedulable instruction. | |
6490–6491 | doesNotNeedScheduling works with a list of values, not with TreeEntry. And in some cases we need to call it for just a list. This check is needed only here, other cases do not require anything like this. |
Where do you want to see this info?
Perhaps right above struct BlockScheduling ?
I would like to save some memory too, not only compile time. Why do we need to allocate ScheduleData for the instructions that should not be scheduled?
Makes sense. Please mention this in the design description text too.
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
800 | Nit: Please add a one line comment saying that this limit is to save compilation time when instrs have too many uses. | |
4137 | I am probably missing something about the design here. If the check is false for !BS.getScheduleData(VL0) || !BS.getScheduleData(VL0)->isPartOfBundle() then it means that VL0 is part of a real bundle with more than one instruction in the bundle. In other words, how can we end up in such a situation where VL0 is in a multi-instruction bundle and still have a single instruction in VL that needs scheduling? Let me explain. If we had only a single instruction that needs scheduling, then this could be either: |
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
4137 | We allocate ScheduleData not only for schedulable instruction, but also for all instruction in between, but since they don't have next elements, they are not considered to be part of bundle. Function isPartOfBundle() is not enough here, because if we have only single schedulable instruction, isPartOfBundle() will still return false here, because this instruction doers not have next element in bundle and TE is not set yet. |
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
2725–2731 | Some of this text repeats multiple times in the patch, it is probably better to avoid too much repetition. What would also be nice to have here is some brief explanation about the design, like when exactly a ScheduleData entry is assigned to an Instruction. | |
4137 |
I see, initScheduleData() won't allocate ScheduleData unless it is schedulable, but extendSchedulingRegion() will do so without checking. Why is this done this way, and not for example always skip ScheduleData?
I don't follow. If isPartOfBundle() returns false then !BS.getScheduleData(VL0) || !BS.getScheduleData(VL0)->isPartOfBundle() is true, and we don't need needToScheduleSingelInstruction(VL). The assertion would pass anyway. | |
7991–7994 | move this under if (doesNotNeedToSchedule(I)) ? |
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
4137 |
Actually, the check in extendSchedulingRegion() is not required, there is an assert instead. Plus, extendSchedulingRegion() calls initScheduleData(), which then checks if actually need to schedule the given instruction.
Ah, sorry, did not check logic correctly. Actually, it is safe to remove this check, was part of the development process, forgot to remove. |
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
817 | nit: perhaps rename it to doesNotNeedToBeScheduled? If it was a member function of the scheduler I think it would sound fine being called like scheduler.doesNotNeedToSchedule(V). |
Hi Alexey
with this patch, i noticed an assert building one of our runtime files , the test case .c produced is around 24000 lines
would you like it as is? or reduced ?
Instruction does not dominate all uses!
%39 = call <8 x i16> @llvm.fshl.v8i16(<8 x i16> %18, <8 x i16> %38, <8 x i16> <i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8>) %23 = shufflevector <8 x i16> %22, <8 x i16> %39, <8 x i32> <i32 0, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14>
Instruction does not dominate all uses!
%67 = call <8 x i16> @llvm.fshl.v8i16(<8 x i16> %44, <8 x i16> %66, <8 x i16> <i16 15, i16 15, i16 15, i16 15, i16 15, i16 15, i16 15, i16 15>) %51 = shufflevector <8 x i16> %50, <8 x i16> %67, <8 x i32> <i32 0, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14>
in function eshift
fatal error: error in backend: Broken function found, compilation aborted!
Hi Ron, it would be good if you can provide reduced case. I'll revert the patch meanwhile.
Hi we are seeing clang front end errors:
Instruction does not dominate all uses! %shuffle419 = shufflevector <2 x i64> %359, <2 x i64> poison, <2 x i32> <i32 1, i32 0> %385 = xor <2 x i64> %shuffle419, %360, !dbg !1753 Instruction does not dominate all uses! %shuffle418 = shufflevector <2 x i64> %357, <2 x i64> poison, <2 x i32> <i32 1, i32 0> %386 = xor <2 x i64> %shuffle418, %358, !dbg !1753 Instruction does not dominate all uses! %shuffle417 = shufflevector <2 x i64> %355, <2 x i64> poison, <2 x i32> <i32 1, i32 0> %387 = xor <2 x i64> %shuffle417, %356, !dbg !1753 Instruction does not dominate all uses! %shuffle = shufflevector <2 x i64> %353, <2 x i64> poison, <2 x i32> <i32 1, i32 0> %388 = xor <2 x i64> %shuffle, %354, !dbg !1753 Instruction does not dominate all uses! %463 = call <2 x i64> @llvm.fshl.v2i64(<2 x i64> %416, <2 x i64> %462, <2 x i64> <i64 63, i64 63>), !dbg !1845 %460 = shufflevector <2 x i64> %432, <2 x i64> %463, <2 x i32> <i32 0, i32 3> in function HRSS_generate_key
When building boringssl and we found out this patch through git bisect. We have a reproducer available. I will file a github issue and post the reproducer there.
Nit: Please add a one line comment saying that this limit is to save compilation time when instrs have too many uses.