This is the continue work for https://reviews.llvm.org/D36704. And I have described the problems of allowing more than two instructions fused together in https://reviews.llvm.org/D69998 . We will create unnecessary dependency edges that hurt the scheduler and break the fusion. Therefore, we should change the interface to leave it to the target to determine the max fuse instruction number it wants.
Details
Diff Detail
Event Timeline
I am not sure if this example best illustrates the issue. Is your main concern here potentially longer dependency chains?
The original test case explicitly checks the 2 fused pairs and without knowing details about the micro architecture, there is no way of telling if it is beneficial to schedule add; add; sub adjacent. The reason 2 pairs are created is because for fuse-arith-logic patterns, the opcode of the first MI can also match the second MI. I think fuse-arith-logic is the only fusion pattern on AARch64 that has this property. And only the ExynosM4 has it enabled. I think @evandro originally added those patterns, maybe he can shed some light on the benefits?
Alternatively, are there similar cases for other targets?
Can you please clarify how the existing implementation hurts the scheduler and when fused pairs are then broken?
See the example I gave in https://reviews.llvm.org/D69998,
int foo(int a, int b, int c, int d) { return a + b + c +d; }
And this is the Dependency Graph:
+------+ +------+ +------+ +------+ | A | | B | | C | | D | +--+--++ +---+--+ +--+---+ +--+---+ ^ ^ ^ ^ ^ ^ | | | | | | | | | |New1 +--------------+ | | | | | | | | | +--+---+ | |New2 | +-------+ ADD1 | | | | +--+---+ | | | Fuse ^ | | +-------------+ | +------------+ | | | Fuse +--+---+ +----------->+ ADD2 | | +------+ +--+---+ | ADD3 | +------+
And we need also create an artificial edge from ADD1 to A if https://reviews.llvm.org/D69998 is landed. That will force the Node A scheduled before the ADD1 and ADD2. But in fact, it is ok to schedule the Node A in-between ADD3 and ADD2, as ADD3 and ADD2 are NOT a fusion pair because ADD2 has been matched to ADD1. We are creating these unnecessary dependency edges that override the heuristics.
My main concern is that, some node (such as Node A in my below example) lose some freedom to do the schedule. We are creating unnecessary dependencies for it. Things become more worse if
ADD ADD // expected some instructions being scheduled here. ADD ADD
Current implementation will force all these 4 ADD scheduled together and no instructions is allowed to insert in-between {ADD,ADD} {ADD,ADD}
The original test case explicitly checks the 2 fused pairs and without knowing details about the micro architecture, there is no way of telling if it is beneficial to schedule add; add; sub adjacent. The reason 2 pairs are created is because for fuse-arith-logic patterns, the opcode of the first MI can also match the second MI. I think fuse-arith-logic is the only fusion pattern on AARch64 that has this property. And only the ExynosM4 has it enabled. I think @evandro originally added those patterns, maybe he can shed some light on the benefits?
Yeah, my point is that, as you said, we didn't see the benefit to schedule add;add;sub adjacent from micro architecture aspect, but we see the bad things happening as I gave below.
Alternatively, are there similar cases for other targets?
No, as far as I know currently. But no guarantee for the future, as hw is moving forward you know.
That's what I thought, thanks for clarifying.
Alternatively, are there similar cases for other targets?
No, as far as I know currently. But no guarantee for the future, as hw is moving forward you know.
Sure, but then someone has to add new fusion patterns to the matcher, otherwise the case can not happen (as I said, I am not what targets beside AArch64 match) and it is not possible to add a test case.
I would recommend moving the logic to count the number of fused instructions into the target specific shouldScheduleAdjacent as suggested inline. On targets that have no problematic patterns, we can just assert that we never fuse an instruction with one that's already fused. So once we have problematic patterns there, we know because of the assertion. For targets like AArch64 with problematic patterns, we can add the bail out based on the number of fused instructions. It probably makes sense to add the limit as sub-target feature. It would also be good to put up patches for each target separately, as different people might be interested in taking a look.
Thanks for working on this! With the suggestions, I think this would be good to go in, independently of D69998
llvm/lib/Target/AArch64/AArch64MacroFusion.cpp | ||
---|---|---|
381 | If we make the decision here, I think we should also just count the predecessors here, instead of doing so at the call site. If the sub target does not have a limit or does not require one, we should not iterate over the clusters unnecessarily. We could add a helper like 'hasLessThanNumFused` in MacroFusion.h which would do something like the function below, to stop when the limit is reached: bool HasLessThanNumFused(const SUnit &SU, unsigned FuseLimit) { unsigned Num = 0; const SUnit *CurrentSU = &SU; while ((CurrentSU = getPredClusterSU(*CurrentSU)) && Num < FuseLimt) Num ++; return Num < FuseLimit; } We would have to pass in the SUnit of course. |
llvm/lib/Target/AArch64/AArch64MacroFusion.cpp | ||
---|---|---|
381 | If I understand correctly, we have to change the interface of shouldScheduleAdjacent that contains the SUnit instead of the MachineInstr. That is really not clean, as the target should be aware of the SUnit, to count the pred. Can we do it another way ? |
llvm/lib/Target/AArch64/AArch64MacroFusion.cpp | ||
---|---|---|
381 |
I am not sure if passing in the SU directly is a bit deal here. Do you have a specific concern? The macro fusion implementations deal specifically with scheduling issues, so IMO passing the SU in directly makes sense to me, so they can make better decisions (instead of passing bits of information computed from the SU). IIRC the main reason we pass in MAchineInstrs instead of SUs is that there was no need for information from the SUs initially. Also, I think there might be cases where better fusion heuristics might want to use additional info from the SU (although the current ones are mostly 'good enough') I think it would be great to keep the API flexible here and to not impose unnecessary checks/computation for some targets.
With the current structure, all fusion related flags are controlled by the sub target, which is the right level of granularity IMO. I think it would be better to not add additional fusion related flags to TTI, if that's what you mean with infrastructure. |
llvm/lib/Target/AArch64/AArch64MacroFusion.cpp | ||
---|---|---|
381 | Oh, sorry about the confusion. The infrastructure I mean is inside the MacroFusion implementation, not TTI. i.e. bool MacroFusion::schedueAdjacentImpl(...) { ... if (!HasLessThanNumFused(DepSU) || ShouldScheduleAdjacent(...)) continue; } The target is unaware of this, so that, we don't need to provide a helper function or pass the SUnit. And we are basing on all the macro-fusion is implemented by hw as back-to-back. And it will benefit all the targets if they add some pattern that could fuse together in case. However, I am also ok with your suggestion, as it also makes sense. |
I have updated the patch for easy to review. Personally, passing the SUnit and add one helper function for target to check the fused SU number is also ok for me. The only concern is that, other target might not know that, they need to check it manually if they happens to add some fusion pattern that could be fused more than once. As the hardware only supports the back-2-back macro-fusion as far as I see now, it makes sense to assume it in MacroFusion. And we can change it in the future if some target want it.
The penalty for this patch is to check the SU's number for all targets. However, as the limit is 2 and we need to walk the list once at most. It is cheap I think.
Right, that's why I originally suggested to pass the SU and add an assertions making sure that exactly 2 instructions are chained together to all targets (except AArch64 with +fuse-arith-logic).
I still think that's slightly preferable to avoid any unnecessary checks in release builds. The impact on compile-time will likely be very small, but the backend is already notorious for being slow. So if there are any unnecessary checks we can skip without too much effort, IMO we should try to do so. And in this case we also give the backend specific heuristics more flexibility. But if you think the benefit of not passing the SU trumps the compile-time concern, I think that would be OK as well.
llvm/lib/CodeGen/MacroFusion.cpp | ||
---|---|---|
35 | nit: no need to put the static functions into an anonymous namespace. | |
196 | I think it is not too clear what back-2-back means here. IIUC it means more that we do not chain more than 2 instructions together. In a chain of 3 instructions, all instructions are still back-to-back. |
Adding the assertion in each target will solve my concern, thank you for the comments. As the check here is really cheap, and I won't expose the SUnit and dependency stuff to the target except that we have some strong need. (We have to check the number of chained SUnits for FirstSU not SecondSU and target has to understand why) Anyway, it is not a problem.
Update the patch according to comments.
I realized that I had a small un-submitted comment here.....Please also update the documentation for shouldScheduleAdjacent to clarify that it won't be called if DepMI is already part of a pair.
Err.. But it is a callback... I didn't find a good place to give this kind of message for shouldScheduleAdjacent. Any suggestions ?
nit: no need to put the static functions into an anonymous namespace.