This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/CodeGen/
-
CodeGen/
2
MacroFusion.cpp
-
test/CodeGen/AArch64/
-
CodeGen/
-
AArch64/
-
macro-fusion.ll

Differential D70066

[MacroFusion] Limit the max fused number as 2 to reduce the dependency
ClosedPublic

Authored by steven.zhang on Nov 10 2019, 11:51 PM.

Download Raw Diff

Details

Reviewers

fhahn
evandro
MatzeB
arsenm
jsji
hfinkel

Group Reviewers

Restricted Project

Commits

rGd84b320dfd0a: [MacroFusion] Limit the max fused number as 2 to reduce the dependency

Summary

This is the continue work for https://reviews.llvm.org/D36704. And I have described the problems of allowing more than two instructions fused together in https://reviews.llvm.org/D69998 . We will create unnecessary dependency edges that hurt the scheduler and break the fusion. Therefore, we should change the interface to leave it to the target to determine the max fuse instruction number it wants.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

steven.zhang created this revision.Nov 10 2019, 11:51 PM

Herald added a project: Restricted Project. · View Herald TranscriptNov 10 2019, 11:52 PM

Herald added subscribers: hiraditya, nhaehnle, wdng, jvesely. · View Herald Transcript

steven.zhang added a parent revision: D69998: [MacroFusion] Create the missing artificial edges if there are more than 2 SU fused..Nov 10 2019, 11:53 PM

Gentle ping.

I am not sure if this example best illustrates the issue. Is your main concern here potentially longer dependency chains?

The original test case explicitly checks the 2 fused pairs and without knowing details about the micro architecture, there is no way of telling if it is beneficial to schedule add; add; sub adjacent. The reason 2 pairs are created is because for fuse-arith-logic patterns, the opcode of the first MI can also match the second MI. I think fuse-arith-logic is the only fusion pattern on AARch64 that has this property. And only the ExynosM4 has it enabled. I think @evandro originally added those patterns, maybe he can shed some light on the benefits?

Alternatively, are there similar cases for other targets?

Can you please clarify how the existing implementation hurts the scheduler and when fused pairs are then broken?

See the example I gave in https://reviews.llvm.org/D69998,

int foo(int a, int b, int c, int d) {
  return a + b + c +d;
}

And this is the Dependency Graph:

+------+       +------+       +------+       +------+
|  A   |       |  B   |       |  C   |       |  D   |
+--+--++       +---+--+       +--+---+       +--+---+
   ^  ^            ^  ^          ^              ^
   |  |            |  |          |              |
   |  |            |  |New1      +--------------+
   |  |            |  |          |
   |  |            |  |       +--+---+
   |  |New2        |  +-------+ ADD1 |
   |  |            |          +--+---+
   |  |            |    Fuse     ^
   |  |            +-------------+
   |  +------------+
   |               |
   |   Fuse     +--+---+
   +----------->+ ADD2 |
   |            +------+
+--+---+
| ADD3 |
+------+

And we need also create an artificial edge from ADD1 to A if https://reviews.llvm.org/D69998 is landed. That will force the Node A scheduled before the ADD1 and ADD2. But in fact, it is ok to schedule the Node A in-between ADD3 and ADD2, as ADD3 and ADD2 are NOT a fusion pair because ADD2 has been matched to ADD1. We are creating these unnecessary dependency edges that override the heuristics.

In D70066#1756949, @fhahn wrote:

I am not sure if this example best illustrates the issue. Is your main concern here potentially longer dependency chains?

My main concern is that, some node (such as Node A in my below example) lose some freedom to do the schedule. We are creating unnecessary dependencies for it. Things become more worse if

ADD
ADD
// expected some instructions being scheduled here.
ADD
ADD

Current implementation will force all these 4 ADD scheduled together and no instructions is allowed to insert in-between {ADD,ADD} {ADD,ADD}

The original test case explicitly checks the 2 fused pairs and without knowing details about the micro architecture, there is no way of telling if it is beneficial to schedule add; add; sub adjacent. The reason 2 pairs are created is because for fuse-arith-logic patterns, the opcode of the first MI can also match the second MI. I think fuse-arith-logic is the only fusion pattern on AARch64 that has this property. And only the ExynosM4 has it enabled. I think @evandro originally added those patterns, maybe he can shed some light on the benefits?

Yeah, my point is that, as you said, we didn't see the benefit to schedule add;add;sub adjacent from micro architecture aspect, but we see the bad things happening as I gave below.

Alternatively, are there similar cases for other targets?

No, as far as I know currently. But no guarantee for the future, as hw is moving forward you know.

In D70066#1758249, @steven.zhang wrote:

In D70066#1756949, @fhahn wrote:

I am not sure if this example best illustrates the issue. Is your main concern here potentially longer dependency chains?

My main concern is that, some node (such as Node A in my below example) lose some freedom to do the schedule.

That's what I thought, thanks for clarifying.

Alternatively, are there similar cases for other targets?

No, as far as I know currently. But no guarantee for the future, as hw is moving forward you know.

Sure, but then someone has to add new fusion patterns to the matcher, otherwise the case can not happen (as I said, I am not what targets beside AArch64 match) and it is not possible to add a test case.

I would recommend moving the logic to count the number of fused instructions into the target specific shouldScheduleAdjacent as suggested inline. On targets that have no problematic patterns, we can just assert that we never fuse an instruction with one that's already fused. So once we have problematic patterns there, we know because of the assertion. For targets like AArch64 with problematic patterns, we can add the bail out based on the number of fused instructions. It probably makes sense to add the limit as sub-target feature. It would also be good to put up patches for each target separately, as different people might be interested in taking a look.

Thanks for working on this! With the suggestions, I think this would be good to go in, independently of D69998

llvm/lib/Target/AArch64/AArch64MacroFusion.cpp
381 ↗	(On Diff #228636)	If we make the decision here, I think we should also just count the predecessors here, instead of doing so at the call site. If the sub target does not have a limit or does not require one, we should not iterate over the clusters unnecessarily. We could add a helper like 'hasLessThanNumFused` in MacroFusion.h which would do something like the function below, to stop when the limit is reached: bool HasLessThanNumFused(const SUnit &SU, unsigned FuseLimit) { unsigned Num = 0; const SUnit CurrentSU = &SU; while ((CurrentSU = getPredClusterSU(CurrentSU)) && Num < FuseLimt) Num ++; return Num < FuseLimit; } We would have to pass in the SUnit of course.

steven.zhang marked an inline comment as done.Nov 25 2019, 9:02 PM

steven.zhang added inline comments.

llvm/lib/Target/AArch64/AArch64MacroFusion.cpp
381 ↗	(On Diff #228636)	If I understand correctly, we have to change the interface of shouldScheduleAdjacent that contains the SUnit instead of the MachineInstr. That is really not clean, as the target should be aware of the SUnit, to count the pred. Can we do it another way ? Let's assume that, the macro fusion must be back to back. That is, check the SU numbers using HasLessThanNumFused () in the infrastructure instead of the target hook and bail out if it reach the limit. Does it make sense ?

fhahn added inline comments.Nov 26 2019, 12:20 PM

llvm/lib/Target/AArch64/AArch64MacroFusion.cpp
381 ↗	(On Diff #228636)	If I understand correctly, we have to change the interface of shouldScheduleAdjacent that contains the SUnit instead of the MachineInstr. That is really not clean, as the target should be aware of the SUnit, to count the pred. Can we do it another way ? I am not sure if passing in the SU directly is a bit deal here. Do you have a specific concern? The macro fusion implementations deal specifically with scheduling issues, so IMO passing the SU in directly makes sense to me, so they can make better decisions (instead of passing bits of information computed from the SU). IIRC the main reason we pass in MAchineInstrs instead of SUs is that there was no need for information from the SUs initially. Also, I think there might be cases where better fusion heuristics might want to use additional info from the SU (although the current ones are mostly 'good enough') I think it would be great to keep the API flexible here and to not impose unnecessary checks/computation for some targets. Let's assume that, the macro fusion must be back to back. That is, check the SU numbers using HasLessThanNumFused () in the infrastructure instead of the target hook and bail out if it reach the limit. Does it make sense ? With the current structure, all fusion related flags are controlled by the sub target, which is the right level of granularity IMO. I think it would be better to not add additional fusion related flags to TTI, if that's what you mean with infrastructure.

steven.zhang marked an inline comment as done.Nov 26 2019, 7:50 PM

steven.zhang added inline comments.

llvm/lib/Target/AArch64/AArch64MacroFusion.cpp
381 ↗	(On Diff #228636)	Oh, sorry about the confusion. The infrastructure I mean is inside the MacroFusion implementation, not TTI. i.e. bool MacroFusion::schedueAdjacentImpl(...) { ... if (!HasLessThanNumFused(DepSU) \|\| ShouldScheduleAdjacent(...)) continue; } The target is unaware of this, so that, we don't need to provide a helper function or pass the SUnit. And we are basing on all the macro-fusion is implemented by hw as back-to-back. And it will benefit all the targets if they add some pattern that could fuse together in case. However, I am also ok with your suggestion, as it also makes sense.

steven.zhang updated this revision to Diff 231175.Nov 26 2019, 9:11 PM

steven.zhang removed a parent revision: D69998: [MacroFusion] Create the missing artificial edges if there are more than 2 SU fused..

I have updated the patch for easy to review. Personally, passing the SUnit and add one helper function for target to check the fused SU number is also ok for me. The only concern is that, other target might not know that, they need to check it manually if they happens to add some fusion pattern that could be fused more than once. As the hardware only supports the back-2-back macro-fusion as far as I see now, it makes sense to assume it in MacroFusion. And we can change it in the future if some target want it.

The penalty for this patch is to check the SU's number for all targets. However, as the limit is 2 and we need to walk the list once at most. It is cheap I think.

In D70066#1761217, @steven.zhang wrote:

I have updated the patch for easy to review. Personally, passing the SUnit and add one helper function for target to check the fused SU number is also ok for me. The only concern is that, other target might not know that, they need to check it manually if they happens to add some fusion pattern that could be fused more than once. As the hardware only supports the back-2-back macro-fusion as far as I see now, it makes sense to assume it in MacroFusion. And we can change it in the future if some target want it.

Right, that's why I originally suggested to pass the SU and add an assertions making sure that exactly 2 instructions are chained together to all targets (except AArch64 with +fuse-arith-logic).

I still think that's slightly preferable to avoid any unnecessary checks in release builds. The impact on compile-time will likely be very small, but the backend is already notorious for being slow. So if there are any unnecessary checks we can skip without too much effort, IMO we should try to do so. And in this case we also give the backend specific heuristics more flexibility. But if you think the benefit of not passing the SU trumps the compile-time concern, I think that would be OK as well.

llvm/lib/CodeGen/MacroFusion.cpp
35	nit: no need to put the static functions into an anonymous namespace.
179	I think it is not too clear what back-2-back means here. IIUC it means more that we do not chain more than 2 instructions together. In a chain of 3 instructions, all instructions are still back-to-back.

Adding the assertion in each target will solve my concern, thank you for the comments. As the check here is really cheap, and I won't expose the SUnit and dependency stuff to the target except that we have some strong need. (We have to check the number of chained SUnits for FirstSU not SecondSU and target has to understand why) Anyway, it is not a problem.

Update the patch according to comments.

LGTM

This revision is now accepted and ready to land.Dec 3 2019, 12:46 AM

Closed by commit rGd84b320dfd0a: [MacroFusion] Limit the max fused number as 2 to reduce the dependency (authored by steven.zhang). · Explain WhyDec 3 2019, 9:12 PM

This revision was automatically updated to reflect the committed changes.

I realized that I had a small un-submitted comment here.....Please also update the documentation for shouldScheduleAdjacent to clarify that it won't be called if DepMI is already part of a pair.

In D70066#1921272, @fhahn wrote:

I realized that I had a small un-submitted comment here.....Please also update the documentation for shouldScheduleAdjacent to clarify that it won't be called if DepMI is already part of a pair.

Err.. But it is a callback... I didn't find a good place to give this kind of message for shouldScheduleAdjacent. Any suggestions ?

Revision Contents

Path

Size

llvm/

lib/

CodeGen/

MacroFusion.cpp

19 lines

test/

CodeGen/

AArch64/

macro-fusion.ll

9 lines

Diff 232033

llvm/lib/CodeGen/MacroFusion.cpp

Show All 26 Lines

STATISTIC(NumFused, "Number of instr pairs fused");		STATISTIC(NumFused, "Number of instr pairs fused");

using namespace llvm;		using namespace llvm;

static cl::opt<bool> EnableMacroFusion("misched-fusion", cl::Hidden,		static cl::opt<bool> EnableMacroFusion("misched-fusion", cl::Hidden,
cl::desc("Enable scheduling for macro fusion."), cl::init(true));		cl::desc("Enable scheduling for macro fusion."), cl::init(true));

static bool isHazard(const SDep &Dep) {		static bool isHazard(const SDep &Dep) {
		fhahnUnsubmitted Not Done Reply Inline Actions nit: no need to put the static functions into an anonymous namespace. fhahn: nit: no need to put the static functions into an anonymous namespace.
return Dep.getKind() == SDep::Anti \|\| Dep.getKind() == SDep::Output;		return Dep.getKind() == SDep::Anti \|\| Dep.getKind() == SDep::Output;
}		}

		static SUnit *getPredClusterSU(const SUnit &SU) {
		for (const SDep &SI : SU.Preds)
		if (SI.isCluster())
		return SI.getSUnit();

		return nullptr;
		}

		static bool hasLessThanNumFused(const SUnit &SU, unsigned FuseLimit) {
		unsigned Num = 1;
		const SUnit *CurrentSU = &SU;
		while ((CurrentSU = getPredClusterSU(*CurrentSU)) && Num < FuseLimit) Num ++;
		return Num < FuseLimit;
		}

static bool fuseInstructionPair(ScheduleDAGInstrs &DAG, SUnit &FirstSU,		static bool fuseInstructionPair(ScheduleDAGInstrs &DAG, SUnit &FirstSU,
SUnit &SecondSU) {		SUnit &SecondSU) {
// Check that neither instr is already paired with another along the edge		// Check that neither instr is already paired with another along the edge
// between them.		// between them.
for (SDep &SI : FirstSU.Succs)		for (SDep &SI : FirstSU.Succs)
if (SI.isCluster())		if (SI.isCluster())
return false;		return false;

▲ Show 20 Lines • Show All 109 Lines • ▼ Show 20 Lines	for (SDep &Dep : AnchorSU.Preds) {
// Ignore dependencies other than data or strong ordering.		// Ignore dependencies other than data or strong ordering.
if (Dep.isWeak() \|\| isHazard(Dep))		if (Dep.isWeak() \|\| isHazard(Dep))
continue;		continue;

SUnit &DepSU = *Dep.getSUnit();		SUnit &DepSU = *Dep.getSUnit();
if (DepSU.isBoundaryNode())		if (DepSU.isBoundaryNode())
continue;		continue;

		// Only chain two instructions together at most.
		fhahnUnsubmitted Not Done Reply Inline Actions I think it is not too clear what back-2-back means here. IIUC it means more that we do not chain more than 2 instructions together. In a chain of 3 instructions, all instructions are still back-to-back. fhahn: I think it is not too clear what back-2-back means here. IIUC it means more that we do not…
const MachineInstr *DepMI = DepSU.getInstr();		const MachineInstr *DepMI = DepSU.getInstr();
if (!shouldScheduleAdjacent(TII, ST, DepMI, AnchorMI))		if (!hasLessThanNumFused(DepSU, 2) \|\|
		!shouldScheduleAdjacent(TII, ST, DepMI, AnchorMI))
continue;		continue;

if (fuseInstructionPair(DAG, DepSU, AnchorSU))		if (fuseInstructionPair(DAG, DepSU, AnchorSU))
return true;		return true;
}		}

return false;		return false;
}		}
Show All 16 Lines

llvm/test/CodeGen/AArch64/macro-fusion.ll

	; REQUIRES: asserts			; REQUIRES: asserts
	; RUN: llc < %s -mtriple=aarch64-linux-gnu -mattr=+fuse-arith-logic -verify-misched -debug-only=machine-scheduler 2>&1 > /dev/null \| FileCheck %s			; RUN: llc < %s -mtriple=aarch64-linux-gnu -mattr=+fuse-arith-logic -verify-misched -debug-only=machine-scheduler 2>&1 > /dev/null \| FileCheck %s

	; Verify that, the macro-fusion creates the necessary dependencies between SUs.			; Verify that, the macro-fusion creates the necessary dependencies between SUs and
				; only 2 SU's are fused at most.
	define signext i32 @test(i32 signext %a, i32 signext %b, i32 signext %c, i32 signext %d) {			define signext i32 @test(i32 signext %a, i32 signext %b, i32 signext %c, i32 signext %d) {
	entry:			entry:
	; CHECK: ******** MI Scheduling ********			; CHECK: ******** MI Scheduling ********
	; CHECK-LABEL: %bb.0 entry			; CHECK-LABEL: %bb.0 entry
	; CHECK: Macro fuse: SU([[SU4:[0-9]+]]) - SU([[SU5:[0-9]+]])			; CHECK: Macro fuse: SU([[SU4:[0-9]+]]) - SU([[SU5:[0-9]+]])
	; CHECK: Bind SU([[SU1:[0-9]+]]) - SU([[SU4]])			; CHECK: Bind SU([[SU1:[0-9]+]]) - SU([[SU4]])
	; CHECK: Macro fuse: SU([[SU5]]) - SU([[SU6:[0-9]+]])			; CHECK-NOT: Macro fuse:
	; CHECK: Bind SU([[SU0:[0-9]+]]) - SU([[SU5]])
	; CHECK: SU([[SU0]]): %{{[0-9]+}}:gpr32 = COPY $w3
	; CHECK: SU([[SU1]]): %{{[0-9]+}}:gpr32 = COPY $w2			; CHECK: SU([[SU1]]): %{{[0-9]+}}:gpr32 = COPY $w2
	; CHECK: SU([[SU4]]): %{{[0-9]+}}:gpr32 = nsw ADDWrr			; CHECK: SU([[SU4]]): %{{[0-9]+}}:gpr32 = nsw ADDWrr
	; CHECK: SU([[SU5]]): %{{[0-9]+}}:gpr32 = nsw ADDWrr			; CHECK: SU([[SU5]]): %{{[0-9]+}}:gpr32 = nsw ADDWrr
	; CHECK: SU([[SU6]]): %{{[0-9]+}}:gpr32 = nsw SUBWrr

	%add = add nsw i32 %b, %a			%add = add nsw i32 %b, %a
	%add1 = add nsw i32 %add, %c			%add1 = add nsw i32 %add, %c
	%sub = sub nsw i32 %add1, %d			%sub = sub nsw i32 %add1, %d
	ret i32 %sub			ret i32 %sub
	}			}