This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/CodeGen/
-
CodeGen/
-
MacroFusion.cpp
-
test/CodeGen/AArch64/
-
CodeGen/
-
AArch64/
-
macro-fusion.ll

Differential D69998

[MacroFusion] Create the missing artificial edges if there are more than 2 SU fused.
AbandonedPublic

Authored by steven.zhang on Nov 8 2019, 2:16 AM.

Download Raw Diff

Details

Reviewers

jsji
nemanjai
hfinkel
fhahn
evandro
MatzeB
arsenm

Group Reviewers

Restricted Project

Summary

For now, llvm MacroFusion would fuse the adjacent instructions no matter if it has been fused before. However, we miss to create some edges that cause problem.

Assume that we have the code:

int foo(int a, int b, int c, int d) {
  return a + b + c +d;
}

And ADD and ADD are a fusion pair. And this is the Dependency graph.

+------+       +------+       +------+       +------+
|  A   |       |  B   |       |  C   |       |  D   |
+--+--++       +---+--+       +--+---+       +--+---+
   ^  ^            ^  ^          ^              ^
   |  |            |  |          |              |
   |  |            |  |New1      +--------------+
   |  |            |  |          |
   |  |            |  |       +--+---+
   |  |New2        |  +-------+ ADD1 |
   |  |            |          +--+---+
   |  |            |    Fuse     ^
   |  |            +-------------+
   |  +------------+
   |               |
   |   Fuse     +--+---+
   +----------->+ ADD2 |
   |            +------+
+--+---+
| ADD3 |
+------+

When ADD1 and ADD2 are fused, we will create an artificial edge New1 to make sure that, B is scheduled before ADD1. And when ADD3 and ADD2 are fused,
another artificial edge New2 is created to make sure that, A is scheduled before ADD2. However, this is NOT enough. We need to create another artificial edge from ADD1 to A to make sure that, A is scheduled before ADD1 also.

Diff Detail

Event Timeline

steven.zhang created this revision.Nov 8 2019, 2:16 AM

Herald added a project: Restricted Project. · View Herald TranscriptNov 8 2019, 2:16 AM

Herald added subscribers: • wuzish, hiraditya, nhaehnle and 2 others. · View Herald Transcript

My understanding is that you are doing two fixes in this patch:

Extend API of shouldScheduleAdjacent to avoid fusing an instruction more than once along the same dependency chain.
Extend fuseInstructionPair to add artificial data dependencies for chained-fusion

To me, yes, we should do #1 first, as https://reviews.llvm.org/D36704 was already trying to do so.

#2 might not be necessary, as you mentioned, all existing targets only support back-to-back fusion.
If we want to extend it for chained-fusion (3 or more in same dependency chain),
then I believe we need more changes than adding data dependencies.
Also need additional tests for some target that support it (maybe RISC-V?)

llvm/include/llvm/CodeGen/MacroFusion.h
34 ↗	(On Diff #228375)	New parameters, should be documented in above comment as well.
llvm/test/CodeGen/AArch64/macro-fusion-verify.ll
4 ↗	(On Diff #228375)	This comment is confusing. I believe the goal of this patch is to avoid chained-fusion, hence reducing unnecessary dependency, so you would like to verify that there is no `extra dependency`?
25 ↗	(On Diff #228375)	Maybe we should check that we only fuse `SU4 SU5`, not `SU5 SU6` too.

steven.zhang added a reviewer: arsenm.Nov 10 2019, 6:21 PM

Herald added a subscriber: wdng. · View Herald TranscriptNov 10 2019, 6:21 PM

In D69998#1739692, @jsji wrote:

My understanding is that you are doing two fixes in this patch:

Extend API of shouldScheduleAdjacent to avoid fusing an instruction more than once along the same dependency chain.

Extend fuseInstructionPair to add artificial data dependencies for chained-fusion

To me, yes, we should do #1 first, as https://reviews.llvm.org/D36704 was already trying to do so.

Yes, but still have problems.

#2 might not be necessary, as you mentioned, all existing targets only support back-to-back fusion.
If we want to extend it for chained-fusion (3 or more in same dependency chain),
then I believe we need more changes than adding data dependencies.
Also need additional tests for some target that support it (maybe RISC-V?)

The current implementation has already been implemented to support more than 2 SU's fusion, However, it misses to create some dependency edges. The patch is trying to fix the bug of the macro fusion infrastructure. From my understanding,
adding these missing edges is enough. I didn't see llvm support the macro fusion for RISC-V target. AMDGPU supports more than 2 SU's cluster for load. @arsenm Would you please help me confirm if AMDGPU target supports more than 2 SU's fusion ?

qiucf added a subscriber: qiucf.Nov 10 2019, 7:29 PM

This comment was removed by qiucf.

In D69998#1740279, @steven.zhang wrote:

In D69998#1739692, @jsji wrote:

My understanding is that you are doing two fixes in this patch:

Extend API of shouldScheduleAdjacent to avoid fusing an instruction more than once along the same dependency chain.

Extend fuseInstructionPair to add artificial data dependencies for chained-fusion

To me, yes, we should do #1 first, as https://reviews.llvm.org/D36704 was already trying to do so.

Yes, but still have problems.

#2 might not be necessary, as you mentioned, all existing targets only support back-to-back fusion.
If we want to extend it for chained-fusion (3 or more in same dependency chain),
then I believe we need more changes than adding data dependencies.
Also need additional tests for some target that support it (maybe RISC-V?)

The current implementation has already been implemented to support more than 2 SU's fusion, However, it misses to create some dependency edges. The patch is trying to fix the bug of the macro fusion infrastructure. From my understanding,
adding these missing edges is enough. I didn't see llvm support the macro fusion for RISC-V target. AMDGPU supports more than 2 SU's cluster for load. @arsenm Would you please help me confirm if AMDGPU target supports more than 2 SU's fusion ?

Load/store clustering should produce > 2 sections of clusters, but I don't remember the details of how the DAG mutation is implemented. Specifically for the MacroFusion mutation, I'm not sure. It may be useful to combine one def with multiple uses, but I'm not sure if that actually happens now.

In D69998#1740295, @arsenm wrote:

In D69998#1740279, @steven.zhang wrote:

In D69998#1739692, @jsji wrote:

My understanding is that you are doing two fixes in this patch:

Extend API of shouldScheduleAdjacent to avoid fusing an instruction more than once along the same dependency chain.

Extend fuseInstructionPair to add artificial data dependencies for chained-fusion

To me, yes, we should do #1 first, as https://reviews.llvm.org/D36704 was already trying to do so.

Yes, but still have problems.

#2 might not be necessary, as you mentioned, all existing targets only support back-to-back fusion.
If we want to extend it for chained-fusion (3 or more in same dependency chain),
then I believe we need more changes than adding data dependencies.
Also need additional tests for some target that support it (maybe RISC-V?)

The current implementation has already been implemented to support more than 2 SU's fusion, However, it misses to create some dependency edges. The patch is trying to fix the bug of the macro fusion infrastructure. From my understanding,
adding these missing edges is enough. I didn't see llvm support the macro fusion for RISC-V target. AMDGPU supports more than 2 SU's cluster for load. @arsenm Would you please help me confirm if AMDGPU target supports more than 2 SU's fusion ?

Load/store clustering should produce > 2 sections of clusters, but I don't remember the details of how the DAG mutation is implemented. Specifically for the MacroFusion mutation, I'm not sure. It may be useful to combine one def with multiple uses, but I'm not sure if that actually happens now.

Yeah, from my investigation, the MacroFusion implementation should support it. Do you know the AMDGPU hw supports more than 2 SU's macro fusion as the Load/Store cluster or just as other target, that is back-to-back. I guess it is also back-to-back, but I want to confirm it.

In D69998#1740313, @steven.zhang wrote:

In D69998#1740295, @arsenm wrote:

In D69998#1740279, @steven.zhang wrote:

In D69998#1739692, @jsji wrote:

My understanding is that you are doing two fixes in this patch:

Extend API of shouldScheduleAdjacent to avoid fusing an instruction more than once along the same dependency chain.

Extend fuseInstructionPair to add artificial data dependencies for chained-fusion

To me, yes, we should do #1 first, as https://reviews.llvm.org/D36704 was already trying to do so.

Yes, but still have problems.

#2 might not be necessary, as you mentioned, all existing targets only support back-to-back fusion.
If we want to extend it for chained-fusion (3 or more in same dependency chain),
then I believe we need more changes than adding data dependencies.
Also need additional tests for some target that support it (maybe RISC-V?)

The current implementation has already been implemented to support more than 2 SU's fusion, However, it misses to create some dependency edges. The patch is trying to fix the bug of the macro fusion infrastructure. From my understanding,
adding these missing edges is enough. I didn't see llvm support the macro fusion for RISC-V target. AMDGPU supports more than 2 SU's cluster for load. @arsenm Would you please help me confirm if AMDGPU target supports more than 2 SU's fusion ?

Load/store clustering should produce > 2 sections of clusters, but I don't remember the details of how the DAG mutation is implemented. Specifically for the MacroFusion mutation, I'm not sure. It may be useful to combine one def with multiple uses, but I'm not sure if that actually happens now.

Yeah, from my investigation, the MacroFusion implementation should support it. Do you know the AMDGPU hw supports more than 2 SU's macro fusion as the Load/Store cluster or just as other target, that is back-to-back. I guess it is also back-to-back, but I want to confirm it.

Load/Store does benefit from multiple instructions back to back.

The MacroFusion doesn't need back to back scheduling. We just want the use of the condition register to follow the def because it usually means we can use a smaller instruction encoding. It doesn't need to be the next instruction, it's just helpful to avoid another condition register def between the two instructions.

I get it. Thank you!

I will split this patch into two in response with Jinsong's comments.

Fix the missing edges.
Extend the interface to allow the target to specify the max fuse SU number.

This patch is to fix the missing edges. I have updated the patch.

steven.zhang added a child revision: D70066: [MacroFusion] Limit the max fused number as 2 to reduce the dependency.Nov 10 2019, 11:53 PM

https://reviews.llvm.org/D70066 is created to limit the max number of the fusion instr.

Gentle ping.

At first glance, this patch seems sensible, but I'm not sure that D70066 is necessary.

fhahn mentioned this in D70066: [MacroFusion] Limit the max fused number as 2 to reduce the dependency.Nov 25 2019, 4:38 AM

steven.zhang removed a child revision: D70066: [MacroFusion] Limit the max fused number as 2 to reduce the dependency.Nov 26 2019, 9:11 PM

Gentle ping...

With rGd84b320dfd0a7dbedacc287ede5e5bc4c0f113ba landed, is this still relevant?

In D69998#1768588, @fhahn wrote:

With rGd84b320dfd0a7dbedacc287ede5e5bc4c0f113ba landed, is this still relevant?

Yes, they are different problems. rGd84b320dfd0a7dbedacc287ede5e5bc4c0f113ba is trying to limit the number of chained SU's as two, while this patch is to fix the problem if we want to chain more than two SU's, though it is limited to two now. But by the design, we should have it work well if someone want to relax the limit later. It is somewhat like, we have the ability to chain any number of SU's, but now, it is limited to two, instead of, we can only chain two SU's, and have bugs if chain more.

There won't be any compiling time impact for this patch if it is limited to two, as the pred/succ of CurrentSU is always null if only chain two SU's.

In D69998#1769958, @steven.zhang wrote:

In D69998#1768588, @fhahn wrote:

With rGd84b320dfd0a7dbedacc287ede5e5bc4c0f113ba landed, is this still relevant?

Yes, they are different problems. rGd84b320dfd0a7dbedacc287ede5e5bc4c0f113ba is trying to limit the number of chained SU's as two, while this patch is to fix the problem if we want to chain more than two SU's, though it is limited to two now. But by the design, we should have it work well if someone want to relax the limit later. It is somewhat like, we have the ability to chain any number of SU's, but now, it is limited to two, instead of, we can only chain two SU's, and have bugs if chain more.

Sure, but currently there can be no bug and the interface prevents that case from happening. I don't see why we would need to deal with a case that might happen in the future, if the interface changes. To me it seems like the time to fix that would be when the interface gets extended. Until then, we cannot test the patch. To support fusing more than pairs, I think it would be better to do this in a separate function and deal with those cases there, rather than unnecessarily complicating the code for pairs.

There won't be any compiling time impact for this patch if it is limited to two, as the pred/succ of CurrentSU is always null if only chain two SU's.

Hm I do not think that's true, we still need to check all the predecessors/successors of the SUs, which potentially can be a large number for bad inputs. The compile-time impact of this patch on its own might be quite small, but at least in degenerate cases it could be measurable (same for rGd84b320dfd0a7dbedacc287ede5e5bc4c0f113ba )

Hmm, we are putting a bomb here if someone want to get extends :P But I agree with you that as we cannot test this patch now, it is NOT the best time to fix it.

In D69998#1772106, @steven.zhang wrote:

Hmm, we are putting a bomb here if someone want to get extends :P But I agree with you that as we cannot test this patch now, it is NOT the best time to fix it.

One possible way to deal with that would be to add an assertion that we chain at most 2 instructions together here, with a comment what the issue will be with chains longer than 2 instructions.

Good suggestion, thank you! I will post a patch to remove that bomb.

https://reviews.llvm.org/D71180 is created.

nhaehnle removed a subscriber: nhaehnle.Dec 9 2019, 1:26 AM

Revision Contents

Path

Size

llvm/

lib/

CodeGen/

MacroFusion.cpp

47 lines

test/

CodeGen/

AArch64/

macro-fusion.ll

1 line

Diff 228635

llvm/lib/CodeGen/MacroFusion.cpp

Show All 30 Lines

static cl::opt<bool> EnableMacroFusion("misched-fusion", cl::Hidden,		static cl::opt<bool> EnableMacroFusion("misched-fusion", cl::Hidden,
cl::desc("Enable scheduling for macro fusion."), cl::init(true));		cl::desc("Enable scheduling for macro fusion."), cl::init(true));

static bool isHazard(const SDep &Dep) {		static bool isHazard(const SDep &Dep) {
return Dep.getKind() == SDep::Anti \|\| Dep.getKind() == SDep::Output;		return Dep.getKind() == SDep::Anti \|\| Dep.getKind() == SDep::Output;
}		}

		namespace {

		static SUnit *getPredClusterSU(const SUnit &SU) {
		for (const SDep &SI : SU.Preds)
		if (SI.isCluster())
		return SI.getSUnit();

		return nullptr;
		}

		static SUnit *getSuccClusterSU(const SUnit &SU) {
		for (const SDep &SI : SU.Succs)
		if (SI.isCluster())
		return SI.getSUnit();

		return nullptr;
		}

static bool fuseInstructionPair(ScheduleDAGInstrs &DAG, SUnit &FirstSU,		static bool fuseInstructionPair(ScheduleDAGInstrs &DAG, SUnit &FirstSU,
SUnit &SecondSU) {		SUnit &SecondSU) {
// Check that neither instr is already paired with another along the edge		// Check that neither instr is already paired with another along the edge
// between them.		// between them.
for (SDep &SI : FirstSU.Succs)		for (SDep &SI : FirstSU.Succs)
if (SI.isCluster())		if (SI.isCluster())
return false;		return false;

Show All 21 Lines	static bool fuseInstructionPair(ScheduleDAGInstrs &DAG, SUnit &FirstSU,
LLVM_DEBUG(		LLVM_DEBUG(
dbgs() << "Macro fuse: "; DAG.dumpNodeName(FirstSU); dbgs() << " - ";		dbgs() << "Macro fuse: "; DAG.dumpNodeName(FirstSU); dbgs() << " - ";
DAG.dumpNodeName(SecondSU); dbgs() << " / ";		DAG.dumpNodeName(SecondSU); dbgs() << " / ";
dbgs() << DAG.TII->getName(FirstSU.getInstr()->getOpcode()) << " - "		dbgs() << DAG.TII->getName(FirstSU.getInstr()->getOpcode()) << " - "
<< DAG.TII->getName(SecondSU.getInstr()->getOpcode()) << '\n';);		<< DAG.TII->getName(SecondSU.getInstr()->getOpcode()) << '\n';);

// Make data dependencies from the FirstSU also dependent on the SecondSU to		// Make data dependencies from the FirstSU also dependent on the SecondSU to
// prevent them from being scheduled between the FirstSU and the SecondSU.		// prevent them from being scheduled between the FirstSU and the SecondSU.
if (&SecondSU != &DAG.ExitSU)		SUnit *CurrentSU = &SecondSU;
		while (CurrentSU && CurrentSU != &DAG.ExitSU) {
for (const SDep &SI : FirstSU.Succs) {		for (const SDep &SI : FirstSU.Succs) {
SUnit *SU = SI.getSUnit();		SUnit *SU = SI.getSUnit();
if (SI.isWeak() \|\| isHazard(SI) \|\|		if (SI.isWeak() \|\| isHazard(SI) \|\|
SU == &DAG.ExitSU \|\| SU == &SecondSU \|\| SU->isPred(&SecondSU))		SU == &DAG.ExitSU \|\| SU == CurrentSU \|\|
		SU->isPred(CurrentSU))
continue;		continue;
LLVM_DEBUG(dbgs() << " Bind "; DAG.dumpNodeName(SecondSU);		LLVM_DEBUG(dbgs() << " Bind "; DAG.dumpNodeName(*CurrentSU);
dbgs() << " - "; DAG.dumpNodeName(*SU); dbgs() << '\n';);		dbgs() << " - "; DAG.dumpNodeName(*SU); dbgs() << '\n';);
DAG.addEdge(SU, SDep(&SecondSU, SDep::Artificial));		DAG.addEdge(SU, SDep(CurrentSU, SDep::Artificial));
		}

		CurrentSU = getSuccClusterSU(*CurrentSU);
}		}

// Make the FirstSU also dependent on the dependencies of the SecondSU to		// Make the FirstSU also dependent on the dependencies of the SecondSU to
// prevent them from being scheduled between the FirstSU and the SecondSU.		// prevent them from being scheduled between the FirstSU and the SecondSU.
if (&FirstSU != &DAG.EntrySU) {		CurrentSU = &FirstSU;
		while (CurrentSU && CurrentSU != &DAG.EntrySU) {
for (const SDep &SI : SecondSU.Preds) {		for (const SDep &SI : SecondSU.Preds) {
SUnit *SU = SI.getSUnit();		SUnit *SU = SI.getSUnit();
if (SI.isWeak() \|\| isHazard(SI) \|\| &FirstSU == SU \|\| FirstSU.isSucc(SU))		if (SI.isWeak() \|\| isHazard(SI) \|\| CurrentSU == SU \|\|
		CurrentSU->isSucc(SU))
continue;		continue;
LLVM_DEBUG(dbgs() << " Bind "; DAG.dumpNodeName(*SU); dbgs() << " - ";		LLVM_DEBUG(dbgs() << " Bind "; DAG.dumpNodeName(*SU); dbgs() << " - ";
DAG.dumpNodeName(FirstSU); dbgs() << '\n';);		DAG.dumpNodeName(*CurrentSU); dbgs() << '\n';);
DAG.addEdge(&FirstSU, SDep(SU, SDep::Artificial));		DAG.addEdge(CurrentSU, SDep(SU, SDep::Artificial));
}		}
// ExitSU comes last by design, which acts like an implicit dependency		// ExitSU comes last by design, which acts like an implicit dependency
// between ExitSU and any bottom root in the graph. We should transfer		// between ExitSU and any bottom root in the graph. We should transfer
// this to FirstSU as well.		// this to FirstSU as well.
if (&SecondSU == &DAG.ExitSU) {		if (&SecondSU == &DAG.ExitSU) {
for (SUnit &SU : DAG.SUnits) {		for (SUnit &SU : DAG.SUnits) {
if (SU.Succs.empty())		if (SU.Succs.empty())
DAG.addEdge(&FirstSU, SDep(&SU, SDep::Artificial));		DAG.addEdge(CurrentSU, SDep(&SU, SDep::Artificial));
}		}
}		}

		CurrentSU = getPredClusterSU(*CurrentSU);
}		}

++NumFused;		++NumFused;
return true;		return true;
}		}

namespace {

/// Post-process the DAG to create cluster edges between instrs that may		/// Post-process the DAG to create cluster edges between instrs that may
/// be fused by the processor into a single operation.		/// be fused by the processor into a single operation.
class MacroFusion : public ScheduleDAGMutation {		class MacroFusion : public ScheduleDAGMutation {
ShouldSchedulePredTy shouldScheduleAdjacent;		ShouldSchedulePredTy shouldScheduleAdjacent;
bool FuseBlock;		bool FuseBlock;
bool scheduleAdjacentImpl(ScheduleDAGInstrs &DAG, SUnit &AnchorSU);		bool scheduleAdjacentImpl(ScheduleDAGInstrs &DAG, SUnit &AnchorSU);

public:		public:
▲ Show 20 Lines • Show All 67 Lines • Show Last 20 Lines

llvm/test/CodeGen/AArch64/macro-fusion.ll

	; REQUIRES: asserts			; REQUIRES: asserts
	; RUN: llc < %s -mtriple=aarch64-linux-gnu -mattr=+fuse-arith-logic -verify-misched -debug-only=machine-scheduler 2>&1 > /dev/null \| FileCheck %s			; RUN: llc < %s -mtriple=aarch64-linux-gnu -mattr=+fuse-arith-logic -verify-misched -debug-only=machine-scheduler 2>&1 > /dev/null \| FileCheck %s

	; Verify that, the macro-fusion creates the necessary dependencies between SUs.			; Verify that, the macro-fusion creates the necessary dependencies between SUs.
	define signext i32 @test(i32 signext %a, i32 signext %b, i32 signext %c, i32 signext %d) {			define signext i32 @test(i32 signext %a, i32 signext %b, i32 signext %c, i32 signext %d) {
	entry:			entry:
	; CHECK: ******** MI Scheduling ********			; CHECK: ******** MI Scheduling ********
	; CHECK-LABEL: %bb.0 entry			; CHECK-LABEL: %bb.0 entry
	; CHECK: Macro fuse: SU([[SU4:[0-9]+]]) - SU([[SU5:[0-9]+]])			; CHECK: Macro fuse: SU([[SU4:[0-9]+]]) - SU([[SU5:[0-9]+]])
	; CHECK: Bind SU([[SU1:[0-9]+]]) - SU([[SU4]])			; CHECK: Bind SU([[SU1:[0-9]+]]) - SU([[SU4]])
	; CHECK: Macro fuse: SU([[SU5]]) - SU([[SU6:[0-9]+]])			; CHECK: Macro fuse: SU([[SU5]]) - SU([[SU6:[0-9]+]])
	; CHECK: Bind SU([[SU0:[0-9]+]]) - SU([[SU5]])			; CHECK: Bind SU([[SU0:[0-9]+]]) - SU([[SU5]])
				; CHECK: Bind SU([[SU0:[0-9]+]]) - SU([[SU4]])
	; CHECK: SU([[SU0]]): %{{[0-9]+}}:gpr32 = COPY $w3			; CHECK: SU([[SU0]]): %{{[0-9]+}}:gpr32 = COPY $w3
	; CHECK: SU([[SU1]]): %{{[0-9]+}}:gpr32 = COPY $w2			; CHECK: SU([[SU1]]): %{{[0-9]+}}:gpr32 = COPY $w2
	; CHECK: SU([[SU4]]): %{{[0-9]+}}:gpr32 = nsw ADDWrr			; CHECK: SU([[SU4]]): %{{[0-9]+}}:gpr32 = nsw ADDWrr
	; CHECK: SU([[SU5]]): %{{[0-9]+}}:gpr32 = nsw ADDWrr			; CHECK: SU([[SU5]]): %{{[0-9]+}}:gpr32 = nsw ADDWrr
	; CHECK: SU([[SU6]]): %{{[0-9]+}}:gpr32 = nsw SUBWrr			; CHECK: SU([[SU6]]): %{{[0-9]+}}:gpr32 = nsw SUBWrr

	%add = add nsw i32 %b, %a			%add = add nsw i32 %b, %a
	%add1 = add nsw i32 %add, %c			%add1 = add nsw i32 %add, %c
	%sub = sub nsw i32 %add1, %d			%sub = sub nsw i32 %add1, %d
	ret i32 %sub			ret i32 %sub
	}			}