This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
include/llvm/CodeGen/
-
llvm/
-
CodeGen/
-
MacroFusion.h
-
lib/
-
CodeGen/
2
MacroFusion.cpp
-
Target/
-
AArch64/
2/4
AArch64MacroFusion.cpp
-
AMDGPU/
-
AMDGPUMacroFusion.cpp
-
ARM/
-
ARMMacroFusion.cpp
-
X86/
-
X86MacroFusion.cpp
-
test/CodeGen/AArch64/
-
CodeGen/
-
AArch64/
-
macro-fusion.ll

Differential D70066

[MacroFusion] Limit the max fused number as 2 to reduce the dependency
ClosedPublic

Authored by steven.zhang on Nov 10 2019, 11:51 PM.

Download Raw Diff

Details

Reviewers

fhahn
evandro
MatzeB
arsenm
jsji
hfinkel

Group Reviewers

Restricted Project

Commits

rGd84b320dfd0a: [MacroFusion] Limit the max fused number as 2 to reduce the dependency

Summary

This is the continue work for https://reviews.llvm.org/D36704. And I have described the problems of allowing more than two instructions fused together in https://reviews.llvm.org/D69998 . We will create unnecessary dependency edges that hurt the scheduler and break the fusion. Therefore, we should change the interface to leave it to the target to determine the max fuse instruction number it wants.

Diff Detail

Event Timeline

steven.zhang created this revision.Nov 10 2019, 11:51 PM

Herald added a project: Restricted Project. · View Herald TranscriptNov 10 2019, 11:52 PM

Herald added subscribers: hiraditya, nhaehnle, wdng, jvesely. · View Herald Transcript

steven.zhang added a parent revision: D69998: [MacroFusion] Create the missing artificial edges if there are more than 2 SU fused..Nov 10 2019, 11:53 PM

Gentle ping.

I am not sure if this example best illustrates the issue. Is your main concern here potentially longer dependency chains?

The original test case explicitly checks the 2 fused pairs and without knowing details about the micro architecture, there is no way of telling if it is beneficial to schedule add; add; sub adjacent. The reason 2 pairs are created is because for fuse-arith-logic patterns, the opcode of the first MI can also match the second MI. I think fuse-arith-logic is the only fusion pattern on AARch64 that has this property. And only the ExynosM4 has it enabled. I think @evandro originally added those patterns, maybe he can shed some light on the benefits?

Alternatively, are there similar cases for other targets?

Can you please clarify how the existing implementation hurts the scheduler and when fused pairs are then broken?

See the example I gave in https://reviews.llvm.org/D69998,

int foo(int a, int b, int c, int d) {
  return a + b + c +d;
}

And this is the Dependency Graph:

+------+       +------+       +------+       +------+
|  A   |       |  B   |       |  C   |       |  D   |
+--+--++       +---+--+       +--+---+       +--+---+
   ^  ^            ^  ^          ^              ^
   |  |            |  |          |              |
   |  |            |  |New1      +--------------+
   |  |            |  |          |
   |  |            |  |       +--+---+
   |  |New2        |  +-------+ ADD1 |
   |  |            |          +--+---+
   |  |            |    Fuse     ^
   |  |            +-------------+
   |  +------------+
   |               |
   |   Fuse     +--+---+
   +----------->+ ADD2 |
   |            +------+
+--+---+
| ADD3 |
+------+

And we need also create an artificial edge from ADD1 to A if https://reviews.llvm.org/D69998 is landed. That will force the Node A scheduled before the ADD1 and ADD2. But in fact, it is ok to schedule the Node A in-between ADD3 and ADD2, as ADD3 and ADD2 are NOT a fusion pair because ADD2 has been matched to ADD1. We are creating these unnecessary dependency edges that override the heuristics.

In D70066#1756949, @fhahn wrote:

I am not sure if this example best illustrates the issue. Is your main concern here potentially longer dependency chains?

My main concern is that, some node (such as Node A in my below example) lose some freedom to do the schedule. We are creating unnecessary dependencies for it. Things become more worse if

ADD
ADD
// expected some instructions being scheduled here.
ADD
ADD

Current implementation will force all these 4 ADD scheduled together and no instructions is allowed to insert in-between {ADD,ADD} {ADD,ADD}

The original test case explicitly checks the 2 fused pairs and without knowing details about the micro architecture, there is no way of telling if it is beneficial to schedule add; add; sub adjacent. The reason 2 pairs are created is because for fuse-arith-logic patterns, the opcode of the first MI can also match the second MI. I think fuse-arith-logic is the only fusion pattern on AARch64 that has this property. And only the ExynosM4 has it enabled. I think @evandro originally added those patterns, maybe he can shed some light on the benefits?

Yeah, my point is that, as you said, we didn't see the benefit to schedule add;add;sub adjacent from micro architecture aspect, but we see the bad things happening as I gave below.

Alternatively, are there similar cases for other targets?

No, as far as I know currently. But no guarantee for the future, as hw is moving forward you know.

In D70066#1758249, @steven.zhang wrote:

In D70066#1756949, @fhahn wrote:

I am not sure if this example best illustrates the issue. Is your main concern here potentially longer dependency chains?

My main concern is that, some node (such as Node A in my below example) lose some freedom to do the schedule.

That's what I thought, thanks for clarifying.

Alternatively, are there similar cases for other targets?

No, as far as I know currently. But no guarantee for the future, as hw is moving forward you know.

Sure, but then someone has to add new fusion patterns to the matcher, otherwise the case can not happen (as I said, I am not what targets beside AArch64 match) and it is not possible to add a test case.

I would recommend moving the logic to count the number of fused instructions into the target specific shouldScheduleAdjacent as suggested inline. On targets that have no problematic patterns, we can just assert that we never fuse an instruction with one that's already fused. So once we have problematic patterns there, we know because of the assertion. For targets like AArch64 with problematic patterns, we can add the bail out based on the number of fused instructions. It probably makes sense to add the limit as sub-target feature. It would also be good to put up patches for each target separately, as different people might be interested in taking a look.

Thanks for working on this! With the suggestions, I think this would be good to go in, independently of D69998

llvm/lib/Target/AArch64/AArch64MacroFusion.cpp
381	If we make the decision here, I think we should also just count the predecessors here, instead of doing so at the call site. If the sub target does not have a limit or does not require one, we should not iterate over the clusters unnecessarily. We could add a helper like 'hasLessThanNumFused` in MacroFusion.h which would do something like the function below, to stop when the limit is reached: bool HasLessThanNumFused(const SUnit &SU, unsigned FuseLimit) { unsigned Num = 0; const SUnit CurrentSU = &SU; while ((CurrentSU = getPredClusterSU(CurrentSU)) && Num < FuseLimt) Num ++; return Num < FuseLimit; } We would have to pass in the SUnit of course.

steven.zhang marked an inline comment as done.Nov 25 2019, 9:02 PM

steven.zhang added inline comments.

llvm/lib/Target/AArch64/AArch64MacroFusion.cpp
381	If I understand correctly, we have to change the interface of shouldScheduleAdjacent that contains the SUnit instead of the MachineInstr. That is really not clean, as the target should be aware of the SUnit, to count the pred. Can we do it another way ? Let's assume that, the macro fusion must be back to back. That is, check the SU numbers using HasLessThanNumFused () in the infrastructure instead of the target hook and bail out if it reach the limit. Does it make sense ?

fhahn added inline comments.Nov 26 2019, 12:20 PM

llvm/lib/Target/AArch64/AArch64MacroFusion.cpp
381	If I understand correctly, we have to change the interface of shouldScheduleAdjacent that contains the SUnit instead of the MachineInstr. That is really not clean, as the target should be aware of the SUnit, to count the pred. Can we do it another way ? I am not sure if passing in the SU directly is a bit deal here. Do you have a specific concern? The macro fusion implementations deal specifically with scheduling issues, so IMO passing the SU in directly makes sense to me, so they can make better decisions (instead of passing bits of information computed from the SU). IIRC the main reason we pass in MAchineInstrs instead of SUs is that there was no need for information from the SUs initially. Also, I think there might be cases where better fusion heuristics might want to use additional info from the SU (although the current ones are mostly 'good enough') I think it would be great to keep the API flexible here and to not impose unnecessary checks/computation for some targets. Let's assume that, the macro fusion must be back to back. That is, check the SU numbers using HasLessThanNumFused () in the infrastructure instead of the target hook and bail out if it reach the limit. Does it make sense ? With the current structure, all fusion related flags are controlled by the sub target, which is the right level of granularity IMO. I think it would be better to not add additional fusion related flags to TTI, if that's what you mean with infrastructure.

steven.zhang marked an inline comment as done.Nov 26 2019, 7:50 PM

steven.zhang added inline comments.

llvm/lib/Target/AArch64/AArch64MacroFusion.cpp
381	Oh, sorry about the confusion. The infrastructure I mean is inside the MacroFusion implementation, not TTI. i.e. bool MacroFusion::schedueAdjacentImpl(...) { ... if (!HasLessThanNumFused(DepSU) \|\| ShouldScheduleAdjacent(...)) continue; } The target is unaware of this, so that, we don't need to provide a helper function or pass the SUnit. And we are basing on all the macro-fusion is implemented by hw as back-to-back. And it will benefit all the targets if they add some pattern that could fuse together in case. However, I am also ok with your suggestion, as it also makes sense.

steven.zhang updated this revision to Diff 231175.Nov 26 2019, 9:11 PM

steven.zhang removed a parent revision: D69998: [MacroFusion] Create the missing artificial edges if there are more than 2 SU fused..

I have updated the patch for easy to review. Personally, passing the SUnit and add one helper function for target to check the fused SU number is also ok for me. The only concern is that, other target might not know that, they need to check it manually if they happens to add some fusion pattern that could be fused more than once. As the hardware only supports the back-2-back macro-fusion as far as I see now, it makes sense to assume it in MacroFusion. And we can change it in the future if some target want it.

The penalty for this patch is to check the SU's number for all targets. However, as the limit is 2 and we need to walk the list once at most. It is cheap I think.

In D70066#1761217, @steven.zhang wrote:

I have updated the patch for easy to review. Personally, passing the SUnit and add one helper function for target to check the fused SU number is also ok for me. The only concern is that, other target might not know that, they need to check it manually if they happens to add some fusion pattern that could be fused more than once. As the hardware only supports the back-2-back macro-fusion as far as I see now, it makes sense to assume it in MacroFusion. And we can change it in the future if some target want it.

Right, that's why I originally suggested to pass the SU and add an assertions making sure that exactly 2 instructions are chained together to all targets (except AArch64 with +fuse-arith-logic).

I still think that's slightly preferable to avoid any unnecessary checks in release builds. The impact on compile-time will likely be very small, but the backend is already notorious for being slow. So if there are any unnecessary checks we can skip without too much effort, IMO we should try to do so. And in this case we also give the backend specific heuristics more flexibility. But if you think the benefit of not passing the SU trumps the compile-time concern, I think that would be OK as well.

llvm/lib/CodeGen/MacroFusion.cpp
35	nit: no need to put the static functions into an anonymous namespace.
196	I think it is not too clear what back-2-back means here. IIUC it means more that we do not chain more than 2 instructions together. In a chain of 3 instructions, all instructions are still back-to-back.

Adding the assertion in each target will solve my concern, thank you for the comments. As the check here is really cheap, and I won't expose the SUnit and dependency stuff to the target except that we have some strong need. (We have to check the number of chained SUnits for FirstSU not SecondSU and target has to understand why) Anyway, it is not a problem.

Update the patch according to comments.

LGTM

This revision is now accepted and ready to land.Dec 3 2019, 12:46 AM

Closed by commit rGd84b320dfd0a: [MacroFusion] Limit the max fused number as 2 to reduce the dependency (authored by steven.zhang). · Explain WhyDec 3 2019, 9:12 PM

This revision was automatically updated to reflect the committed changes.

I realized that I had a small un-submitted comment here.....Please also update the documentation for shouldScheduleAdjacent to clarify that it won't be called if DepMI is already part of a pair.

In D70066#1921272, @fhahn wrote:

I realized that I had a small un-submitted comment here.....Please also update the documentation for shouldScheduleAdjacent to clarify that it won't be called if DepMI is already part of a pair.

Err.. But it is a callback... I didn't find a good place to give this kind of message for shouldScheduleAdjacent. Any suggestions ?

Revision Contents

Path

Size

llvm/

include/

llvm/

CodeGen/

MacroFusion.h

6 lines

lib/

CodeGen/

MacroFusion.cpp

12 lines

Target/

AArch64/

AArch64MacroFusion.cpp

7 lines

AMDGPU/

AMDGPUMacroFusion.cpp

7 lines

ARM/

ARMMacroFusion.cpp

7 lines

X86/

X86MacroFusion.cpp

7 lines

test/

CodeGen/

AArch64/

macro-fusion.ll

6 lines

Diff 228636

llvm/include/llvm/CodeGen/MacroFusion.h

	Show All 20 Lines

	class MachineInstr;			class MachineInstr;
	class ScheduleDAGMutation;			class ScheduleDAGMutation;
	class TargetInstrInfo;			class TargetInstrInfo;
	class TargetSubtargetInfo;			class TargetSubtargetInfo;

	/// Check if the instr pair, FirstMI and SecondMI, should be fused			/// Check if the instr pair, FirstMI and SecondMI, should be fused
	/// together. Given SecondMI, when FirstMI is unspecified, then check if			/// together. Given SecondMI, when FirstMI is unspecified, then check if
	/// SecondMI may be part of a fused pair at all.			/// SecondMI may be part of a fused pair at all. And NumFused specify
				/// the number of SU's has been fused already in current dependency chain.
	using ShouldSchedulePredTy = std::function<bool(const TargetInstrInfo &TII,			using ShouldSchedulePredTy = std::function<bool(const TargetInstrInfo &TII,
	const TargetSubtargetInfo &TSI,			const TargetSubtargetInfo &TSI,
	const MachineInstr *FirstMI,			const MachineInstr *FirstMI,
	const MachineInstr &SecondMI)>;			const MachineInstr &SecondMI,
				unsigned NumFused)>;

	/// Create a DAG scheduling mutation to pair instructions back to back			/// Create a DAG scheduling mutation to pair instructions back to back
	/// for instructions that benefit according to the target-specific			/// for instructions that benefit according to the target-specific
	/// shouldScheduleAdjacent predicate function.			/// shouldScheduleAdjacent predicate function.
	std::unique_ptr<ScheduleDAGMutation>			std::unique_ptr<ScheduleDAGMutation>
	createMacroFusionDAGMutation(ShouldSchedulePredTy shouldScheduleAdjacent);			createMacroFusionDAGMutation(ShouldSchedulePredTy shouldScheduleAdjacent);

	/// Create a DAG scheduling mutation to pair branch instructions with one			/// Create a DAG scheduling mutation to pair branch instructions with one
	/// of their predecessors back to back for instructions that benefit according			/// of their predecessors back to back for instructions that benefit according
	/// to the target-specific shouldScheduleAdjacent predicate function.			/// to the target-specific shouldScheduleAdjacent predicate function.
	std::unique_ptr<ScheduleDAGMutation>			std::unique_ptr<ScheduleDAGMutation>
	createBranchMacroFusionDAGMutation(ShouldSchedulePredTy shouldScheduleAdjacent);			createBranchMacroFusionDAGMutation(ShouldSchedulePredTy shouldScheduleAdjacent);

	} // end namespace llvm			} // end namespace llvm

	#endif // LLVM_CODEGEN_MACROFUSION_H			#endif // LLVM_CODEGEN_MACROFUSION_H

llvm/lib/CodeGen/MacroFusion.cpp

	Show All 26 Lines

	STATISTIC(NumFused, "Number of instr pairs fused");			STATISTIC(NumFused, "Number of instr pairs fused");

	using namespace llvm;			using namespace llvm;

	static cl::opt<bool> EnableMacroFusion("misched-fusion", cl::Hidden,			static cl::opt<bool> EnableMacroFusion("misched-fusion", cl::Hidden,
	cl::desc("Enable scheduling for macro fusion."), cl::init(true));			cl::desc("Enable scheduling for macro fusion."), cl::init(true));

	static bool isHazard(const SDep &Dep) {			static bool isHazard(const SDep &Dep) {
				fhahnUnsubmitted Not Done Reply Inline Actions nit: no need to put the static functions into an anonymous namespace. fhahn: nit: no need to put the static functions into an anonymous namespace.
	return Dep.getKind() == SDep::Anti \|\| Dep.getKind() == SDep::Output;			return Dep.getKind() == SDep::Anti \|\| Dep.getKind() == SDep::Output;
	}			}

	namespace {			namespace {

	static SUnit *getPredClusterSU(const SUnit &SU) {			static SUnit *getPredClusterSU(const SUnit &SU) {
	for (const SDep &SI : SU.Preds)			for (const SDep &SI : SU.Preds)
	if (SI.isCluster())			if (SI.isCluster())
	return SI.getSUnit();			return SI.getSUnit();

	return nullptr;			return nullptr;
	}			}

	static SUnit *getSuccClusterSU(const SUnit &SU) {			static SUnit *getSuccClusterSU(const SUnit &SU) {
	for (const SDep &SI : SU.Succs)			for (const SDep &SI : SU.Succs)
	if (SI.isCluster())			if (SI.isCluster())
	return SI.getSUnit();			return SI.getSUnit();

	return nullptr;			return nullptr;
	}			}

				static unsigned getNumOfClusterSU(const SUnit &SU) {
				unsigned Num = 0;
				const SUnit *CurrentSU = &SU;
				while ((CurrentSU = getPredClusterSU(*CurrentSU))) Num ++;
				return Num;
				}

	static bool fuseInstructionPair(ScheduleDAGInstrs &DAG, SUnit &FirstSU,			static bool fuseInstructionPair(ScheduleDAGInstrs &DAG, SUnit &FirstSU,
	SUnit &SecondSU) {			SUnit &SecondSU) {
	// Check that neither instr is already paired with another along the edge			// Check that neither instr is already paired with another along the edge
	// between them.			// between them.
	for (SDep &SI : FirstSU.Succs)			for (SDep &SI : FirstSU.Succs)
	if (SI.isCluster())			if (SI.isCluster())
	return false;			return false;

	▲ Show 20 Lines • Show All 103 Lines • ▼ Show 20 Lines
	/// Implement the fusion of instr pairs in the scheduling DAG,			/// Implement the fusion of instr pairs in the scheduling DAG,
	/// anchored at the instr in AnchorSU..			/// anchored at the instr in AnchorSU..
	bool MacroFusion::scheduleAdjacentImpl(ScheduleDAGInstrs &DAG, SUnit &AnchorSU) {			bool MacroFusion::scheduleAdjacentImpl(ScheduleDAGInstrs &DAG, SUnit &AnchorSU) {
	const MachineInstr &AnchorMI = *AnchorSU.getInstr();			const MachineInstr &AnchorMI = *AnchorSU.getInstr();
	const TargetInstrInfo &TII = *DAG.TII;			const TargetInstrInfo &TII = *DAG.TII;
	const TargetSubtargetInfo &ST = DAG.MF.getSubtarget();			const TargetSubtargetInfo &ST = DAG.MF.getSubtarget();

	// Check if the anchor instr may be fused.			// Check if the anchor instr may be fused.
	if (!shouldScheduleAdjacent(TII, ST, nullptr, AnchorMI))			if (!shouldScheduleAdjacent(TII, ST, nullptr, AnchorMI, 0))
	return false;			return false;

	// Explorer for fusion candidates among the dependencies of the anchor instr.			// Explorer for fusion candidates among the dependencies of the anchor instr.
	for (SDep &Dep : AnchorSU.Preds) {			for (SDep &Dep : AnchorSU.Preds) {
	// Ignore dependencies other than data or strong ordering.			// Ignore dependencies other than data or strong ordering.
	if (Dep.isWeak() \|\| isHazard(Dep))			if (Dep.isWeak() \|\| isHazard(Dep))
	continue;			continue;

	SUnit &DepSU = *Dep.getSUnit();			SUnit &DepSU = *Dep.getSUnit();
	if (DepSU.isBoundaryNode())			if (DepSU.isBoundaryNode())
	continue;			continue;

	const MachineInstr *DepMI = DepSU.getInstr();			const MachineInstr *DepMI = DepSU.getInstr();
				fhahnUnsubmitted Not Done Reply Inline Actions I think it is not too clear what back-2-back means here. IIUC it means more that we do not chain more than 2 instructions together. In a chain of 3 instructions, all instructions are still back-to-back. fhahn: I think it is not too clear what back-2-back means here. IIUC it means more that we do not…
	if (!shouldScheduleAdjacent(TII, ST, DepMI, AnchorMI))			if (!shouldScheduleAdjacent(TII, ST, DepMI, AnchorMI,
				getNumOfClusterSU(DepSU)))
	continue;			continue;

	if (fuseInstructionPair(DAG, DepSU, AnchorSU))			if (fuseInstructionPair(DAG, DepSU, AnchorSU))
	return true;			return true;
	}			}

	return false;			return false;
	}			}
	Show All 16 Lines

llvm/lib/Target/AArch64/AArch64MacroFusion.cpp

	Show First 20 Lines • Show All 369 Lines • ▼ Show 20 Lines
	}			}

	/// \brief Check if the instr pair, FirstMI and SecondMI, should be fused			/// \brief Check if the instr pair, FirstMI and SecondMI, should be fused
	/// together. Given SecondMI, when FirstMI is unspecified, then check if			/// together. Given SecondMI, when FirstMI is unspecified, then check if
	/// SecondMI may be part of a fused pair at all.			/// SecondMI may be part of a fused pair at all.
	static bool shouldScheduleAdjacent(const TargetInstrInfo &TII,			static bool shouldScheduleAdjacent(const TargetInstrInfo &TII,
	const TargetSubtargetInfo &TSI,			const TargetSubtargetInfo &TSI,
	const MachineInstr *FirstMI,			const MachineInstr *FirstMI,
	const MachineInstr &SecondMI) {			const MachineInstr &SecondMI,
				unsigned NumFused) {
				// Only back to back fusion are supported.
				if (NumFused > 0)
				fhahnUnsubmitted Not Done Reply Inline Actions If we make the decision here, I think we should also just count the predecessors here, instead of doing so at the call site. If the sub target does not have a limit or does not require one, we should not iterate over the clusters unnecessarily. We could add a helper like 'hasLessThanNumFused` in MacroFusion.h which would do something like the function below, to stop when the limit is reached: bool HasLessThanNumFused(const SUnit &SU, unsigned FuseLimit) { unsigned Num = 0; const SUnit CurrentSU = &SU; while ((CurrentSU = getPredClusterSU(CurrentSU)) && Num < FuseLimt) Num ++; return Num < FuseLimit; } We would have to pass in the SUnit of course. fhahn: If we make the decision here, I think we should also just count the predecessors here, instead…
				steven.zhangAuthorUnsubmitted Done Reply Inline Actions If I understand correctly, we have to change the interface of shouldScheduleAdjacent that contains the SUnit instead of the MachineInstr. That is really not clean, as the target should be aware of the SUnit, to count the pred. Can we do it another way ? Let's assume that, the macro fusion must be back to back. That is, check the SU numbers using HasLessThanNumFused () in the infrastructure instead of the target hook and bail out if it reach the limit. Does it make sense ? steven.zhang: If I understand correctly, we have to change the interface of shouldScheduleAdjacent that…
				fhahnUnsubmitted Not Done Reply Inline Actions If I understand correctly, we have to change the interface of shouldScheduleAdjacent that contains the SUnit instead of the MachineInstr. That is really not clean, as the target should be aware of the SUnit, to count the pred. Can we do it another way ? I am not sure if passing in the SU directly is a bit deal here. Do you have a specific concern? The macro fusion implementations deal specifically with scheduling issues, so IMO passing the SU in directly makes sense to me, so they can make better decisions (instead of passing bits of information computed from the SU). IIRC the main reason we pass in MAchineInstrs instead of SUs is that there was no need for information from the SUs initially. Also, I think there might be cases where better fusion heuristics might want to use additional info from the SU (although the current ones are mostly 'good enough') I think it would be great to keep the API flexible here and to not impose unnecessary checks/computation for some targets. Let's assume that, the macro fusion must be back to back. That is, check the SU numbers using HasLessThanNumFused () in the infrastructure instead of the target hook and bail out if it reach the limit. Does it make sense ? With the current structure, all fusion related flags are controlled by the sub target, which is the right level of granularity IMO. I think it would be better to not add additional fusion related flags to TTI, if that's what you mean with infrastructure. fhahn: > If I understand correctly, we have to change the interface of shouldScheduleAdjacent that…
				steven.zhangAuthorUnsubmitted Done Reply Inline Actions Oh, sorry about the confusion. The infrastructure I mean is inside the MacroFusion implementation, not TTI. i.e. bool MacroFusion::schedueAdjacentImpl(...) { ... if (!HasLessThanNumFused(DepSU) \|\| ShouldScheduleAdjacent(...)) continue; } The target is unaware of this, so that, we don't need to provide a helper function or pass the SUnit. And we are basing on all the macro-fusion is implemented by hw as back-to-back. And it will benefit all the targets if they add some pattern that could fuse together in case. However, I am also ok with your suggestion, as it also makes sense. steven.zhang: Oh, sorry about the confusion. The infrastructure I mean is inside the MacroFusion…
				return false;

	const AArch64Subtarget &ST = static_cast<const AArch64Subtarget&>(TSI);			const AArch64Subtarget &ST = static_cast<const AArch64Subtarget&>(TSI);

	// All checking functions assume that the 1st instr is a wildcard if it is			// All checking functions assume that the 1st instr is a wildcard if it is
	// unspecified.			// unspecified.
	if (ST.hasArithmeticBccFusion() && isArithmeticBccPair(FirstMI, SecondMI))			if (ST.hasArithmeticBccFusion() && isArithmeticBccPair(FirstMI, SecondMI))
	return true;			return true;
	if (ST.hasArithmeticCbzFusion() && isArithmeticCbzPair(FirstMI, SecondMI))			if (ST.hasArithmeticCbzFusion() && isArithmeticCbzPair(FirstMI, SecondMI))
	return true;			return true;
	Show All 26 Lines

llvm/lib/Target/AMDGPU/AMDGPUMacroFusion.cpp

	Show All 22 Lines
	namespace {			namespace {

	/// Check if the instr pair, FirstMI and SecondMI, should be fused			/// Check if the instr pair, FirstMI and SecondMI, should be fused
	/// together. Given SecondMI, when FirstMI is unspecified, then check if			/// together. Given SecondMI, when FirstMI is unspecified, then check if
	/// SecondMI may be part of a fused pair at all.			/// SecondMI may be part of a fused pair at all.
	static bool shouldScheduleAdjacent(const TargetInstrInfo &TII_,			static bool shouldScheduleAdjacent(const TargetInstrInfo &TII_,
	const TargetSubtargetInfo &TSI,			const TargetSubtargetInfo &TSI,
	const MachineInstr *FirstMI,			const MachineInstr *FirstMI,
	const MachineInstr &SecondMI) {			const MachineInstr &SecondMI,
				unsigned NumFused) {
				// Only back to back fusion are supported.
				if (NumFused > 0)
				return false;

	const SIInstrInfo &TII = static_cast<const SIInstrInfo&>(TII_);			const SIInstrInfo &TII = static_cast<const SIInstrInfo&>(TII_);

	switch (SecondMI.getOpcode()) {			switch (SecondMI.getOpcode()) {
	case AMDGPU::V_ADDC_U32_e64:			case AMDGPU::V_ADDC_U32_e64:
	case AMDGPU::V_SUBB_U32_e64:			case AMDGPU::V_SUBB_U32_e64:
	case AMDGPU::V_CNDMASK_B32_e64: {			case AMDGPU::V_CNDMASK_B32_e64: {
	// Try to cluster defs of condition registers to their uses. This improves			// Try to cluster defs of condition registers to their uses. This improves
	// the chance VCC will be available which will allow shrinking to VOP2			// the chance VCC will be available which will allow shrinking to VOP2
	Show All 28 Lines

llvm/lib/Target/ARM/ARMMacroFusion.cpp

	Show First 20 Lines • Show All 45 Lines • ▼ Show 20 Lines
	}			}

	/// Check if the instr pair, FirstMI and SecondMI, should be fused			/// Check if the instr pair, FirstMI and SecondMI, should be fused
	/// together. Given SecondMI, when FirstMI is unspecified, then check if			/// together. Given SecondMI, when FirstMI is unspecified, then check if
	/// SecondMI may be part of a fused pair at all.			/// SecondMI may be part of a fused pair at all.
	static bool shouldScheduleAdjacent(const TargetInstrInfo &TII,			static bool shouldScheduleAdjacent(const TargetInstrInfo &TII,
	const TargetSubtargetInfo &TSI,			const TargetSubtargetInfo &TSI,
	const MachineInstr *FirstMI,			const MachineInstr *FirstMI,
	const MachineInstr &SecondMI) {			const MachineInstr &SecondMI,
				unsigned NumFused) {
				// Only back to back fusion are supported.
				if (NumFused > 0)
				return false;

	const ARMSubtarget &ST = static_cast<const ARMSubtarget&>(TSI);			const ARMSubtarget &ST = static_cast<const ARMSubtarget&>(TSI);

	if (ST.hasFuseAES() && isAESPair(FirstMI, SecondMI))			if (ST.hasFuseAES() && isAESPair(FirstMI, SecondMI))
	return true;			return true;
	if (ST.hasFuseLiterals() && isLiteralsPair(FirstMI, SecondMI))			if (ST.hasFuseLiterals() && isLiteralsPair(FirstMI, SecondMI))
	return true;			return true;

	return false;			return false;
	}			}

	std::unique_ptr<ScheduleDAGMutation> createARMMacroFusionDAGMutation () {			std::unique_ptr<ScheduleDAGMutation> createARMMacroFusionDAGMutation () {
	return createMacroFusionDAGMutation(shouldScheduleAdjacent);			return createMacroFusionDAGMutation(shouldScheduleAdjacent);
	}			}

	} // end namespace llvm			} // end namespace llvm

llvm/lib/Target/X86/X86MacroFusion.cpp

	Show First 20 Lines • Show All 174 Lines • ▼ Show 20 Lines
	}			}

	/// Check if the instr pair, FirstMI and SecondMI, should be fused			/// Check if the instr pair, FirstMI and SecondMI, should be fused
	/// together. Given SecondMI, when FirstMI is unspecified, then check if			/// together. Given SecondMI, when FirstMI is unspecified, then check if
	/// SecondMI may be part of a fused pair at all.			/// SecondMI may be part of a fused pair at all.
	static bool shouldScheduleAdjacent(const TargetInstrInfo &TII,			static bool shouldScheduleAdjacent(const TargetInstrInfo &TII,
	const TargetSubtargetInfo &TSI,			const TargetSubtargetInfo &TSI,
	const MachineInstr *FirstMI,			const MachineInstr *FirstMI,
	const MachineInstr &SecondMI) {			const MachineInstr &SecondMI,
				unsigned NumFused) {
				// Only back to back fusion are supported.
				if (NumFused > 0)
				return false;

	const X86Subtarget &ST = static_cast<const X86Subtarget &>(TSI);			const X86Subtarget &ST = static_cast<const X86Subtarget &>(TSI);

	// Check if this processor supports any kind of fusion.			// Check if this processor supports any kind of fusion.
	if (!(ST.hasBranchFusion() \|\| ST.hasMacroFusion()))			if (!(ST.hasBranchFusion() \|\| ST.hasMacroFusion()))
	return false;			return false;

	const JumpKind BranchKind = classifySecond(SecondMI);			const JumpKind BranchKind = classifySecond(SecondMI);

	▲ Show 20 Lines • Show All 42 Lines • Show Last 20 Lines

llvm/test/CodeGen/AArch64/macro-fusion.ll

	; REQUIRES: asserts			; REQUIRES: asserts
	; RUN: llc < %s -mtriple=aarch64-linux-gnu -mattr=+fuse-arith-logic -verify-misched -debug-only=machine-scheduler 2>&1 > /dev/null \| FileCheck %s			; RUN: llc < %s -mtriple=aarch64-linux-gnu -mattr=+fuse-arith-logic -verify-misched -debug-only=machine-scheduler 2>&1 > /dev/null \| FileCheck %s

	; Verify that, the macro-fusion creates the necessary dependencies between SUs.			; Verify that, the macro-fusion creates the necessary dependencies between SUs.
	define signext i32 @test(i32 signext %a, i32 signext %b, i32 signext %c, i32 signext %d) {			define signext i32 @test(i32 signext %a, i32 signext %b, i32 signext %c, i32 signext %d) {
	entry:			entry:
	; CHECK: ******** MI Scheduling ********			; CHECK: ******** MI Scheduling ********
	; CHECK-LABEL: %bb.0 entry			; CHECK-LABEL: %bb.0 entry
	; CHECK: Macro fuse: SU([[SU4:[0-9]+]]) - SU([[SU5:[0-9]+]])			; CHECK: Macro fuse: SU([[SU4:[0-9]+]]) - SU([[SU5:[0-9]+]])
	; CHECK: Bind SU([[SU1:[0-9]+]]) - SU([[SU4]])			; CHECK: Bind SU([[SU1:[0-9]+]]) - SU([[SU4]])
	; CHECK: Macro fuse: SU([[SU5]]) - SU([[SU6:[0-9]+]])			; CHECK-NOT: Macro fuse:
	; CHECK: Bind SU([[SU0:[0-9]+]]) - SU([[SU5]])
	; CHECK: Bind SU([[SU0:[0-9]+]]) - SU([[SU4]])
	; CHECK: SU([[SU0]]): %{{[0-9]+}}:gpr32 = COPY $w3
	; CHECK: SU([[SU1]]): %{{[0-9]+}}:gpr32 = COPY $w2			; CHECK: SU([[SU1]]): %{{[0-9]+}}:gpr32 = COPY $w2
	; CHECK: SU([[SU4]]): %{{[0-9]+}}:gpr32 = nsw ADDWrr			; CHECK: SU([[SU4]]): %{{[0-9]+}}:gpr32 = nsw ADDWrr
	; CHECK: SU([[SU5]]): %{{[0-9]+}}:gpr32 = nsw ADDWrr			; CHECK: SU([[SU5]]): %{{[0-9]+}}:gpr32 = nsw ADDWrr
	; CHECK: SU([[SU6]]): %{{[0-9]+}}:gpr32 = nsw SUBWrr

	%add = add nsw i32 %b, %a			%add = add nsw i32 %b, %a
	%add1 = add nsw i32 %add, %c			%add1 = add nsw i32 %add, %c
	%sub = sub nsw i32 %add1, %d			%sub = sub nsw i32 %add1, %d
	ret i32 %sub			ret i32 %sub
	}			}

This is an archive of the discontinued LLVM Phabricator instance.

[MacroFusion] Limit the max fused number as 2 to reduce the dependencyClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 228636

llvm/include/llvm/CodeGen/MacroFusion.h

llvm/lib/CodeGen/MacroFusion.cpp

llvm/lib/Target/AArch64/AArch64MacroFusion.cpp

llvm/lib/Target/AMDGPU/AMDGPUMacroFusion.cpp

llvm/lib/Target/ARM/ARMMacroFusion.cpp

llvm/lib/Target/X86/X86MacroFusion.cpp

llvm/test/CodeGen/AArch64/macro-fusion.ll

[MacroFusion] Limit the max fused number as 2 to reduce the dependency
ClosedPublic