This is an archive of the discontinued LLVM Phabricator instance.

For the context: I'm experimenting with turning post-ra scheduling on for SNB onwards, and I some promising improvements. All the regressions I see are macro-fused instructions being moved apart, which this fixes.

Nice patch Clement!

I always wondered why on x86 we only enabled that mutator in the pre-ra scheduler.
In the past, I remember I did some quick experiments with enabling that mutator in the post-RA scheduler. I must admit that I wasn't particularly lucky wih the experiments (i.e. I couldn't find significant/promising improvements). But then - again - those were just quick experiments, and I didn't try it on many codebases.

If you think you can share some numbers then that would be great.

On a slightly different topic: I often wondered whether we - at some point - we should start using the PostMIScheduler for X86.

Did you experiment with it?
The API is basically the same; mutators are run on the DAG as part of a dag-postprocessing step. You can try to enable that mutator on that pass too.
To enable PostMIScheduler however you have to substitutePass(&PostRASchedulerID, &PostMachineSchedulerID) in the X86PassConfig (similar to how AArch64 does it).
It would be interesting to see which one gives us the best codegen in your experiments.. (just curious).

In D59688#1439297, @andreadb wrote:

Nice patch Clement!

I always wondered why on x86 we only enabled that mutator in the pre-ra scheduler.
In the past, I remember I did some quick experiments with enabling that mutator in the post-RA scheduler. I must admit that I wasn't particularly lucky wih the experiments (i.e. I couldn't find significant/promising improvements). But then - again - those were just quick experiments, and I didn't try it on many codebases. If you think you can share some numbers then that would be great.

Thanks Andrea.

Yes, that's essentially what the comment in X86.td says:

"This generally gives a nice performance increase on silvermont, with largely neutral behavior on other contemporary large core processors."

However, that was before the round of scheduling information fixes that Simon & I made based on llvm-exegesis. I wanted to give it another try after that, and from my first experiments it seems that it indeed makes sense to look at it again.
What I have done for now is run our (internal, sorry) main macrobenchmark with post-ra enabled. With the base code I see a consistent regression of 0.5% to 1% depending on metrics. With this patch I see a consistent improvement of 0.5% to 2%.

On a slightly different topic: I often wondered whether we - at some point - we should start using the PostMIScheduler for X86.
Did you experiment with it?

I'm looking at all the options right now. But I want to make sure that we're comparing apples to apples, and that's why I'm fixing this.

When I'm done experimenting with different possibilities I'll write a summary of the results.

The API is basically the same; mutators are run on the DAG as part of a dag-postprocessing step. You can try to enable that mutator on that pass too.

Yup, I have another patch coming for this :)

RKSimon added reviewers: arsenm, MatzeB.Mar 22 2019, 6:17 AM

Herald added a subscriber: wdng. · View Herald TranscriptMar 22 2019, 6:17 AM

arsenm added a reviewer: vpykhtin.Mar 22 2019, 6:24 AM

In D59688#1439350, @courbet wrote:

In D59688#1439297, @andreadb wrote:

Nice patch Clement!

I always wondered why on x86 we only enabled that mutator in the pre-ra scheduler.
In the past, I remember I did some quick experiments with enabling that mutator in the post-RA scheduler. I must admit that I wasn't particularly lucky wih the experiments (i.e. I couldn't find significant/promising improvements). But then - again - those were just quick experiments, and I didn't try it on many codebases. If you think you can share some numbers then that would be great.

Thanks Andrea.

Yes, that's essentially what the comment in X86.td says:

"This generally gives a nice performance increase on silvermont, with largely neutral behavior on other contemporary large core processors."

However, that was before the round of scheduling information fixes that Simon & I made based on llvm-exegesis. I wanted to give it another try after that, and from my first experiments it seems that it indeed makes sense to look at it again.
What I have done for now is run our (internal, sorry) main macrobenchmark with post-ra enabled. With the base code I see a consistent regression of 0.5% to 1% depending on metrics. With this patch I see a consistent improvement of 0.5% to 2%.

Thanks. That's good to know.
I plan to run some experiments today using your patch.

But in general, I am happy with this patch.

On a slightly different topic: I often wondered whether we - at some point - we should start using the PostMIScheduler for X86.
Did you experiment with it?

I'm looking at all the options right now. But I want to make sure that we're comparing apples to apples, and that's why I'm fixing this.

When I'm done experimenting with different possibilities I'll write a summary of the results.

The API is basically the same; mutators are run on the DAG as part of a dag-postprocessing step. You can try to enable that mutator on that pass too.

Yup, I have another patch coming for this :)

Cool. :-)

-Andrea

I plan to run some experiments today using your patch.

That's great, thanks.

In D59688#1439408, @courbet wrote:

I plan to run some experiments today using your patch.

That's great, thanks.

Sorry, I was over optimistic about my other workload. I don't think I'll get a chance to get any perf numbers anytime soon.

That being said, I tried your patch on a few small examples on some different targets, and results seem good.
For example, before your patch I saw cases where the test/cmp was not emitted before the conditional branch. Your patch seems to fix that "issue" in most cases.

My only concern is that the macro-fusion mutator might be a bit too aggressive for AMD processors.
X86MacroFusion assumes that branch fusion can happen with ADD/SUB/INC/DEC too. That is okay for Intel processors, but not necessarily for AMD processors where branch fusion (as far as I remember) is limited to CMP/TEST opcodes only.
Since your patch enables that mutator for targets with FeatureMacroFusion, it would be nice to get some feedback from somebody with access to an AMD target where macro fusion is enabled (Bobcat/Jaguar doesn't do branch fusion). Perhaps @lebedev.ri can run some quick tests on BdVer2?
I don't think is a blocking issue, but in future we should revisit the logic in X86MacroFusion.

In D59688#1441464, @andreadb wrote:

In D59688#1439408, @courbet wrote:

I plan to run some experiments today using your patch.

That's great, thanks.

Sorry, I was over optimistic about my other workload. I don't think I'll get a chance to get any perf numbers anytime soon.

That being said, I tried your patch on a few small examples on some different targets, and results seem good.
For example, before your patch I saw cases where the test/cmp was not emitted before the conditional branch. Your patch seems to fix that "issue" in most cases.

My only concern is that the macro-fusion mutator might be a bit too aggressive for AMD processors.
X86MacroFusion assumes that branch fusion can happen with ADD/SUB/INC/DEC too. That is okay for Intel processors, but not necessarily for

In D59688#1441464, @andreadb wrote:

AMD processors where branch fusion (as far as I remember) is limited to CMP/TEST opcodes only.

That is consistent with what is stated in agner's microarchitecture, amd sog for piledriver.

Since your patch enables that mutator for targets with FeatureMacroFusion, it would be nice to get some feedback from somebody with access to an AMD target where macro fusion is enabled (Bobcat/Jaguar doesn't do branch fusion). Perhaps @lebedev.ri can run some quick tests on BdVer2?

It will, as usual, depend on whether this happens to affect the hotpath or not.
I did just run my rawspeed benchmark, and i'm not observing any notable non-noise perf changes.

I don't think is a blocking issue, but in future we should revisit the logic in X86MacroFusion.

While there, @andreadb, can you reply on https://reviews.llvm.org/D46662#1293043 ?

In D59688#1442004, @lebedev.ri wrote:

In D59688#1441464, @andreadb wrote:

In D59688#1439408, @courbet wrote:

I plan to run some experiments today using your patch.

That's great, thanks.

Sorry, I was over optimistic about my other workload. I don't think I'll get a chance to get any perf numbers anytime soon.

That being said, I tried your patch on a few small examples on some different targets, and results seem good.
For example, before your patch I saw cases where the test/cmp was not emitted before the conditional branch. Your patch seems to fix that "issue" in most cases.

My only concern is that the macro-fusion mutator might be a bit too aggressive for AMD processors.
X86MacroFusion assumes that branch fusion can happen with ADD/SUB/INC/DEC too. That is okay for Intel processors, but not necessarily for

In D59688#1441464, @andreadb wrote:

AMD processors where branch fusion (as far as I remember) is limited to CMP/TEST opcodes only.

That is consistent with what is stated in agner's microarchitecture, amd sog for piledriver.

Since your patch enables that mutator for targets with FeatureMacroFusion, it would be nice to get some feedback from somebody with access to an AMD target where macro fusion is enabled (Bobcat/Jaguar doesn't do branch fusion). Perhaps @lebedev.ri can run some quick tests on BdVer2?

It will, as usual, depend on whether this happens to affect the hotpath or not.
I did just run my rawspeed benchmark, and i'm not observing any notable non-noise perf changes.

Thanks. That matches what I also saw in the past when I tested it.

I don't think is a blocking issue, but in future we should revisit the logic in X86MacroFusion.

While there, @andreadb, can you reply on https://reviews.llvm.org/D46662#1293043 ?

I replied to that code review (the two small benchmarks were available from one of Xur’s older posts).

courbet mentioned this in D59872: [X86MacroFusion] Handle branch fusion (AMD CPUs)..Mar 27 2019, 4:44 AM

courbet mentioned this in rL357171: [X86MacroFusion] Handle branch fusion (AMD CPUs)..Mar 28 2019, 7:12 AM

courbet mentioned this in rG699dc025a625: [X86MacroFusion] Handle branch fusion (AMD CPUs)..

I don't think is a blocking issue, but in future we should revisit the logic in X86MacroFusion.

For the record, this was dealt with in D59872.

Thanks Clement.

LGTM

This revision is now accepted and ready to land.Mar 28 2019, 8:12 AM

This could use test coverage i guess?

This could use test coverage i guess?

This is actually an NFC as the post-ra scheduler is not on by default. The next patch that enables scheduling will show the actual changes.

In D59688#1446142, @courbet wrote:

This could use test coverage i guess?

This is actually an NFC as the post-ra scheduler is not on by default. The next patch that enables scheduling will show the actual changes.

What about CPU's that specify let PostRAScheduler = 1; ?

What about CPU's that specify let PostRAScheduler = 1; ?

Ah yes, sorry I lost track of this. Interestingly there are no tests that fail currently. I'll try to come up with some that do.

For now I'll submit the refactoring part of this change (D59689).

In D59688#1446261, @courbet wrote:

What about CPU's that specify let PostRAScheduler = 1; ?

Ah yes, sorry I lost track of this. Interestingly there are no tests that fail currently.

I'll try to come up with some that do.

For now I'll submit the refactoring part of this change (D59689).

Thanks!

courbet mentioned this in rL357381: [X86MacroFusion][NFC] Add more tests..Apr 1 2019, 6:18 AM

courbet mentioned this in rGd9f6ee1c3cc6: [X86MacroFusion][NFC] Add more tests..

Add tests.

This revision is now accepted and ready to land.Apr 1 2019, 6:37 AM

Harbormaster completed remote builds in B29895: Diff 193074.Apr 1 2019, 6:39 AM

Please upload correct diff, this seems to be relative to previous patch.

nevermind, i see that it was committed in D59689.
Looks good.

Closed by commit rL357384: [X86] Make post-ra scheduling macrofusion-aware. (authored by courbet). · Explain WhyApr 1 2019, 6:47 AM

This revision was automatically updated to reflect the committed changes.

lebedev.ri mentioned this in D60185: [X86] Make the post machine scheduler macrofusion-aware..Apr 3 2019, 3:08 AM

Revision Contents

Path

Size

llvm/

trunk/

lib/

Target/

X86/

X86Subtarget.h

3 lines

X86Subtarget.cpp

6 lines

test/

CodeGen/

X86/

testb-je-fusion.ll

4 lines

Diff 193081

llvm/trunk/lib/Target/X86/X86Subtarget.h

Show First 20 Lines • Show All 837 Lines • ▼ Show 20 Lines	bool enableIndirectBrExpand() const override {
return useRetpolineIndirectBranches();		return useRetpolineIndirectBranches();
}		}

/// Enable the MachineScheduler pass for all X86 subtargets.		/// Enable the MachineScheduler pass for all X86 subtargets.
bool enableMachineScheduler() const override { return true; }		bool enableMachineScheduler() const override { return true; }

bool enableEarlyIfConversion() const override;		bool enableEarlyIfConversion() const override;

		void getPostRAMutations(
		std::vector<std::unique_ptr<ScheduleDAGMutation>> &Mutations) const;

AntiDepBreakMode getAntiDepBreakMode() const override {		AntiDepBreakMode getAntiDepBreakMode() const override {
return TargetSubtargetInfo::ANTIDEP_CRITICAL;		return TargetSubtargetInfo::ANTIDEP_CRITICAL;
}		}

bool enableAdvancedRASplitCost() const override { return true; }		bool enableAdvancedRASplitCost() const override { return true; }
};		};

} // end namespace llvm		} // end namespace llvm

#endif // LLVM_LIB_TARGET_X86_X86SUBTARGET_H		#endif // LLVM_LIB_TARGET_X86_X86SUBTARGET_H

llvm/trunk/lib/Target/X86/X86Subtarget.cpp

	//===-- X86Subtarget.cpp - X86 Subtarget Information ----------------------===//			//===-- X86Subtarget.cpp - X86 Subtarget Information ----------------------===//
	//			//
	// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.			// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
	// See https://llvm.org/LICENSE.txt for license information.			// See https://llvm.org/LICENSE.txt for license information.
	// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception			// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	//			//
	// This file implements the X86 specific subclass of TargetSubtargetInfo.			// This file implements the X86 specific subclass of TargetSubtargetInfo.
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	#include "X86.h"			#include "X86.h"

	#include "X86CallLowering.h"			#include "X86CallLowering.h"
	#include "X86LegalizerInfo.h"			#include "X86LegalizerInfo.h"
				#include "X86MacroFusion.h"
	#include "X86RegisterBankInfo.h"			#include "X86RegisterBankInfo.h"
	#include "X86Subtarget.h"			#include "X86Subtarget.h"
	#include "MCTargetDesc/X86BaseInfo.h"			#include "MCTargetDesc/X86BaseInfo.h"
	#include "X86TargetMachine.h"			#include "X86TargetMachine.h"
	#include "llvm/ADT/Triple.h"			#include "llvm/ADT/Triple.h"
	#include "llvm/CodeGen/GlobalISel/CallLowering.h"			#include "llvm/CodeGen/GlobalISel/CallLowering.h"
	#include "llvm/CodeGen/GlobalISel/InstructionSelect.h"			#include "llvm/CodeGen/GlobalISel/InstructionSelect.h"
	#include "llvm/IR/Attributes.h"			#include "llvm/IR/Attributes.h"
	▲ Show 20 Lines • Show All 336 Lines • ▼ Show 20 Lines

	const RegisterBankInfo *X86Subtarget::getRegBankInfo() const {			const RegisterBankInfo *X86Subtarget::getRegBankInfo() const {
	return RegBankInfo.get();			return RegBankInfo.get();
	}			}

	bool X86Subtarget::enableEarlyIfConversion() const {			bool X86Subtarget::enableEarlyIfConversion() const {
	return hasCMov() && X86EarlyIfConv;			return hasCMov() && X86EarlyIfConv;
	}			}

				void X86Subtarget::getPostRAMutations(
				std::vector<std::unique_ptr<ScheduleDAGMutation>> &Mutations) const {
				Mutations.push_back(createX86MacroFusionDAGMutation());
				}

llvm/trunk/test/CodeGen/X86/testb-je-fusion.ll

	Show All 39 Lines
	; NOFUSION_POSTRA-NEXT: # %bb.1: # %if.then			; NOFUSION_POSTRA-NEXT: # %bb.1: # %if.then
	; NOFUSION_POSTRA-NEXT: movl $1, %eax			; NOFUSION_POSTRA-NEXT: movl $1, %eax
	; NOFUSION_POSTRA-NEXT: .LBB0_2: # %if.end			; NOFUSION_POSTRA-NEXT: .LBB0_2: # %if.end
	; NOFUSION_POSTRA-NEXT: retq			; NOFUSION_POSTRA-NEXT: retq
	;			;
	; BRANCHFUSION_POSTRA-LABEL: macrofuse_test_je:			; BRANCHFUSION_POSTRA-LABEL: macrofuse_test_je:
	; BRANCHFUSION_POSTRA: # %bb.0: # %entry			; BRANCHFUSION_POSTRA: # %bb.0: # %entry
	; BRANCHFUSION_POSTRA-NEXT: xorl %eax, %eax			; BRANCHFUSION_POSTRA-NEXT: xorl %eax, %eax
	; BRANCHFUSION_POSTRA-NEXT: testl $512, %edi # imm = 0x200
	; BRANCHFUSION_POSTRA-NEXT: movb $1, (%rsi)			; BRANCHFUSION_POSTRA-NEXT: movb $1, (%rsi)
				; BRANCHFUSION_POSTRA-NEXT: testl $512, %edi # imm = 0x200
	; BRANCHFUSION_POSTRA-NEXT: je .LBB0_2			; BRANCHFUSION_POSTRA-NEXT: je .LBB0_2
	; BRANCHFUSION_POSTRA-NEXT: # %bb.1: # %if.then			; BRANCHFUSION_POSTRA-NEXT: # %bb.1: # %if.then
	; BRANCHFUSION_POSTRA-NEXT: movl $1, %eax			; BRANCHFUSION_POSTRA-NEXT: movl $1, %eax
	; BRANCHFUSION_POSTRA-NEXT: .LBB0_2: # %if.end			; BRANCHFUSION_POSTRA-NEXT: .LBB0_2: # %if.end
	; BRANCHFUSION_POSTRA-NEXT: retq			; BRANCHFUSION_POSTRA-NEXT: retq
	entry:			entry:
	%and = and i32 %flags, 512			%and = and i32 %flags, 512
	%tobool = icmp eq i32 %and, 0			%tobool = icmp eq i32 %and, 0
	▲ Show 20 Lines • Show All 42 Lines • ▼ Show 20 Lines
	; NOFUSION_POSTRA-NEXT: movl $1, %eax			; NOFUSION_POSTRA-NEXT: movl $1, %eax
	; NOFUSION_POSTRA-NEXT: retq			; NOFUSION_POSTRA-NEXT: retq
	; NOFUSION_POSTRA-NEXT: .LBB1_1:			; NOFUSION_POSTRA-NEXT: .LBB1_1:
	; NOFUSION_POSTRA-NEXT: xorl %eax, %eax			; NOFUSION_POSTRA-NEXT: xorl %eax, %eax
	; NOFUSION_POSTRA-NEXT: retq			; NOFUSION_POSTRA-NEXT: retq
	;			;
	; BRANCHFUSION_POSTRA-LABEL: macrofuse_cmp_je:			; BRANCHFUSION_POSTRA-LABEL: macrofuse_cmp_je:
	; BRANCHFUSION_POSTRA: # %bb.0: # %entry			; BRANCHFUSION_POSTRA: # %bb.0: # %entry
	; BRANCHFUSION_POSTRA-NEXT: cmpl $512, %edi # imm = 0x200
	; BRANCHFUSION_POSTRA-NEXT: movb $1, (%rsi)			; BRANCHFUSION_POSTRA-NEXT: movb $1, (%rsi)
				; BRANCHFUSION_POSTRA-NEXT: cmpl $512, %edi # imm = 0x200
	; BRANCHFUSION_POSTRA-NEXT: je .LBB1_1			; BRANCHFUSION_POSTRA-NEXT: je .LBB1_1
	; BRANCHFUSION_POSTRA-NEXT: # %bb.2: # %if.then			; BRANCHFUSION_POSTRA-NEXT: # %bb.2: # %if.then
	; BRANCHFUSION_POSTRA-NEXT: movl $1, %eax			; BRANCHFUSION_POSTRA-NEXT: movl $1, %eax
	; BRANCHFUSION_POSTRA-NEXT: retq			; BRANCHFUSION_POSTRA-NEXT: retq
	; BRANCHFUSION_POSTRA-NEXT: .LBB1_1:			; BRANCHFUSION_POSTRA-NEXT: .LBB1_1:
	; BRANCHFUSION_POSTRA-NEXT: xorl %eax, %eax			; BRANCHFUSION_POSTRA-NEXT: xorl %eax, %eax
	; BRANCHFUSION_POSTRA-NEXT: retq			; BRANCHFUSION_POSTRA-NEXT: retq
	entry:			entry:
	▲ Show 20 Lines • Show All 150 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[X86] Make post-ra scheduling macrofusion-aware.ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 193081

llvm/trunk/lib/Target/X86/X86Subtarget.h

llvm/trunk/lib/Target/X86/X86Subtarget.cpp

llvm/trunk/test/CodeGen/X86/testb-je-fusion.ll

[X86] Make post-ra scheduling macrofusion-aware.
ClosedPublic