This is an archive of the discontinued LLVM Phabricator instance.

[X86] Make post-ra scheduling macrofusion-aware.
ClosedPublic

Authored by courbet on Mar 22 2019, 3:36 AM.

Event Timeline

courbet created this revision.Mar 22 2019, 3:36 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 22 2019, 3:36 AM

For the context: I'm experimenting with turning post-ra scheduling on for SNB onwards, and I some promising improvements. All the regressions I see are macro-fused instructions being moved apart, which this fixes.

Nice patch Clement!

I always wondered why on x86 we only enabled that mutator in the pre-ra scheduler.
In the past, I remember I did some quick experiments with enabling that mutator in the post-RA scheduler. I must admit that I wasn't particularly lucky wih the experiments (i.e. I couldn't find significant/promising improvements). But then - again - those were just quick experiments, and I didn't try it on many codebases.

If you think you can share some numbers then that would be great.

On a slightly different topic: I often wondered whether we - at some point - we should start using the PostMIScheduler for X86.

Did you experiment with it?
The API is basically the same; mutators are run on the DAG as part of a dag-postprocessing step. You can try to enable that mutator on that pass too.
To enable PostMIScheduler however you have to substitutePass(&PostRASchedulerID, &PostMachineSchedulerID) in the X86PassConfig (similar to how AArch64 does it).
It would be interesting to see which one gives us the best codegen in your experiments.. (just curious).

Nice patch Clement!

I always wondered why on x86 we only enabled that mutator in the pre-ra scheduler.
In the past, I remember I did some quick experiments with enabling that mutator in the post-RA scheduler. I must admit that I wasn't particularly lucky wih the experiments (i.e. I couldn't find significant/promising improvements). But then - again - those were just quick experiments, and I didn't try it on many codebases. If you think you can share some numbers then that would be great.

Thanks Andrea.

Yes, that's essentially what the comment in X86.td says:

"This generally gives a nice performance increase on silvermont, with largely neutral behavior on other contemporary large core processors."

However, that was before the round of scheduling information fixes that Simon & I made based on llvm-exegesis. I wanted to give it another try after that, and from my first experiments it seems that it indeed makes sense to look at it again.
What I have done for now is run our (internal, sorry) main macrobenchmark with post-ra enabled. With the base code I see a consistent regression of 0.5% to 1% depending on metrics. With this patch I see a consistent improvement of 0.5% to 2%.

On a slightly different topic: I often wondered whether we - at some point - we should start using the PostMIScheduler for X86.
Did you experiment with it?

I'm looking at all the options right now. But I want to make sure that we're comparing apples to apples, and that's why I'm fixing this.

When I'm done experimenting with different possibilities I'll write a summary of the results.

The API is basically the same; mutators are run on the DAG as part of a dag-postprocessing step. You can try to enable that mutator on that pass too.

Yup, I have another patch coming for this :)

Nice patch Clement!

I always wondered why on x86 we only enabled that mutator in the pre-ra scheduler.
In the past, I remember I did some quick experiments with enabling that mutator in the post-RA scheduler. I must admit that I wasn't particularly lucky wih the experiments (i.e. I couldn't find significant/promising improvements). But then - again - those were just quick experiments, and I didn't try it on many codebases. If you think you can share some numbers then that would be great.

Thanks Andrea.

Yes, that's essentially what the comment in X86.td says:

"This generally gives a nice performance increase on silvermont, with largely neutral behavior on other contemporary large core processors."

However, that was before the round of scheduling information fixes that Simon & I made based on llvm-exegesis. I wanted to give it another try after that, and from my first experiments it seems that it indeed makes sense to look at it again.
What I have done for now is run our (internal, sorry) main macrobenchmark with post-ra enabled. With the base code I see a consistent regression of 0.5% to 1% depending on metrics. With this patch I see a consistent improvement of 0.5% to 2%.

Thanks. That's good to know.
I plan to run some experiments today using your patch.

But in general, I am happy with this patch.

On a slightly different topic: I often wondered whether we - at some point - we should start using the PostMIScheduler for X86.
Did you experiment with it?

I'm looking at all the options right now. But I want to make sure that we're comparing apples to apples, and that's why I'm fixing this.

When I'm done experimenting with different possibilities I'll write a summary of the results.

The API is basically the same; mutators are run on the DAG as part of a dag-postprocessing step. You can try to enable that mutator on that pass too.

Yup, I have another patch coming for this :)

Cool. :-)

-Andrea

I plan to run some experiments today using your patch.

That's great, thanks.

I plan to run some experiments today using your patch.

That's great, thanks.

Sorry, I was over optimistic about my other workload. I don't think I'll get a chance to get any perf numbers anytime soon.

That being said, I tried your patch on a few small examples on some different targets, and results seem good.
For example, before your patch I saw cases where the test/cmp was not emitted before the conditional branch. Your patch seems to fix that "issue" in most cases.

My only concern is that the macro-fusion mutator might be a bit too aggressive for AMD processors.
X86MacroFusion assumes that branch fusion can happen with ADD/SUB/INC/DEC too. That is okay for Intel processors, but not necessarily for AMD processors where branch fusion (as far as I remember) is limited to CMP/TEST opcodes only.
Since your patch enables that mutator for targets with FeatureMacroFusion, it would be nice to get some feedback from somebody with access to an AMD target where macro fusion is enabled (Bobcat/Jaguar doesn't do branch fusion). Perhaps @lebedev.ri can run some quick tests on BdVer2?
I don't think is a blocking issue, but in future we should revisit the logic in X86MacroFusion.

I plan to run some experiments today using your patch.

That's great, thanks.

Sorry, I was over optimistic about my other workload. I don't think I'll get a chance to get any perf numbers anytime soon.

That being said, I tried your patch on a few small examples on some different targets, and results seem good.
For example, before your patch I saw cases where the test/cmp was not emitted before the conditional branch. Your patch seems to fix that "issue" in most cases.

My only concern is that the macro-fusion mutator might be a bit too aggressive for AMD processors.
X86MacroFusion assumes that branch fusion can happen with ADD/SUB/INC/DEC too. That is okay for Intel processors, but not necessarily for

AMD processors where branch fusion (as far as I remember) is limited to CMP/TEST opcodes only.

That is consistent with what is stated in agner's microarchitecture, amd sog for piledriver.

Since your patch enables that mutator for targets with FeatureMacroFusion, it would be nice to get some feedback from somebody with access to an AMD target where macro fusion is enabled (Bobcat/Jaguar doesn't do branch fusion). Perhaps @lebedev.ri can run some quick tests on BdVer2?

It will, as usual, depend on whether this happens to affect the hotpath or not.
I did just run my rawspeed benchmark, and i'm not observing any notable non-noise perf changes.

I don't think is a blocking issue, but in future we should revisit the logic in X86MacroFusion.

While there, @andreadb, can you reply on https://reviews.llvm.org/D46662#1293043 ?

I plan to run some experiments today using your patch.

That's great, thanks.

Sorry, I was over optimistic about my other workload. I don't think I'll get a chance to get any perf numbers anytime soon.

That being said, I tried your patch on a few small examples on some different targets, and results seem good.
For example, before your patch I saw cases where the test/cmp was not emitted before the conditional branch. Your patch seems to fix that "issue" in most cases.

My only concern is that the macro-fusion mutator might be a bit too aggressive for AMD processors.
X86MacroFusion assumes that branch fusion can happen with ADD/SUB/INC/DEC too. That is okay for Intel processors, but not necessarily for

AMD processors where branch fusion (as far as I remember) is limited to CMP/TEST opcodes only.

That is consistent with what is stated in agner's microarchitecture, amd sog for piledriver.

Since your patch enables that mutator for targets with FeatureMacroFusion, it would be nice to get some feedback from somebody with access to an AMD target where macro fusion is enabled (Bobcat/Jaguar doesn't do branch fusion). Perhaps @lebedev.ri can run some quick tests on BdVer2?

It will, as usual, depend on whether this happens to affect the hotpath or not.
I did just run my rawspeed benchmark, and i'm not observing any notable non-noise perf changes.

Thanks. That matches what I also saw in the past when I tested it.

I don't think is a blocking issue, but in future we should revisit the logic in X86MacroFusion.

While there, @andreadb, can you reply on https://reviews.llvm.org/D46662#1293043 ?

I replied to that code review (the two small benchmarks were available from one of Xur’s older posts).

I don't think is a blocking issue, but in future we should revisit the logic in X86MacroFusion.

For the record, this was dealt with in D59872.

andreadb accepted this revision.Mar 28 2019, 8:12 AM

Thanks Clement.

LGTM

This revision is now accepted and ready to land.Mar 28 2019, 8:12 AM

This could use test coverage i guess?

This could use test coverage i guess?

This is actually an NFC as the post-ra scheduler is not on by default. The next patch that enables scheduling will show the actual changes.

This could use test coverage i guess?

This is actually an NFC as the post-ra scheduler is not on by default. The next patch that enables scheduling will show the actual changes.

What about CPU's that specify let PostRAScheduler = 1; ?

courbet planned changes to this revision.Mar 28 2019, 9:34 AM

What about CPU's that specify let PostRAScheduler = 1; ?

Ah yes, sorry I lost track of this. Interestingly there are no tests that fail currently. I'll try to come up with some that do.

For now I'll submit the refactoring part of this change (D59689).

What about CPU's that specify let PostRAScheduler = 1; ?

Ah yes, sorry I lost track of this. Interestingly there are no tests that fail currently.

I'll try to come up with some that do.

For now I'll submit the refactoring part of this change (D59689).

Thanks!

courbet updated this revision to Diff 193074.Apr 1 2019, 6:37 AM

Add tests.

This revision is now accepted and ready to land.Apr 1 2019, 6:37 AM

Please upload correct diff, this seems to be relative to previous patch.

lebedev.ri accepted this revision.Apr 1 2019, 6:45 AM

nevermind, i see that it was committed in D59689.
Looks good.

This revision was automatically updated to reflect the committed changes.