Details
Diff Detail
- Repository
- rL LLVM
Event Timeline
For the context: I'm experimenting with turning post-ra scheduling on for SNB onwards, and I some promising improvements. All the regressions I see are macro-fused instructions being moved apart, which this fixes.
Nice patch Clement!
I always wondered why on x86 we only enabled that mutator in the pre-ra scheduler.
In the past, I remember I did some quick experiments with enabling that mutator in the post-RA scheduler. I must admit that I wasn't particularly lucky wih the experiments (i.e. I couldn't find significant/promising improvements). But then - again - those were just quick experiments, and I didn't try it on many codebases.
If you think you can share some numbers then that would be great.
On a slightly different topic: I often wondered whether we - at some point - we should start using the PostMIScheduler for X86.
Did you experiment with it?
The API is basically the same; mutators are run on the DAG as part of a dag-postprocessing step. You can try to enable that mutator on that pass too.
To enable PostMIScheduler however you have to substitutePass(&PostRASchedulerID, &PostMachineSchedulerID) in the X86PassConfig (similar to how AArch64 does it).
It would be interesting to see which one gives us the best codegen in your experiments.. (just curious).
Thanks Andrea.
Yes, that's essentially what the comment in X86.td says:
"This generally gives a nice performance increase on silvermont, with largely neutral behavior on other contemporary large core processors."
However, that was before the round of scheduling information fixes that Simon & I made based on llvm-exegesis. I wanted to give it another try after that, and from my first experiments it seems that it indeed makes sense to look at it again.
What I have done for now is run our (internal, sorry) main macrobenchmark with post-ra enabled. With the base code I see a consistent regression of 0.5% to 1% depending on metrics. With this patch I see a consistent improvement of 0.5% to 2%.
On a slightly different topic: I often wondered whether we - at some point - we should start using the PostMIScheduler for X86.
Did you experiment with it?
I'm looking at all the options right now. But I want to make sure that we're comparing apples to apples, and that's why I'm fixing this.
When I'm done experimenting with different possibilities I'll write a summary of the results.
The API is basically the same; mutators are run on the DAG as part of a dag-postprocessing step. You can try to enable that mutator on that pass too.
Yup, I have another patch coming for this :)
Thanks. That's good to know.
I plan to run some experiments today using your patch.
But in general, I am happy with this patch.
On a slightly different topic: I often wondered whether we - at some point - we should start using the PostMIScheduler for X86.
Did you experiment with it?I'm looking at all the options right now. But I want to make sure that we're comparing apples to apples, and that's why I'm fixing this.
When I'm done experimenting with different possibilities I'll write a summary of the results.
The API is basically the same; mutators are run on the DAG as part of a dag-postprocessing step. You can try to enable that mutator on that pass too.
Yup, I have another patch coming for this :)
Cool. :-)
-Andrea
Sorry, I was over optimistic about my other workload. I don't think I'll get a chance to get any perf numbers anytime soon.
That being said, I tried your patch on a few small examples on some different targets, and results seem good.
For example, before your patch I saw cases where the test/cmp was not emitted before the conditional branch. Your patch seems to fix that "issue" in most cases.
My only concern is that the macro-fusion mutator might be a bit too aggressive for AMD processors.
X86MacroFusion assumes that branch fusion can happen with ADD/SUB/INC/DEC too. That is okay for Intel processors, but not necessarily for AMD processors where branch fusion (as far as I remember) is limited to CMP/TEST opcodes only.
Since your patch enables that mutator for targets with FeatureMacroFusion, it would be nice to get some feedback from somebody with access to an AMD target where macro fusion is enabled (Bobcat/Jaguar doesn't do branch fusion). Perhaps @lebedev.ri can run some quick tests on BdVer2?
I don't think is a blocking issue, but in future we should revisit the logic in X86MacroFusion.
That is consistent with what is stated in agner's microarchitecture, amd sog for piledriver.
Since your patch enables that mutator for targets with FeatureMacroFusion, it would be nice to get some feedback from somebody with access to an AMD target where macro fusion is enabled (Bobcat/Jaguar doesn't do branch fusion). Perhaps @lebedev.ri can run some quick tests on BdVer2?
It will, as usual, depend on whether this happens to affect the hotpath or not.
I did just run my rawspeed benchmark, and i'm not observing any notable non-noise perf changes.
I don't think is a blocking issue, but in future we should revisit the logic in X86MacroFusion.
While there, @andreadb, can you reply on https://reviews.llvm.org/D46662#1293043 ?
Thanks. That matches what I also saw in the past when I tested it.
I don't think is a blocking issue, but in future we should revisit the logic in X86MacroFusion.
While there, @andreadb, can you reply on https://reviews.llvm.org/D46662#1293043 ?
I replied to that code review (the two small benchmarks were available from one of Xur’s older posts).
I don't think is a blocking issue, but in future we should revisit the logic in X86MacroFusion.
For the record, this was dealt with in D59872.
This could use test coverage i guess?
This is actually an NFC as the post-ra scheduler is not on by default. The next patch that enables scheduling will show the actual changes.
What about CPU's that specify let PostRAScheduler = 1; ?
Ah yes, sorry I lost track of this. Interestingly there are no tests that fail currently. I'll try to come up with some that do.
For now I'll submit the refactoring part of this change (D59689).
I'll try to come up with some that do.
For now I'll submit the refactoring part of this change (D59689).
Thanks!