Add a new pattern A - (B + C) ==> (A - B) - C to give machine combiner a chance to evaluate which instruction sequence has lower latency.
Details
Diff Detail
Event Timeline
Hello. This looks like two different patches.
The first part is the subadd - If that is reducing the critical path length then it would seem to make sense. It might be worth adding a mir test? It's hard to tell with all the other changes how profitable it would be.
The other part is fixing the machine combines costs. That is something we've noticed before in the context of transient instructions and the cost model it uses seeming to be incorrect. D123512 for example. I've not stepped through the new costs, but I think we should be preferring madd over mul and add, even if the scheduling model suggests otherwise. Most CPU implement forwarding into the accumulate of a madd, that probably isn't modelled in the A55 model that we use by default. The codegen we produce for -mcpu=generic should ideally be descent for the combination of big and little core. Scheduling we take from the little core because being in-order it requires the most help, and out-of-order cores can schedule inside the core themselves. The codegen preferably should be good on both though, if we can, and we should aim not to make the big core performance worse just because we are using a little-cores schedule.
The simplest solution is possibly to just add the forwarding paths to the A55 schedule (although every time we try things like that the performance seems to be mixed at best). For others like fmadd it is less obvious what we should do. At any rate it would be best if the addsub and the costmodel changes were two separate patches.
I did the MachineCombiner cost model improvement because it impacts my aarch64 test cases. But as you said it's reasonable to sent it as a separate patch. I will do that.
The improvement of cost model has been split out. Update this patch to contain the new pattern only.
If we add A - (B + C) are there any other patterns that would be similarly useful? This seems similar to the existing reassociate logic in the machine combiner, but AArch64InstrInfo::isAssociativeAndCommutative is not very thorough under AArch64. Would the new pattern be best to be marked as CombinerObjective::MustReduceDepth similar to the existing REASSOC patterns?
Can you add some mir tests to test some of the edge cases, especially around flag setting instructions.
I think there are more similar useful patterns, but I encountered this pattern only. We can add other patterns when we have tests.
This pattern does look like existing reassociate. The difference is its operations are not commutable, so it can't be handled by existing reassociate logic. It is already marked as MustReduceDepth in function getCombinerObjective.
I think there are more similar useful patterns, but I encountered this pattern only. We can add other patterns when we have tests.
This pattern does look like existing reassociate. The difference is its operations are not commutable, so it can't be handled by existing reassociate logic.
OK. I agree these are different to the existing reassociate patterns, but if this is worth adding we might want to look into improving those too. We needn't do that in this patch though, we can look at that separately.
It is already marked as MustReduceDepth in function getCombinerObjective.
Oh yeah, I see. The way MachineCombiner's logic is held in target independent files for target patterns always trips me up. To check - this doesn't need the changes from D125588 now? It works stand-alone?
The mir tests are good - thanks for explaining them well. Can you add a test for a extra uses of the Add.
This optimization doesn't depend on D125588, it was the test case. In the original test case there are several operands come from COPY directly, and wrong latency is computed for them. In current version I added several instructions so no operands of SUB/ADD come from COPY instructions.
The mir tests are good - thanks for explaining them well. Can you add a test for a extra uses of the Add.
Done.