This is an archive of the discontinued LLVM Phabricator instance.

[MachineCombiner, AArch64] Add a new pattern A-(B+C) => (A-B)-C to reduce latency
ClosedPublic

Authored by Carrot on Apr 27 2022, 3:29 PM.

Details

Summary

Add a new pattern A - (B + C) ==> (A - B) - C to give machine combiner a chance to evaluate which instruction sequence has lower latency.

Diff Detail

Event Timeline

Carrot created this revision.Apr 27 2022, 3:29 PM
Herald added a project: Restricted Project. · View Herald TranscriptApr 27 2022, 3:29 PM
Carrot requested review of this revision.Apr 27 2022, 3:29 PM
Herald added a project: Restricted Project. · View Herald TranscriptApr 27 2022, 3:29 PM

Hello. This looks like two different patches.

The first part is the subadd - If that is reducing the critical path length then it would seem to make sense. It might be worth adding a mir test? It's hard to tell with all the other changes how profitable it would be.

The other part is fixing the machine combines costs. That is something we've noticed before in the context of transient instructions and the cost model it uses seeming to be incorrect. D123512 for example. I've not stepped through the new costs, but I think we should be preferring madd over mul and add, even if the scheduling model suggests otherwise. Most CPU implement forwarding into the accumulate of a madd, that probably isn't modelled in the A55 model that we use by default. The codegen we produce for -mcpu=generic should ideally be descent for the combination of big and little core. Scheduling we take from the little core because being in-order it requires the most help, and out-of-order cores can schedule inside the core themselves. The codegen preferably should be good on both though, if we can, and we should aim not to make the big core performance worse just because we are using a little-cores schedule.

The simplest solution is possibly to just add the forwarding paths to the A55 schedule (although every time we try things like that the performance seems to be mixed at best). For others like fmadd it is less obvious what we should do. At any rate it would be best if the addsub and the costmodel changes were two separate patches.

Hello. This looks like two different patches.

I did the MachineCombiner cost model improvement because it impacts my aarch64 test cases. But as you said it's reasonable to sent it as a separate patch. I will do that.

Carrot updated this revision to Diff 439181.Jun 22 2022, 3:30 PM
Carrot edited the summary of this revision. (Show Details)
Carrot added a reviewer: dmgreen.

The improvement of cost model has been split out. Update this patch to contain the new pattern only.

If we add A - (B + C) are there any other patterns that would be similarly useful? This seems similar to the existing reassociate logic in the machine combiner, but AArch64InstrInfo::isAssociativeAndCommutative is not very thorough under AArch64. Would the new pattern be best to be marked as CombinerObjective::MustReduceDepth similar to the existing REASSOC patterns?

Can you add some mir tests to test some of the edge cases, especially around flag setting instructions.

Carrot updated this revision to Diff 439907.Jun 24 2022, 3:27 PM

Add mir tests.

If we add A - (B + C) are there any other patterns that would be similarly useful? This seems similar to the existing reassociate logic in the machine combiner, but AArch64InstrInfo::isAssociativeAndCommutative is not very thorough under AArch64. Would the new pattern be best to be marked as CombinerObjective::MustReduceDepth similar to the existing REASSOC patterns?

I think there are more similar useful patterns, but I encountered this pattern only. We can add other patterns when we have tests.

This pattern does look like existing reassociate. The difference is its operations are not commutable, so it can't be handled by existing reassociate logic. It is already marked as MustReduceDepth in function getCombinerObjective.

I think there are more similar useful patterns, but I encountered this pattern only. We can add other patterns when we have tests.

This pattern does look like existing reassociate. The difference is its operations are not commutable, so it can't be handled by existing reassociate logic.

OK. I agree these are different to the existing reassociate patterns, but if this is worth adding we might want to look into improving those too. We needn't do that in this patch though, we can look at that separately.

It is already marked as MustReduceDepth in function getCombinerObjective.

Oh yeah, I see. The way MachineCombiner's logic is held in target independent files for target patterns always trips me up. To check - this doesn't need the changes from D125588 now? It works stand-alone?

The mir tests are good - thanks for explaining them well. Can you add a test for a extra uses of the Add.

Oh yeah, I see. The way MachineCombiner's logic is held in target independent files for target patterns always trips me up. To check - this doesn't need the changes from D125588 now? It works stand-alone?

This optimization doesn't depend on D125588, it was the test case. In the original test case there are several operands come from COPY directly, and wrong latency is computed for them. In current version I added several instructions so no operands of SUB/ADD come from COPY instructions.

The mir tests are good - thanks for explaining them well. Can you add a test for a extra uses of the Add.

Done.

Carrot updated this revision to Diff 440414.Jun 27 2022, 3:26 PM
dmgreen accepted this revision.Jun 28 2022, 12:03 AM

Thanks. Then I think this patch LGTM

This revision is now accepted and ready to land.Jun 28 2022, 12:03 AM
This revision was landed with ongoing or failed builds.Jun 28 2022, 2:49 PM
This revision was automatically updated to reflect the committed changes.