Page MenuHomePhabricator

[AArch64] Default to zero-cycle-zeroing FP registers.

Authored by SjoerdMeijer on Mar 30 2021, 6:13 AM.



It is generally beneficial to prefer "movi d0, #0" over "fmov s0, wzr" as this is most efficient across all cores. For newer cores, fmov instructions are also eliminated early and there is no difference with movi, but this is not true for other/older cores. Thus this standardises on using movi.

Diff Detail

Event Timeline

SjoerdMeijer created this revision.Mar 30 2021, 6:13 AM
SjoerdMeijer requested review of this revision.Mar 30 2021, 6:13 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 30 2021, 6:13 AM
SjoerdMeijer retitled this revision from [AArch64] Default to zero-cycling-zeroing FP registers. to [AArch64] Default to zero-cycle-zeroing FP registers..Mar 30 2021, 6:14 AM
This revision is now accepted and ready to land.Mar 30 2021, 9:24 AM

What CPU is this expected to be better for? I don't buy the "int -> fp register transfer". I'm not going to pretend to know how cpus work internally, but there is no real register value it is transferring.

dmgreen requested changes to this revision.Mar 30 2021, 9:45 AM
This revision now requires changes to proceed.Mar 30 2021, 9:45 AM

OK. I think I see what's wrong. According to the A55 software optimization guide, the dual issue for a movi is a little more restrictive than fmov, which can lead to slower code. We would probably want to prefer the fmov there. Which probably applies to other inorder cpus.

I don't have great visibility on other cpus. I just happen to have some very low noise A55 tests that can show whether this kind of small change is actually beneficial.

It looks from the other optimization guides like the two instructions should be treated the same, performance wise. I would be surprised if a fmov s0, wzr wasn't really treated like a form of "FP move, immed", although I have no evidence one way or the other which way it works.

Fair enough, let's refrain from micro-architectural details. But the point is that zero-cost zeroing idioms are supported on integer operations, which is why this is preferred. This should always gives the same or better performance, but it looks like you found a bit of corner case with dual issuing, which is a bit surprising but perhaps makes some sense for smaller in-order cores. I will add FeatureNoZCZeroingFP to the A55's description.

SjoerdMeijer edited the summary of this revision. (Show Details)Mar 30 2021, 12:01 PM

This sets FeatureNoZCZeroingFP for some older cores.

I have considered not making this the default, but just setting ZCZeroingFP for each core. But I think the cores not supporting this are an exception, the trend is that this supported by newer cores, and GCC also defaults to this, see Thus making ZCZeroingFP the default still makes sense I think.

After some more discussions, it turns out the original revision was doing the right thing. Except that we should be using the .2s variant as that may be more efficient on some cores.

SjoerdMeijer edited the summary of this revision. (Show Details)

Adjust this to D99710, that uses movi d0 that zeros 64 bits and not 128 bits, which enables this as a default for all cores.

david-arm added inline comments.Apr 1 2021, 8:32 AM

This looks like a regression here I think

dmgreen accepted this revision.Apr 1 2021, 8:53 AM

Thanks. My tests agreed, LGTM


It's just regenerating the tests with the test script now, by the look of it.

This revision is now accepted and ready to land.Apr 1 2021, 8:53 AM
SjoerdMeijer added inline comments.Apr 1 2021, 9:14 AM

Yep, exactly that. I wanted to see all of the codegen for this example, so just used the script.

Thanks for reviewing!

david-arm accepted this revision.Apr 1 2021, 9:16 AM

LGTM too!

This revision was landed with ongoing or failed builds.Apr 6 2021, 1:48 AM
This revision was automatically updated to reflect the committed changes.