It is generally beneficial to prefer "movi d0, #0" over "fmov s0, wzr" as this is most efficient across all cores. For newer cores, fmov instructions are also eliminated early and there is no difference with movi, but this is not true for other/older cores. Thus this standardises on using movi.
Details
Diff Detail
Event Timeline
What CPU is this expected to be better for? I don't buy the "int -> fp register transfer". I'm not going to pretend to know how cpus work internally, but there is no real register value it is transferring.
OK. I think I see what's wrong. According to the A55 software optimization guide, the dual issue for a movi is a little more restrictive than fmov, which can lead to slower code. We would probably want to prefer the fmov there. Which probably applies to other inorder cpus.
I don't have great visibility on other cpus. I just happen to have some very low noise A55 tests that can show whether this kind of small change is actually beneficial.
It looks from the other optimization guides like the two instructions should be treated the same, performance wise. I would be surprised if a fmov s0, wzr wasn't really treated like a form of "FP move, immed", although I have no evidence one way or the other which way it works.
Fair enough, let's refrain from micro-architectural details. But the point is that zero-cost zeroing idioms are supported on integer operations, which is why this is preferred. This should always gives the same or better performance, but it looks like you found a bit of corner case with dual issuing, which is a bit surprising but perhaps makes some sense for smaller in-order cores. I will add FeatureNoZCZeroingFP to the A55's description.
This sets FeatureNoZCZeroingFP for some older cores.
I have considered not making this the default, but just setting ZCZeroingFP for each core. But I think the cores not supporting this are an exception, the trend is that this supported by newer cores, and GCC also defaults to this, see https://godbolt.org/z/Mvxvze48M. Thus making ZCZeroingFP the default still makes sense I think.
After some more discussions, it turns out the original revision was doing the right thing. Except that we should be using the .2s variant as that may be more efficient on some cores.
Adjust this to D99710, that uses movi d0 that zeros 64 bits and not 128 bits, which enables this as a default for all cores.
llvm/test/CodeGen/AArch64/arm64-fp-contract-zero.ll | ||
---|---|---|
10 | This looks like a regression here I think |
Thanks. My tests agreed, LGTM
llvm/test/CodeGen/AArch64/arm64-fp-contract-zero.ll | ||
---|---|---|
10 | It's just regenerating the tests with the test script now, by the look of it. |
llvm/test/CodeGen/AArch64/arm64-fp-contract-zero.ll | ||
---|---|---|
10 | Yep, exactly that. I wanted to see all of the codegen for this example, so just used the script. Thanks for reviewing! |
This looks like a regression here I think