Converts RSHRN/RSHRN2 to RADDHN/RADDHN2 when the shift amount is half the width of the vector element. The latter has twice the throughput and half the latency on Arm out-of-order cores. Setting up the zero register adds no latency.
Details
Details
Diff Detail
Diff Detail
Unit Tests
Unit Tests
Time | Test | |
---|---|---|
60 ms | x64 debian > LLVM.Bindings/Go::go.test |
Event Timeline
Comment Actions
Hmm. Are you sure this one is a great idea? The "Setting up the zero register adds no latency" won't be true on any in-order cpu, and still has some frontend cost on an out of order cpu. The codesize will be larger in any case, so this probably shouldn't be done at -Os/-Oz.
The idea with these kind of transforms is that they are OK to do so long as they make some cpu better without making anything else worse. This is intrinsic only, but it may be best to only do it for specific cpus when not under minsize. Or do it at a different point where we know the movi can be pulled out of a loop. (If we really want to do it at all and not just leave it to the programmer if they need it).