According to the N2 Software Optimization Guide, arithmetic ops with LSL ≤ 4, no flagset logical ops, and flagset logical ops with LSL = 0 have a latency of 1 and use pipeline group I. However, most of these ops were being modelled as having a latency of 2 and using pipeline M. The affected instructions include the "unshifted" versions of ADD/SUB, among others.
Details
Diff Detail
Event Timeline
The add/sub sounds like a nice change.
From what I can tell all the logical operators (and/orr/etc) that don't set flags are latency 1, throughput 4. The ones that set flags (ands/bics) should be lat:2 throughput:2.
llvm/lib/Target/AArch64/AArch64SchedPredNeoverse.td | ||
---|---|---|
17 | This is a quite common pattern and I think already exists somewhere. Can you move it somewhere shared? |
llvm/lib/Target/AArch64/AArch64SchedPredNeoverse.td | ||
---|---|---|
17 | I believe you are referring to the Exynos and Ampere versions? Though similar in spirit, they actually implement a slightly different logic (I'm not sure if intentionally or not). In particular, the Exynos considers a "cheap shift" any shift = 0, or LSL <= 3. The Ampere considers cheap shifts to be any shift = 0, or LSL <= 4. As far as I could test, this is not accurate for the N2. For example, on N2 an LSR = 0 is still an "expensive shift". Since I was not sure if these slight differences were intentional, I decided to create a separate predicate for the Neoverses. Having said that, I agree it would be good to move it to a shared location! Do you have any suggestions as to where that could be? |
LGTM. Thanks
llvm/lib/Target/AArch64/AArch64SchedPredNeoverse.td | ||
---|---|---|
17 | Ah I see. It was the Ampere version I was thinking of. I would guess a lsr with a shift of 0 isn't too important, and combining the two wouldn't be a problem in practice, but if they are different then that's fine. |
This is a quite common pattern and I think already exists somewhere. Can you move it somewhere shared?