This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Use neon instructions for i64/i128 ISD::PARITY calculation
ClosedPublic

Authored by RKSimon on Jul 21 2022, 4:04 AM.

Details

Summary

As noticed on D129765 and reported on Issue #56531 - aarch64 targets can use the neon ctpop + add-reduce instructions to speed up scalar ctpop instructions, but we fail to do this for parity calculations.

I'm not sure where the cutoff should be, but i64 (+ i128 special case) shows a definite reduction in instruction count. i32 is about the same (not sure if scalar <-> neon transfers are particularly costly?), and sub-i32 promotion looks to be a definite regression compared to parity expansion optimized for those widths.

Diff Detail

Event Timeline

RKSimon created this revision.Jul 21 2022, 4:04 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 21 2022, 4:04 AM
RKSimon requested review of this revision.Jul 21 2022, 4:04 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 21 2022, 4:04 AM
RKSimon edited the summary of this revision. (Show Details)Jul 22 2022, 7:07 AM
deadalnix accepted this revision.Jul 22 2022, 8:01 AM
deadalnix added a subscriber: deadalnix.

It took me a bit a grave digging to figure out the motivation behind PARITY, but this seems to do the right thing.

That being said, shouldn't this be the default strategy to lower PARITY, rather than special case it for AArch64?

This revision is now accepted and ready to land.Jul 22 2022, 8:01 AM
dmgreen accepted this revision.Jul 22 2022, 8:13 AM

I've been trying to add up latencies to see which is better between then two sequences. I think you are right about i32 case - it is better to avoid the fpr register moves.

The code changes looks good to me. I was just not sure which is better between the i64 eor's and moving to float regs to use a cnt. It will depend on the cpu - but an eor is either a quick 1 cycle instruction, which is hard to beat with neon instructions, or it is a 2 cycle instruction and the cnt; addv and fmov's will have longer latencies.

I ended up having to get a simulator out to measure the differences. Whilst it is slower on some cpus, it seems to be quicker in more cases and by more of a margin. So looks OK to me.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
7790

Formatting.

This revision was landed with ongoing or failed builds.Jul 22 2022, 9:36 AM
This revision was automatically updated to reflect the committed changes.