This is an archive of the discontinued LLVM Phabricator instance.

[PPC] Use xxbrd to speed up bswap64
ClosedPublic

Authored by Carrot on Nov 1 2017, 2:38 PM.

Details

Summary

Power doesn't have bswap instructions, so llvm generates following code sequence for bswap64.

rotldi   5, 3, 16
rotldi   4, 3, 8
rotldi   9, 3, 24
rotldi   10, 3, 32
rotldi   11, 3, 48
rotldi   12, 3, 56
rldimi 4, 5, 8, 48
rldimi 4, 9, 16, 40
rldimi 4, 10, 24, 32
rldimi 4, 11, 40, 16
rldimi 4, 12, 48, 8
rldimi 4, 3, 56, 0

But Power9 has vector bswap instructions, they can also be used to speed up scalar bswap intrinsic. With this patch, bswap64 can be translated to:

mtvsrdd 34, 3, 3
xxbrd 34, 34
mfvsrld 3, 34

Diff Detail

Repository
rL LLVM

Event Timeline

Carrot created this revision.Nov 1 2017, 2:38 PM
nemanjai accepted this revision.Nov 2 2017, 12:31 AM

This is a great idea considering direct moves are so fast on Power9. I guess we just didn't think of this use when we implemented the vector byte reversal. Thanks for doing this. Other than the rather obvious change to generate the faster mfvsrd instruction, this LGTM.

lib/Target/PowerPC/PPCISelLowering.cpp
8571 ↗(On Diff #121186)

Extracting LE doubleword 1 is probably better. It'll produce mfvsrd rather than mfvsrld on LE systems. The latter uses the permute pipeline and is potentially a higher-latency instruction. And it shouldn't make a functional difference since you're populating both doublewords.

This revision is now accepted and ready to land.Nov 2 2017, 12:31 AM
Carrot updated this revision to Diff 121343.Nov 2 2017, 11:46 AM
Carrot marked an inline comment as done.

Will check in this version.

This revision was automatically updated to reflect the committed changes.