- Adds SSE4.1 coverage.
- Adds prefer-256-bit or not coverage.
- Adds more power-of-two tests up to 512 bytes.
- Adds power-of-two-minus-one tests to verify overlapping loads.
- Adds power-of-two-plus-one-half tests (48, 96, 192, and 384).
- Adds greater-than/less-than tests from 16 to 512 bytes.
Details
Details
- Reviewers
craig.topper RKSimon spatel xbolva00
Diff Detail
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
Comment Actions
Please note: 48 byte and 96 byte comparisons have terrible code gen on AVX2 and later.
Comment Actions
And now that I've had a chance to rebase D69044 ("up to four load pairs") on top of this updated test file, I can report that:
- The 48 and 96 byte memcmps do not improve for AVX2 or AVX512.
- The AVX1 code gen is relatively reasonable for 48 bytes: three XMM compares. It could have been one YMM compare and one zero extended XMM compare.
I think I figured out why 48 and 96 bytes are awful. It seems that lowering a vector that is the result of a zero extended scalar generates terrible code. Should combineVectorSizedSetCCEquality detect the zero extend and create an ISD::INSERT_SUBVECTOR node? Or should something more fundamental detect this scenario and create the ISD::INSERT_SUBVECTOR?
Comment Actions
Ping. Is this ready to land? I also have a forthcoming patch for the 48 and 96 byte scenarios.