This is the last step needed to fix PR33325:
https://bugs.llvm.org/show_bug.cgi?id=33325
We're trading branch and compares for loads and logic ops. This makes the code smaller and hopefully faster in most cases.
The 24-byte case shows an interesting construct: we load the trailing scalar elements into vector registers and generate the same pcmpeq+movmsk code that we expected for a pair of full vector elements (see the 32- and 64-byte tests)
We're losing coverage for the case where we're testing for equality with only one load per block. What about adding a line with:
-memcmp-num-loads-per-block=1 to keep these alive ?