This is the payoff for D31156 - if a target has efficient comparison instructions for vector-sized equality, we can replace memcmp calls with inline code that is both smaller and faster.
Seems like we're missing a load folding opportunity on the first test, but that's a separate problem.
I can enable the 32-byte case for AVX2 as an immediate follow-up, but I want to make sure this part looks ok before adding that.