This is the payoff for D31156 - if a target has efficient comparison instructions for vector-sized equality, we can replace memcmp calls with inline code that is both smaller and faster.
Seems like we're missing a load folding opportunity on the first test, but that's a separate problem.
I can enable the 32-byte case for AVX2 as an immediate follow-up, but I want to make sure this part looks ok before adding that.
What's the point of performing the load in a vector type if you're going to immediately bitcast the result to an integer type? IIRC DAGCombine will fold this away.