The test diff for PowerPC is minimal, but for x86, there's a substantial difference because branches are assumed cheap and SDAG can't optimize across blocks. Instead of this:
_cmp_eq8: movq (%rdi), %rax cmpq (%rsi), %rax je LBB23_1 ## BB#2: ## %res_block movl $1, %ecx jmp LBB23_3 LBB23_1: xorl %ecx, %ecx LBB23_3: ## %endblock xorl %eax, %eax testl %ecx, %ecx sete %al retq
We get this:
cmp_eq8: movq (%rdi), %rcx xorl %eax, %eax cmpq (%rsi), %rcx sete %al retq
And that matches the optimal codegen that we get from the current expansion in SelectionDAGBuilder::visitMemCmpCall(). If this looks right, then I just need to confirm that vector-sized expansion will work from here, and we can enable CGP memcmp() expansion for x86. Ie, we'll bypass the power-of-2 special cases currently optimized in SDAG because we can lower the IR produced here optimally.
I think the comment should also explain why in this case (only one block) we don't want to abort everything right now and let the SDAG do the lowering (IIUC, something along the lines of "in that case, we still want to do the memcmp expansion here because this code handles vector expansions better").