Compile the attached new test case, original llvm generates
movzwl a, %esi // %esi already contains zero extended value calll v1 jmp .LBB0_3
movzwl a+2, %esi // %esi already contains zero extended value calll v2
movzwl %si, %eax // another zext, we should avoid it. cmpl $4, %eax
In source code all related values are 16 bit. In lowering phase, function X86TargetLowering::EmitCmp intentionally creates zero extension before cmp in order to avoid 16 bit immediate, which is slow on modern x86. Later in X86FixupBWInsts.cpp 16 bit loads are changed to load and extension, it makes the extension before cmp redundant.
The extension created by X86TargetLowering::EmitCmp is beneficial, we need to fold it into previous load instructions to get best performance. The function optimizeLoadInstr() can only move the load into its single user in the same BB, so it can fold the simple ext(load) pair, but it can't do the folding cross BB, and it can't fold ext into load, so it can't be used for this purpose.
So I write this new pass X86FoldXBBExtLoad.cpp to fold 16bit extension into previous load instructions.