This is an archive of the discontinued LLVM Phabricator instance.

[InstCombine] try to fold insertelt + vector op into scalar op + insertelt
AbandonedPublic

Authored by spatel on Aug 20 2018, 1:05 PM.

Details

Summary

This is a transform suggested in PR37463:
https://bugs.llvm.org/show_bug.cgi?id=37463

This makes sense for IR because we should always have equal or better analysis for a scalar binop vs. a vector binop. It can also lead to follow-on transforms/simplifications as seen in some of the div/rem tests that change opcodes.

It's not as clearly a win for codegen (where keeping everything as vector ops might be better for perf), but I haven't seen any regressions yet, and it should be possible to reverse this if needed.

A couple of details worth pointing out:

  1. This does not need to be limited with isSafeToSpeculativelyExecute() like the related shuffle transforms in this function. We're doing strictly less work by only operating on 1 lane of a vector, so there's no opportunity for executing on unknown bits. So div/rem are included in the diffs.
  2. As noted in the last code comment, we can't just insert into an undef vector constant because that could be more undef than the original code was. But we'll probably end up inserting into an undef vector anyway in most cases because that's what the constants simplify to (as seen in the test results).

Diff Detail

Event Timeline

spatel created this revision.Aug 20 2018, 1:05 PM

I'm concerned the backend won't reliably reverse the transform. For integer operations, SelectionDAG heavily depends on IR types to decide whether to perform an operation in integer or SIMD registers, and transferring values between the two register files is slow. Yes, the second of an insertelement is a scalar, but many backends pattern-match a load+insertelement to a vector register load.

I agree this makes sense for IR.

I'm concerned the backend won't reliably reverse the transform. For integer operations, SelectionDAG heavily depends on IR types to decide whether to perform an operation in integer or SIMD registers, and transferring values between the two register files is slow. Yes, the second of an insertelement is a scalar, but many backends pattern-match a load+insertelement to a vector register load.

If we add a load to the sequence, we have something like this:

define <4 x i32> @vector_add_constant(i32* %p) {
  %x = load i32, i32* %p
  %ins = insertelement <4 x i32> undef, i32 %x, i32 0
  %bo = add <4 x i32> %ins, <i32 42, i32 42, i32 42, i32 42>
  ret <4 x i32> %bo
}

define <4 x i32> @scalar_add_constant(i32* %p) {
  %x = load i32, i32* %p
  %b = add i32 %x, 42
  %bo = insertelement <4 x i32> undef, i32 %b, i32 0
  ret <4 x i32> %bo
}

And for x86, that's:

vmovd	(%rdi), %xmm0           ## xmm0 = mem[0],zero,zero,zero
vpaddd	LCPI4_0(%rip), %xmm0, %xmm0

vs.

movl	(%rdi), %eax
addl	$42, %eax
vmovd	%eax, %xmm0

Let me see about adding a DAG reversal...

Looks ok to me.

lib/Transforms/InstCombine/InstructionCombining.cpp
1362

I'd prefer ConstIsOp1

1372

Hm, why dyn_cast<>(), we did just use CreateBinOp()?

lebedev.ri accepted this revision.Aug 24 2018, 11:30 AM
This revision is now accepted and ready to land.Aug 24 2018, 11:30 AM

Note: this patch is on hold pending backend enhancements that try to avoid known codegen regressions.
The first 2 of those are D51186 and D51125.

lib/Transforms/InstCombine/InstructionCombining.cpp
1362

Yes, will fix.

1372

CreateBinOp() will return a Constant if it manages to constant fold, so we can't be sure that 'ScalarOp' is actually a BinaryOperator. I suppose that would mean X was a constant...which seems unlikely. But it's the kind of thing a fuzzer finds N months after commit. :)

Note: this patch is on hold pending backend enhancements that try to avoid known codegen regressions.
The first 2 of those are D51186 and D51125.

Yep, i noticed, thank you for doing these!

lib/Transforms/InstCombine/InstructionCombining.cpp
1372

Right, constant-folding again. I should try to keep that in mind.

spatel abandoned this revision.Jun 1 2020, 6:56 AM

I'm finally going to abandon this. D80885 would introduce the same transform, but it includes load-from-memory and cost model constraints, so should be safer. And I expect that we would still run -instcombine after -vector-combine in a typical opt pipeline, so there shouldn't be much chance of losing overall optimization power.