This is an archive of the discontinued LLVM Phabricator instance.

[InstCombine] canonicalize a scalar-select-of-vectors to vector select
ClosedPublic

Authored by spatel on Aug 12 2019, 10:26 AM.

Details

Summary

This pattern may arise with an enhancement to SLP vectorization suggested in PR42755:
https://bugs.llvm.org/show_bug.cgi?id=42755

For all in-tree targets that I looked at, codegen looks better when we change to a vector select, so I think this is safe to do without a cost model (in other words, as a target-independent canonicalization).

For example, if the condition of the select is a scalar, we end up with something like this on x86:

	vpcmpgtd	%xmm0, %xmm1, %xmm0
	vpextrb	$12, %xmm0, %eax
	testb	$1, %al
	jne	LBB0_2
## %bb.1:
	vmovaps	%xmm3, %xmm2
LBB0_2:
	vmovaps	%xmm2, %xmm0

Rather than the splat-condition variant:

	vpcmpgtd	%xmm0, %xmm1, %xmm0
	vpshufd	$255, %xmm0, %xmm0      ## xmm0 = xmm0[3,3,3,3]
	vblendvps	%xmm0, %xmm2, %xmm3, %xmm0

Diff Detail

Repository
rL LLVM

Event Timeline

spatel created this revision.Aug 12 2019, 10:26 AM
Herald added a project: Restricted Project. · View Herald TranscriptAug 12 2019, 10:26 AM

The codegen improvement is a bug in X86TargetLowering::LowerSELECT(), not dis-similar to https://bugs.llvm.org/show_bug.cgi?id=42903
It never even tries to lower integer vector select to blend, but produces cmov

Optimized type-legalized selection DAG: %bb.0 'extract_cond:'
SelectionDAG has 17 nodes:
  t0: ch = EntryToken
            t6: v4i32,ch = CopyFromReg t0, Register:v4i32 %2
          t21: v16i8 = bitcast t6
        t23: i8 = extract_vector_elt t21, Constant:i64<12>
      t20: i8 = and t23, Constant:i8<1>
      t2: v4i32,ch = CopyFromReg t0, Register:v4i32 %0
      t4: v4i32,ch = CopyFromReg t0, Register:v4i32 %1
    t11: v4i32 = select t20, t2, t4
  t14: ch,glue = CopyToReg t0, Register:v4i32 $xmm0, t11
  t15: ch = X86ISD::RET_FLAG t14, TargetConstant:i32<0>, Register:v4i32 $xmm0, t14:1



Legalizing vector op: t11: v4i32 = select t20, t2, t4
Trying custom legalization
Creating constant: t24: i8 = Constant<5>
Creating constant: t25: i8 = Constant<0>
Creating new node: t26: i32 = X86ISD::CMP t20, Constant:i8<0>
Creating new node: t27: v4i32 = X86ISD::CMOV t4, t2, Constant:i8<5>, t26
Successfully custom legalized node
Vector-legalized selection DAG: %bb.0 'extract_cond:'
SelectionDAG has 20 nodes:
  t0: ch = EntryToken
      t4: v4i32,ch = CopyFromReg t0, Register:v4i32 %1
      t2: v4i32,ch = CopyFromReg t0, Register:v4i32 %0
              t6: v4i32,ch = CopyFromReg t0, Register:v4i32 %2
            t21: v16i8 = bitcast t6
          t23: i8 = extract_vector_elt t21, Constant:i64<12>
        t20: i8 = and t23, Constant:i8<1>
      t26: i32 = X86ISD::CMP t20, Constant:i8<0>
    t27: v4i32 = X86ISD::CMOV t4, t2, Constant:i8<5>, t26
  t14: ch,glue = CopyToReg t0, Register:v4i32 $xmm0, t27
  t15: ch = X86ISD::RET_FLAG t14, TargetConstant:i32<0>, Register:v4i32 $xmm0, t14:1

I'm honestly not sure here if i would consider splat-of-i1 or i1 more canonical,
i would kind-of guessed i1 since it is a single bit while vector isn't.

I'm honestly not sure here if i would consider splat-of-i1 or i1 more canonical,
i would kind-of guessed i1 since it is a single bit while vector isn't.

Yes, I can see that argument. However, if we consider the i1 canonical, then we need to reverse this patch + add the codegen fixup + make sure that IR transforms based on vector splat patterns can match that potential variation of the pattern. That seems like a lot more work than what we have here: try to convert a select with vector true/false operands to also have a vector condition because we know that plays nicely with vector code in general.

lebedev.ri accepted this revision.Aug 14 2019, 2:46 PM

I'm honestly not sure here if i would consider splat-of-i1 or i1 more canonical,
i would kind-of guessed i1 since it is a single bit while vector isn't.

Yes, I can see that argument.

However, if we consider the i1 canonical, then we need to reverse this patch +

True.

add the codegen fixup

To be noted, that codegen fix is wanted regardless.

+ make sure that IR transforms based on vector splat patterns can match that potential variation of the pattern.

I don't have any good grasp to guess what those would be.

That seems like a lot more work than what we have here: try to convert a select with vector true/false operands to also have a vector condition because we know that plays nicely with vector code in general.

I just feel like pointing out that it is the case because we don't do said opposite transform (select based on splat -> select-of-vecs) in dagcombine as of now.
But yeah, i suspect this may be more canonical; e.g. on x86 there is no i128 blend, only <? X ?> blends.

Looks ok to me, but would be good for someone else to double-check.

This revision is now accepted and ready to land.Aug 14 2019, 2:46 PM

What about very long vectors that do no fit into single register? Is it cost effective for such vectors too?

What about very long vectors that do no fit into single register? Is it cost effective for such vectors too?

We would split the long vectors into values that fit the target registers in the backend. At that point, the target can decide if N vector selects are better or worse than a transfer to scalar compare and branch. As @lebedev.ri mentioned, we're missing that logic in SDAG, but given that this transform produces the better code for the default case, I don't think we need to make this patch dependent on backend fixups.

What about very long vectors that do no fit into single register? Is it cost effective for such vectors too?

We would split the long vectors into values that fit the target registers in the backend. At that point, the target can decide if N vector selects are better or worse than a transfer to scalar compare and branch. As @lebedev.ri mentioned, we're missing that logic in SDAG, but given that this transform produces the better code for the default case, I don't think we need to make this patch dependent on backend fixups.

But it adds extra cost for vectors splitting. Maybe limit the size of the vectors in the patch?

What about very long vectors that do no fit into single register? Is it cost effective for such vectors too?

We would split the long vectors into values that fit the target registers in the backend. At that point, the target can decide if N vector selects are better or worse than a transfer to scalar compare and branch. As @lebedev.ri mentioned, we're missing that logic in SDAG, but given that this transform produces the better code for the default case, I don't think we need to make this patch dependent on backend fixups.

But it adds extra cost for vectors splitting. Maybe limit the size of the vectors in the patch?

How would we do that? This is canonicalization, so we are not using any target-dependent information here. Presumably, whoever or whatever created the illegal vectors in the first place knows that codegen will have to alter the those ops to create legal code, so that will be handled by a pass that has a cost model.

What about very long vectors that do no fit into single register? Is it cost effective for such vectors too?

We would split the long vectors into values that fit the target registers in the backend. At that point, the target can decide if N vector selects are better or worse than a transfer to scalar compare and branch. As @lebedev.ri mentioned, we're missing that logic in SDAG, but given that this transform produces the better code for the default case, I don't think we need to make this patch dependent on backend fixups.

But it adds extra cost for vectors splitting. Maybe limit the size of the vectors in the patch?

How would we do that? This is canonicalization, so we are not using any target-dependent information here. Presumably, whoever or whatever created the illegal vectors in the first place knows that codegen will have to alter the those ops to create legal code, so that will be handled by a pass that has a cost model.

It means that you can make a transformation that may be less cost-effective than the original code.

What about very long vectors that do no fit into single register? Is it cost effective for such vectors too?

We would split the long vectors into values that fit the target registers in the backend. At that point, the target can decide if N vector selects are better or worse than a transfer to scalar compare and branch. As @lebedev.ri mentioned, we're missing that logic in SDAG, but given that this transform produces the better code for the default case, I don't think we need to make this patch dependent on backend fixups.

But it adds extra cost for vectors splitting. Maybe limit the size of the vectors in the patch?

How would we do that? This is canonicalization, so we are not using any target-dependent information here. Presumably, whoever or whatever created the illegal vectors in the first place knows that codegen will have to alter the those ops to create legal code, so that will be handled by a pass that has a cost model.

It means that you can make a transformation that may be less cost-effective than the original code.

That's correct. As for much of instcombine, this is not proposed as an optimization, just a canonicalization. If there's reason to believe that this will induce more perf regressions than wins, then we should prepare the backend to reverse the transform in advance. But as I wrote before, I don't see that happening for the common cases/targets that I looked at.

What about very long vectors that do no fit into single register? Is it cost effective for such vectors too?

We would split the long vectors into values that fit the target registers in the backend. At that point, the target can decide if N vector selects are better or worse than a transfer to scalar compare and branch. As @lebedev.ri mentioned, we're missing that logic in SDAG, but given that this transform produces the better code for the default case, I don't think we need to make this patch dependent on backend fixups.

But it adds extra cost for vectors splitting. Maybe limit the size of the vectors in the patch?

How would we do that? This is canonicalization, so we are not using any target-dependent information here. Presumably, whoever or whatever created the illegal vectors in the first place knows that codegen will have to alter the those ops to create legal code, so that will be handled by a pass that has a cost model.

It means that you can make a transformation that may be less cost-effective than the original code.

I agree with @spatel, there is no TTI in InstCombine, neither should there be.

In general, i do think this fold is the opposite from what the correct canonicalization is,
but as @spatel notes, this fix clearly results in better results right as of this moment.
Things can be readjusted later.

This revision was automatically updated to reflect the committed changes.