Page MenuHomePhabricator

[X86] Legalize v2i32 via widening rather than promoting

Authored by craig.topper on Dec 10 2017, 10:31 PM.



Previously we promoted v2i32 to v2i64. This patch changes this to widen to v4i32 instead.

I think widening is a better behavior for illegal vectors. In fact we have an experimental flag to do just that, that has been around for a few years.

There are definitely a few deficiencies observed in here, but I think overall this is an improvement. I'll submit more patches for some of the issues.

One of my goals is to try to clean up some of the handling we have to do to account for the current legalization around masked load, store, gather, and scatter.

Diff Detail

Event Timeline

craig.topper created this revision.Dec 10 2017, 10:31 PM

Rebase and remove some code I forgot remove earlier

zvi added a comment.Jan 7 2018, 11:02 PM

There are some regressions that need to be addressed (or we decide to accept), but overall your approach seems right to me.


This patch does not change mask argumenent representation, so his compare is redundant, right?


Any way to easily fix vmovd+vpinsrd -> vmovq?


Is this redundant move a known issue?


Two more missed vmovq opportunities


What happened here?


What about this?

Rebase on top of other recent changes.

Some random minor comments, but I'm not great on calling conventions and what issues we might encounter with this change.


Is there a helper function we should be using instead of this?


What happened? This is way out!


Again, way too high


I think this is OK, but it still makes me nervous. We go from accessing 64-bits to 128-bits per argument.




Yes - why didn't EltsFromConsecutiveLoads convert this to a i64 VZEXT_LOAD (VMOVQ)?



Rebase. Still need to go through Simon's comments.

craig.topper added inline comments.Feb 12 2018, 4:59 PM

I assume it switched between these rows in the cost table.

{ ISD::SINT_TO_FP, MVT::v2f64, MVT::v2i64, 2*10 },

{ ISD::SINT_TO_FP, MVT::v2f64, MVT::v4i32, 4*10 },

The 64-bits we get today is just dumb luck. If we compile with sse2 we get

subq    $16, %rsp
.seh_stackalloc 16
pshufd  $212, (%rdx), %xmm1     # xmm1 = mem[0,1,1,3]
pshufd  $212, (%rcx), %xmm0     # xmm0 = mem[0,1,1,3]
psubq   %xmm1, %xmm0
addq    $16, %rsp

This compare is emulating an a v2i64 arithmetic shift right since we don't have that instruction. We only consider the lower bit of each mask to be valid coming in so we have to do a sign_extend_inreg operation.

I thought we had a combine that used demanded bits that should have removed the right shift. But I think its getting tripped up by the concat_vectors that's in front of the gather.


I'm not sure anything ever saw the VPINSRD as anything more than a insert_vector_elt. We never had it as a shuffle or build_vector where we could detect multiple elements.

I wonder if we shouldn't just custom legalize v2i32 loads to VZEXT_LOAD during type legalization?


It's there to clear bits 255:128 because we don't do a good job of detecting when the producer already zeroed those bits. I think we only whitelist a couple of instructions today.


After type leglaization we have this. We could probably add a combine to catch the truncated build vector with sign extended inputs and squash them.

Type-legalized selection DAG: %bb.0 'signbits_sext_v2i64_sitofp_v2f64:'
SelectionDAG has 21 nodes:

t0: ch = EntryToken
      t33: i32 = extract_vector_elt t27, Constant:i64<0>
      t35: i32 = extract_vector_elt t27, Constant:i64<1>
    t37: v4i32 = BUILD_VECTOR t33, t35, undef:i32, undef:i32
  t30: v2f64 = X86ISD::CVTSI2P t37
t17: ch,glue = CopyToReg t0, Register:v2f64 $xmm0, t30
      t2: i32,ch = CopyFromReg t0, Register:i32 %0
    t5: i64 = sign_extend t2
      t4: i32,ch = CopyFromReg t0, Register:i32 %1
    t6: i64 = sign_extend t4
  t26: v4i64 = BUILD_VECTOR t5, t6, undef:i64, undef:i64
t27: v4i32 = truncate t26
t18: ch = X86ISD::RET_FLAG t17, TargetConstant:i32<0>, Register:v2f64 $xmm0, t17:1

We type legalized the v2i64->v2i32 truncate by widening to v4i64 and then truncating. Maybe we just need to emit a bitcast to v4i32 and a vector shuffle ourselves?

We now use getTypeAction in place of ExperimentalVectorWideningLegalization in most places so we no longer have to check for v2i32 in the bitcast code.

craig.topper marked an inline comment as done.Feb 12 2018, 5:58 PM

Add custom legalization for v2i32 loads to v2f64 in 32-bit mode to avoid extract and insert.

Custom widen v2i32 stores as well.

FWIW, I'm a *huge* fan of the approach of legalizing via widening (and have advocated for this in the past). Is there any specific review feedback you're looking for here beyond what you've already got?

craig.topper planned changes to this revision.Mar 26 2018, 1:44 PM

No I just need to sit down and try to fix the regressions some more.

Rebase and improve gather/scatter with v2i32 indices

Fix bug in scatter legalization.

Try to use pmaddwd for v2i32 muls before type legalization obscures the zext.

craig.topper planned changes to this revision.Mar 27 2018, 9:01 AM
RKSimon added inline comments.Mar 28 2018, 5:49 AM
726 ↗(On Diff #139873)

convert to for-range loop or for (unsigned i = 0, e = Op.getNumOperands(); i != e; ++i) {

BTW - Can this diff be pulled out?

craig.topper added inline comments.Apr 3 2018, 2:32 PM
726 ↗(On Diff #139873)

It doesn't change any test cases in the existing lit tests today.

Address a review comment.

craig.topper abandoned this revision.Dec 3 2018, 10:15 PM

Going to replace with a patch for all narrow types