This is an archive of the discontinued LLVM Phabricator instance.

[X86] Legalize v32i1 without BWI via splitting to v16i1 rather than the default of promoting to v32i8.
ClosedPublic

Authored by craig.topper on Jan 13 2018, 11:52 AM.

Details

Summary

For the most part its better to keep v32i1 as a mask type of a narrower width than trying to promote it to a ymm register.

I had to add some overrides to the methods that get the types for the calling convention so that we still use v32i8 for argument/return purposes.

There are still some regressions in here. I definitely saw some around shuffles. I think we probably should move vXi1 shuffle from lowering to a DAG combine where I think the extend and truncate we have to emit would be better combined.

I think we also need a DAG combine to remove trunc from (extract_vector_elt (trunc))

Overall this removes something like 13000 CHECK lines from lit tests.

Diff Detail

Event Timeline

craig.topper created this revision.Jan 13 2018, 11:52 AM

Rebase after improving vXi16/vXi8 select combines

There are still some regressions in here. I definitely saw some around shuffles. I think we probably should move vXi1 shuffle from lowering to a DAG combine where I think the extend and truncate we have to emit would be better combined.

I've been putting off looking at this but I've been thinking about ways to improve shuffle combining - 2 things come to mind: some form of DemandedVectorElts (preferably in DAGCombine but just in x86 if necessary) and the ability for the x86 shuffle combines to combine through subvector insertion/extraction/concat (and avx512 compress/expand?). Supporting extend/trucate could fit in with the second.

While the v32i1 is technically a regression, its not a fair test of what happens with v32i1 shuffles. Previously we were getting lucky with type promotion working favorably with the types used for argument passing. I think a shuffle sandwiched between say an icmp and a select condition would be very different.

I'll add more test cases to show what happens when vXi1 shuffles are used near k-registers.

Add diffs for more v32i1 shuffle test case. I haven't commited the current versions of the tests to the repo yet, but you can see the before and after here.

Looks like we still have a regression when we have a shuffle between a v32i8 icmp and a v32i8 select. We could add a DAG combine to detect sign extends of i1 shuffles or i1 shuffles of truncates and adjust the type to match instead of somewhat arbitrarily picking a type to extend due to during lowering.

RKSimon added inline comments.Jan 19 2018, 7:28 AM
test/CodeGen/X86/avx512-insert-extract.ll
1068

Is the andb necessary? Are we missing some known bits handling here?

craig.topper added inline comments.Jan 19 2018, 9:39 AM
test/CodeGen/X86/avx512-insert-extract.ll
1068

Yeah its unnecessary based on how it gets isel. But I'm not sure the way the DAG is written that we can guarantee it.

t0: ch = EntryToken
                t2: v64i8,ch = CopyFromReg t0, Register:v64i8 %0
                t4: v64i8,ch = CopyFromReg t0, Register:v64i8 %1
              t24: v64i1 = X86ISD::CMPMU t2, t4, Constant:i8<6>
            t28: v64i1 = X86ISD::KSHIFTR t24, Constant:i8<63>
          t30: i32 = extract_vector_elt t28, Constant:i64<0>
        t26: i8 = truncate t30
      t22: i8 = and t26, Constant:i8<1>
    t19: i8 = sub Constant:i8<4>, t22
  t13: i32 = zero_extend t19
t16: ch,glue = CopyToReg t0, Register:i32 %eax, t13
t17: ch = X86ISD::RET_FLAG t16, TargetConstant:i32<0>, Register:i32 %eax, t16:1

I don't think we can make any assumptions about the upper bits of the input to the extract_vector_elt being passed through.

We could maybe pattern match this out during isel?

Or we could legalize to a DAG that explicitly contains a KMOVW node instead of extract_vector_elt. Then we could probably put out an AssertZExt.

RKSimon added inline comments.Jan 19 2018, 10:07 AM
test/CodeGen/X86/avx512-insert-extract.ll
1068

Couldn't we handle this by adding X86ISD::KSHIFTR handling to X86TargetLowering::computeKnownBitsForTargetNode (with suitable DemandedElts twiddling)?

craig.topper added inline comments.Jan 19 2018, 10:15 AM
test/CodeGen/X86/avx512-insert-extract.ll
1068

But computeKnownBits for extract_vector_elt won't pass the info through. It doesn't know the bits from the upper elements are going to make it through to the result. (they don't for any other vector type). It would just take the known bits info about element 0 and any extend that to i32.

Rebase to change vmovdqa32 to vmovdqa64 in some tests.

RKSimon added inline comments.Jan 20 2018, 9:49 AM
test/CodeGen/X86/avx512-insert-extract.ll
1068

OK - sorry I was thinking that the extract_vector_elt disappeared when it became a KSHIFT+KMOVW.

Is this patch good enough to go in as is? And we file bugs and work on any problems after?

test/CodeGen/X86/avx512-insert-extract.ll
1068

Oh sorry. If we use KMOVW explicitly in the DAG then yes the extract_vector_elt goes away. But it still looks sort of like a vector to scalar bitcast.

The KMOVW will demand all elts from the KSHIFT. And the KSHIFT won't be able to say that some elts are 0 and the lowest element is unknown will it?

Rebase after prefer vector width.

RKSimon accepted this revision.Jan 23 2018, 4:02 AM

LGTM

test/CodeGen/X86/avx512-insert-extract.ll
1068

OK - it sounds like we need some better computeknownbits / demandedbits support for target nodes - that was what D38832 was trying to do yes?

This revision is now accepted and ready to land.Jan 23 2018, 4:02 AM
craig.topper added inline comments.Jan 23 2018, 6:07 AM
test/CodeGen/X86/avx512-insert-extract.ll
1068

I don't know if that helps here. For vectors, compute known bits tells the common known bits of all elements. So for the KSHIFT it would only return a KnownBits object with width 1. That won't be able to convey anything about the upper elements separately from the lower element.

craig.topper added inline comments.Jan 23 2018, 6:26 AM
test/CodeGen/X86/avx512-insert-extract.ll
1068

Maybe it makes more sense to do the KMOV and then a GPR shift if assuming we have a wide enough KMOV available? KSHIFT has a 3 cycle latency on SKX while a GPR shift is only 1 cyc.

Known bits would handle that well.

This revision was automatically updated to reflect the committed changes.
test/CodeGen/X86/avx512-vec-cmp.ll