Page MenuHomePhabricator

[x86] vectorize more cast ops in lowering to avoid register file transfers
ClosedPublic

Authored by spatel on Feb 13 2019, 11:15 AM.

Details

Summary

This is a follow-up to D56864.

If we're extracting from a non-zero index before casting to FP, then shuffle the vector and optionally narrow the vector before doing the cast:

cast (extelt V, C) --> extelt (cast (extract_subv (shuffle V, [C...]))), 0

This might be enough to close PR39974:
https://bugs.llvm.org/show_bug.cgi?id=39974

Diff Detail

Repository
rL LLVM

Event Timeline

spatel created this revision.Feb 13 2019, 11:15 AM
Herald added a project: Restricted Project. · View Herald TranscriptFeb 13 2019, 11:15 AM
RKSimon added inline comments.Feb 13 2019, 12:49 PM
llvm/test/CodeGen/X86/known-signbits-vector.ll
79 ↗(On Diff #186701)

Any idea why this still fails?

spatel marked an inline comment as done.Feb 13 2019, 1:39 PM
spatel added inline comments.
llvm/test/CodeGen/X86/known-signbits-vector.ll
79 ↗(On Diff #186701)

This is almost the same example as in:
https://bugs.llvm.org/show_bug.cgi?id=39975

On x86-64 only (because the 64-bit shift isn't legal on i686), we scalarize the shift. So that means we have the shift sitting between the extract and cast, so no match.

RKSimon added inline comments.Feb 14 2019, 4:17 AM
llvm/test/CodeGen/X86/known-signbits-vector.ll
79 ↗(On Diff #186701)

Hmm - is it worth us investigating a trunc(lshr(extract(v2i64 x, i), 32)) -> trunc(extract(v2i64 x, i+1)) combine? (and variants)

spatel marked an inline comment as done.Feb 14 2019, 6:02 AM
spatel added inline comments.
llvm/test/CodeGen/X86/known-signbits-vector.ll
79 ↗(On Diff #186701)

Yeah, I though we already had that transform, but that case isn't matched.

Note that this example is an ashr, not lshr, and the example in PR39975 is probably tougher because it's shift-by-33. We might want to carve out an exception for the scalarization transform for these kinds of cases in general.

RKSimon accepted this revision.Feb 20 2019, 12:18 PM

LGTM, cheers

This revision is now accepted and ready to land.Feb 20 2019, 12:18 PM
This revision was automatically updated to reflect the committed changes.
efriedma added inline comments.
llvm/trunk/test/CodeGen/X86/vec_int_to_fp.ll
5874

Why not "vcvtudq2pd %xmm0, %xmm0"?

spatel marked an inline comment as done.Feb 22 2019, 6:36 AM
spatel added inline comments.
llvm/trunk/test/CodeGen/X86/vec_int_to_fp.ll
5874

We're only matching the generic UINT_TO_FP node, so we go from <4 x i32> to <4 x double>. That's also why the SSE targets don't get the similar SINT_TO_FP test above here. I can look into how the SINT_TO_FP example gets narrowed and try to make that happen here too.

There's no documentation that the generic nodes can change the number of elements in the vector, so I'm assuming they don't have that ability. Currently, we use the X86ISD::CVTSI2P for those patterns, so I think we need to extend the matching logic to handle that case to solve this more completely.

spatel marked an inline comment as done.Feb 22 2019, 7:55 AM
spatel added inline comments.
llvm/trunk/test/CodeGen/X86/vec_int_to_fp.ll
5874