This is an archive of the discontinued LLVM Phabricator instance.

[x86] eliminate unnecessary shuffling/moves with unary scalar math ops (PR21507)
ClosedPublic

Authored by spatel on May 5 2015, 2:01 PM.

Details

Summary

This patch attempts to finish the job that was abandoned in D6958 following the refactoring in http://reviews.llvm.org/rL230221:

  1. Uncomment the intrinsic def for the AVX r_Int instruction.
  2. Add missing r_Int entries to the load folding tables; there are already tests that check these in "test/Codegen/X86/fold-load-unops.ll", so I haven't added any more in this patch.
  3. Add patterns to solve PR21507 ( https://llvm.org/bugs/show_bug.cgi?id=21507 ).

So instead of this:

movaps	%xmm0, %xmm1
rcpss	%xmm1, %xmm1
movss	%xmm1, %xmm0

We should now get:

rcpss	%xmm0, %xmm0

And instead of this:

vsqrtss	%xmm0, %xmm0, %xmm1
vblendps	$1, %xmm1, %xmm0, %xmm0 ## xmm0 = xmm1[0],xmm0[1,2,3]

We should now get:

vsqrtss	%xmm0, %xmm0, %xmm0

Diff Detail

Repository
rL LLVM

Event Timeline

spatel updated this revision to Diff 24978.May 5 2015, 2:01 PM
spatel retitled this revision from to [x86] eliminate unnecessary shuffling/moves with unary scalar math ops (PR21507).
spatel updated this object.
spatel edited the test plan for this revision. (Show Details)
spatel added reviewers: delena, mkuper, RKSimon.
spatel added a subscriber: Unknown Object (MLST).
RKSimon edited edge metadata.May 6 2015, 2:48 AM

Thanks Sanjay, in D9095 I added some tests to sse-scalar-fp-arith.ll for llvm.sqrt.f32 / llvm.sqrt.f64 tests - these don't appear to be optimized by this patch - is this something that could be easily added? Feel free to transfer them to sse-scalar-fp-arith-unary.ll.

lib/Target/X86/X86InstrInfo.cpp
529 ↗(On Diff #24978)

excess whitespace before the comma

spatel added a comment.May 6 2015, 8:58 AM

Thanks Sanjay, in D9095 I added some tests to sse-scalar-fp-arith.ll for llvm.sqrt.f32 / llvm.sqrt.f64 tests - these don't appear to be optimized by this patch - is this something that could be easily added? Feel free to transfer them to sse-scalar-fp-arith-unary.ll.

Ah, the sqrt IR intrinsics as opposed to the sqrt SSE intrinsics. I didn't consider them specifically, but I did look at patterns that would match the X86frcp and X86frsqrt SDNodes + blend/move. It seemed to me that the conditions needed to produce that pattern were so far-fetched (-ffast-math + reciprocals enabled + NR refinement turned off + a scalar op in the middle of vector code?) that it wasn't worth the effort.

I'm probably not being imaginative enough, but here's my thinking: in order to get a scalar sqrt *IR* intrinsic from/to vector operands from C source, a coder would have to be explicitly using SSE intrinsics and then throw a libm sqrt() call into the mix. If the coder used _mm_sqrt_ss(), we wouldn't see a sqrt IR intrinsic. If the code was auto-vectorized, we also wouldn't see a scalar sqrt intrinsic; it would be a vector intrinsic, and then we wouldn't see this insert/extract pattern where we're just operating on the scalar lane?

That said, if this is a common enough occurrence, then what I'd hope to do is just add more defm lines instead of duplicating the multiclass of patterns to match SDNodes rather than Intrinsics, eg:

defm : scalar_unary_math_patterns<fsqrt, "SQRTSD", X86Movsd, v2f64, UseSSE2>;

...but I'm not sure how to do that in tablegen. Any suggestions?

spatel added a comment.May 6 2015, 9:17 AM

That said, if this is a common enough occurrence, then what I'd hope to do is just add more defm lines instead of duplicating the multiclass of patterns to match SDNodes rather than Intrinsics, eg:

defm : scalar_unary_math_patterns<fsqrt, "SQRTSD", X86Movsd, v2f64, UseSSE2>;

...but I'm not sure how to do that in tablegen. Any suggestions?

I think I have a hack-around: just make the param a string and then cast it, but after a little more thought, it's not enough. If we do want to optimize the scalar op case, we need a whole set of different patterns to match (as we do for the binops earlier in this file). Ie, we're looking for something like this:

    0x7fcbf987f8c0: f64 = extract_vector_elt 0x7fcbf987f530, 0x7fcbf987f790 [ORD=2]
  0x7fcbf987f9f0: f64 = fsqrt 0x7fcbf987f8c0 [ORD=3]
0x7fcbf987fb20: v2f64 = insert_vector_elt 0x7fcbf987f530, 0x7fcbf987f9f0, 0x7fcbf987f790 [ORD=4]
spatel updated this revision to Diff 25064.May 6 2015, 10:25 AM
spatel edited edge metadata.

Patch updated:
Simon and I discussed supporting the cases in sse-scalar-fp-arith.ll and how that IR can occur via generic vector code that relies on compiler builtins. We'll need more patterns to handle those; I don't see a quick fix.

  1. Added a TODO comment regarding additional patterns.
  2. Fixed extra space in memory folding table entry.
  3. Changed test file to specify a triple rather than a cpu.
  4. Modified check lines in the test file for generic x86-64 matches.
RKSimon accepted this revision.May 6 2015, 1:44 PM
RKSimon edited edge metadata.

LGTM

This revision is now accepted and ready to land.May 6 2015, 1:44 PM
andreadb added inline comments.
lib/Target/X86/X86InstrSSE.td
3417–3420 ↗(On Diff #25064)

Hi Sanjay,

do we still need this pattern? I maybe wrong, but your new pattern (at line 3413) should make this one dead.

spatel added inline comments.May 7 2015, 8:42 AM
lib/Target/X86/X86InstrSSE.td
3417–3420 ↗(On Diff #25064)

Thanks, Andrea.

Yes, I'm not seeing how this pattern will do anything now, and there's no difference on any regression tests. I'll remove it.

This revision was automatically updated to reflect the committed changes.