This is an archive of the discontinued LLVM Phabricator instance.

[x86] eliminate unnecessary shuffling/moves with unary scalar math ops (PR21507)
ClosedPublic

Authored by spatel on May 5 2015, 2:01 PM.

Details

Summary

This patch attempts to finish the job that was abandoned in D6958 following the refactoring in http://reviews.llvm.org/rL230221:

  1. Uncomment the intrinsic def for the AVX r_Int instruction.
  2. Add missing r_Int entries to the load folding tables; there are already tests that check these in "test/Codegen/X86/fold-load-unops.ll", so I haven't added any more in this patch.
  3. Add patterns to solve PR21507 ( https://llvm.org/bugs/show_bug.cgi?id=21507 ).

So instead of this:

movaps	%xmm0, %xmm1
rcpss	%xmm1, %xmm1
movss	%xmm1, %xmm0

We should now get:

rcpss	%xmm0, %xmm0

And instead of this:

vsqrtss	%xmm0, %xmm0, %xmm1
vblendps	$1, %xmm1, %xmm0, %xmm0 ## xmm0 = xmm1[0],xmm0[1,2,3]

We should now get:

vsqrtss	%xmm0, %xmm0, %xmm0

Diff Detail

Event Timeline

spatel updated this revision to Diff 24978.May 5 2015, 2:01 PM
spatel retitled this revision from to [x86] eliminate unnecessary shuffling/moves with unary scalar math ops (PR21507).
spatel updated this object.
spatel edited the test plan for this revision. (Show Details)
spatel added reviewers: delena, mkuper, RKSimon.
spatel added a subscriber: Unknown Object (MLST).
RKSimon edited edge metadata.May 6 2015, 2:48 AM

Thanks Sanjay, in D9095 I added some tests to sse-scalar-fp-arith.ll for llvm.sqrt.f32 / llvm.sqrt.f64 tests - these don't appear to be optimized by this patch - is this something that could be easily added? Feel free to transfer them to sse-scalar-fp-arith-unary.ll.

lib/Target/X86/X86InstrInfo.cpp
529

excess whitespace before the comma

spatel added a comment.May 6 2015, 8:58 AM

Thanks Sanjay, in D9095 I added some tests to sse-scalar-fp-arith.ll for llvm.sqrt.f32 / llvm.sqrt.f64 tests - these don't appear to be optimized by this patch - is this something that could be easily added? Feel free to transfer them to sse-scalar-fp-arith-unary.ll.

Ah, the sqrt IR intrinsics as opposed to the sqrt SSE intrinsics. I didn't consider them specifically, but I did look at patterns that would match the X86frcp and X86frsqrt SDNodes + blend/move. It seemed to me that the conditions needed to produce that pattern were so far-fetched (-ffast-math + reciprocals enabled + NR refinement turned off + a scalar op in the middle of vector code?) that it wasn't worth the effort.

I'm probably not being imaginative enough, but here's my thinking: in order to get a scalar sqrt *IR* intrinsic from/to vector operands from C source, a coder would have to be explicitly using SSE intrinsics and then throw a libm sqrt() call into the mix. If the coder used _mm_sqrt_ss(), we wouldn't see a sqrt IR intrinsic. If the code was auto-vectorized, we also wouldn't see a scalar sqrt intrinsic; it would be a vector intrinsic, and then we wouldn't see this insert/extract pattern where we're just operating on the scalar lane?

That said, if this is a common enough occurrence, then what I'd hope to do is just add more defm lines instead of duplicating the multiclass of patterns to match SDNodes rather than Intrinsics, eg:

defm : scalar_unary_math_patterns<fsqrt, "SQRTSD", X86Movsd, v2f64, UseSSE2>;

...but I'm not sure how to do that in tablegen. Any suggestions?

spatel added a comment.May 6 2015, 9:17 AM

That said, if this is a common enough occurrence, then what I'd hope to do is just add more defm lines instead of duplicating the multiclass of patterns to match SDNodes rather than Intrinsics, eg:

defm : scalar_unary_math_patterns<fsqrt, "SQRTSD", X86Movsd, v2f64, UseSSE2>;

...but I'm not sure how to do that in tablegen. Any suggestions?

I think I have a hack-around: just make the param a string and then cast it, but after a little more thought, it's not enough. If we do want to optimize the scalar op case, we need a whole set of different patterns to match (as we do for the binops earlier in this file). Ie, we're looking for something like this:

    0x7fcbf987f8c0: f64 = extract_vector_elt 0x7fcbf987f530, 0x7fcbf987f790 [ORD=2]
  0x7fcbf987f9f0: f64 = fsqrt 0x7fcbf987f8c0 [ORD=3]
0x7fcbf987fb20: v2f64 = insert_vector_elt 0x7fcbf987f530, 0x7fcbf987f9f0, 0x7fcbf987f790 [ORD=4]
spatel updated this revision to Diff 25064.May 6 2015, 10:25 AM
spatel edited edge metadata.

Patch updated:
Simon and I discussed supporting the cases in sse-scalar-fp-arith.ll and how that IR can occur via generic vector code that relies on compiler builtins. We'll need more patterns to handle those; I don't see a quick fix.

  1. Added a TODO comment regarding additional patterns.
  2. Fixed extra space in memory folding table entry.
  3. Changed test file to specify a triple rather than a cpu.
  4. Modified check lines in the test file for generic x86-64 matches.
RKSimon accepted this revision.May 6 2015, 1:44 PM
RKSimon edited edge metadata.

LGTM

This revision is now accepted and ready to land.May 6 2015, 1:44 PM
andreadb added inline comments.
lib/Target/X86/X86InstrSSE.td
3417–3420

Hi Sanjay,

do we still need this pattern? I maybe wrong, but your new pattern (at line 3413) should make this one dead.

spatel added inline comments.May 7 2015, 8:42 AM
lib/Target/X86/X86InstrSSE.td
3417–3420

Thanks, Andrea.

Yes, I'm not seeing how this pattern will do anything now, and there's no difference on any regression tests. I'll remove it.

This revision was automatically updated to reflect the committed changes.