This is an archive of the discontinued LLVM Phabricator instance.

Add intrinsic definitions for unary op AVX instructions [x86]
AbandonedPublic

Authored by spatel on Jan 13 2015, 12:57 PM.

Details

Summary

This is a 'no functional change' patch to ease writing patterns to fix PR21507 ( http://llvm.org/bugs/show_bug.cgi?id=21507 ).

Adding these definitions makes even the existing patterns match other instructions, so I assume it was just an oversight that they weren't included in the first place.

Diff Detail

Event Timeline

spatel updated this revision to Diff 18106.Jan 13 2015, 12:57 PM
spatel retitled this revision from to Add intrinsic definitions for unary op AVX instructions [x86].
spatel updated this object.
spatel edited the test plan for this revision. (Show Details)
spatel added reviewers: delena, nadav, craig.topper, RKSimon.
spatel added a subscriber: Unknown Object (MLST).
craig.topper edited edge metadata.Jan 13 2015, 3:39 PM

I believe they used to exist and were removed to reduce the number of
instructions. Instructions aren't completely free. They increase the static
size of the asm writer and intruction info tables. And they have to be
replicated in the folding tables in X86InstrInfo.td (which this patch
neglects to do).

Hi Craig -

Thank you for looking at this. I was hoping to solve PR21507 using patterns like this:

def : Pat<(v4f32 (X86Blendi (v4f32 VR128:$dst), (int_x86_sse_rcp_ss VR128:$src), (i8 1))),
          (VRCPSSr_Int VR128:$dst, VR128:$src)>;

I thought that would be similar to the existing patterns used for binops in this file.

Do you think it would better to not create these new defs and just write out the patterns using COPY_TO_REGCLASSes instead? Eg:

def : Pat<(v4f32 (X86Blendi (v4f32 VR128:$dst), (int_x86_sse_rcp_ss VR128:$src), (i8 1))),
          (COPY_TO_REGCLASS (VRCPSSr (COPY_TO_REGCLASS VR128:$dst, FR32), (COPY_TO_REGCLASS VR128:$src, FR32)), VR128)>;
RKSimon edited edge metadata.Jan 22 2015, 10:33 AM

Hi Sanjay, a minor query about VSQRTSDr_Int

lib/Target/X86/X86InstrSSE.td
3810

Since you've added it above shouldn't this be VSQRTSDr_Int ?

spatel added inline comments.Jan 22 2015, 10:49 AM
lib/Target/X86/X86InstrSSE.td
3810

Good eye - yes, that was a bug in the uploaded patch that I had corrected locally.

Do you think it would better to not create these new defs and just write out the patterns using COPY_TO_REGCLASSes instead?

Ping.

Please can you add the new V*r_Int instructions to the relevant tables / switches in X86InstrInfo.cpp (and add tests if you're adding memory folding patterns)?

spatel updated this revision to Diff 20234.Feb 18 2015, 4:41 PM
spatel edited edge metadata.

Patch updated to include new intrinsics in the memory folding table. Also added AVX test cases for load folding of each unary op.

In the interest of patch minimalism, I'm not fixing the SSE variants of these in this patch. The loads in the new test cases are not getting folded without -mattr=avx. Presumably this is because we don't have patterns to match those and/or the load folding tables have holes. It's still not clear to me exactly when the pattern is matched vs. peephole pass load folding is needed.

I also did not add the new intrinsics to the switch in hasUndefRegUpdate(). Please correct me if I'm misunderstanding, but these additions are AVX-only and don't have any partial update problem. It seems like a bug to me that the existing AVX unop instructions are in that switch because the destination register is not partially updated - the 2nd source operand high bits are passed through, so the destination register is always fully updated.

delena edited edge metadata.Feb 19 2015, 4:09 AM

You actually can use one class to define all instruction (FR32, FR64) and intrinsics (VR128) using COPY_TO_REGCLASS. in order to convert VR128 to FR32 or FR64.
See http://reviews.llvm.org/rL229837.

You also can define V#NAME#SDm_Int, if you don't like the COPY_TO_REGCLASS patterns.
There are too many intrinsics. I don't think that we should put them all in the peephole optimization pass.

You actually can use one class to define all instruction (FR32, FR64) and intrinsics (VR128) using COPY_TO_REGCLASS. in order to convert VR128 to FR32 or FR64.
See http://reviews.llvm.org/rL229837.

Hi Elena,

Thanks for the suggestion. But I'm not sure how that applies in this case because the rcp and rsqrt instructions only exist as f32 (not f64). If I am not understanding correctly, can you please provide a code example of how the AVX-512 method would apply here?

You also can define V#NAME#SDm_Int, if you don't like the COPY_TO_REGCLASS patterns.
There are too many intrinsics. I don't think that we should put them all in the peephole optimization pass.

I agree that we have a mess. But once these intrinsics are defined, we must add them to the peephole load folding table in order to catch the cases in llvm/test/CodeGen/X86/stack-folding-fp-avx1.ll.

The end goal that I was hoping to reach was to reuse this blob of shuffle matching patterns for the unary ops:
http://reviews.llvm.org/diffusion/L/browse/llvm/trunk/lib/Target/X86/X86InstrSSE.td;229871$3166

But now that I'm looking at it again, I don't see a way to do this cleanly. Trying to squeeze the unary ops in with the binary ops will just make everything even more unreadable.

I've checked the test case file in at r229870, so I'll abandon this patch and work on the fix for PR21507 using whatever bits are lying around.

spatel abandoned this revision.Mar 1 2015, 9:16 AM

Elena rewrote the unop patterns (thanks!) in:
http://reviews.llvm.org/rL230221

Abandoning this patch as it's now obsolete; will retry with a new patch to solve PR21507.