This is an archive of the discontinued LLVM Phabricator instance.

[X86] Replacing X86-specific floor and ceil vector intrinsics with generic LLVM intrinsics
AbandonedPublic

Authored by mike.dvoretsky on Apr 3 2018, 2:12 AM.

Details

Summary

Currently, X86 floor and ceil intrinsics for vectors are implemented as target-specific intrinsics that use the generic rounding instruction of the corresponding vector processing feature (ROUND* or VRNDSCALE*). This patch replaces those specific cases with calls to target-independent @llvm.floor.* and @llvm.ceil.* intrinsics. This doesn't affect the resulting machine code, as those intrinsics are lowered to the same instructions, but exposes these specific rounding cases to generic optimizations.

This patch also has an LLVM part, D45203. An alternative InstCombine-based implementation is proposed in D48067.

Diff Detail

Event Timeline

mike.dvoretsky created this revision.Apr 3 2018, 2:12 AM
craig.topper added inline comments.Apr 3 2018, 1:05 PM
include/clang/Basic/BuiltinsX86.def
951 ↗(On Diff #140745)

I'd prefer CGBuiltin to detect the specific immediates on the rndscale value. Primarily because we should be able to optimize _mm512_roundscale_pd when the ceil/floor immediate is used.

mike.dvoretsky edited the summary of this revision. (Show Details)

On suggestion from @craig.topper moved all lowering to CGBuiltin.cpp with no new builtins added. Instead the existing builtins are lowered if their immediate values correspond to generic ceil and floor operations. D45203 is now required to enable transformations.

What about rndscaless/rndscalesd?

clang/lib/CodeGen/CGBuiltin.cpp
8477

I'm not sure we should even try to emit a mask for the legacy scalar intrinsics. Does this get removed well by the middle or backend?

8490

Why Int32? That's not the right mask width for the legacy intrinsics.

mike.dvoretsky added inline comments.Apr 5 2018, 7:09 AM
clang/lib/CodeGen/CGBuiltin.cpp
8477

The masking is done to represent all operations handled here in a uniform way. D45203 removes it in the backend.

But it’s not really consistent because the mask is being removed early for the packed intrinsics, but late for the scalar intrinsics. Doesn’t it also introduce extra code for fast isel?

There's a similar patch for sqrt here https://reviews.llvm.org/D41168 and it uses a scalar sqrt and insert element for the scalar case. I think we need a consistent direction here.

Changed the scalar intrinsic lowering to work via extract-insert. D45203 contains tests for folding the resulting IR patterns.

I'm not sure whether we should be doing this here or in InstCombine. @spatel, what do you think?

I'm not sure whether we should be doing this here or in InstCombine. @spatel, what do you think?

It's been a while since I looked at these. Last memory I have is for the conversion from x86 masked ops to the generic LLVM intrinsics, and we did that in InstCombineCalls. I don't know if there was any sound reasoning for that though. If it makes no functional difference, I'd continue with that structure just so we don't become scattered in the transform.

mike.dvoretsky abandoned this revision.Jun 15 2018, 12:43 AM

Abandoning this due to D48067 being accepted instead.