This is an archive of the discontinued LLVM Phabricator instance.

Adding missing intrinsics _cvtsh_ss and _cvtss_sh
ClosedPublic

Authored by kromanova on Jan 13 2016, 9:09 PM.

Details

Summary

I have added 2 missing intrinsics _cvtss_sh and _mm_cvtps_ph to the intrinsics header f16intrin.h.

GCC has these intrinsics in f16cintrin.h. Here is the definition:

extern __inline float __attribute__((__gnu_inline__, __always_inline__, __artificial__))
_cvtsh_ss (unsigned short __S)
{
  __v8hi __H = __extension__ (__v8hi){ __S, 0, 0, 0, 0, 0, 0, 0 };
  __v4sf __A = __builtin_ia32_vcvtph2ps (__H);
  return __builtin_ia32_vec_ext_v4sf (__A, 0);
`}
#ifdef __OPTIMIZE__
extern __inline unsigned short __attribute__((__gnu_inline__, __always_inline__, __artificial__))
_cvtss_sh (float __F, const int __I)
{
  __v4sf __A =  __extension__ (__v4sf){ __F, 0, 0, 0 };
  __v8hi __H = __builtin_ia32_vcvtps2ph (__A, __I);
  return (unsigned short) __builtin_ia32_vec_ext_v8hi (__H, 0);
`}
#else
#define _cvtss_sh(__F, __I)                                             \
  (__extension__                                                        \
   ({                                                                   \
      __v4sf __A =  __extension__ (__v4sf){ __F, 0, 0, 0 };             \
      __v8hi __H = __builtin_ia32_vcvtps2ph (__A, __I);                 \
      (unsigned short) __builtin_ia32_vec_ext_v8hi (__H, 0);            \
    }))
#endif /* __OPTIMIZE */

Intel's documentation expects _cvtsh_ss to have 2 parameters (instead of one)
https://software.intel.com/en-us/node/524287
but most likely the documentation is wrong, because Intel’s' headers contain these intrinsic prototypes in emmintrin.h

extern float          __ICL_INTRINCC _cvtsh_ss(unsigned short);
extern unsigned short __ICL_INTRINCC _cvtss_sh(float, int);

BTW, emmintrin.h includes f16cintrin.h, so it should be OK to place these 2 intrinsics in f16cintrin.h. This should satisfy both Intel's and GCC's expectations.

Clang generates the following IR for cvtsh_ss (see below). In the test I simply checked that the builtin @llvm.x86.vcvtph2ps.128 is generated for _cvtsh_ss. I was afraid that checks for initialization of the vector 'v' might be too lengthy and the IR is prone to change frequently, so I didn't add these checks. However, if you think that this is important and it won't create too much headache because IR will keep changing over time, I could certainly add them.
The same goes for test_cvtss_sh.

static __inline float __DEFAULT_FN_ATTRS
cvtsh_ss(unsigned short a)
{
  __v8hi v = {(short)a, 0, 0, 0, 0, 0, 0, 0};
  __v4sf r = __builtin_ia32_vcvtph2ps(v);
  return r[0];
}
define float @test_cvtsh_ss(i16 zeroext %a) #0 {
entry:
  %a.addr.i = alloca i16, align 2
  %v.i = alloca <8 x i16>, align 16
  %r.i = alloca <4 x float>, align 16
  %a.addr = alloca i16, align 2
  store i16 %a, i16* %a.addr, align 2
  %0 = load i16, i16* %a.addr, align 2
  store i16 %0, i16* %a.addr.i, align 2
  %1 = load i16, i16* %a.addr.i, align 2
  %vecinit.i = insertelement <8 x i16> undef, i16 %1, i32 0
  %vecinit1.i = insertelement <8 x i16> %vecinit.i, i16 0, i32 1
  %vecinit2.i = insertelement <8 x i16> %vecinit1.i, i16 0, i32 2
  %vecinit3.i = insertelement <8 x i16> %vecinit2.i, i16 0, i32 3
  %vecinit4.i = insertelement <8 x i16> %vecinit3.i, i16 0, i32 4
  %vecinit5.i = insertelement <8 x i16> %vecinit4.i, i16 0, i32 5
  %vecinit6.i = insertelement <8 x i16> %vecinit5.i, i16 0, i32 6
  %vecinit7.i = insertelement <8 x i16> %vecinit6.i, i16 0, i32 7
  store <8 x i16> %vecinit7.i, <8 x i16>* %v.i, align 16
  %2 = load <8 x i16>, <8 x i16>* %v.i, align 16
  %3 = call <4 x float> @llvm.x86.vcvtph2ps.128(<8 x i16> %2) #2
  store <4 x float> %3, <4 x float>* %r.i, align 16
  %4 = load <4 x float>, <4 x float>* %r.i, align 16
  %vecext.i = extractelement <4 x float> %4, i32 0
  ret float %vecext.i
}

Diff Detail

Repository
rL LLVM

Event Timeline

kromanova updated this revision to Diff 44832.Jan 13 2016, 9:09 PM
kromanova retitled this revision from to Adding missing intrinsics _cvtsh_ss and _cvtss_sh.
kromanova updated this object.
kromanova set the repository for this revision to rL LLVM.

Adding cfe-commits as a subscriber.

kromanova updated this object.Jan 13 2016, 9:18 PM
kromanova updated this object.Jan 13 2016, 9:22 PM
kromanova updated this object.Jan 13 2016, 9:26 PM
craig.topper added inline comments.Jan 19 2016, 10:03 PM
lib/Headers/f16cintrin.h
47

Would it be possible to do this without temporaries? Temporaries in macros can cause -Wshadow warnings if the macro is used multiple times.

kromanova updated this revision to Diff 45458.Jan 20 2016, 3:55 PM

Craig, thank you for the review. Here are the changes that you requested.
Katya.

craig.topper added inline comments.Jan 20 2016, 9:52 PM
lib/Headers/f16cintrin.h
47

Can we do something like this to remove the last temporary?

#define _cvtss_sh(a, imm) extension ({ \

(unsigned short)((__v8hi)__builtin_ia32_vcvtps2ph((__v4sf){a, 0, 0, 0}, (imm))[0]); \

})

kromanova updated this revision to Diff 45632.Jan 21 2016, 5:55 PM
kromanova marked an inline comment as done.

Updated patch to address Craig's comments.

kromanova marked an inline comment as done.Jan 21 2016, 6:05 PM
kromanova added inline comments.
lib/Headers/f16cintrin.h
47

Hi Craig,

I should have looked how it's done just a few lines below. Sorry.
I had to slightly modify the body of the define that you proposed by adding an additional pair of round brackets, otherwise I got compilation errors like this:

 ~/ngh/ToT_commit/build/bin/clang  intr.cpp  -mf16c intr.cpp:10:7: 
error: C-style cast from scalar 'short' to vector '__v8hi'
    (vector of 8 'short' values) of different size
a = _cvtss_sh(res, imm);
    ^~~~~~~~~~~~~~~~~~~
~/ngh/ToT_commit/build/bin/../lib/clang/3.8.0/include/f16cintrin.h:43:20: note:
    expanded from macro '_cvtss_sh'
(unsigned short)((__v8hi)__builtin_ia32_vcvtps2ph((__v4sf){a, 0, 0, 0}, \
                 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
intr.cpp:10:5: error: assigning to 'unsigned short' from incompatible type
    'void'
a = _cvtss_sh(res, imm);
  ^ ~~~~~~~~~~~~~~~~~~~
2 errors generated.
kromanova updated this revision to Diff 45635.Jan 21 2016, 6:32 PM
kromanova marked an inline comment as done.

I further simplified the macros by removing the statement for the define that I added (_cvtss_sh) and for the one that was there before (_mm_cvtps_ph).
I also formatted __DEFAULT_FN_ATTRS macro to comply with 80 characters limitation.

Craig, do you think it's necessary to make the tests more fancy by checking how the vector is initialized before the builtin invocation and/or that one element is extracted from the vector after the builtin returned a value? It will add additional 10-15 check lines to each test. My concern is that these additional lines might change from time to time.

craig.topper accepted this revision.Jan 21 2016, 10:16 PM
craig.topper edited edge metadata.

I agree that the vector initialization code will be prone to changing. I think what you have is fine.

LGTM

This revision is now accepted and ready to land.Jan 21 2016, 10:16 PM
This revision was automatically updated to reflect the committed changes.