This is an archive of the discontinued LLVM Phabricator instance.

Convert __m64 intrinsics to unconditionally use SSE2 instead of MMX instructions.
Needs ReviewPublic

Authored by jyknight on Aug 30 2020, 3:29 PM.



Preliminary patch, posted to go along with discussion on llvm-dev.

3DNow! intrinsics are not converted, as of yet.

Tests have not been updated to match new expected IR output. Currently failing:
Clang :: CodeGen/attr-target-x86-mmx.c
Clang :: CodeGen/mmx-builtins.c
Clang :: CodeGen/mmx-shift-with-immediate.c
Clang :: Headers/xmmintrin.c

Diff Detail

Event Timeline

jyknight created this revision.Aug 30 2020, 3:29 PM
Herald added a project: Restricted Project. · View Herald TranscriptAug 30 2020, 3:29 PM
jyknight requested review of this revision.Aug 30 2020, 3:29 PM
craig.topper added inline comments.Aug 30 2020, 10:44 PM

I think you we should use __v8qu to match what we do in emmintrin.h. We don't currently set nsw on signed vector arithmetic, but we should be careful in case that changes in the future.


I think we probably want to use a v2su or v2si here. Using v1di scalarizes and splits on 32-bit targets. On 64-bit targets it emits GPR code.


Need to use v8qs here to force "signed char" elements. v8qi uses "char" which has platform dependent signedness or can be changed with a command line.


Same here


Is this needed?


I don't think this change is needed. And I think the operands are in the wrong order.


I'm worried that using v1di with the shuffles will lead to scalarization in the type legalizer. Should we use v2si instead?


This doesn't guarantee zeroes in bits 15:8 does it?


Does this work with large pages?

jyknight marked 7 inline comments as done.Jan 6 2021, 8:30 PM

I've finally got back to moving this patch forward -- PTAL, thanks!

To start with, I wrote a simple test-suite to verify the functionality of these changes. I've included the tests I wrote under mmx-tests/ in this version of the patch -- although it doesn't actually belong there. I'm not exactly sure where it _does_ belong, however.

The test-suite runs a number of combinations of inputs against two different compilers' implementations of these intrinsics, and makes sure they produce identical results. I used this to ensure that there are no changes in behavior between old clang and clang after this change, as well as compared clang to GCC. Using that, I've fixed and verified all the bugs you noticed in codereview already, as well as additional bugs the testsuite found (in _mm_maddubs_pi16 and _mm_shuffle_pi8). I'm feeling reasonably confident, now, that this change will not change behavior of these functions. The tests also discovered two bugs in GCC,,

Some other changes in this update:

I switched _mm_extract_pi16 and _mm_insert_pi16 back to using an clang intrinsic, for consistency with the other extract/insert macros, which are using an intrinsic function simply to force the element-number to be a compile-time constant, and produce an error when it's not. But, the intrinsic now lowers to generic IR like all the other __builtin_ia32_vec_{ext,set}_*, rather than an llvm intrinsic forcing MMX. I modified the "composite" functions in xmmintrin.h to directly use 128-bit operations, instead of composites of multiple 64bit operations, where possible.

Finally, the clang tests have been updated, so that all tests pass again.


Done, here and everywhere else I was using signed math (except the comparisons).


AFAICT, this doesn't matter? It seems to emit GPR or XMM code just depending on whether the result values are needed as XMM or not, independent of whether the type is specified as v2su or v1du.




This is a short, which is always signed, so it should be ok as written.


No, reverted this change and the others like it.


Change was unnecessary, so reverted. (But operands are supposed to be backwards here.)


Converting __trunc64 to v4si (and thus v2si return value) seems to make codegen _worse_ in some cases, and I don't see any case where it gets better.

For example,

#define __trunc64_1(x) (__m64)__builtin_shufflevector((__v2di)(x), __extension__ (__v2di){}, 0)
#define __trunc64_2(x) (__m64)__builtin_shufflevector((__v4si)(x), __extension__ (__v4si){}, 0, 1)
__m64 trunc1(__m128 a, int i) { return __trunc64_1(__builtin_ia32_psllqi128(a, i)); }
__m64 trunc2(__m128 a, int i) { return __trunc64_2(__builtin_ia32_psllqi128(a, i)); }

In trunc2, you get two extraneous moves at the end:

movd    %edi, %xmm1
psllq   %xmm1, %xmm0
movq    %xmm0, %rax // extra
movq    %rax, %xmm0 // extra

I guess that's related to calling-convention lowering which turns m64 into "double" confusing the various IR simplifications?

Similarly, there's also extraneous moves to/from a GPR for argument passing sometimes. But I don't see an easy way around that. Both variants do that here, instead of just movq %xmm0, %xmm0:

#define __anyext128_1(x) (__m128i)__builtin_shufflevector((__v1di)(x), __extension__ (__v1di){}, 0, -1)
#define __anyext128_2(x) (__m128i)__builtin_shufflevector((__v2si)(x), __extension__ (__v2si){}, 0, 1, -1, -1)
#define __zext128_1(x) (__m128i)__builtin_shufflevector((__v1di)(x), __extension__ (__v1di){}, 0, 1)
#define __zext128_2(x) (__m128i)__builtin_shufflevector((__v2si)(x), __extension__ (__v2si){}, 0, 1, 2, 3)

__m128 ext1(__m64 a) { return __builtin_convertvector((__v4si)__zext128_1(a), __v4sf)); }
__m128 ext2(__m64 a) { return __builtin_convertvector((__v4si)__zext128_2(a), __v4sf)); }

Both produce:

movq    %xmm0, %rax
movq    %rax, %xmm0
cvtdq2ps        %xmm0, %xmm0

However, switching to variant 2 of anyext128 and zext128 does seem to improve things in other cases, avoiding _some_ of those sorts of extraneous moves to a scalar register and back again. So I've made that change.


It does not. Switched to zext128.


Yes -- this needs to be the boundary at which a trap _might_ occur if we crossed it. Whether it's in fact the end of of a page or not is irrelevant, only that it _could_ be.

jyknight updated this revision to Diff 315039.Jan 6 2021, 8:30 PM
jyknight marked 7 inline comments as done.

Fix and test.

Herald added a project: Restricted Project. · View Herald TranscriptJan 6 2021, 8:30 PM
craig.topper added inline comments.Jan 6 2021, 11:58 PM

Yeah. I don't know why I wrote that now.

Ping, thanks!

Or, if you have suggestions on how to make it easier to review, I'd be open to that.

mr-c added a subscriber: mr-c.May 27 2023, 3:26 AM
Herald added a project: Restricted Project. · View Herald TranscriptMay 27 2023, 3:26 AM
Herald added a subscriber: StephenFan. · View Herald Transcript

Reverse ping. Any progress or plan for this patch?