This is an archive of the discontinued LLVM Phabricator instance.

[x86] Fix wrong prototypes for AVX mask load/store intrinsics.
ClosedPublic

Authored by andreadb on Oct 15 2015, 9:02 AM.

Details

Summary

The llvm types used for the mask operand of AVX maskload/maskstore intrinsics are incorrect.

In particular:

  1. The mask argument for builtin_ia32_maskloadpd and builtin_ia32_maskstorepd should be of type llvm_v2i64_ty and not llvm_v2f64_ty.
  2. The mask argument for builtin_ia32_maskloadpd256 and builtin_ia32_maskstorepd256 should be of type llvm_v4i64_ty and not llvm_v4f64_ty.
  3. The mask argument for builtin_ia32_maskloadps and builtin_ia32_maskstoreps should be of type llvm_v4i32_ty and not llvm_v4f32_ty.
  4. The mask argument for builtin_ia32_maskloadps256 and builtin_ia32_maskstoreps256 should be of type llvm_v8i32_ty and not llvm_v8f32_ty.

Basically, the mask type for maskload/maskstore GCC builtins is never a vector of packed floats/doubles.

I also noticed that Clang definitions for those builtins are incorrect in BuiltinsX86.def. Also, Clang header file avxintrin.h definitions for maskload/maskstore intrinsics wrongly use packed floats/doubles instead of packed int/long for the mask operands.

For example, _mm_maskstore_pd is currently defined in avxintrin.h as:

static __inline __m256 __DEFAULT_FN_ATTRS
_mm_maskstore_pd(double *__p, __m128d __m, __m128d __a)
{
  __builtin_ia32_maskstorepd((__v2df *)__p, (__v2df)__m, (__v2df)__a);
}

According to the Intel documentation, the correct prototype for _mm_maskstore_pd should be:
_mm_maskstore_pd(double *p, m128i m, m128d __a).

So, I think the definition should be something like:

static __inline __m256 __DEFAULT_FN_ATTRS
_mm_maskstore_pd(double *__p, __m128i __m, __m128d __a)
{
  __builtin_ia32_maskstorepd((__v2df *)__p, (__v2di)__m, (__v2df)__a);
}

If you agree with this patch, I plan to send a follow-on patch (this time a Clang patch) to also fix intrinsic header file avxintrin.h (and the prototype definitions for the x86 maskload/store builtins in BuiltinsX86.def).

Please let me know if okay to submit.

-Andrea

Diff Detail

Event Timeline

andreadb updated this revision to Diff 37488.Oct 15 2015, 9:02 AM
andreadb retitled this revision from to [x86] Fix wrong prototypes for AVX mask load/store intrinsics..
andreadb updated this object.
andreadb added reviewers: qcolombet, rnk, nadav, bruno, mkuper.
andreadb updated this object.
andreadb added a subscriber: llvm-commits.
bruno edited edge metadata.Oct 15 2015, 10:07 AM

Hi Andrea,

I don't recall off hand but my guess here is that the mask is a vector float type because VMASKMOV* uses two FP ports. In practice I'm not sure what the side effects of changing this are, but it might be something worth considering.

Hi Andrea,

I don't recall off hand but my guess here is that the mask is a vector float type because VMASKMOV* uses two FP ports. In practice I'm not sure what the side effects of changing this are, but it might be something worth considering.

Hi Bruno,

You are right, VMASKMOV* are definitely floating point domain. Depending on the subtarget and whether the VMASKMOV is a load or a store it may use one or more ports.

That said, in practice the only side effect in changing those intrinsics is that we end up with an extra bitcast in the case where the mask type in input is a vector of packed floats. However, that bitcast would be equivalent to a bitconvert between types of the same register class. So, it would be no-op and no extra instructions would be generated (tested on small examples using both constant and non-constant mask values).

Since VMASKMOV is floating point domain, AMD chips would suffer for a potential stall if the Mask value originated in the integer domain. A domain crossing caused by data movement (example: VInt -> VFP) is 1cy penalty on AMD chips.
However, the domain crossing issue is not a problem that would be introduced by this change. That issue was already affecting the previous intrinsic definition (i.e. the source of this "problem" has nothing to do with this change).
The backend already knows how to mitigate this problem running the "execution dependency fix" pass.

Example:

define void @foo(<4 x float>* %dst, <4 x float> %InVec, <4 x i32> %Mask1, <4 x i32> %Mask2) {
  %xor = xor <4 x i32> %Mask1, %Mask2
  %0 = bitcast <4 x float>* %dst to i8*
  tail call void @llvm.x86.avx.maskstore.ps(i8* %0, <4 x i32> %xor, <4 x float> %InVec)
  ret void
}

In this example, ISel would select a VPXORrr for the 'xor' operation.
Before code emission, we end up with the sequence:

%XMM1<def> = VPXORrr %XMM1<kill>, %XMM2<kill>
VMASKMOVPSrm %RDI<kill>, 1, noreg, 0, %noreg, %XMM1<kill>, %XMM0<kill>

After 'exedep-fix' we have:

%XMM1<def> = VXORPSrr %XMM1<kill>, %XMM2<kill>
VMASKMOVPSrm %RDI<kill>, 1, noreg, 0, %noreg, %XMM1<kill>, %XMM0<kill>

As a developer, it comes more natural to think about the mask as a vector of integers rather than floats. For example, we may want to manipulate the mask with logical operators before passing it in input to (V)MASKMOV* instructions. Having to pass it as a float simply forces to insert an explicitly (no-op) cast.

That said, I don't know how much we care about being consistent with gcc. Gcc defines those builtins differently than us (i.e. always as vector int/long values). This "problem" was found internally when testing the codegen of intrinsic calls. We spotted this discrepancy and that's the main reason why I uploaded this patch.
At the very least I suggest that we fix our intrinsic definitions in avxintrin.h just to be consistent with what the Intel documentation says.

What do you think?

mkuper edited edge metadata.

I don't really have a strong opinion on this.

I definitely agree that we should fix the definitions in avxintrin.h, though, regardless of the way the internals work.

Thanks for the detailed explanation, LGTM

bruno accepted this revision.Oct 19 2015, 6:17 AM
bruno edited edge metadata.

LGTM

This revision is now accepted and ready to land.Oct 19 2015, 6:17 AM
This revision was automatically updated to reflect the committed changes.