This is an archive of the discontinued LLVM Phabricator instance.

[X86] Allow unaligned stores with KMOV* intrinsics
Needs ReviewPublic

Authored by kalcutter on Jan 21 2023, 9:42 AM.

Details

Summary

Avoid undefined behavior when _store_mask* intrinsics are used with an unaligned memory address. The corresponding KMOVW/KMOVQ/KMOVD instructions allow unaligned stores.

Diff Detail

Event Timeline

kalcutter created this revision.Jan 21 2023, 9:42 AM
Herald added a project: Restricted Project. · View Herald TranscriptJan 21 2023, 9:42 AM
kalcutter requested review of this revision.Jan 21 2023, 9:42 AM

Added tests.

Matt added a subscriber: Matt.Jan 25 2023, 9:08 AM

Since the function takes a __mmask16 *, wouldn't the user have had to do an explicit cast to call the function with a misaligned pointer?

Since the function takes a __mmask16 *, wouldn't the user have had to do an explicit cast to call the function with a misaligned pointer?

Yes. The user would have to do an explicit cast. This is the same as many other X86 load/store intrinsics, for example:

__m128i _mm_loadl_epi64 (__m128i const* mem_addr)
void _mm256_storeu_pd (double * mem_addr, __m256d a)

Both of these functions work with unaligned data and require the user to do an explicit cast (which itself pedantically invokes UB if the alignment is wrong). Ideally, all intrinsics supporting unaligned addresses would use void* and the other AVX-512 intrinsics all do in fact use void* for unaligned arguments. I think these intrinsics should have also taken void *, sadly, I don't think that can be changed now. Maybe someone from Intel can chime in?

Since the function takes a __mmask16 *, wouldn't the user have had to do an explicit cast to call the function with a misaligned pointer?

Yes. The user would have to do an explicit cast. This is the same as many other X86 load/store intrinsics, for example:

__m128i _mm_loadl_epi64 (__m128i const* mem_addr)
void _mm256_storeu_pd (double * mem_addr, __m256d a)

Both of these functions work with unaligned data and require the user to do an explicit cast (which itself pedantically invokes UB if the alignment is wrong). Ideally, all intrinsics supporting unaligned addresses would use void* and the other AVX-512 intrinsics all do in fact use void* for unaligned arguments. I think these intrinsics should have also taken void *, sadly, I don't think that can be changed now. Maybe someone from Intel can chime in?

I think it's OK to use double * for unligned memory given it's no difference in the backend between alignment 1 and 4.
For _mm_loadl_epi64, the trunk code has already been using __m128i_u. This was changed by @craig.topper in https://github.com/llvm/llvm-project/commit/4390c721cba09597037578100948bbc83cc41b16

I don't see the benefits to use unligned type explicated for the mask. Just to save memory?

I think it's OK to use double * for unligned memory given it's no difference in the backend between alignment 1 and 4.
For _mm_loadl_epi64, the trunk code has already been using __m128i_u. This was changed by @craig.topper in https://github.com/llvm/llvm-project/commit/4390c721cba09597037578100948bbc83cc41b16

I don't see the benefits to use unligned type explicated for the mask. Just to save memory?

One use-case is bit-scattering, like:

const __m512i a = _mm512_loadu_epi64((const __m512i*)&in[i * 8]);                                                              
_store_mask64((__mmask64*)&out[offset_0 + i], _mm512_bitshuffle_epi64_mask(a, c0));                          
_store_mask64((__mmask64*)&out[offset_1 + i], _mm512_bitshuffle_epi64_mask(a, c1));                                            
_store_mask64((__mmask64*)&out[offset_2 + i], _mm512_bitshuffle_epi64_mask(a, c2));                                            
_store_mask64((__mmask64*)&out[offset_3 + i], _mm512_bitshuffle_epi64_mask(a, c3));                                            
_store_mask64((__mmask64*)&out[offset_4 + i], _mm512_bitshuffle_epi64_mask(a, c4));                                            
_store_mask64((__mmask64*)&out[offset_5 + i], _mm512_bitshuffle_epi64_mask(a, c5));                                            
_store_mask64((__mmask64*)&out[offset_6 + i], _mm512_bitshuffle_epi64_mask(a, c6));                                            
_store_mask64((__mmask64*)&out[offset_7 + i], _mm512_bitshuffle_epi64_mask(a, c7));

I think it's OK to use double * for unligned memory given it's no difference in the backend between alignment 1 and 4.
For _mm_loadl_epi64, the trunk code has already been using __m128i_u. This was changed by @craig.topper in https://github.com/llvm/llvm-project/commit/4390c721cba09597037578100948bbc83cc41b16

I don't see the benefits to use unligned type explicated for the mask. Just to save memory?

One use-case is bit-scattering, like:

const __m512i a = _mm512_loadu_epi64((const __m512i*)&in[i * 8]);                                                              
_store_mask64((__mmask64*)&out[offset_0 + i], _mm512_bitshuffle_epi64_mask(a, c0));                          
_store_mask64((__mmask64*)&out[offset_1 + i], _mm512_bitshuffle_epi64_mask(a, c1));                                            
_store_mask64((__mmask64*)&out[offset_2 + i], _mm512_bitshuffle_epi64_mask(a, c2));                                            
_store_mask64((__mmask64*)&out[offset_3 + i], _mm512_bitshuffle_epi64_mask(a, c3));                                            
_store_mask64((__mmask64*)&out[offset_4 + i], _mm512_bitshuffle_epi64_mask(a, c4));                                            
_store_mask64((__mmask64*)&out[offset_5 + i], _mm512_bitshuffle_epi64_mask(a, c5));                                            
_store_mask64((__mmask64*)&out[offset_6 + i], _mm512_bitshuffle_epi64_mask(a, c6));                                            
_store_mask64((__mmask64*)&out[offset_7 + i], _mm512_bitshuffle_epi64_mask(a, c7));

This doesn't show why out cannot be aligned to 64-bit. I assume it is defined like long long out[N];. The use of type long long should make sure it's aligned to 64-bit at least.

This doesn't show why out cannot be aligned to 64-bit. I assume it is defined like long long out[N];. The use of type long long should make sure it's aligned to 64-bit at least.

in and out are user-supplied byte buffers with no alignment requirements.

Who is the best person to review this?

Who is the best person to review this?

Do you know what gcc does here?

Could you memcpy the __mmask64 into the byte buffer?

Do you know what gcc does here?

Could you memcpy the __mmask64 into the byte buffer?

I think gcc does the same thing as clang. In both cases the issue can be observed with UBSAN.

memcpy can be used as a workaround but it is less ergonomic since both arguments are pointers which means it can only directly be used with lvalues. memcpy is also less explicit.

Intel's documentation of _store_mask64 does not specify any kind of alignment. Also, they document this intrinsic as being kmovq m64, k which has no alignment restrictions. I don't think it makes sense for intrinsics to enforce an arbitrary stricter alignment than instructions they represent.

Do you know what gcc does here?

Could you memcpy the __mmask64 into the byte buffer?

I think gcc does the same thing as clang. In both cases the issue can be observed with UBSAN.

memcpy can be used as a workaround but it is less ergonomic since both arguments are pointers which means it can only directly be used with lvalues. memcpy is also less explicit.

Intel's documentation of _store_mask64 does not specify any kind of alignment. Also, they document this intrinsic as being kmovq m64, k which has no alignment restrictions. I don't think it makes sense for intrinsics to enforce an arbitrary stricter alignment than instructions they represent.

For one thing, I think we should be aligned with GCC, otherwise, code may fail when cross-compile. For another, forcing the aligment should do better to performance.

For one thing, I think we should be aligned with GCC, otherwise, code may fail when cross-compile. For another, forcing the aligment should do better to performance.

I think this should be changed in GCC too. Also, this patch doesn't seem to meaningfully change the generated code (excepting quieting UBSAN).

You don't always have a choice of what alignment is used. This change doesn't make aligned code any slower. It is the same instruction whether the address happens to be aligned or not.