Avoid undefined behavior when _store_mask* intrinsics are used with an unaligned memory address. The corresponding KMOVW/KMOVQ/KMOVD instructions allow unaligned stores.
Details
Diff Detail
Unit Tests
Event Timeline
Since the function takes a __mmask16 *, wouldn't the user have had to do an explicit cast to call the function with a misaligned pointer?
Yes. The user would have to do an explicit cast. This is the same as many other X86 load/store intrinsics, for example:
__m128i _mm_loadl_epi64 (__m128i const* mem_addr) void _mm256_storeu_pd (double * mem_addr, __m256d a)
Both of these functions work with unaligned data and require the user to do an explicit cast (which itself pedantically invokes UB if the alignment is wrong). Ideally, all intrinsics supporting unaligned addresses would use void* and the other AVX-512 intrinsics all do in fact use void* for unaligned arguments. I think these intrinsics should have also taken void *, sadly, I don't think that can be changed now. Maybe someone from Intel can chime in?
I think it's OK to use double * for unligned memory given it's no difference in the backend between alignment 1 and 4.
For _mm_loadl_epi64, the trunk code has already been using __m128i_u. This was changed by @craig.topper in https://github.com/llvm/llvm-project/commit/4390c721cba09597037578100948bbc83cc41b16
I don't see the benefits to use unligned type explicated for the mask. Just to save memory?
I think it's OK to use double * for unligned memory given it's no difference in the backend between alignment 1 and 4.
For _mm_loadl_epi64, the trunk code has already been using __m128i_u. This was changed by @craig.topper in https://github.com/llvm/llvm-project/commit/4390c721cba09597037578100948bbc83cc41b16I don't see the benefits to use unligned type explicated for the mask. Just to save memory?
One use-case is bit-scattering, like:
const __m512i a = _mm512_loadu_epi64((const __m512i*)&in[i * 8]); _store_mask64((__mmask64*)&out[offset_0 + i], _mm512_bitshuffle_epi64_mask(a, c0)); _store_mask64((__mmask64*)&out[offset_1 + i], _mm512_bitshuffle_epi64_mask(a, c1)); _store_mask64((__mmask64*)&out[offset_2 + i], _mm512_bitshuffle_epi64_mask(a, c2)); _store_mask64((__mmask64*)&out[offset_3 + i], _mm512_bitshuffle_epi64_mask(a, c3)); _store_mask64((__mmask64*)&out[offset_4 + i], _mm512_bitshuffle_epi64_mask(a, c4)); _store_mask64((__mmask64*)&out[offset_5 + i], _mm512_bitshuffle_epi64_mask(a, c5)); _store_mask64((__mmask64*)&out[offset_6 + i], _mm512_bitshuffle_epi64_mask(a, c6)); _store_mask64((__mmask64*)&out[offset_7 + i], _mm512_bitshuffle_epi64_mask(a, c7));
This doesn't show why out cannot be aligned to 64-bit. I assume it is defined like long long out[N];. The use of type long long should make sure it's aligned to 64-bit at least.
This doesn't show why out cannot be aligned to 64-bit. I assume it is defined like long long out[N];. The use of type long long should make sure it's aligned to 64-bit at least.
in and out are user-supplied byte buffers with no alignment requirements.
Do you know what gcc does here?
Could you memcpy the __mmask64 into the byte buffer?
I think gcc does the same thing as clang. In both cases the issue can be observed with UBSAN.
memcpy can be used as a workaround but it is less ergonomic since both arguments are pointers which means it can only directly be used with lvalues. memcpy is also less explicit.
Intel's documentation of _store_mask64 does not specify any kind of alignment. Also, they document this intrinsic as being kmovq m64, k which has no alignment restrictions. I don't think it makes sense for intrinsics to enforce an arbitrary stricter alignment than instructions they represent.
For one thing, I think we should be aligned with GCC, otherwise, code may fail when cross-compile. For another, forcing the aligment should do better to performance.
For one thing, I think we should be aligned with GCC, otherwise, code may fail when cross-compile. For another, forcing the aligment should do better to performance.
I think this should be changed in GCC too. Also, this patch doesn't seem to meaningfully change the generated code (excepting quieting UBSAN).
You don't always have a choice of what alignment is used. This change doesn't make aligned code any slower. It is the same instruction whether the address happens to be aligned or not.