This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
clang/
-
lib/Headers/
-
Headers/
-
avx512bwintrin.h
-
avx512fintrin.h
-
test/CodeGen/X86/
-
CodeGen/
-
X86/
-
avx512bw-builtins.c
-
avx512f-builtins.c

Differential D142287

[X86] Allow unaligned stores with KMOV* intrinsics
Needs ReviewPublic

Authored by kalcutter on Jan 21 2023, 9:42 AM.

Download Raw Diff

Details

Reviewers

FreddyYe
craig.topper

Summary

Avoid undefined behavior when _store_mask* intrinsics are used with an unaligned memory address. The corresponding KMOVW/KMOVQ/KMOVD instructions allow unaligned stores.

Diff Detail

Unit TestsFailed

	Time	Test
	3,440 ms	libcxx CI Modules > llvm-libc++-shared-cfg-in.libcxx/algorithms/specialized_algorithms/special_mem_concepts::nothrow_sentinel_for.compile.pass.cpp

Event Timeline

kalcutter created this revision.Jan 21 2023, 9:42 AM

Herald added a project: Restricted Project. · View Herald TranscriptJan 21 2023, 9:42 AM

kalcutter requested review of this revision.Jan 21 2023, 9:42 AM

Harbormaster completed remote builds in B209154: Diff 491081.Jan 21 2023, 10:16 AM

Added tests.

Herald added a subscriber: pengfei. · View Herald TranscriptJan 21 2023, 12:42 PM

Harbormaster completed remote builds in B209165: Diff 491094.Jan 21 2023, 1:19 PM

Matt added a subscriber: Matt.Jan 25 2023, 9:08 AM

Since the function takes a __mmask16 *, wouldn't the user have had to do an explicit cast to call the function with a misaligned pointer?

In D142287#4081585, @craig.topper wrote:

Since the function takes a __mmask16 *, wouldn't the user have had to do an explicit cast to call the function with a misaligned pointer?

Yes. The user would have to do an explicit cast. This is the same as many other X86 load/store intrinsics, for example:

__m128i _mm_loadl_epi64 (__m128i const* mem_addr)
void _mm256_storeu_pd (double * mem_addr, __m256d a)

Both of these functions work with unaligned data and require the user to do an explicit cast (which itself pedantically invokes UB if the alignment is wrong). Ideally, all intrinsics supporting unaligned addresses would use void* and the other AVX-512 intrinsics all do in fact use void* for unaligned arguments. I think these intrinsics should have also taken void *, sadly, I don't think that can be changed now. Maybe someone from Intel can chime in?

In D142287#4083111, @kalcutter wrote:
In D142287#4081585, @craig.topper wrote:

Since the function takes a __mmask16 *, wouldn't the user have had to do an explicit cast to call the function with a misaligned pointer?

Yes. The user would have to do an explicit cast. This is the same as many other X86 load/store intrinsics, for example:
__m128i _mm_loadl_epi64 (__m128i const* mem_addr)
void _mm256_storeu_pd (double * mem_addr, __m256d a)
Both of these functions work with unaligned data and require the user to do an explicit cast (which itself pedantically invokes UB if the alignment is wrong). Ideally, all intrinsics supporting unaligned addresses would use void* and the other AVX-512 intrinsics all do in fact use void* for unaligned arguments. I think these intrinsics should have also taken void *, sadly, I don't think that can be changed now. Maybe someone from Intel can chime in?

I think it's OK to use double * for unligned memory given it's no difference in the backend between alignment 1 and 4.
For _mm_loadl_epi64, the trunk code has already been using __m128i_u. This was changed by @craig.topper in https://github.com/llvm/llvm-project/commit/4390c721cba09597037578100948bbc83cc41b16

I don't see the benefits to use unligned type explicated for the mask. Just to save memory?

I think it's OK to use double * for unligned memory given it's no difference in the backend between alignment 1 and 4.
For _mm_loadl_epi64, the trunk code has already been using __m128i_u. This was changed by @craig.topper in https://github.com/llvm/llvm-project/commit/4390c721cba09597037578100948bbc83cc41b16

I don't see the benefits to use unligned type explicated for the mask. Just to save memory?

One use-case is bit-scattering, like:

const __m512i a = _mm512_loadu_epi64((const __m512i*)&in[i * 8]);                                                              
_store_mask64((__mmask64*)&out[offset_0 + i], _mm512_bitshuffle_epi64_mask(a, c0));                          
_store_mask64((__mmask64*)&out[offset_1 + i], _mm512_bitshuffle_epi64_mask(a, c1));                                            
_store_mask64((__mmask64*)&out[offset_2 + i], _mm512_bitshuffle_epi64_mask(a, c2));                                            
_store_mask64((__mmask64*)&out[offset_3 + i], _mm512_bitshuffle_epi64_mask(a, c3));                                            
_store_mask64((__mmask64*)&out[offset_4 + i], _mm512_bitshuffle_epi64_mask(a, c4));                                            
_store_mask64((__mmask64*)&out[offset_5 + i], _mm512_bitshuffle_epi64_mask(a, c5));                                            
_store_mask64((__mmask64*)&out[offset_6 + i], _mm512_bitshuffle_epi64_mask(a, c6));                                            
_store_mask64((__mmask64*)&out[offset_7 + i], _mm512_bitshuffle_epi64_mask(a, c7));

In D142287#4084819, @kalcutter wrote:

I think it's OK to use double * for unligned memory given it's no difference in the backend between alignment 1 and 4.
For _mm_loadl_epi64, the trunk code has already been using __m128i_u. This was changed by @craig.topper in https://github.com/llvm/llvm-project/commit/4390c721cba09597037578100948bbc83cc41b16

I don't see the benefits to use unligned type explicated for the mask. Just to save memory?

One use-case is bit-scattering, like:

const __m512i a = _mm512_loadu_epi64((const __m512i*)&in[i * 8]);                                                              
_store_mask64((__mmask64*)&out[offset_0 + i], _mm512_bitshuffle_epi64_mask(a, c0));                          
_store_mask64((__mmask64*)&out[offset_1 + i], _mm512_bitshuffle_epi64_mask(a, c1));                                            
_store_mask64((__mmask64*)&out[offset_2 + i], _mm512_bitshuffle_epi64_mask(a, c2));                                            
_store_mask64((__mmask64*)&out[offset_3 + i], _mm512_bitshuffle_epi64_mask(a, c3));                                            
_store_mask64((__mmask64*)&out[offset_4 + i], _mm512_bitshuffle_epi64_mask(a, c4));                                            
_store_mask64((__mmask64*)&out[offset_5 + i], _mm512_bitshuffle_epi64_mask(a, c5));                                            
_store_mask64((__mmask64*)&out[offset_6 + i], _mm512_bitshuffle_epi64_mask(a, c6));                                            
_store_mask64((__mmask64*)&out[offset_7 + i], _mm512_bitshuffle_epi64_mask(a, c7));

This doesn't show why out cannot be aligned to 64-bit. I assume it is defined like long long out[N];. The use of type long long should make sure it's aligned to 64-bit at least.

This doesn't show why out cannot be aligned to 64-bit. I assume it is defined like long long out[N];. The use of type long long should make sure it's aligned to 64-bit at least.

in and out are user-supplied byte buffers with no alignment requirements.

Who is the best person to review this?

In D142287#4164338, @kalcutter wrote:

Who is the best person to review this?

Do you know what gcc does here?

Could you memcpy the __mmask64 into the byte buffer?

Do you know what gcc does here?

Could you memcpy the __mmask64 into the byte buffer?

I think gcc does the same thing as clang. In both cases the issue can be observed with UBSAN.

memcpy can be used as a workaround but it is less ergonomic since both arguments are pointers which means it can only directly be used with lvalues. memcpy is also less explicit.

Intel's documentation of _store_mask64 does not specify any kind of alignment. Also, they document this intrinsic as being kmovq m64, k which has no alignment restrictions. I don't think it makes sense for intrinsics to enforce an arbitrary stricter alignment than instructions they represent.

In D142287#4167017, @kalcutter wrote:

Do you know what gcc does here?

Could you memcpy the __mmask64 into the byte buffer?

I think gcc does the same thing as clang. In both cases the issue can be observed with UBSAN.

memcpy can be used as a workaround but it is less ergonomic since both arguments are pointers which means it can only directly be used with lvalues. memcpy is also less explicit.

Intel's documentation of _store_mask64 does not specify any kind of alignment. Also, they document this intrinsic as being kmovq m64, k which has no alignment restrictions. I don't think it makes sense for intrinsics to enforce an arbitrary stricter alignment than instructions they represent.

For one thing, I think we should be aligned with GCC, otherwise, code may fail when cross-compile. For another, forcing the aligment should do better to performance.

For one thing, I think we should be aligned with GCC, otherwise, code may fail when cross-compile. For another, forcing the aligment should do better to performance.

I think this should be changed in GCC too. Also, this patch doesn't seem to meaningfully change the generated code (excepting quieting UBSAN).

You don't always have a choice of what alignment is used. This change doesn't make aligned code any slower. It is the same instruction whether the address happens to be aligned or not.

Revision Contents

Path

Size

clang/

lib/

Headers/

avx512bwintrin.h

10 lines

avx512fintrin.h

5 lines

test/

CodeGen/

X86/

avx512bw-builtins.c

4 lines

avx512f-builtins.c

2 lines

Diff 491094

clang/lib/Headers/avx512bwintrin.h

	Show First 20 Lines • Show All 215 Lines • ▼ Show 20 Lines

	static __inline__ __mmask64 __DEFAULT_FN_ATTRS			static __inline__ __mmask64 __DEFAULT_FN_ATTRS
	_load_mask64(__mmask64 *__A) {			_load_mask64(__mmask64 *__A) {
	return (__mmask64)__builtin_ia32_kmovq((__mmask64 )__A);			return (__mmask64)__builtin_ia32_kmovq((__mmask64 )__A);
	}			}

	static __inline__ void __DEFAULT_FN_ATTRS			static __inline__ void __DEFAULT_FN_ATTRS
	_store_mask32(__mmask32 *__A, __mmask32 __B) {			_store_mask32(__mmask32 *__A, __mmask32 __B) {
	(__mmask32 )__A = __builtin_ia32_kmovd((__mmask32)__B);			struct __store_mask32 {
				__mmask32 __v;
				} __attribute__((__packed__, __may_alias__));
				((struct __store_mask32 *)__A)->__v = __builtin_ia32_kmovd((__mmask32)__B);
	}			}

	static __inline__ void __DEFAULT_FN_ATTRS			static __inline__ void __DEFAULT_FN_ATTRS
	_store_mask64(__mmask64 *__A, __mmask64 __B) {			_store_mask64(__mmask64 *__A, __mmask64 __B) {
	(__mmask64 )__A = __builtin_ia32_kmovq((__mmask64)__B);			struct __store_mask64 {
				__mmask64 __v;
				} __attribute__((__packed__, __may_alias__));
				((struct __store_mask64 *)__A)->__v = __builtin_ia32_kmovq((__mmask64)__B);
	}			}

	/* Integer compare */			/* Integer compare */

	#define _mm512_cmp_epi8_mask(a, b, p) \			#define _mm512_cmp_epi8_mask(a, b, p) \
	((__mmask64)__builtin_ia32_cmpb512_mask((__v64qi)(__m512i)(a), \			((__mmask64)__builtin_ia32_cmpb512_mask((__v64qi)(__m512i)(a), \
	(__v64qi)(__m512i)(b), (int)(p), \			(__v64qi)(__m512i)(b), (int)(p), \
	(__mmask64)-1))			(__mmask64)-1))
	▲ Show 20 Lines • Show All 1,787 Lines • Show Last 20 Lines

clang/lib/Headers/avx512fintrin.h

This file is larger than 256 KB, so syntax highlighting is disabled by default.

	Show First 20 Lines • Show All 8,439 Lines • ▼ Show 20 Lines

	static __inline__ __mmask16 __DEFAULT_FN_ATTRS			static __inline__ __mmask16 __DEFAULT_FN_ATTRS
	_load_mask16(__mmask16 *__A) {			_load_mask16(__mmask16 *__A) {
	return (__mmask16)__builtin_ia32_kmovw((__mmask16 )__A);			return (__mmask16)__builtin_ia32_kmovw((__mmask16 )__A);
	}			}

	static __inline__ void __DEFAULT_FN_ATTRS			static __inline__ void __DEFAULT_FN_ATTRS
	_store_mask16(__mmask16 *__A, __mmask16 __B) {			_store_mask16(__mmask16 *__A, __mmask16 __B) {
	(__mmask16 )__A = __builtin_ia32_kmovw((__mmask16)__B);			struct __store_mask16 {
				__mmask16 __v;
				} __attribute__((__packed__, __may_alias__));
				((struct __store_mask16 *)__A)->__v = __builtin_ia32_kmovw((__mmask16)__B);
	}			}

	static __inline__ void __DEFAULT_FN_ATTRS512			static __inline__ void __DEFAULT_FN_ATTRS512
	_mm512_stream_si512 (void * __P, __m512i __A)			_mm512_stream_si512 (void * __P, __m512i __A)
	{			{
	typedef __v8di __v8di_aligned __attribute__((aligned(64)));			typedef __v8di __v8di_aligned __attribute__((aligned(64)));
	__builtin_nontemporal_store((__v8di_aligned)__A, (__v8di_aligned*)__P);			__builtin_nontemporal_store((__v8di_aligned)__A, (__v8di_aligned*)__P);
	}			}
	▲ Show 20 Lines • Show All 1,309 Lines • Show Last 20 Lines

clang/test/CodeGen/X86/avx512bw-builtins.c

	Show First 20 Lines • Show All 358 Lines • ▼ Show 20 Lines
	__mmask64 test_load_mask64(__mmask64 *A, __m512i B, __m512i C) {			__mmask64 test_load_mask64(__mmask64 *A, __m512i B, __m512i C) {
	// CHECK-LABEL: @test_load_mask64			// CHECK-LABEL: @test_load_mask64
	// CHECK: [[LOAD:%.]] = load i64, ptr %{{.}}			// CHECK: [[LOAD:%.]] = load i64, ptr %{{.}}
	return _mm512_mask_cmpneq_epu8_mask(_load_mask64(A), B, C);			return _mm512_mask_cmpneq_epu8_mask(_load_mask64(A), B, C);
	}			}

	void test_store_mask32(__mmask32 *A, __m512i B, __m512i C) {			void test_store_mask32(__mmask32 *A, __m512i B, __m512i C) {
	// CHECK-LABEL: @test_store_mask32			// CHECK-LABEL: @test_store_mask32
	// CHECK: store i32 %{{.}}, ptr %{{.}}			// CHECK: store i32 %{{.}}, ptr %{{.}}, align 1
	_store_mask32(A, _mm512_cmpneq_epu16_mask(B, C));			_store_mask32(A, _mm512_cmpneq_epu16_mask(B, C));
	}			}

	void test_store_mask64(__mmask64 *A, __m512i B, __m512i C) {			void test_store_mask64(__mmask64 *A, __m512i B, __m512i C) {
	// CHECK-LABEL: @test_store_mask64			// CHECK-LABEL: @test_store_mask64
	// CHECK: store i64 %{{.}}, ptr %{{.}}			// CHECK: store i64 %{{.}}, ptr %{{.}}, align 1
	_store_mask64(A, _mm512_cmpneq_epu8_mask(B, C));			_store_mask64(A, _mm512_cmpneq_epu8_mask(B, C));
	}			}

	__mmask64 test_mm512_cmpeq_epi8_mask(__m512i __a, __m512i __b) {			__mmask64 test_mm512_cmpeq_epi8_mask(__m512i __a, __m512i __b) {
	// CHECK-LABEL: @test_mm512_cmpeq_epi8_mask			// CHECK-LABEL: @test_mm512_cmpeq_epi8_mask
	// CHECK: icmp eq <64 x i8> %{{.}}, %{{.}}			// CHECK: icmp eq <64 x i8> %{{.}}, %{{.}}
	return (__mmask64)_mm512_cmpeq_epi8_mask(__a, __b);			return (__mmask64)_mm512_cmpeq_epi8_mask(__a, __b);
	}			}
	▲ Show 20 Lines • Show All 2,015 Lines • Show Last 20 Lines

clang/test/CodeGen/X86/avx512f-builtins.c

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 8,521 Lines • ▼ Show 20 Lines	__mmask16 test_load_mask16(__mmask16 *A, __m512i B, __m512i C) {
// CHECK: [[LOAD:%.]] = load i16, ptr %{{.}}{{$}}		// CHECK: [[LOAD:%.]] = load i16, ptr %{{.}}{{$}}
// CHECK: bitcast i16 [[LOAD]] to <16 x i1>		// CHECK: bitcast i16 [[LOAD]] to <16 x i1>
return _mm512_mask_cmpneq_epu32_mask(_load_mask16(A), B, C);		return _mm512_mask_cmpneq_epu32_mask(_load_mask16(A), B, C);
}		}

void test_store_mask16(__mmask16 *A, __m512i B, __m512i C) {		void test_store_mask16(__mmask16 *A, __m512i B, __m512i C) {
// CHECK-LABEL: @test_store_mask16		// CHECK-LABEL: @test_store_mask16
// CHECK: bitcast <16 x i1> %{{.*}} to i16		// CHECK: bitcast <16 x i1> %{{.*}} to i16
// CHECK: store i16 %{{.}}, ptr %{{.}}		// CHECK: store i16 %{{.}}, ptr %{{.}}, align 1
_store_mask16(A, _mm512_cmpneq_epu32_mask(B, C));		_store_mask16(A, _mm512_cmpneq_epu32_mask(B, C));
}		}

void test_mm512_stream_si512(__m512i * __P, __m512i __A) {		void test_mm512_stream_si512(__m512i * __P, __m512i __A) {
// CHECK-LABEL: @test_mm512_stream_si512		// CHECK-LABEL: @test_mm512_stream_si512
// CHECK: store <8 x i64> %{{.}}, ptr %{{.}}, align 64, !nontemporal [[NONTEMPORAL:![0-9]+]]		// CHECK: store <8 x i64> %{{.}}, ptr %{{.}}, align 64, !nontemporal [[NONTEMPORAL:![0-9]+]]
_mm512_stream_si512(__P, __A);		_mm512_stream_si512(__P, __A);
}		}
▲ Show 20 Lines • Show All 2,325 Lines • Show Last 20 Lines