Download Raw Diff

Details

Reviewers

anemet
Ayal
mkuper
hfinkel
hsaito
dcaballe

Commits

rG0ef2ce366722: Added documentation for Masked Vector Expanding Load and Compressing Store…
rL334075: Added documentation for Masked Vector Expanding Load and Compressing Store…

Summary

Added description of new intrinsics masked.expandload and masked.compressstore to the LLVM LangRef.

Implementation is partially committed. I'm working on the rest.
The related discussion is here:
http://lists.llvm.org/pipermail/llvm-dev/2016-September/104985.html

Diff Detail

Repository: rL LLVM

Event Timeline

delena updated this revision to Diff 78170.Nov 16 2016, 5:17 AM

delena retitled this revision from to Expandload and Compressing store - documentation update.

delena updated this object.

delena added reviewers: Ayal, mkuper.

delena set the repository for this revision to rL LLVM.

delena added a subscriber: llvm-commits.

Ayal added inline comments.Nov 16 2016, 11:26 PM

docs/LangRef.rst
11857 ↗	(On Diff #78170)	Data selected from a vector according to a mask is stored in consecutive memory addresses (compressed store), and vice-versa (expanding load). These operations effective map to "if (cond) a[i++] = v.i" and "if (cond) v.i = a[i++]" patterns, respectively. Note that when the mask starts with '1' bits followed by '0' bits, these operations are identical to llvm.masked.store and llvm.masked.load [link to them].
11866 ↗	(On Diff #78170)	"The loaded data is a number of scalar values of any integer," >> "Several scalar values of integer," "loaded together" >> "are loaded from consecutive memory addresses" "spread according to the mask into one vector" >> "stored into the elements of a vector according to the mask"
11876 ↗	(On Diff #78170)	spread >> spreads If the mask >> E.g., if the mask the "expandload" >> "expandload" from memory >> from memory addresses ptr, ptr+1, ptr+2 "position" >> "positions" (or "places")
11902 ↗	(On Diff #78170)	N?
11905 ↗	(On Diff #78170)	%Aptr should have pointer type
11908 ↗	(On Diff #78170)	load operation >> load operations
11909 ↗	(On Diff #78170)	regular >> regular unmasked

Updated, thanks Ayal for the comments.

minor change + ping..

Ayal added inline comments.Nov 28 2016, 12:27 AM

docs/LangRef.rst
11866 ↗	(On Diff #79352)	"The loaded data is several values of integer," >> "Several values of integer," "memory address" >> "memory addresses"
11882 ↗	(On Diff #79352)	Explain about the type of the first operand.
11887 ↗	(On Diff #79352)	If the terms "dense" or "sparse" are to be used they should be defined to avoid confusion - a sparse representation is often the one that is condensed. Alternatively: "designed for sequential reading of multiple scalar values from memory into a sparse vector in a single IR operation" >> "designed for reading multiple scalar values from adjacent memory addresses into possibly non-adjacent vector lanes in a single IR operation"
11902 ↗	(On Diff #79352)	for consistency, use "'1' bits" or "'true' elements" but not both.
11903 ↗	(On Diff #79352)	%Bptr should have type <8 x double>*, right?
11918 ↗	(On Diff #79352)	"The stored data is a number of scalar values of any integer, floating point or pointer data type picked up from an input vector and stored as a contiguous vector in memory" >> "A number of scalar values of integer, floating point or pointer data type are collected from an input vector and stored into adjacent memory addresses" "The mask defines active elements from the input vector that should be stored" >> "A mask defines which elements to collect from the vector"
11922–11923 ↗	(On Diff #79352)	ptr should have pointer-to-vector types.
11928 ↗	(On Diff #79352)	"Writes all selected elements from lower to higher sequentially to memory '`ptr`' as one contiguous vector." >> "All selected elements are written into adjacent memory addresses starting at address '`ptr`', from lower to higher." "equal to number" >> "equal to the number"
11933 ↗	(On Diff #79352)	"The first operand is the vector value, which elements to be picked up and written to memory." >> "The first operand is the input vector, from which elements are collected and written to memory." "vector value operand." >> "input vector operand." "The types of the mask and the value operand" >> "The mask and the input vector"
11939 ↗	(On Diff #79352)	"data compressing" >> "compressing data in memory" "to pick up single elements" >> "to collect elements from possibly non-adjacent lanes of a vector" "store operation" >> "store operations" "vectorizing loop" >> "vectorizing loops" "a cross-iteration dependency" >> "cross-iteration dependences"
11943 ↗	(On Diff #79352)	"dense them" >> "store them consecutively"
11955 ↗	(On Diff #79352)	"densely" >> "consecutively"

delena marked 8 inline comments as done.Dec 6 2016, 2:49 AM

delena added inline comments.

docs/LangRef.rst
11882 ↗	(On Diff #79352)	"The underlying type of the pointer is a scalar type of vector element" - is this description clear?
11903 ↗	(On Diff #79352)	No, a pointer to scalar.
11922–11923 ↗	(On Diff #79352)	I defined them as pointers to scalar values.

I changed the text according to Ayal's comments.

This version of the documentation LGTM, thanks for addressing.

Michael, ok to land, given the discussion in http://lists.llvm.org/pipermail/llvm-dev/2016-September/104985.html?

../docs/LangRef.rst
11857 ↗	(On Diff #80545)	May be clearer to write "if(cond) a[j++] = v.i" >> "if (cond.i) a[j++] = v.i" and "if (cond) v.i = a[j++]" >> "if (cond.i) v.i = a[j++]" showing the condition as a mask vector similar to the data vector.

The way I understand it, the mailing list discussion ended with "let's discuss this at the BoF", and the decision post-BoF was to have a working group to decide on idiom representation, etc.

Having said that, I'm ok with this going on, since I don't see a sane way to represent this in IR that's selectable in DAG.

But for the signatures, I don't have enough non-X86 context.
Hal, Adam, does this seem sensible to you too?

../docs/LangRef.rst
11861 ↗	(On Diff #80545)	Bikeshedding - will "llvm.masked.load.expand." and "llvm.masked.store.compress." make more or less sense than the current names?
11882 ↗	(On Diff #80545)	"are the same vector types" -> "have the same vector type"?
11887 ↗	(On Diff #80545)	"in a single IR operation" is redundant.
11901 ↗	(On Diff #80545)	This example is slightly confusing to me, because it's not clear "what happens to Bptr" - we need to advance it to the next iteration by adding the popcount of %mask, right? Do we have a good way to represent that right now? It seems like "llvm.ctpop" for a <k x i1> type does the wrong thing (it's basically a nop).

In D26743#616109, @mkuper wrote:

Hal, Adam, does this seem sensible to you too?

My preference would be to wait for the outcome of the result of the working group on vectorization idioms. This is not the only idiom that requires control-flow so there may be some common mechanism developed and as a result this would have to be completely redone/autoupgraded, etc.

But if you and Hal feel strongly about this I won't stand in the way.

Adam

This patch complements the LangRef documentation, which we seem to be converging on above, following http://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20161031/402101.html. Changing the signatures and/or a complete redoing/autoupgrade may be needed, but best keep documentation and codegen in sync, right?

../docs/LangRef.rst
11901 ↗	(On Diff #80545)	I extended the code to show popcnt operation. Mask should be converted to integer, otherwise llvm.ctpop does not help.

Updated according to Michael's and Ayal's comments.
Codegen Implementation was partially committed based on RFC discussion. This doc update synchronizes LangRef and CodeGen.
I can change "expandload" to "load.expand" is the both places, if it is principal.

May I commit this patch?

Ping. The intrinsics have been implemented with partial support for a year and a half now. I'd like to use them to replace the X86 specific expand load intrinsics. Can we commit this documentation?

craig.topper added reviewers: hsaito, dcaballe.Jun 3 2018, 12:07 AM

In D26743#1120061, @craig.topper wrote:

Ping. The intrinsics have been implemented with partial support for a year and a half now. I'd like to use them to replace the X86 specific expand load intrinsics. Can we commit this documentation?

Are the LLVM intrinsics already in the trunk? Does the behavior match with the documentation here? I expect the answer is yes, but I need someone who knows enough about actual implementation to say yes, or verify that myself before marking this good to go.

Thanks,
Hideki

I wrote this documentation after implementation. I don't work on X86 about a year, but I doubt that somebody touched this code.

In D26743#1121969, @delena wrote:

I wrote this documentation after implementation. I don't work on X86 about a year, but I doubt that somebody touched this code.

Then, the doc update is long overdue, and the current description reasonably reflects the review feedback. If anything else should be done, it can be done after commit.
Sanity checked intrinsic names/args matching here and in the Intrinsics.td. LGTM.

This revision is now accepted and ready to land.Jun 5 2018, 8:18 AM

Closed by commit rL334075: Added documentation for Masked Vector Expanding Load and Compressing Store… (authored by delena). · Explain WhyJun 6 2018, 2:17 AM

This revision was automatically updated to reflect the committed changes.

Diff 150092

llvm/trunk/docs/LangRef.rst

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 13,072 Lines • ▼ Show 20 Lines	::
%ptr7 = extractelement <8 x i32*> %ptrs, i32 7		%ptr7 = extractelement <8 x i32*> %ptrs, i32 7
;; Note: the order of the following stores is important when they overlap:		;; Note: the order of the following stores is important when they overlap:
store i32 %val0, i32* %ptr0, align 4		store i32 %val0, i32* %ptr0, align 4
store i32 %val1, i32* %ptr1, align 4		store i32 %val1, i32* %ptr1, align 4
..		..
store i32 %val7, i32* %ptr7, align 4		store i32 %val7, i32* %ptr7, align 4


		Masked Vector Expanding Load and Compressing Store Intrinsics
		-------------------------------------------------------------

		LLVM provides intrinsics for expanding load and compressing store operations. Data selected from a vector according to a mask is stored in consecutive memory addresses (compressed store), and vice-versa (expanding load). These operations effective map to "if (cond.i) a[j++] = v.i" and "if (cond.i) v.i = a[j++]" patterns, respectively. Note that when the mask starts with '1' bits followed by '0' bits, these operations are identical to :ref:`llvm.masked.store <int_mstore>` and :ref:`llvm.masked.load <int_mload>`.

		.. _int_expandload:

		'``llvm.masked.expandload.*``' Intrinsics
		^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

		Syntax:
		"""""""
		This is an overloaded intrinsic. Several values of integer, floating point or pointer data type are loaded from consecutive memory addresses and stored into the elements of a vector according to the mask.

		::

		declare <16 x float> @llvm.masked.expandload.v16f32 (float* <ptr>, <16 x i1> <mask>, <16 x float> <passthru>)
		declare <2 x i64> @llvm.masked.expandload.v2i64 (i64* <ptr>, <2 x i1> <mask>, <2 x i64> <passthru>)

		Overview:
		"""""""""

		Reads a number of scalar values sequentially from memory location provided in '``ptr``' and spreads them in a vector. The '``mask``' holds a bit for each vector lane. The number of elements read from memory is equal to the number of '1' bits in the mask. The loaded elements are positioned in the destination vector according to the sequence of '1' and '0' bits in the mask. E.g., if the mask vector is '10010001', "explandload" reads 3 values from memory addresses ptr, ptr+1, ptr+2 and places them in lanes 0, 3 and 7 accordingly. The masked-off lanes are filled by elements from the corresponding lanes of the '``passthru``' operand.


		Arguments:
		""""""""""

		The first operand is the base pointer for the load. It has the same underlying type as the element of the returned vector. The second operand, mask, is a vector of boolean values with the same number of elements as the return type. The third is a pass-through value that is used to fill the masked-off lanes of the result. The return type and the type of the '``passthru``' operand have the same vector type.

		Semantics:
		""""""""""

		The '``llvm.masked.expandload``' intrinsic is designed for reading multiple scalar values from adjacent memory addresses into possibly non-adjacent vector lanes. It is useful for targets that support vector expanding loads and allows vectorizing loop with cross-iteration dependency like in the following example:

		.. code-block:: c

		// In this loop we load from B and spread the elements into array A.
		double A, B; int C;
		for (int i = 0; i < size; ++i) {
		if (C[i] != 0)
		A[i] = B[j++];
		}


		.. code-block:: llvm

		; Load several elements from array B and expand them in a vector.
		; The number of loaded elements is equal to the number of '1' elements in the Mask.
		%Tmp = call <8 x double> @llvm.masked.expandload.v8f64(double* %Bptr, <8 x i1> %Mask, <8 x double> undef)
		; Store the result in A
		call void @llvm.masked.store.v8f64.p0v8f64(<8 x double> %Tmp, <8 x double>* %Aptr, i32 8, <8 x i1> %Mask)

		; %Bptr should be increased on each iteration according to the number of '1' elements in the Mask.
		%MaskI = bitcast <8 x i1> %Mask to i8
		%MaskIPopcnt = call i8 @llvm.ctpop.i8(i8 %MaskI)
		%MaskI64 = zext i8 %MaskIPopcnt to i64
		%BNextInd = add i64 %BInd, %MaskI64


		Other targets may support this intrinsic differently, for example, by lowering it into a sequence of conditional scalar load operations and shuffles.
		If all mask elements are '1', the intrinsic behavior is equivalent to the regular unmasked vector load.

		.. _int_compressstore:

		'``llvm.masked.compressstore.*``' Intrinsics
		^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

		Syntax:
		"""""""
		This is an overloaded intrinsic. A number of scalar values of integer, floating point or pointer data type are collected from an input vector and stored into adjacent memory addresses. A mask defines which elements to collect from the vector.

		::

		declare void @llvm.masked.compressstore.v8i32 (<8 x i32> <value>, i32* <ptr>, <8 x i1> <mask>)
		declare void @llvm.masked.compressstore.v16f32 (<16 x float> <value>, float* <ptr>, <16 x i1> <mask>)

		Overview:
		"""""""""

		Selects elements from input vector '``value``' according to the '``mask``'. All selected elements are written into adjacent memory addresses starting at address '`ptr`', from lower to higher. The mask holds a bit for each vector lane, and is used to select elements to be stored. The number of elements to be stored is equal to the number of active bits in the mask.

		Arguments:
		""""""""""

		The first operand is the input vector, from which elements are collected and written to memory. The second operand is the base pointer for the store, it has the same underlying type as the element of the input vector operand. The third operand is the mask, a vector of boolean values. The mask and the input vector must have the same number of vector elements.


		Semantics:
		""""""""""

		The '``llvm.masked.compressstore``' intrinsic is designed for compressing data in memory. It allows to collect elements from possibly non-adjacent lanes of a vector and store them contiguously in memory in one IR operation. It is useful for targets that support compressing store operations and allows vectorizing loops with cross-iteration dependences like in the following example:

		.. code-block:: c

		// In this loop we load elements from A and store them consecutively in B
		double A, B; int C;
		for (int i = 0; i < size; ++i) {
		if (C[i] != 0)
		B[j++] = A[i]
		}


		.. code-block:: llvm

		; Load elements from A.
		%Tmp = call <8 x double> @llvm.masked.load.v8f64.p0v8f64(<8 x double>* %Aptr, i32 8, <8 x i1> %Mask, <8 x double> undef)
		; Store all selected elements consecutively in array B
		call <void> @llvm.masked.compressstore.v8f64(<8 x double> %Tmp, double* %Bptr, <8 x i1> %Mask)

		; %Bptr should be increased on each iteration according to the number of '1' elements in the Mask.
		%MaskI = bitcast <8 x i1> %Mask to i8
		%MaskIPopcnt = call i8 @llvm.ctpop.i8(i8 %MaskI)
		%MaskI64 = zext i8 %MaskIPopcnt to i64
		%BNextInd = add i64 %BInd, %MaskI64


		Other targets may support this intrinsic differently, for example, by lowering it into a sequence of branches that guard scalar store operations.


Memory Use Markers		Memory Use Markers
------------------		------------------

This class of intrinsics provides information about the lifetime of		This class of intrinsics provides information about the lifetime of
memory objects and ranges where variables are immutable.		memory objects and ranges where variables are immutable.

.. _int_lifestart:		.. _int_lifestart:

▲ Show 20 Lines • Show All 1,852 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

Expandload and Compressing store - documentation update
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 150092

llvm/trunk/docs/LangRef.rst

This is an archive of the discontinued LLVM Phabricator instance.

Expandload and Compressing store - documentation updateClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 150092

llvm/trunk/docs/LangRef.rst

Expandload and Compressing store - documentation update
ClosedPublic