Index: docs/LangRef.rst
===================================================================
--- docs/LangRef.rst
+++ docs/LangRef.rst
@@ -11851,6 +11851,114 @@
        store i32 %val7, i32* %ptr7, align 4
 
 
+Masked Vector Expanding Load and Compressing Store Intrinsics
+-------------------------------------------------------------
+
+LLVM provides intrinsics for expanding load and compressing store operations. Compressing store is designated to select single elements from data vector and store them in a dense form. The selection is done using mask operand, which holds one bit per vector element. The number of stored elements is equal to the number of '1' bits in the mask. Expanding load performs the opposite operation - reads a number of sequential scalar elements from memory and spreads them in a vector according to the mask. The total number and position of active bits inside the mask vector state the number of loaded elements and their disposition in the result vector.
+
+.. _int_expandload:
+
+'``llvm.masked.expandload.*``' Intrinsics
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Syntax:
+"""""""
+This is an overloaded intrinsic. The loaded data is a number of scalar values of any integer, floating point or pointer data type loaded together and spread according to the mask into one vector.
+
+::
+
+      declare <16 x float>  @llvm.masked.expandload.v16f32 (float* <ptr>, <16 x i1> <mask>, <16 x float> <passthru>)
+      declare <2 x i64>     @llvm.masked.expandload.v2i64 (i64* <ptr>, <2 x i1>  <mask>, <2 x i64> <passthru>)
+
+Overview:
+"""""""""
+
+Reads a number of scalar values sequentially from memory location provided in '``ptr``' and spread them in a vector. The '``mask``' holds a bit for each vector lane. The number of elements read from memory is equal to the number of '1' bits in the mask. The loaded elements are positioned in the destination vector according to the sequence of '1' and '0' bits in the mask. If the mask vector is '10010001', the "explandload" reads 3 values from memory and position them in lanes 0, 3 and 7 accordingly. The masked-off lanes are filled by elements from the corresponding lanes of the '``passthru``' operand.
+
+
+Arguments:
+""""""""""
+
+The first operand is the base pointer for the load. The second operand, mask, is a vector of boolean values with the same number of elements as the return type. The third is a pass-through value that is used to fill the masked-off lanes of the result. The return type and the type of the '``passthru``' operand are the same vector types.
+
+Semantics:
+""""""""""
+
+The '``llvm.masked.expandload``' intrinsic is designed for sequential reading of multiple scalar values from memory into a sparse vector in a single IR operation. It is useful for targets that support vector expanding loads and allows vectorizing loop with cross-iteration dependency like in the following example:
+
+.. code-block:: c
+
+    // In this loop we load from B and spread the elements into array A.
+    double *A, B; int *C;
+    for (int i = 0; i < size; ++i) {
+      if (C[i] != 0)
+        A[i] = B[j++];
+    }
+
+
+.. code-block:: llvm
+
+    ; Load N elements from array B and expand them in a vector.
+    ; N is equal to the number of 'true' elements in the mask.
+    %Tmp = call <8 x double> @llvm.masked.expandload.v8f64(double* %Bptr, <8 x i1> %mask, <8 x double> undef)
+    ; Store the result in A
+    call <void> @llvm.masked.store.v8f64.p0v8f64(<8 x double> %Tmp, <8 x double> %Aptr, i32 8, <8 x i1> %mask)
+
+
+Other targets may support this intrinsic differently, for example, by lowering it into a sequence of conditional scalar load operation and shuffles.
+If all mask elements are 'true', the intrinsic behavior is equivalent to the regular vector load.
+
+.. _int_compressstore:
+
+'``llvm.masked.compressstore.*``' Intrinsics
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Syntax:
+"""""""
+This is an overloaded intrinsic. The stored data is a number of scalar values of any integer, floating point or pointer data type picked up from an input vector and stored as a contiguous vector in memory. The mask defines active elements from the input vector that should be stored.
+
+::
+
+      declare void @llvm.masked.compressstore.v8i32  (<8  x i32>   <value>, i32*   <ptr>, <8  x i1> <mask>)
+      declare void @llvm.masked.compressstore.v16f32 (<16 x float> <value>, float* <ptr>, <16 x i1> <mask>)
+
+Overview:
+"""""""""
+
+Selects elements from input vector '``value``' according to the '``mask``'. Writes all selected elements from lower to higher sequentially to memory '``ptr``' as one contiguous vector. The mask holds a bit for each vector lane, and is used to select elements to be stored. The number of elements to be stored is equal to number of active bits in the mask.
+
+Arguments:
+""""""""""
+
+The first operand is the vector value, which elements to be picked up and written to memory. The second operand is the base pointer for the store, it has the same underlying type as the element of the vector value operand. The third operand is the mask, a vector of boolean values. The types of the mask and the value operand must have the same number of vector elements.
+
+
+Semantics:
+""""""""""
+
+The '``llvm.masked.compressstore``' intrinsic is designed for data compressing. It allows to pick up single elements from a vector and store them contiguously in memory in one IR operation. It is useful for targets that support compressing store operation and allows vectorizing loop with a cross-iteration dependency like in the following example:
+
+.. code-block:: c
+
+    // In this loop we load elements from A and dense them into B
+    double *A, B; int *C;
+    for (int i = 0; i < size; ++i) {
+      if (C[i] != 0)
+        B[j++] = A[i]
+    }
+
+
+.. code-block:: llvm
+
+    ; Load elements from A.
+    %Tmp = call <8 x double> @llvm.masked.load.v8f64.p0v8f64(<8 x double>* %Aptr, i32 8, <8 x i1> %mask, <8 x double> undef)
+    ; Store all selected elements densely in array B
+    call <void> @llvm.masked.compressstore.v8f64(<8 x double> %Tmp, double* %Bptr, <8 x i1> %mask)
+
+
+Other targets may support this intrinsic differently, for example, by lowering it into a sequence of branches that guard scalar store operations.
+
+
 Memory Use Markers
 ------------------