Index: docs/Frontend/PerformanceTips.rst =================================================================== --- docs/Frontend/PerformanceTips.rst +++ docs/Frontend/PerformanceTips.rst @@ -120,6 +120,93 @@ lower an under aligned access into a sequence of natively aligned accesses. As a result, alignment is mandatory for atomic loads and stores. +Architecture-specific code +^^^^^^^^^^^^^^^^^^^^^^^^^^ +Whenever possible, the IR generated should be generic IR, instead of architecture +specific IR (i.e. intrinsics). +If LLVM cannot lower the generic code to the desired intrinsic, start a discussion +on `llvm-dev `_ +for the missing lowering opportunity. +A few known patterns that lead to lowering to intrinsics are listed below. + +The *interleaved access pass* performs the following lowerings (tests can be found in CodeGen/ARM/arm-interleaved-accesses.ll and CodeGen/AArch64/aarch64-interleaved-accesses.ll): + +#. ARM/AArch64: lower an interleaved/strided load into a vldN/ldN intrinsic. + * General rule: Factor = F, Lane Length = L: + :: + + %wide.vec = load %ptr + %v1 = shufflevector %wide.vec, undef, + [...] + %vF = shufflevector %wide.vec, undef, + + Is lowered to: + :: + + %ldF = call @llvm.arm.neon.vldF(%ptr, L) + ; @llvm.aarch64.neon.ldF(%ptr) + %vec1 = extractvalue %ldF, 0 + [...] + %vecF = extractvalue %ldF, F-1 + + * E.g. Factor = 2, Lane Length = 4: + .. code-block:: llvm + + %wide.vec = load <8 x i32>, <8 x i32>* %ptr + %v0 = shufflevector <8 x i32> %wide.vec, <8 x i32> undef, <4 x i32> ; Extract even elements + %v1 = shufflevector <8 x i32> %wide.vec, <8 x i32> undef, <4 x i32> ; Extract odd elements + + Is lowered to: + :: + + %ld2 = call { <4 x i32>, <4 x i32> } @llvm.arm.neon.vld2(<8 x i32>* %ptr, i32 4) + ; @llvm.aarch64.neon.ld2(<8 x i32>* %ptr) + %vec0 = extractvalue { <4 x i32>, <4 x i32> } %ld2, 0 + %vec1 = extractvalue { <4 x i32>, <4 x i32> } %ld2, 1 + +#. ARM/AArch64: lower an interleaved/strided store into a vstN/stN intrinsic. + * General rule: Factor = F, Lane Length = L: + :: + + %i.vec = shufflevector %v0, %v1, + + store %i.vec, %ptr + Is lowered to: + :: + + %sub.v1 = shufflevector %v0, %v1, + [...] + %sub.vF = shufflevector %v0, %v1, + call void @llvm.arm.neon.vstF(%ptr, %sub.v1, ..., %sub.vF, L) + ; @llvm.aarch64.neon.stF(%sub.v1, ..., %sub.vF, %ptr) + + * E.g. Factor = 3, Lane Length = 4: + .. code-block:: llvm + + %i.vec = shufflevector <8 x i32> %v0, <8 x i32> %v1, + + store <12 x i32> %i.vec, <12 x i32>* %ptr + + Is lowered to: + .. code-block:: llvm + + %sub.v0 = shufflevector <8 x i32> %v0, <8 x i32> %v1, <4 x i32> + %sub.v1 = shufflevector <8 x i32> %v0, <8 x i32> %v1, <4 x i32> + %sub.v2 = shufflevector <8 x i32> %v0, <8 x i32> %v1, <4 x i32> + call void @llvm.arm.neon.vst3(<8 x i32> %ptr, <4 x i32> %sub.v0, <4 x i32> %sub.v1, <4 x i32> %sub.v2, i32 4) + ; @llvm.aarch64.neon.st3(<4 x i32> %sub.v0, <4 x i32> %sub.v1, <4 x i32> %sub.v2, <8 x i32> %ptr) + +LLVM does not promise to be performance aware, so the above patterns, while generic IR, are still recommended for the particular platforms. +For more suggestions of architecture specific patterns, please send +a patch to `llvm-commits +`_ for review. + Other Things to Consider ^^^^^^^^^^^^^^^^^^^^^^^^