This is an archive of the discontinued LLVM Phabricator instance.

One question: do you think the added helper would be useful to others? I wonder whether putting it somewhere in the realm of the Vector dialect (VectorUtils?) would make sense.

mlir/lib/Dialect/Math/Transforms/PolynomialApproximation.cpp
131	'rank' is only used here so consider dropping it and doing `innerDim = inputShape.back()`.
132	Would it be worth asserting that `vectorWidth > 0` before dividing by it?
141	Consider adding this comment after the `if` instead of the two inline comments below it: `// Expand shape from {..., innerDim} to {..., expansionDim, vectorWidth}`
142	Would be more readable to use `vectorWidth` here
mlir/test/Dialect/Math/polynomial-approximation.mlir
470	Consider adding here `AVX2-NOT: vector.shape_cast`, as below.

This revision is now accepted and ready to land.Oct 28 2021, 12:20 PM

ezhulenev marked 5 inline comments as done.Oct 28 2021, 12:37 PM

Address comments

Harbormaster completed remote builds in B131265: Diff 383123.Oct 28 2021, 12:50 PM

Closed by commit rG627fa0b9a897: [mlir] MathApproximations: unroll virtual vectors into hardware vectors for ISA… (authored by ezhulenev). · Explain WhyOct 28 2021, 12:52 PM

This revision was automatically updated to reflect the committed changes.

ezhulenev added a commit: rG627fa0b9a897: [mlir] MathApproximations: unroll virtual vectors into hardware vectors for ISA….

This doesn't seem to be the right pass to do target-specific lowerings. Can you add math.rsqrt and move the vector unrolling to a generic pass in the vector dialect? Then this pass wouldn't depend on target-details.

In D112736#3094843, @silvas wrote:

This doesn't seem to be the right pass to do target-specific lowerings. Can you add math.rsqrt and move the vector unrolling to a generic pass in the vector dialect? Then this pass wouldn't depend on target-details.

We already have math.rsqrt, and it can take virtual vectors of arbitrary rank, the problem is that math.rsqrt approximated with rsqrt (which is weird, but the problem is that AVX2 rsqrt instruction does not have enough precision).

I didn't add this vector unrolling anywhere in the vector passes/utils, because in my opinion that would violate the principles of vector dialect, everything should be in virtual vectors for as long as possible.

Can we move this RsqrtOp lowering to the x86vector dialect? It seems really out of place in the math dialect, which should remain target agnostic. (also, since this approximation is newton-raphson based instead of polynomial based, it seems out of place in PolynomialApproximation.cpp as well)

In D112736#3094908, @silvas wrote:

Can we move this RsqrtOp lowering to the x86vector dialect? It seems really out of place in the math dialect, which should remain target agnostic. (also, since this approximation is newton-raphson based instead of polynomial based, it seems out of place in PolynomialApproximation.cpp as well)

Do you mean to update x86vector.avx operation to take arbitrary virtual vectors and do the unrolling in the lowering to LLVM? I have mixed feelings, I kind of like this idea, but then not a big fan that avx operation is not directly mapped to avx intrinsic anymore. Inviting @aartbik for his opinion. Although if we have this "virtual avx" we can unroll it to avx2 or avx512 depending on the vector width.

Anyway, if we do this, we still will not be able to get rid of avx flag, it will just become x86 flag.

We can just remove the polynomial bit from the file name, I'd say any math approximation should be able to live here.

Maybe we can have generic x86vector.rsqrt operation (without avx or avx512 prefix), and then it can be unrolled and lowered to ISA specific intrinsics by the x86dialect passes? /cc @aartbik

I would just like Math dialect transforms to not depend on target-specific dialects. So concretely just RsqrtOpApproximation pattern would move to x86vector/Transform/, since it is the target specific one.
I agree, removing Polynomial would be good.

But avx and avx512 rsqrt have different accuracy

_mm256_rsqrt_ps
The maximum relative error for this approximation is less than 1.5*2^-12.

_mm512_rsqrt23_ps
to 23 bits of accuracy and stores the result in dst.

(+ another version with 14 bits of accuracy)

So this approximation only makes sense in AVX2 case and not the AVX512.

Basically add a pass something like X86vectorMathApproximationsPass?

In D112736#3094966, @ezhulenev wrote:

Basically add a pass something like X86vectorMathApproximationsPass?

Correct (something that lives in X86Vector/Transforms). That pass would have target-specific knowledge of avx vs avx512 approximations.

In D112736#3094954, @silvas wrote:

I would just like Math dialect transforms to not depend on target-specific dialects. So concretely just RsqrtOpApproximation pattern would move to x86vector/Transform/, since it is the target specific one.

I agree, removing Polynomial would be good.

Sounds good to me, I'll do both. Thanks for the feedback.

Revision Contents

Path

Size

mlir/

lib/

Dialect/

Math/

Transforms/

PolynomialApproximation.cpp

103 lines

test/

Dialect/

Math/

polynomial-approximation.mlir

88 lines

Diff 383133

mlir/lib/Dialect/Math/Transforms/PolynomialApproximation.cpp

Show All 10 Lines
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#include "mlir/Dialect/Arithmetic/IR/Arithmetic.h"		#include "mlir/Dialect/Arithmetic/IR/Arithmetic.h"
#include "mlir/Dialect/Math/IR/Math.h"		#include "mlir/Dialect/Math/IR/Math.h"
#include "mlir/Dialect/Math/Transforms/Approximation.h"		#include "mlir/Dialect/Math/Transforms/Approximation.h"
#include "mlir/Dialect/Math/Transforms/Passes.h"		#include "mlir/Dialect/Math/Transforms/Passes.h"
#include "mlir/Dialect/Vector/VectorOps.h"		#include "mlir/Dialect/Vector/VectorOps.h"
		#include "mlir/Dialect/Vector/VectorUtils.h"
#include "mlir/Dialect/X86Vector/X86VectorDialect.h"		#include "mlir/Dialect/X86Vector/X86VectorDialect.h"
#include "mlir/IR/Builders.h"		#include "mlir/IR/Builders.h"
#include "mlir/IR/ImplicitLocOpBuilder.h"		#include "mlir/IR/ImplicitLocOpBuilder.h"
#include "mlir/Transforms/Bufferize.h"		#include "mlir/Transforms/Bufferize.h"
#include "mlir/Transforms/DialectConversion.h"		#include "mlir/Transforms/DialectConversion.h"
#include "mlir/Transforms/GreedyPatternRewriteDriver.h"		#include "mlir/Transforms/GreedyPatternRewriteDriver.h"
#include "llvm/ADT/ArrayRef.h"		#include "llvm/ADT/ArrayRef.h"
#include <climits>		#include <climits>
▲ Show 20 Lines • Show All 61 Lines • ▼ Show 20 Lines	static Value broadcast(ImplicitLocOpBuilder &builder, Value value,
ArrayRef<int64_t> shape) {		ArrayRef<int64_t> shape) {
assert(!value.getType().isa<VectorType>() && "must be scalar value");		assert(!value.getType().isa<VectorType>() && "must be scalar value");
auto type = broadcast(value.getType(), shape);		auto type = broadcast(value.getType(), shape);
return isNonScalarShape(shape) ? builder.create<BroadcastOp>(type, value)		return isNonScalarShape(shape) ? builder.create<BroadcastOp>(type, value)
: value;		: value;
}		}

//----------------------------------------------------------------------------//		//----------------------------------------------------------------------------//
		// Helper function to handle n-D vectors with 1-D operations.
		//----------------------------------------------------------------------------//

		// Expands and unrolls n-D vector operands into multiple fixed size 1-D vectors
		// and calls the compute function with 1-D vector operands. Stitches back all
		// results into the original n-D vector result.
		//
		// Examples: vectorWidth = 8
		// - vector<4x8xf32> unrolled 4 times
		// - vector<16xf32> expanded to vector<2x8xf32> and unrolled 2 times
		// - vector<4x16xf32> expanded to vector<4x2x8xf32> and unrolled 4*2 times
		//
		// Some math approximations rely on ISA-specific operations that only accept
		// fixed size 1-D vectors (e.g. AVX expects vectors of width 8).
		//
		// It is the caller's responsibility to verify that the inner dimension is
		// divisible by the vectorWidth, and that all operands have the same vector
		// shape.
		static Value
		handleMultidimensionalVectors(ImplicitLocOpBuilder &builder,
		ValueRange operands, int64_t vectorWidth,
		std::function<Value(ValueRange)> compute) {
		assert(!operands.empty() && "operands must be not empty");
		assert(vectorWidth > 0 && "vector width must be larger than 0");

		VectorType inputType = operands[0].getType().cast<VectorType>();
		ArrayRef<int64_t> inputShape = inputType.getShape();

		// If input shape matches target vector width, we can just call the
		// user-provided compute function with the operands.
		if (inputShape == llvm::makeArrayRef(vectorWidth))
		return compute(operands);

		// Check if the inner dimension has to be expanded, or we can directly iterate
		// over the outer dimensions of the vector.
		cotaUnsubmitted Done Reply Inline Actions 'rank' is only used here so consider dropping it and doing `innerDim = inputShape.back()`. cota: 'rank' is only used here so consider dropping it and doing `innerDim = inputShape.back()`.
		int64_t innerDim = inputShape.back();
		cotaUnsubmitted Done Reply Inline Actions Would it be worth asserting that `vectorWidth > 0` before dividing by it? cota: Would it be worth asserting that `vectorWidth > 0` before dividing by it?
		int64_t expansionDim = innerDim / vectorWidth;
		assert((innerDim % vectorWidth == 0) && "invalid inner dimension size");

		// Maybe expand operands to the higher rank vector shape that we'll use to
		// iterate over and extract one dimensional vectors.
		SmallVector<int64_t> expandedShape(inputShape.begin(), inputShape.end());
		SmallVector<Value> expandedOperands(operands);

		if (expansionDim > 1) {
		cotaUnsubmitted Done Reply Inline Actions Consider adding this comment after the `if` instead of the two inline comments below it: `// Expand shape from {..., innerDim} to {..., expansionDim, vectorWidth}` cota: Consider adding this comment after the `if` instead of the two inline comments below it: `//…
		// Expand shape from [..., innerDim] to [..., expansionDim, vectorWidth].
		cotaUnsubmitted Done Reply Inline Actions Would be more readable to use `vectorWidth` here cota: Would be more readable to use `vectorWidth` here
		expandedShape.insert(expandedShape.end() - 1, expansionDim);
		expandedShape.back() = vectorWidth;

		for (unsigned i = 0; i < operands.size(); ++i) {
		auto operand = operands[i];
		auto eltType = operand.getType().cast<VectorType>().getElementType();
		auto expandedType = VectorType::get(expandedShape, eltType);
		expandedOperands[i] =
		builder.create<vector::ShapeCastOp>(expandedType, operand);
		}
		}

		// Iterate over all outer dimensions of the compute shape vector type.
		auto iterationDims = ArrayRef<int64_t>(expandedShape).drop_back();
		int64_t maxLinearIndex = computeMaxLinearIndex(iterationDims);

		SmallVector<int64_t> ones(iterationDims.size(), 1);
		auto strides = computeStrides(iterationDims, ones);

		// Compute results for each one dimensional vector.
		SmallVector<Value> results(maxLinearIndex);

		for (int64_t i = 0; i < maxLinearIndex; ++i) {
		auto offsets = delinearize(strides, i);

		SmallVector<Value> extracted(expandedOperands.size());
		for (auto tuple : llvm::enumerate(expandedOperands))
		extracted[tuple.index()] =
		builder.create<vector::ExtractOp>(tuple.value(), offsets);

		results[i] = compute(extracted);
		}

		// Stitch results together into one large vector.
		Type resultEltType = results[0].getType().cast<VectorType>().getElementType();
		Type resultExpandedType = VectorType::get(expandedShape, resultEltType);
		Value result = builder.create<ConstantOp>(
		resultExpandedType, builder.getZeroAttr(resultExpandedType));

		for (int64_t i = 0; i < maxLinearIndex; ++i)
		result = builder.create<vector::InsertOp>(results[i], result,
		delinearize(strides, i));

		// Reshape back to the original vector shape.
		return builder.create<vector::ShapeCastOp>(
		VectorType::get(inputShape, resultEltType), result);
		}

		//----------------------------------------------------------------------------//
// Helper functions to create constants.		// Helper functions to create constants.
//----------------------------------------------------------------------------//		//----------------------------------------------------------------------------//

static Value f32Cst(ImplicitLocOpBuilder &builder, float value) {		static Value f32Cst(ImplicitLocOpBuilder &builder, float value) {
return builder.create<arith::ConstantOp>(builder.getF32FloatAttr(value));		return builder.create<arith::ConstantOp>(builder.getF32FloatAttr(value));
}		}

static Value i32Cst(ImplicitLocOpBuilder &builder, int32_t value) {		static Value i32Cst(ImplicitLocOpBuilder &builder, int32_t value) {
▲ Show 20 Lines • Show All 834 Lines • ▼ Show 20 Lines
};		};
} // namespace		} // namespace

LogicalResult		LogicalResult
RsqrtApproximation::matchAndRewrite(math::RsqrtOp op,		RsqrtApproximation::matchAndRewrite(math::RsqrtOp op,
PatternRewriter &rewriter) const {		PatternRewriter &rewriter) const {
auto shape = vectorShape(op.operand().getType(), isF32);		auto shape = vectorShape(op.operand().getType(), isF32);
// Only support already-vectorized rsqrt's.		// Only support already-vectorized rsqrt's.
if (!shape.hasValue() \|\| (*shape)[0] != 8)		if (!shape.hasValue() \|\| shape->back() % 8 != 0)
return rewriter.notifyMatchFailure(op, "unsupported operand type");		return rewriter.notifyMatchFailure(op, "unsupported operand type");

ImplicitLocOpBuilder builder(op->getLoc(), rewriter);		ImplicitLocOpBuilder builder(op->getLoc(), rewriter);
auto bcast = [&](Value value) -> Value {		auto bcast = [&](Value value) -> Value {
return broadcast(builder, value, *shape);		return broadcast(builder, value, *shape);
};		};

Value cstPosInf = bcast(f32FromBits(builder, 0x7f800000u));		Value cstPosInf = bcast(f32FromBits(builder, 0x7f800000u));
Value cstOnePointFive = bcast(f32Cst(builder, 1.5f));		Value cstOnePointFive = bcast(f32Cst(builder, 1.5f));
Value cstNegHalf = bcast(f32Cst(builder, -0.5f));		Value cstNegHalf = bcast(f32Cst(builder, -0.5f));
Value cstMinNormPos = bcast(f32FromBits(builder, 0x00800000u));		Value cstMinNormPos = bcast(f32FromBits(builder, 0x00800000u));

Value negHalf = builder.create<arith::MulFOp>(op.operand(), cstNegHalf);		Value negHalf = builder.create<arith::MulFOp>(op.operand(), cstNegHalf);

// Select only the inverse sqrt of positive normals (denormals are		// Select only the inverse sqrt of positive normals (denormals are
// flushed to zero).		// flushed to zero).
Value ltMinMask = builder.create<arith::CmpFOp>(arith::CmpFPredicate::OLT,		Value ltMinMask = builder.create<arith::CmpFOp>(arith::CmpFPredicate::OLT,
op.operand(), cstMinNormPos);		op.operand(), cstMinNormPos);
Value infMask = builder.create<arith::CmpFOp>(arith::CmpFPredicate::OEQ,		Value infMask = builder.create<arith::CmpFOp>(arith::CmpFPredicate::OEQ,
op.operand(), cstPosInf);		op.operand(), cstPosInf);
Value notNormalFiniteMask = builder.create<arith::OrIOp>(ltMinMask, infMask);		Value notNormalFiniteMask = builder.create<arith::OrIOp>(ltMinMask, infMask);

// Compute an approximate result.		// Compute an approximate result.
Value yApprox = builder.create<x86vector::RsqrtOp>(op.operand());		Value yApprox = handleMultidimensionalVectors(
		builder, op->getOperands(), 8, [&builder](ValueRange operands) -> Value {
		return builder.create<x86vector::RsqrtOp>(operands);
		});

// Do a single step of Newton-Raphson iteration to improve the approximation.		// Do a single step of Newton-Raphson iteration to improve the approximation.
// This uses the formula y_{n+1} = y_n * (1.5 - y_n * (0.5 * x) * y_n).		// This uses the formula y_{n+1} = y_n * (1.5 - y_n * (0.5 * x) * y_n).
// It is essential to evaluate the inner term like this because forming		// It is essential to evaluate the inner term like this because forming
// y_n^2 may over- or underflow.		// y_n^2 may over- or underflow.
Value inner = builder.create<arith::MulFOp>(negHalf, yApprox);		Value inner = builder.create<arith::MulFOp>(negHalf, yApprox);
Value fma = builder.create<math::FmaOp>(yApprox, inner, cstOnePointFive);		Value fma = builder.create<math::FmaOp>(yApprox, inner, cstOnePointFive);
Value yNewton = builder.create<arith::MulFOp>(yApprox, fma);		Value yNewton = builder.create<arith::MulFOp>(yApprox, fma);
Show All 25 Lines

mlir/test/Dialect/Math/polynomial-approximation.mlir

	Show First 20 Lines • Show All 396 Lines • ▼ Show 20 Lines
	// AVX2-LABEL: func @rsqrt_scalar			// AVX2-LABEL: func @rsqrt_scalar
	// CHECK: math.rsqrt			// CHECK: math.rsqrt
	// AVX2: math.rsqrt			// AVX2: math.rsqrt
	func @rsqrt_scalar(%arg0: f32) -> f32 {			func @rsqrt_scalar(%arg0: f32) -> f32 {
	%0 = math.rsqrt %arg0 : f32			%0 = math.rsqrt %arg0 : f32
	return %0 : f32			return %0 : f32
	}			}

	// CHECK-LABEL: func @rsqrt_vector			// CHECK-LABEL: func @rsqrt_vector_8xf32
	// CHECK: math.rsqrt			// CHECK: math.rsqrt
	// AVX2-LABEL: func @rsqrt_vector(			// AVX2-LABEL: func @rsqrt_vector_8xf32(
	// AVX2-SAME: %[[VAL_0:.*]]: vector<8xf32>) -> vector<8xf32> {			// AVX2-SAME: %[[VAL_0:.*]]: vector<8xf32>) -> vector<8xf32> {
	// AVX2: %[[VAL_1:.*]] = arith.constant dense<0x7F800000> : vector<8xf32>			// AVX2: %[[VAL_1:.*]] = arith.constant dense<0x7F800000> : vector<8xf32>
	// AVX2: %[[VAL_2:.*]] = arith.constant dense<1.500000e+00> : vector<8xf32>			// AVX2: %[[VAL_2:.*]] = arith.constant dense<1.500000e+00> : vector<8xf32>
	// AVX2: %[[VAL_3:.*]] = arith.constant dense<-5.000000e-01> : vector<8xf32>			// AVX2: %[[VAL_3:.*]] = arith.constant dense<-5.000000e-01> : vector<8xf32>
	// AVX2: %[[VAL_4:.*]] = arith.constant dense<1.17549435E-38> : vector<8xf32>			// AVX2: %[[VAL_4:.*]] = arith.constant dense<1.17549435E-38> : vector<8xf32>
	// AVX2: %[[VAL_5:.*]] = arith.mulf %[[VAL_0]], %[[VAL_3]] : vector<8xf32>			// AVX2: %[[VAL_5:.*]] = arith.mulf %[[VAL_0]], %[[VAL_3]] : vector<8xf32>
	// AVX2: %[[VAL_6:.*]] = arith.cmpf olt, %[[VAL_0]], %[[VAL_4]] : vector<8xf32>			// AVX2: %[[VAL_6:.*]] = arith.cmpf olt, %[[VAL_0]], %[[VAL_4]] : vector<8xf32>
	// AVX2: %[[VAL_7:.*]] = arith.cmpf oeq, %[[VAL_0]], %[[VAL_1]] : vector<8xf32>			// AVX2: %[[VAL_7:.*]] = arith.cmpf oeq, %[[VAL_0]], %[[VAL_1]] : vector<8xf32>
	// AVX2: %[[VAL_8:.*]] = arith.ori %[[VAL_6]], %[[VAL_7]] : vector<8xi1>			// AVX2: %[[VAL_8:.*]] = arith.ori %[[VAL_6]], %[[VAL_7]] : vector<8xi1>
	// AVX2: %[[VAL_9:.*]] = x86vector.avx.rsqrt %[[VAL_0]] : vector<8xf32>			// AVX2: %[[VAL_9:.*]] = x86vector.avx.rsqrt %[[VAL_0]] : vector<8xf32>
	// AVX2: %[[VAL_10:.*]] = arith.mulf %[[VAL_5]], %[[VAL_9]] : vector<8xf32>			// AVX2: %[[VAL_10:.*]] = arith.mulf %[[VAL_5]], %[[VAL_9]] : vector<8xf32>
	// AVX2: %[[VAL_11:.*]] = math.fma %[[VAL_9]], %[[VAL_10]], %[[VAL_2]] : vector<8xf32>			// AVX2: %[[VAL_11:.*]] = math.fma %[[VAL_9]], %[[VAL_10]], %[[VAL_2]] : vector<8xf32>
	// AVX2: %[[VAL_12:.*]] = arith.mulf %[[VAL_9]], %[[VAL_11]] : vector<8xf32>			// AVX2: %[[VAL_12:.*]] = arith.mulf %[[VAL_9]], %[[VAL_11]] : vector<8xf32>
	// AVX2: %[[VAL_13:.*]] = select %[[VAL_8]], %[[VAL_9]], %[[VAL_12]] : vector<8xi1>, vector<8xf32>			// AVX2: %[[VAL_13:.*]] = select %[[VAL_8]], %[[VAL_9]], %[[VAL_12]] : vector<8xi1>, vector<8xf32>
	// AVX2: return %[[VAL_13]] : vector<8xf32>			// AVX2: return %[[VAL_13]] : vector<8xf32>
	// AVX2: }			// AVX2: }
	func @rsqrt_vector(%arg0: vector<8xf32>) -> vector<8xf32> {			func @rsqrt_vector_8xf32(%arg0: vector<8xf32>) -> vector<8xf32> {
	%0 = math.rsqrt %arg0 : vector<8xf32>			%0 = math.rsqrt %arg0 : vector<8xf32>
	return %0 : vector<8xf32>			return %0 : vector<8xf32>
	}			}

				// Virtual vector width is not a multiple of an AVX2 vector width.
				//
				// CHECK-LABEL: func @rsqrt_vector_5xf32
				// CHECK: math.rsqrt
				// AVX2-LABEL: func @rsqrt_vector_5xf32
				// AVX2: math.rsqrt
				func @rsqrt_vector_5xf32(%arg0: vector<5xf32>) -> vector<5xf32> {
				%0 = math.rsqrt %arg0 : vector<5xf32>
				return %0 : vector<5xf32>
				}

				// One dimensional virtual vector expanded and unrolled into multiple AVX2-sized
				// vectors.
				//
				// CHECK-LABEL: func @rsqrt_vector_16xf32
				// CHECK: math.rsqrt
				// AVX2-LABEL: func @rsqrt_vector_16xf32(
				// AVX2-SAME: %[[ARG:.*]]: vector<16xf32>
				// AVX2-SAME: ) -> vector<16xf32>
				// AVX2: %[[INIT:.*]] = arith.constant dense<0.000000e+00> : vector<2x8xf32>
				// AVX2: %[[EXPAND:.*]] = vector.shape_cast %[[ARG]] : vector<16xf32> to vector<2x8xf32>
				// AVX2: %[[VEC0:.*]] = vector.extract %[[EXPAND]][0]
				// AVX2: %[[RSQRT0:.*]] = x86vector.avx.rsqrt %[[VEC0]]
				// AVX2: %[[VEC1:.*]] = vector.extract %[[EXPAND]][1]
				// AVX2: %[[RSQRT1:.*]] = x86vector.avx.rsqrt %[[VEC1]]
				// AVX2: %[[RESULT0:.*]] = vector.insert %[[RSQRT0]], %[[INIT]] [0]
				// AVX2: %[[RESULT1:.*]] = vector.insert %[[RSQRT1]], %[[RESULT0]] [1]
				// AVX2: %[[RSQRT:.*]] = vector.shape_cast %[[RESULT1]] : vector<2x8xf32> to vector<16xf32>
				func @rsqrt_vector_16xf32(%arg0: vector<16xf32>) -> vector<16xf32> {
				%0 = math.rsqrt %arg0 : vector<16xf32>
				return %0 : vector<16xf32>
				}

				// Two dimensional virtual vector unrolled into multiple AVX2-sized vectors.
				//
				// CHECK-LABEL: func @rsqrt_vector_2x8xf32
				// CHECK: math.rsqrt
				// AVX2-LABEL: func @rsqrt_vector_2x8xf32(
				// AVX2-SAME: %[[ARG:.*]]: vector<2x8xf32>
				// AVX2-SAME: ) -> vector<2x8xf32>
				// AVX2: %[[INIT:.*]] = arith.constant dense<0.000000e+00> : vector<2x8xf32>
				// AVX2-NOT: vector.shape_cast
				cotaUnsubmitted Done Reply Inline Actions Consider adding here `AVX2-NOT: vector.shape_cast`, as below. cota: Consider adding here `AVX2-NOT: vector.shape_cast`, as below.
				// AVX2: %[[VEC0:.*]] = vector.extract %[[ARG]][0]
				// AVX2: %[[RSQRT0:.*]] = x86vector.avx.rsqrt %[[VEC0]]
				// AVX2: %[[VEC1:.*]] = vector.extract %[[ARG]][1]
				// AVX2: %[[RSQRT1:.*]] = x86vector.avx.rsqrt %[[VEC1]]
				// AVX2: %[[RESULT0:.*]] = vector.insert %[[RSQRT0]], %[[INIT]] [0]
				// AVX2: %[[RESULT1:.*]] = vector.insert %[[RSQRT1]], %[[RESULT0]] [1]
				// AVX2-NOT: vector.shape_cast
				func @rsqrt_vector_2x8xf32(%arg0: vector<2x8xf32>) -> vector<2x8xf32> {
				%0 = math.rsqrt %arg0 : vector<2x8xf32>
				return %0 : vector<2x8xf32>
				}

				// Two dimensional virtual vector expanded and unrolled into multiple AVX2-sized
				// vectors.
				//
				// CHECK-LABEL: func @rsqrt_vector_2x16xf32
				// CHECK: math.rsqrt
				// AVX2-LABEL: func @rsqrt_vector_2x16xf32(
				// AVX2-SAME: %[[ARG:.*]]: vector<2x16xf32>
				// AVX2-SAME: ) -> vector<2x16xf32>
				// AVX2: %[[INIT:.*]] = arith.constant dense<0.000000e+00> : vector<2x2x8xf32>
				// AVX2: %[[EXPAND:.*]] = vector.shape_cast %[[ARG]] : vector<2x16xf32> to vector<2x2x8xf32>
				// AVX2: %[[VEC00:.*]] = vector.extract %[[EXPAND]][0, 0]
				// AVX2: %[[RSQRT00:.*]] = x86vector.avx.rsqrt %[[VEC00]]
				// AVX2: %[[VEC01:.*]] = vector.extract %[[EXPAND]][0, 1]
				// AVX2: %[[RSQRT01:.*]] = x86vector.avx.rsqrt %[[VEC01]]
				// AVX2: %[[VEC10:.*]] = vector.extract %[[EXPAND]][1, 0]
				// AVX2: %[[RSQRT10:.*]] = x86vector.avx.rsqrt %[[VEC10]]
				// AVX2: %[[VEC11:.*]] = vector.extract %[[EXPAND]][1, 1]
				// AVX2: %[[RSQRT11:.*]] = x86vector.avx.rsqrt %[[VEC11]]
				// AVX2: %[[RESULT0:.*]] = vector.insert %[[RSQRT00]], %[[INIT]] [0, 0]
				// AVX2: %[[RESULT1:.*]] = vector.insert %[[RSQRT01]], %[[RESULT0]] [0, 1]
				// AVX2: %[[RESULT2:.*]] = vector.insert %[[RSQRT10]], %[[RESULT1]] [1, 0]
				// AVX2: %[[RESULT3:.*]] = vector.insert %[[RSQRT11]], %[[RESULT2]] [1, 1]
				// AVX2: %[[RSQRT:.*]] = vector.shape_cast %[[RESULT3]] : vector<2x2x8xf32> to vector<2x16xf32>
				func @rsqrt_vector_2x16xf32(%arg0: vector<2x16xf32>) -> vector<2x16xf32> {
				%0 = math.rsqrt %arg0 : vector<2x16xf32>
				return %0 : vector<2x16xf32>
				}

This is an archive of the discontinued LLVM Phabricator instance.

[mlir] MathApproximations: unroll virtual vectors into hardware vectors for ISA specific operationClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 383133

mlir/lib/Dialect/Math/Transforms/PolynomialApproximation.cpp

mlir/test/Dialect/Math/polynomial-approximation.mlir

[mlir] MathApproximations: unroll virtual vectors into hardware vectors for ISA specific operation
ClosedPublic