Download Raw Diff

Details

Reviewers

aartbik
nicolasvasilache
ThomasRaoux

Commits

rG0d6e4199e32a: [mlir][vector] Order parallel indices before transposing the input in…

Summary

The current code does not preserve the order of the parallel
dimensions when doing multi-reductions and thus we can end
up in scenarios where the result shape does not match the
desired shape after reduction.

This patch fixes that by ensuring that the parallel indices
are in order and then concatenates them to the reduction dimensions
so that the reduction dimensions are innermost.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

harsh created this revision.Jun 24 2021, 4:33 PM

Herald added a reviewer: aartbik. · View Herald TranscriptJun 24 2021, 4:33 PM

Herald added subscribers: dcaballe, cota, teijeong and 17 others. · View Herald Transcript

harsh requested review of this revision.Jun 24 2021, 4:33 PM

Herald added a reviewer: nicolasvasilache. · View Herald TranscriptJun 24 2021, 4:33 PM

Herald added a project: Restricted Project. · View Herald Transcript

Herald added subscribers: stephenneuendorffer, nicolasvasilache. · View Herald Transcript

Harbormaster completed remote builds in B110920: Diff 354393.Jun 24 2021, 5:37 PM

Fixed lint changes

Harbormaster completed remote builds in B110933: Diff 354411.Jun 24 2021, 6:54 PM

ThomasRaoux added a subscriber: ThomasRaoux.Jun 27 2021, 11:11 PM

ThomasRaoux added inline comments.

mlir/lib/Dialect/Vector/VectorTransforms.cpp
3954	Why don't we sort the indices so that the non reduction dimensions are in order in this transpose instead? This way we don't need a second transpose if they are not in the right order.

ThomasRaoux added a reviewer: ThomasRaoux.Jun 27 2021, 11:19 PM

ThomasRaoux added inline comments.

mlir/lib/Dialect/Vector/VectorTransforms.cpp
3938–3947	I think you can simplify this logic and solve the problem by just pushing the non reduced dimensions first in order then push the reduction dimensions.

asaadaldien added a subscriber: asaadaldien.Jun 27 2021, 11:19 PM

asaadaldien added inline comments.

mlir/lib/Dialect/Vector/VectorTransforms.cpp
3939	Good catch the bug is here! the linear swap isn't correct because it shuffles parallel dims around. You can replace this by inserting parallel induces first then reductions, e.g something like the following:

+1 to what @ThomasRaoux said.
It should also be possible to rewrite that logic in a much cleaner and simpler form like:

auto reductionDimsRange = op.reduction_dims.getAsValueRange<int64_t>();
SmallVector<int64_t, 4> reductionDims = llvm::to_vector<4>(llvm::map_range(reductionDimsRange.begin(), reductionDimsRange.end(), [](APInt a) { return a.getZExtValue(); }));
SmallDenseSet<int64_T> reductionDimsSet;
// fill it from reductionDims or from llvm::map_range, whichever makes most sense. 
SmallVector<int64_t> parallelDims;
for (int64_t idx = 0, e = ; idx != e; ++idx)
  if (!reductionDimsSet.contains(idx))
    parallelDims.push_back(idx);

if (parallelDims != llvm::seq<int64_t>(0, parallelDims.size())) { // This is the condition for "reductions must be innermost", see comment below.
  // concat parallelDims, reductionDims + add transpose. 
}

One thing to note is that the current lowering seems to always put reductions inside and use vector.reduction.
I think in most cases this will result in quite bad code because it will rely on horizontal reductions and add a ton of insert/extract.

We should have an option to emit the other way around too: put the reductionDims outside and just unroll pointwise operations on the parallel dims.
This is closer to the "outerproduct-like" semantics and has much better chances of being rewritten as fmas when possible; missing fma instructions is immediately a 2x performance penalty on many systems.

With the code structuring I propose above, the proper transposition amounts to just changing the order of concatenated dims.

Thanks @ThomasRaoux , @asaadaldien , @nicolasvasilache for the comments. @nicolasvasilache I have modified the patch as per your changes but kept the current patch to just handle moving reductions to the inner most dimensions. I will put up another patch to handle moving reductions to the outermost dimensions and based on the performance of inner vs outer, we can decide which path we want to take.

mlir/lib/Dialect/Vector/VectorTransforms.cpp
3938–3947	Thanks you are right. I have fixed this as per Nicolas' comments.
3939	Thanks! Since I will be attempting moving the reduction dims to be outermost also, I have modified the patch as per Nicolas' comments.
3954	Makes sense.

In D104884#2845548, @harsh wrote:

Thanks @ThomasRaoux , @asaadaldien , @nicolasvasilache for the comments. @nicolasvasilache I have modified the patch as per your changes but kept the current patch to just handle moving reductions to the inner most dimensions. I will put up another patch to handle moving reductions to the outermost dimensions and based on the performance of inner vs outer, we can decide which path we want to take.

Did you upload the new patch?

Fixed as per comments

@ThomasRaoux - yes I just uploaded it.

LGTM

mlir/lib/Dialect/Vector/VectorTransforms.cpp
3934	nit: formatting looks off.
3939	nit: skip braces on single line if statements (https://llvm.org/docs/CodingStandards.html#don-t-use-braces-on-simple-single-statement-bodies-of-if-else-loop-statements)
3947	nit: In general prefer early exit (https://llvm.org/docs/CodingStandards.html#use-early-exits-and-continue-to-simplify-code)

This revision is now accepted and ready to land.Jun 28 2021, 3:12 PM

harsh marked 3 inline comments as done.Jun 28 2021, 4:10 PM

harsh added inline comments.

mlir/lib/Dialect/Vector/VectorTransforms.cpp
3947	Just to clarify, this means I should check if the inner most dims are reduction and if so, exit early here?

Address comments and linting errors

Harbormaster completed remote builds in B111398: Diff 355059.Jun 28 2021, 4:52 PM

ThomasRaoux accepted this revision.Jun 28 2021, 5:07 PM

ThomasRaoux added inline comments.

mlir/lib/Dialect/Vector/VectorTransforms.cpp
3947	Yes, looks good.

Thanks and feel free to merge because I don't have merge priveleges.

Closed by commit rG0d6e4199e32a: [mlir][vector] Order parallel indices before transposing the input in… (authored by harsh, committed by ThomasRaoux). · Explain WhyJun 28 2021, 6:47 PM

This revision was automatically updated to reflect the committed changes.

ThomasRaoux added a commit: rG0d6e4199e32a: [mlir][vector] Order parallel indices before transposing the input in….

In D104884#2845835, @harsh wrote:

Thanks and feel free to merge because I don't have merge priveleges.

Done

Diff 354411

mlir/lib/Dialect/Vector/VectorTransforms.cpp

//===- VectorTransforms.cpp - Conversion within the Vector dialect --------===//

// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.

// See https://llvm.org/LICENSE.txt for license information.

// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception

//===----------------------------------------------------------------------===//

// This file implements target-independent rewrites as 1->N patterns.

//===----------------------------------------------------------------------===//

#include <numeric>

#include <type_traits>

#include "mlir/Dialect/Affine/IR/AffineOps.h"

#include "mlir/Dialect/Affine/Utils.h"

#include "mlir/Dialect/Linalg/IR/LinalgOps.h"

#include "mlir/Dialect/MemRef/IR/MemRef.h"

#include "mlir/Dialect/SCF/SCF.h"

#include "mlir/Dialect/StandardOps/IR/Ops.h"

▲ Show 20 Lines • Show All 3,904 Lines • ▼ Show 20 Lines

LogicalResult matchAndRewrite(vector::MultiDimReductionOp multiReductionOp,

int64_t reductionSize = multiReductionOp.reduction_dims().size();

// Fails if already inner most reduction.

bool innerMostReduction = true;

for (int i = 0; i < reductionSize; ++i) {

if (reductionDims[reductionSize - i - 1] != srcRank - i - 1) {

innerMostReduction = false;

}

ThomasRaouxUnsubmitted

Done

nit: formatting looks off.

ThomasRaoux: nit: formatting looks off.

if (innerMostReduction)

return failure();

// Permutes the indices so reduction dims are inner most dims.

SmallVector<int64_t> indices;

asaadaldienUnsubmitted

Done

// Permutes the indices so reduction dims are inner most dims.

SmallVector<int64_t> indices;

for (int i = 0; i < srcRank; ++i) {

+ if (!std::binary_search(reductionDims.begin(), reductionDims.end(), i)) {

+ indices.push_back(i);

+ }

+ for (int rd : reductionDims) {

+ indices.push_back(rd);

+ }

+ for (int i = 0; i < srcRank; ++i) {

Good catch the bug is here! the linear swap isn't correct because it shuffles parallel dims around. You can replace this by inserting parallel induces first then reductions, e.g something like the following:

asaadaldien: Good catch the bug is here! the linear swap isn't correct because it shuffles parallel dims…

harshAuthorUnsubmitted

Done

Thanks! Since I will be attempting moving the reduction dims to be outermost also, I have modified the patch as per Nicolas' comments.

harsh: Thanks! Since I will be attempting moving the reduction dims to be outermost also, I have…

ThomasRaouxUnsubmitted

Done

nit: skip braces on single line if statements (https://llvm.org/docs/CodingStandards.html#don-t-use-braces-on-simple-single-statement-bodies-of-if-else-loop-statements)

ThomasRaoux: nit: skip braces on single line if statements (https://llvm.org/docs/CodingStandards.html#don-t…

for (int i = 0; i < srcRank; ++i) {

indices.push_back(i);

}

int ir = reductionSize - 1;

int id = srcRank - 1;

while (ir >= 0) {

std::swap(indices[reductionDims[ir--]], indices[id--]);

}

ThomasRaouxUnsubmitted

Done

I think you can simplify this logic and solve the problem by just pushing the non reduced dimensions first in order then push the reduction dimensions.

ThomasRaoux: I think you can simplify this logic and solve the problem by just pushing the non reduced…

harshAuthorUnsubmitted

Done

Thanks you are right. I have fixed this as per Nicolas' comments.

harsh: Thanks you are right. I have fixed this as per Nicolas' comments.

ThomasRaouxUnsubmitted

Done

nit: In general prefer early exit (https://llvm.org/docs/CodingStandards.html#use-early-exits-and-continue-to-simplify-code)

ThomasRaoux: nit: In general prefer early exit (https://llvm.org/docs/CodingStandards.html#use-early-exits…

harshAuthorUnsubmitted

Done

Just to clarify, this means I should check if the inner most dims are reduction and if so, exit early here?

harsh: Just to clarify, this means I should check if the inner most dims are reduction and if so, exit…

ThomasRaouxUnsubmitted

Not Done

Yes, looks good.

ThomasRaoux: Yes, looks good.

// Sets inner most dims as reduction.

SmallVector<bool> reductionMask(srcRank, false);

for (int i = 0; i < reductionSize; ++i) {

reductionMask[srcRank - i - 1] = true;

}

auto transposeOp = rewriter.create<vector::TransposeOp>(loc, src, indices);

ThomasRaouxUnsubmitted

Done

Why don't we sort the indices so that the non reduction dimensions are in order in this transpose instead? This way we don't need a second transpose if they are not in the right order.

ThomasRaoux: Why don't we sort the indices so that the non reduction dimensions are in order in this…

harshAuthorUnsubmitted

Done

Makes sense.

harsh: Makes sense.

rewriter.replaceOpWithNewOp<vector::MultiDimReductionOp>(

auto newMultiReductionOp = rewriter.create<vector::MultiDimReductionOp>(

multiReductionOp, transposeOp.result(), reductionMask,

loc, transposeOp.result(), reductionMask, multiReductionOp.kind());

multiReductionOp.kind());

if (!std::is_sorted(indices.begin(), indices.end() - reductionSize)) {

// Add additional transpose to restore to original shape

SmallVector<int64_t> newIndices(srcRank - reductionSize);

std::iota(std::begin(newIndices), std::end(newIndices), 0);

std::sort(newIndices.begin(), newIndices.end(),

[&](int i, int j) { return indices[i] < indices[j]; });

auto transposeOp = rewriter.create<vector::TransposeOp>(

loc, newMultiReductionOp.getResult(), newIndices);

rewriter.replaceOp(multiReductionOp, transposeOp.result());

} else {

rewriter.replaceOp(multiReductionOp, newMultiReductionOp.getResult());

}

return success();

}

};

// Reduces the rank of vector.mult_reduction nd -> 2d given all reduction

// dimensions are inner most.

struct ReduceMultiDimReductionRank

: public OpRewritePattern<vector::MultiDimReductionOp> {

▲ Show 20 Lines • Show All 225 Lines • Show Last 20 Lines

mlir/test/Dialect/Vector/vector-multi-reduction-lowering.mlir

	Show First 20 Lines • Show All 55 Lines • ▼ Show 20 Lines
	func @vector_multi_reduction_transposed(%arg0: vector<2x3x4x5xf32>) -> vector<2x5xf32> {			func @vector_multi_reduction_transposed(%arg0: vector<2x3x4x5xf32>) -> vector<2x5xf32> {
	%0 = vector.multi_reduction #vector.kind<add>, %arg0 [1, 2] : vector<2x3x4x5xf32> to vector<2x5xf32>			%0 = vector.multi_reduction #vector.kind<add>, %arg0 [1, 2] : vector<2x3x4x5xf32> to vector<2x5xf32>
	return %0 : vector<2x5xf32>			return %0 : vector<2x5xf32>
	}			}

	// CHECK-LABEL: func @vector_multi_reduction_transposed			// CHECK-LABEL: func @vector_multi_reduction_transposed
	// CHECK-SAME: %[[INPUT:.+]]: vector<2x3x4x5xf32>			// CHECK-SAME: %[[INPUT:.+]]: vector<2x3x4x5xf32>
	// CHECK: %[[TRANSPOSED_INPUT:.+]] = vector.transpose %[[INPUT]], [0, 3, 1, 2] : vector<2x3x4x5xf32> to vector<2x5x3x4xf32>			// CHECK: %[[TRANSPOSED_INPUT:.+]] = vector.transpose %[[INPUT]], [0, 3, 1, 2] : vector<2x3x4x5xf32> to vector<2x5x3x4xf32>
	// CHEKC: vector.shape_cast %[[TRANSPOSED_INPUT]] : vector<2x5x3x4xf32> to vector<10x12xf32>			// CHECK: vector.shape_cast %[[TRANSPOSED_INPUT]] : vector<2x5x3x4xf32> to vector<10x12xf32>
	// CHECK: %[[RESULT:.+]] = vector.shape_cast %{{.*}} : vector<10xf32> to vector<2x5xf32>			// CHECK: %[[RESULT:.+]] = vector.shape_cast %{{.*}} : vector<10xf32> to vector<2x5xf32>
	// CHECK: return %[[RESULT]]			// CHECK: return %[[RESULT]]

				func @vector_multi_reduction_additional_transpose(%arg0: vector<3x2x4xf32>) -> vector<2x4xf32> {
				%0 = vector.multi_reduction #vector.kind<mul>, %arg0 [0] : vector<3x2x4xf32> to vector<2x4xf32>
				return %0 : vector<2x4xf32>
				}
				// CHECK-LABEL: func @vector_multi_reduction_additional_transpose
				// CHECK-SAME: %[[INPUT:.+]]: vector<3x2x4xf32>
				// CHECK: %[[RESULT_VEC_0:.+]] = constant dense<{{.*}}> : vector<8xf32>
				// CHECK: %[[C0:.+]] = constant 0 : i32
				// CHECK: %[[C1:.+]] = constant 1 : i32
				// CHECK: %[[C2:.+]] = constant 2 : i32
				// CHECK: %[[C3:.+]] = constant 3 : i32
				// CHECK: %[[C4:.+]] = constant 4 : i32
				// CHECK: %[[C5:.+]] = constant 5 : i32
				// CHECK: %[[C6:.+]] = constant 6 : i32
				// CHECK: %[[C7:.+]] = constant 7 : i32
				// CHECK: %[[TRANSPOSED_INPUT:.+]] = vector.transpose %[[INPUT]], [2, 1, 0] : vector<3x2x4xf32> to vector<4x2x3xf32>
				// CHECK: %[[V0:.+]] = vector.extract %[[TRANSPOSED_INPUT]][0, 0]
				// CHECK: %[[RV0:.+]] = vector.reduction "mul", %[[V0]] : vector<3xf32> into f32
				// CHECK: %[[RESULT_VEC_1:.+]] = vector.insertelement %[[RV0:.+]], %[[RESULT_VEC_0]][%[[C0]] : i32] : vector<8xf32>
				// CHECK: %[[V1:.+]] = vector.extract %[[TRANSPOSED_INPUT]][0, 1]
				// CHECK: %[[RV1:.+]] = vector.reduction "mul", %[[V1]] : vector<3xf32> into f32
				// CHECK: %[[RESULT_VEC_2:.+]] = vector.insertelement %[[RV1:.+]], %[[RESULT_VEC_1]][%[[C1]] : i32] : vector<8xf32>
				// CHECK: %[[V2:.+]] = vector.extract %[[TRANSPOSED_INPUT]][1, 0]
				// CHECK: %[[RV2:.+]] = vector.reduction "mul", %[[V2]] : vector<3xf32> into f32
				// CHECK: %[[RESULT_VEC_3:.+]] = vector.insertelement %[[RV2:.+]], %[[RESULT_VEC_2]][%[[C2]] : i32] : vector<8xf32>
				// CHECK: %[[V3:.+]] = vector.extract %[[TRANSPOSED_INPUT]][1, 1]
				// CHECK: %[[RV3:.+]] = vector.reduction "mul", %[[V3]] : vector<3xf32> into f32
				// CHECK: %[[RESULT_VEC_4:.+]] = vector.insertelement %[[RV3:.+]], %[[RESULT_VEC_3]][%[[C3]] : i32] : vector<8xf32>
				// CHECK: %[[V4:.+]] = vector.extract %[[TRANSPOSED_INPUT]][2, 0]
				// CHECK: %[[RV4:.+]] = vector.reduction "mul", %[[V4]] : vector<3xf32> into f32
				// CHECK: %[[RESULT_VEC_5:.+]] = vector.insertelement %[[RV4:.+]], %[[RESULT_VEC_4]][%[[C4]] : i32] : vector<8xf32>
				// CHECK: %[[V5:.+]] = vector.extract %[[TRANSPOSED_INPUT]][2, 1]
				// CHECK: %[[RV5:.+]] = vector.reduction "mul", %[[V5]] : vector<3xf32> into f32
				// CHECK: %[[RESULT_VEC_6:.+]] = vector.insertelement %[[RV5:.+]], %[[RESULT_VEC_5]][%[[C5]] : i32] : vector<8xf32>
				// CHECK: %[[V6:.+]] = vector.extract %[[TRANSPOSED_INPUT]][3, 0]
				// CHECK: %[[RV6:.+]] = vector.reduction "mul", %[[V6]] : vector<3xf32> into f32
				// CHECK: %[[RESULT_VEC_7:.+]] = vector.insertelement %[[RV6:.+]], %[[RESULT_VEC_6]][%[[C6]] : i32] : vector<8xf32>
				// CHECK: %[[V7:.+]] = vector.extract %[[TRANSPOSED_INPUT]][3, 1]
				// CHECK: %[[RV7:.+]] = vector.reduction "mul", %[[V7]] : vector<3xf32> into f32
				// CHECK: %[[RESULT_VEC:.+]] = vector.insertelement %[[RV7:.+]], %[[RESULT_VEC_7]][%[[C7]] : i32] : vector<8xf32>
				// CHECK: %[[RESHAPED_VEC:.+]] = vector.shape_cast %[[RESULT_VEC]] : vector<8xf32> to vector<4x2xf32>
				// CHECK: %[[TRANSPOSED_VEC:.+]] = vector.transpose %[[RESHAPED_VEC]], [1, 0] : vector<4x2xf32> to vector<2x4xf32>
				// CHECK: return %[[TRANSPOSED_VEC]]

This is an archive of the discontinued LLVM Phabricator instance.

Order parallel indices before transposing the input in multireductions
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 354411

mlir/lib/Dialect/Vector/VectorTransforms.cpp

mlir/test/Dialect/Vector/vector-multi-reduction-lowering.mlir

This is an archive of the discontinued LLVM Phabricator instance.

Order parallel indices before transposing the input in multireductionsClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 354411

mlir/lib/Dialect/Vector/VectorTransforms.cpp

mlir/test/Dialect/Vector/vector-multi-reduction-lowering.mlir

Order parallel indices before transposing the input in multireductions
ClosedPublic