Download Raw Diff

Details

Reviewers

nicolasvasilache
aartbik

Commits

rG5d45f758f0fb: [mlir][vector] Improve vector distribute integration test and fix block…

Summary

Fix semantic in the distribute integration test based on offline feedback. This exposed a bug in block distribution, we need to make sure the id is multiplied by the stride of the vector. Fix the transformation and unit test.

Diff Detail

Event Timeline

ThomasRaoux created this revision.Oct 12 2020, 8:59 PM

Herald added a reviewer: aartbik. · View Herald TranscriptOct 12 2020, 8:59 PM

Herald added a reviewer: aartbik. · View Herald Transcript

Herald added a project: Restricted Project. · View Herald Transcript

Herald added subscribers: rdzhabarov, tatianashp, msifontes and 13 others. · View Herald Transcript

ThomasRaoux requested review of this revision.Oct 12 2020, 8:59 PM

Herald added a subscriber: stephenneuendorffer. · View Herald TranscriptOct 12 2020, 8:59 PM

ThomasRaoux updated this revision to Diff 297765.Oct 12 2020, 10:10 PM

ThomasRaoux retitled this revision from [mlir][vector] Improve vector distribute integration test to [mlir][vector] Improve vector distribute integration test and fix block distribution.

ThomasRaoux edited the summary of this revision. (Show Details)

aartbik added inline comments.Oct 13 2020, 3:16 PM

mlir/integration_test/Dialect/Vector/CPU/test-vector-distribute.mlir
22–24	Not in this CL, but probably during a later cleanup, I would rename the "distribution" part to something better. Loop distribution is typically reserved for for { for s1 -> s1 s2 for } s2 what is done here is more stripmining, blocking or tiling or chunking in 1-D, or something named like that.
35–40	c32 because 2x32=64, right? I am not super convinced I find this intermediate step much easier to understand than generating the chunked loop right away, but I hope you convince me in the discourse discussion

mehdi_amini added inline comments.Oct 14 2020, 10:20 AM

mlir/test/Dialect/Vector/vector-distribution.mlir
3 ↗	(On Diff #297765)	Why removing the "-LABEL" here?

Changes to preserve semantic in the integration test based on discourse:
https://llvm.discourse.group/t/vector-vector-distribution-large-vector-to-small-vector/1983/12

I changed the integration test to match what I described here: https://llvm.discourse.group/t/vector-vector-distribution-large-vector-to-small-vector/1983/12

If it makes sense to all of you I can do the rest of the changes.

mlir/integration_test/Dialect/Vector/CPU/test-vector-distribute.mlir
22–24	Makes sense. I can rename it in a future patch once we get more agreement on the design,
35–40	Correct, right now the extract_map expects contiguous IDs. (%arg5 : 32 goes from 0 to 31). About iterative vs all at once transformation let's keep talking on Discourse :)
mlir/test/Dialect/Vector/vector-distribution.mlir
3 ↗	(On Diff #297765)	That was done by mistake. I do need to change the CHECK-LABEL in the 3rd test otherwise the [[MAP]] variable gets reset after the CHECK-LABEL.

Let's also update the ops' documentation to remove this section:

For instance, the following code:

   %a = vector.transfer_read %A[%c0]: memref<32xf32>, vector<32xf32>
   %b = vector.transfer_read %B[%c0]: memref<32xf32>, vector<32xf32>
   %c = addf %a, %b: vector<32xf32>
   vector.transfer_write %c, %C[%c0]: memref<32xf32>, vector<32xf32>

   can be rewritten to:
   %a = vector.transfer_read %A[%c0]: memref<32xf32>, vector<32xf32>
   %b = vector.transfer_read %B[%c0]: memref<32xf32>, vector<32xf32>
   %ea = vector.extract_map %a[%id : 32] : vector<32xf32> to vector<1xf32>
   %eb = vector.extract_map %b[%id : 32] : vector<32xf32> to vector<1xf32>
   %ec = addf %ea, %eb : vector<1xf32>
   %c = vector.insert_map %ec, %id, 32 : vector<1xf32> to vector<32xf32>
   vector.transfer_write %c, %C[%c0]: memref<32xf32>, vector<32xf32>

   Where %id can be an induction variable or an SPMD id going from 0 to 31.

   And then be rewritten to:
   %a = vector.transfer_read %A[%id]: memref<32xf32>, vector<1xf32>
   %b = vector.transfer_read %B[%id]: memref<32xf32>, vector<1xf32>
   %c = addf %a, %b: vector<1xf32>
   vector.transfer_write %c, %C[%id]: memref<32xf32>, vector<1xf32>

and instead focus on the actual op semantics, with an example e.g.

Examples:

%idx0 = ... : index // dynamic computation producing the value 0 of index type
%idx1 = ... : index // dynamic computation producing the value 1 of index type
%0 = constant dense<0, 1, 2, 3>: vector<4xi32>
%1 = vector.extract_map %0[%idx0 : 2] : vector<4xi32> to vector<2xi32> // extracts values [0, 1]
%2 = vector.extract_map %0[%idx1 : 2] : vector<4xi32> to vector<2xi32> // extracts values [1, 2]
... (same for insert variant)

As the op semantics evolve, the examples will too.

Thanks @ThomasRaoux !

mlir/integration_test/Dialect/Vector/CPU/test-vector-distribute.mlir

35–40

As discussed on discourse, this is transient state internal to the test pass and should not be exposed.

Let's please have the test do 2 things.

Input IR:

%a = vector.transfer_read %in1[%c0], %cf0: memref<?xf32>, vector<256xf32>
%b = vector.transfer_read %in2[%c0], %cf0: memref<?xf32>, vector<256xf32>
%acc = addf %a, %b: vector<256xf32>
vector.transfer_write %acc, %out[%c0]: vector<256xf32>, memref<?xf32>

Output IR:

scf.for %arg5 = %c0 to %c256 step %c8 {
  %a = vector.transfer_read %in1[%arg5], %cf0: memref<?xf32>, vector<8xf32>
  %b = vector.transfer_read %in2[%arg5], %cf0: memref<?xf32>, vector<8xf32>
  %acc = addf %a, %b: vector<8xf32>
  vector.transfer_write %acc, %out[%arg5]: vector<8xf32>, memref<?xf32>
}

The test should also run with and without the application of the test pass and produce the same result.

Update integration test to start from a large vector without loop.

I re-based the patch and changed the test to avoid starting from an intermediate state. Please take another look.

mlir/integration_test/Dialect/Vector/CPU/test-vector-distribute.mlir
35–40	Done. Starting from just the vector add and running with and without the transformation pass.

nicolasvasilache added inline comments.Oct 29 2020, 12:35 PM

mlir/integration_test/Dialect/Vector/CPU/test-vector-distribute.mlir
1–2	Could we add an extra RUN command that just does `mlir-opt %s -test-vector-to-forloop` and checks the presence of the forloop+vectors ?
6	blank line to delimit commands ?

nicolasvasilache accepted this revision.Oct 29 2020, 12:36 PM

This revision is now accepted and ready to land.Oct 29 2020, 12:36 PM

Add extra check to the integration test to make sure the transformation happened.

ThomasRaoux updated this revision to Diff 301758.Oct 29 2020, 2:21 PM

ThomasRaoux marked 2 inline comments as done.

Closed by commit rG5d45f758f0fb: [mlir][vector] Improve vector distribute integration test and fix block… (authored by ThomasRaoux). · Explain WhyOct 29 2020, 2:55 PM

This revision was automatically updated to reflect the committed changes.

ThomasRaoux added a commit: rG5d45f758f0fb: [mlir][vector] Improve vector distribute integration test and fix block….

Diff 297753

mlir/integration_test/Dialect/Vector/CPU/test-vector-distribute.mlir

	// RUN: mlir-opt %s -test-vector-distribute-patterns=distribution-multiplicity=32 \			// RUN: mlir-opt %s -test-vector-distribute-patterns -convert-vector-to-scf \
	// RUN: -convert-vector-to-scf -lower-affine -convert-scf-to-std -convert-vector-to-llvm \| \			// RUN: -lower-affine -convert-scf-to-std -convert-vector-to-llvm \| \
				nicolasvasilacheUnsubmitted Done Reply Inline Actions Could we add an extra RUN command that just does `mlir-opt %s -test-vector-to-forloop` and checks the presence of the forloop+vectors ? nicolasvasilache: Could we add an extra RUN command that just does `mlir-opt %s -test-vector-to-forloop` and…
	// RUN: mlir-cpu-runner -e main -entry-point-result=void \			// RUN: mlir-cpu-runner -e main -entry-point-result=void \
	// RUN: -shared-libs=%mlir_integration_test_dir/libmlir_runner_utils%shlibext \| \			// RUN: -shared-libs=%mlir_integration_test_dir/libmlir_runner_utils%shlibext \| \
	// RUN: FileCheck %s			// RUN: FileCheck %s

				nicolasvasilacheUnsubmitted Done Reply Inline Actions blank line to delimit commands ? nicolasvasilache: blank line to delimit commands ?
	func @print_memref_f32(memref<*xf32>)			func @print_memref_f32(memref<*xf32>)

	func @alloc_1d_filled_inc_f32(%arg0: index, %arg1: f32) -> memref<?xf32> {			func @alloc_1d_filled_inc_f32(%arg0: index, %arg1: f32) -> memref<?xf32> {
	%c0 = constant 0 : index			%c0 = constant 0 : index
	%c1 = constant 1 : index			%c1 = constant 1 : index
	%0 = alloc(%arg0) : memref<?xf32>			%0 = alloc(%arg0) : memref<?xf32>
	scf.for %arg2 = %c0 to %arg0 step %c1 {			scf.for %arg2 = %c0 to %arg0 step %c1 {
	%tmp = index_cast %arg2 : index to i32			%tmp = index_cast %arg2 : index to i32
	%tmp1 = sitofp %tmp : i32 to f32			%tmp1 = sitofp %tmp : i32 to f32
	%tmp2 = addf %tmp1, %arg1 : f32			%tmp2 = addf %tmp1, %arg1 : f32
	store %tmp2, %0[%arg2] : memref<?xf32>			store %tmp2, %0[%arg2] : memref<?xf32>
	}			}
	return %0 : memref<?xf32>			return %0 : memref<?xf32>
	}			}

	func @vector_add_cycle(%id : index, %A: memref<?xf32>, %B: memref<?xf32>, %C: memref<?xf32>) {			// Loop over a vector add being distributed into a loop of vec2 and make sure
	%c0 = constant 0 : index			// distribution is being propagated.
	%cf0 = constant 0.0 : f32
	%a = vector.transfer_read %A[%c0], %cf0: memref<?xf32>, vector<64xf32>
	%b = vector.transfer_read %B[%c0], %cf0: memref<?xf32>, vector<64xf32>
	%acc = addf %a, %b: vector<64xf32>
	vector.transfer_write %acc, %C[%c0]: vector<64xf32>, memref<?xf32>
	return
	}

	// Loop over a function containinng a large add vector and distribute it so that
	// each iteration of the loop process part of the vector operation.
	func @main() {			func @main() {
				aartbikUnsubmitted Not Done Reply Inline Actions Not in this CL, but probably during a later cleanup, I would rename the "distribution" part to something better. Loop distribution is typically reserved for for { for s1 -> s1 s2 for } s2 what is done here is more stripmining, blocking or tiling or chunking in 1-D, or something named like that. aartbik: Not in this CL, but probably during a later cleanup, I would rename the "distribution" part to…
				ThomasRaouxAuthorUnsubmitted Done Reply Inline Actions Makes sense. I can rename it in a future patch once we get more agreement on the design, ThomasRaoux: Makes sense. I can rename it in a future patch once we get more agreement on the design,
				%cf0 = constant 0.0 : f32
	%cf1 = constant 1.0 : f32			%cf1 = constant 1.0 : f32
	%cf2 = constant 2.0 : f32			%cf2 = constant 2.0 : f32
	%c0 = constant 0 : index			%c0 = constant 0 : index
	%c1 = constant 1 : index			%c1 = constant 1 : index
	%c64 = constant 64 : index			%c64 = constant 64 : index
	%out = alloc(%c64) : memref<?xf32>			%out = alloc(%c64) : memref<?xf32>
	%in1 = call @alloc_1d_filled_inc_f32(%c64, %cf1) : (index, f32) -> memref<?xf32>			%in1 = call @alloc_1d_filled_inc_f32(%c64, %cf1) : (index, f32) -> memref<?xf32>
	%in2 = call @alloc_1d_filled_inc_f32(%c64, %cf2) : (index, f32) -> memref<?xf32>			%in2 = call @alloc_1d_filled_inc_f32(%c64, %cf2) : (index, f32) -> memref<?xf32>
	scf.for %arg5 = %c0 to %c64 step %c1 {			scf.for %arg5 = %c0 to %c64 step %c1 {
	call @vector_add_cycle(%arg5, %in1, %in2, %out) : (index, memref<?xf32>, memref<?xf32>, memref<?xf32>) -> ()			%a = vector.transfer_read %in1[%c0], %cf0: memref<?xf32>, vector<64xf32>
				%b = vector.transfer_read %in2[%c0], %cf0: memref<?xf32>, vector<64xf32>
				%acc = addf %a, %b: vector<64xf32>
				%ext = vector.extract_map %acc[%arg5 : 32] : vector<64xf32> to vector<2xf32>
				%ins = vector.insert_map %ext, %arg5, 32 : vector<2xf32> to vector<64xf32>
				vector.transfer_write %ins, %out[%c0]: vector<64xf32>, memref<?xf32>
				aartbikUnsubmitted Not Done Reply Inline Actions c32 because 2x32=64, right? I am not super convinced I find this intermediate step much easier to understand than generating the chunked loop right away, but I hope you convince me in the discourse discussion aartbik: c32 because 2x32=64, right? I am not super convinced I find this intermediate step much easier…
				ThomasRaouxAuthorUnsubmitted Done Reply Inline Actions Correct, right now the extract_map expects contiguous IDs. (%arg5 : 32 goes from 0 to 31). About iterative vs all at once transformation let's keep talking on Discourse :) ThomasRaoux: Correct, right now the extract_map expects contiguous IDs. (%arg5 : 32 goes from 0 to 31).
				nicolasvasilacheUnsubmitted Not Done Reply Inline Actions As discussed on discourse, this is transient state internal to the test pass and should not be exposed. Let's please have the test do 2 things. Input IR: %a = vector.transfer_read %in1[%c0], %cf0: memref<?xf32>, vector<256xf32> %b = vector.transfer_read %in2[%c0], %cf0: memref<?xf32>, vector<256xf32> %acc = addf %a, %b: vector<256xf32> vector.transfer_write %acc, %out[%c0]: vector<256xf32>, memref<?xf32> Output IR: scf.for %arg5 = %c0 to %c256 step %c8 { %a = vector.transfer_read %in1[%arg5], %cf0: memref<?xf32>, vector<8xf32> %b = vector.transfer_read %in2[%arg5], %cf0: memref<?xf32>, vector<8xf32> %acc = addf %a, %b: vector<8xf32> vector.transfer_write %acc, %out[%arg5]: vector<8xf32>, memref<?xf32> } The test should also run with and without the application of the test pass and produce the same result. nicolasvasilache: As discussed on discourse, this is transient state internal to the test pass and should not be…
				ThomasRaouxAuthorUnsubmitted Done Reply Inline Actions Done. Starting from just the vector add and running with and without the transformation pass. ThomasRaoux: Done. Starting from just the vector add and running with and without the transformation pass.
	}			}
	%converted = memref_cast %out : memref<?xf32> to memref<*xf32>			%converted = memref_cast %out : memref<?xf32> to memref<*xf32>
	call @print_memref_f32(%converted): (memref<*xf32>) -> ()			call @print_memref_f32(%converted): (memref<*xf32>) -> ()
	// CHECK: Unranked{{.*}}data =			// CHECK: Unranked{{.*}}data =
	// CHECK: [			// CHECK: [
	// CHECK-SAME: 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27,			// CHECK-SAME: 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27,
	// CHECK-SAME: 29, 31, 33, 35, 37, 39, 41, 43, 45, 47, 49, 51,			// CHECK-SAME: 29, 31, 33, 35, 37, 39, 41, 43, 45, 47, 49, 51,
	// CHECK-SAME: 53, 55, 57, 59, 61, 63, 65, 67, 69, 71, 73, 75,			// CHECK-SAME: 53, 55, 57, 59, 61, 63, 65, 67, 69, 71, 73, 75,
	// CHECK-SAME: 77, 79, 81, 83, 85, 87, 89, 91, 93, 95, 97, 99,			// CHECK-SAME: 77, 79, 81, 83, 85, 87, 89, 91, 93, 95, 97, 99,
	// CHECK-SAME: 101, 103, 105, 107, 109, 111, 113, 115, 117, 119,			// CHECK-SAME: 101, 103, 105, 107, 109, 111, 113, 115, 117, 119,
	// CHECK-SAME: 121, 123, 125, 127, 129]			// CHECK-SAME: 121, 123, 125, 127, 129]
	dealloc %out : memref<?xf32>			dealloc %out : memref<?xf32>
	dealloc %in1 : memref<?xf32>			dealloc %in1 : memref<?xf32>
	dealloc %in2 : memref<?xf32>			dealloc %in2 : memref<?xf32>
	return			return
	}			}

mlir/test/lib/Transforms/TestVectorTransforms.cpp

Show First 20 Lines • Show All 130 Lines • ▼ Show 20 Lines	struct TestVectorDistributePatterns
TestVectorDistributePatterns(const TestVectorDistributePatterns &pass) {}		TestVectorDistributePatterns(const TestVectorDistributePatterns &pass) {}
void getDependentDialects(DialectRegistry &registry) const override {		void getDependentDialects(DialectRegistry &registry) const override {
registry.insert<VectorDialect>();		registry.insert<VectorDialect>();
registry.insert<AffineDialect>();		registry.insert<AffineDialect>();
}		}
Option<int32_t> multiplicity{		Option<int32_t> multiplicity{
*this, "distribution-multiplicity",		*this, "distribution-multiplicity",
llvm::cl::desc("Set the multiplicity used for distributing vector"),		llvm::cl::desc("Set the multiplicity used for distributing vector"),
llvm::cl::init(32)};		llvm::cl::init(1)};
void runOnFunction() override {		void runOnFunction() override {
MLIRContext *ctx = &getContext();		MLIRContext *ctx = &getContext();
OwningRewritePatternList patterns;		OwningRewritePatternList patterns;
FuncOp func = getFunction();		FuncOp func = getFunction();
		if (multiplicity > 1) {
func.walk([&](AddFOp op) {		func.walk([&](AddFOp op) {
OpBuilder builder(op);		OpBuilder builder(op);
Optional<mlir::vector::DistributeOps> ops = distributPointwiseVectorOp(		Optional<mlir::vector::DistributeOps> ops = distributPointwiseVectorOp(
builder, op.getOperation(), func.getArgument(0), multiplicity);		builder, op.getOperation(), func.getArgument(0), multiplicity);
if (ops.hasValue()) {		if (ops.hasValue()) {
SmallPtrSet<Operation *, 1> extractOp({ops->extract});		SmallPtrSet<Operation *, 1> extractOp({ops->extract});
op.getResult().replaceAllUsesExcept(ops->insert.getResult(), extractOp);		op.getResult().replaceAllUsesExcept(ops->insert.getResult(),
		extractOp);
}		}
});		});
		}
patterns.insert<PointwiseExtractPattern>(ctx);		patterns.insert<PointwiseExtractPattern>(ctx);
populateVectorToVectorTransformationPatterns(patterns, ctx);		populateVectorToVectorTransformationPatterns(patterns, ctx);
applyPatternsAndFoldGreedily(getFunction(), patterns);		applyPatternsAndFoldGreedily(getFunction(), patterns);
}		}
};		};

struct TestVectorTransferFullPartialSplitPatterns		struct TestVectorTransferFullPartialSplitPatterns
: public PassWrapper<TestVectorTransferFullPartialSplitPatterns,		: public PassWrapper<TestVectorTransferFullPartialSplitPatterns,
▲ Show 20 Lines • Show All 57 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[mlir][vector] Improve vector distribute integration test and fix block distribution
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 297753

mlir/integration_test/Dialect/Vector/CPU/test-vector-distribute.mlir

mlir/test/lib/Transforms/TestVectorTransforms.cpp

This is an archive of the discontinued LLVM Phabricator instance.

[mlir][vector] Improve vector distribute integration test and fix block distributionClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 297753

mlir/integration_test/Dialect/Vector/CPU/test-vector-distribute.mlir

mlir/test/lib/Transforms/TestVectorTransforms.cpp

[mlir][vector] Improve vector distribute integration test and fix block distribution
ClosedPublic