This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
mlir/test/Transforms/
-
test/
-
Transforms/
2/5
loop-fusion-2.mlir
1
loop-fusion.mlir

Differential D106473

[NFC][MLIR] Split large fusion test file into two test files
ClosedPublic

Authored by sumesh13 on Jul 21 2021, 12:07 PM.

Download Raw Diff

Details

Reviewers

bondhugula

Commits

rG24b0df868604: [NFC][MLIR] Split large fusion test file into 4 test files

Summary

mlir/test/transforms/loop-fusion.mlir is too big and is split into mlir/test/transforms/loop-fusion.mlir, mlir/test/transforms/loop-fusion-2.mlir, mlir/test/transforms/loop-fusion-3.mlir
and mlir/test/transforms/loop-fusion-4.mlir. Further tests can be added in mlir/test/transforms/loop-fusion-4.mlir

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

sumesh13 created this revision.Jul 21 2021, 12:07 PM

Herald added subscribers: dcaballe, cota, teijeong and 16 others. · View Herald TranscriptJul 21 2021, 12:07 PM

sumesh13 requested review of this revision.Jul 21 2021, 12:07 PM

Herald added a project: Restricted Project. · View Herald TranscriptJul 21 2021, 12:07 PM

Herald added subscribers: stephenneuendorffer, nicolasvasilache. · View Herald Transcript

Harbormaster completed remote builds in B115382: Diff 360546.Jul 21 2021, 2:19 PM

Thanks for splitting these test cases. A couple of minor comments.

mlir/test/Transforms/loop-fusion-2.mlir
4	Nit: space after `//`
4	Instead of this - you could just have a comment in loop-fusion.mlir // Part I of loop fusion tests. Part II in ... and then a `// Part II of loop fusion tests` for this file.
mlir/test/Transforms/loop-fusion.mlir
3151	There shouldn't be a `-----` separator above.

mlir/test/transforms/loop-fusion.mlir is too big and is split into mlir/test/transforms/loop-fusion.mlir and mlir/test/transforms/loop-fusion-2.mlir

On which basis? Don't you think having _two similar names_ make confusion who don't know the context of test splitting.

mehdi_amini added inline comments.Jul 21 2021, 7:58 PM

mlir/test/Transforms/loop-fusion-2.mlir
2	Since we're here, can someone look into removing the `-allow-unregistered-dialect` flag?

bondhugula added inline comments.Jul 21 2021, 8:10 PM

mlir/test/Transforms/loop-fusion-2.mlir
2	I can do it - we should do this in another commit. Just wondering what are all the strong motivations to free these passes from `allow-unregistered-dialect`?

In D106473#2895372, @xgupta wrote:

mlir/test/transforms/loop-fusion.mlir is too big and is split into mlir/test/transforms/loop-fusion.mlir and mlir/test/transforms/loop-fusion-2.mlir

On which basis? Don't you think having _two similar names_ make confusion who don't know the context of test splitting.

This test case file has been slowing down check-mlir for everyone. I won't be surprised if this is the one that runs the longest across all test cases in the codebase. It has to be split. Open to a better/logical way of partitioning (for eg. sibling fusion vs producer consumer fusion) but they both might just be intertwined to separate in a balanced way. It is common for large files to be split this way - both sources and test cases. For eg. in the TF/MLIR repo, tf_ops.cc became too large and they went with something like t_ops_a_m.cc and tf_ops_n_z.cc to cut down the 60+s compile time.

In D106473#2895386, @bondhugula wrote:

In D106473#2895372, @xgupta wrote:

mlir/test/transforms/loop-fusion.mlir is too big and is split into mlir/test/transforms/loop-fusion.mlir and mlir/test/transforms/loop-fusion-2.mlir

On which basis? Don't you think having _two similar names_ make confusion who don't know the context of test splitting.

This test case file has been slowing down check-mlir for everyone. I won't be surprised if this is the one that runs the longest across all test cases in the codebase. It has to be split. Open to a better/logical way of partitioning (for eg. sibling fusion vs producer consumer fusion) but they both might just be intertwined to separate in a balanced way. It is common for large files to be split this way - both sources and test cases. For eg. in the TF/MLIR repo, tf_ops.cc became too large and they went with something like t_ops_a_m.cc and tf_ops_n_z.cc to cut down the 60+s compile time.

The main problem for me is that this split feels rather arbitrary. The original file still has >3000 lines. Has any benchmarking gone into determining the split point/number of partitions/etc?

In D106473#2895388, @rriddle wrote:

In D106473#2895386, @bondhugula wrote:

In D106473#2895372, @xgupta wrote:

mlir/test/transforms/loop-fusion.mlir is too big and is split into mlir/test/transforms/loop-fusion.mlir and mlir/test/transforms/loop-fusion-2.mlir

On which basis? Don't you think having _two similar names_ make confusion who don't know the context of test splitting.

This test case file has been slowing down check-mlir for everyone. I won't be surprised if this is the one that runs the longest across all test cases in the codebase. It has to be split. Open to a better/logical way of partitioning (for eg. sibling fusion vs producer consumer fusion) but they both might just be intertwined to separate in a balanced way. It is common for large files to be split this way - both sources and test cases. For eg. in the TF/MLIR repo, tf_ops.cc became too large and they went with something like t_ops_a_m.cc and tf_ops_n_z.cc to cut down the 60+s compile time.

The main problem for me is that this split feels rather arbitrary. The original file still has >3000 lines. Has any benchmarking gone into determining the split point/number of partitions/etc?

This came as a feedback in a previous review. Basically all tests since that review have been moved to this new file and it also serves as a placeholder for future tests. Would be ideal if there were some logical division like sibling fusion however there seem to be very few tests that are purely sibling fusion.

mehdi_amini added inline comments.Jul 21 2021, 9:00 PM

mlir/test/Transforms/loop-fusion-2.mlir
2	This flag makes the testing loses: they may be buggy (not declaring the dependent dialect) but we won't catch it during testing.

This test case file has been slowing down check-mlir for everyone.

Parallelism might help https://reviews.llvm.org/D106521 ([Docs] Mention how to run lit tests in parallel)?

I won't be surprised if this is the one that runs the longest across all test cases in the codebase.

This file is just 3398 lines long but see https://reviews.llvm.org/D103873 there are many many test cases longer than 14000 lines long for example.

In D106473#2895498, @xgupta wrote:

This test case file has been slowing down check-mlir for everyone.

Parallelism might help https://reviews.llvm.org/D106521 ([Docs] Mention how to run lit tests in parallel)?

Parallelism has no effect when a single test is slow. On the contrary: you are providing an argument in favor of splitting the test: after splitting it will be able to run in parallel threads!

In D106473#2895519, @mehdi_amini wrote:

In D106473#2895498, @xgupta wrote:

This test case file has been slowing down check-mlir for everyone.

Parallelism might help https://reviews.llvm.org/D106521 ([Docs] Mention how to run lit tests in parallel)?

Parallelism has no effect when a single test is slow. On the contrary: you are providing an argument in favor of splitting the test: after splitting it will be able to run in parallel threads!

Oh, nice. But I didn't object to spitting this testcase, problem was on which basis this splitting happen. I wish we can splits it into 30 files!

In D106473#2895498, @xgupta wrote:

This test case file has been slowing down check-mlir for everyone.

Parallelism might help https://reviews.llvm.org/D106521 ([Docs] Mention how to run lit tests in parallel)?

I won't be surprised if this is the one that runs the longest across all test cases in the codebase.

This file is just 3398 lines long but see https://reviews.llvm.org/D103873 there are many many test cases longer than 14000 lines long for example.

Just the number of lines would barely mean anything! Affine fusion that happens here uses a lot of analysis utilities (dependences, slices) and does a lot more in terms of checks and iterating over the graph; so it takes a noticeable time to run on these 3000 lines. OTOH, there could be 20000+ lines of MLIR that I've seen being canonicalized in a hundred milliseconds - so it's almost immaterial how long the test cases are.

@sumesh13 Could you please time just this test case before the split and the two parts separately after the split? I can do this as well on a 32-core system I have here. A 4/8-core one and a larger one would be useful data points.

Just the number of lines would barely mean anything! Affine fusion that happens here uses a lot of analysis utilities (dependences, slices) and does a lot more in terms of checks and iterating over the graph; so it takes a noticeable time to run on these 3000 lines. OTOH, there could be 20000+ lines of MLIR that I've seen being canonicalized in a hundred milliseconds - so it's almost immaterial how long the test cases are.

Oh thanks for explaining details, don't know about it. My pov was that it will understood better if we have name with a meaning of why it is named as it is, otherwise that is not a big issue, patch author can proceed without considering it 👍

Here is using time on a 16 core machine

loop-fusion.mlir
real 0m2.324s
user 0m2.335s
sys 0m0.154s

loop-fusion-2.mlir
real 0m0.417s
user 0m0.377s
sys 0m0.052s

Is there a benchmark threshold to target ? If there is, I can split to achieve that (leaving some room for the last test file to grow)

In D106473#2897540, @sumesh13 wrote:

Here is using time on a 16 core machine

loop-fusion.mlir
real 0m2.324s
user 0m2.335s
sys 0m0.154s

loop-fusion-2.mlir
real 0m0.417s
user 0m0.377s
sys 0m0.052s

Is there a benchmark threshold to target ? If there is, I can split to achieve that (leaving some room for the last test file to grow)

Is the timing for loop-fusion.mlir above for before the split or after? We really want to see how much it was before the split as well to see how much this helped. On a really fast 32-core (64 with SMT) system, here's what I see:

BEFORE

time bin/llvm-lit -sv ../mlir/test/Transforms/loop-fusion.mlir

Testing Time: 1.01s

Passed: 1

real 0m1.081s
user 0m0.627s
sys 0m1.131s

AFTER

time bin/llvm-lit -sv ../mlir/test/Transforms/loop-fusion.mlir

Testing Time: 1.01s

Passed: 1

real 0m1.081s
user 0m0.591s
sys 0m1.143s

time bin/llvm-lit -sv ../mlir/test/Transforms/loop-fusion-2.mlir

Testing Time: 0.21s

Passed: 1

real 0m0.281s
user 0m0.198s
sys 0m0.128s

So, there's really no impact on the actual bottleneck on a "high"-way multicore; the critical path is still inside loop-fusion.mlir. The split might definitely help when running on fewer cores but I don't think there is a way to make lit run on a specified fewer number of threads to immediately test that.

In D106473#2897540, @sumesh13 wrote:

Here is using time on a 16 core machine

loop-fusion.mlir
real 0m2.324s
user 0m2.335s
sys 0m0.154s

loop-fusion-2.mlir
real 0m0.417s
user 0m0.377s
sys 0m0.052s

Is there a benchmark threshold to target ? If there is, I can split to achieve that (leaving some room for the last test file to grow)

I'd consider 2s to be quite a lot on a 16-core system. We should get this to one second through the split if possible.

xgupta removed a subscriber: xgupta.Aug 1 2021, 1:04 AM

Herald added a subscriber: Chia-hungDuan. · View Herald TranscriptAug 1 2021, 1:04 AM

4 -way split that brings down the time per test from 5.5 seconds on a 16 core machine to about 1.5 second. The last file has some more room to grow.

sumesh13 marked 2 inline comments as done.Aug 1 2021, 3:13 PM

sumesh13 edited the summary of this revision. (Show Details)

Harbormaster completed remote builds in B117367: Diff 363356.Aug 1 2021, 3:44 PM

LGTM - thanks.

This revision is now accepted and ready to land.Aug 3 2021, 3:28 AM

Herald added a subscriber: wrengr. · View Herald TranscriptAug 3 2021, 3:28 AM

Closed by commit rG24b0df868604: [NFC][MLIR] Split large fusion test file into 4 test files (authored by sumesh13). · Explain WhyAug 3 2021, 10:13 AM

This revision was automatically updated to reflect the committed changes.

sumesh13 added a commit: rG24b0df868604: [NFC][MLIR] Split large fusion test file into 4 test files.

Revision Contents

Path

Size

mlir/

test/

Transforms/

loop-fusion-2.mlir

389 lines

loop-fusion.mlir

384 lines

Diff 360546

mlir/test/Transforms/loop-fusion-2.mlir

This file was added.

				// RUN: mlir-opt -allow-unregistered-dialect %s -affine-loop-fusion -split-input-file \| FileCheck %s
				// RUN: mlir-opt -allow-unregistered-dialect %s -affine-loop-fusion="fusion-maximal" -split-input-file \| FileCheck %s --check-prefix=MAXIMAL
				mehdi_aminiUnsubmitted Not Done Reply Inline Actions Since we're here, can someone look into removing the `-allow-unregistered-dialect` flag? mehdi_amini: Since we're here, can someone look into removing the `-allow-unregistered-dialect` flag?
				bondhugulaUnsubmitted Not Done Reply Inline Actions I can do it - we should do this in another commit. Just wondering what are all the strong motivations to free these passes from `allow-unregistered-dialect`? bondhugula: I can do it - we should do this in another commit. Just wondering what are all the strong…
				mehdi_aminiUnsubmitted Not Done Reply Inline Actions This flag makes the testing loses: they may be buggy (not declaring the dependent dialect) but we won't catch it during testing. mehdi_amini: This flag makes the testing loses: they may be buggy (not declaring the dependent dialect) but…

				//Part 1 of fusion tests in mlir/test/Transforms/loop-fusion.mlir
				bondhugulaUnsubmitted Done Reply Inline Actions Nit: space after `//` bondhugula: Nit: space after `//`
				bondhugulaUnsubmitted Done Reply Inline Actions Instead of this - you could just have a comment in loop-fusion.mlir // Part I of loop fusion tests. Part II in ... and then a `// Part II of loop fusion tests` for this file. bondhugula: Instead of this - you could just have a comment in loop-fusion.mlir ```// Part I of loop…
				// ----

				// MAXIMAL-LABEL: func @reduce_add_f32_f32(
				func @reduce_add_f32_f32(%arg0: memref<64x64xf32, 1>, %arg1: memref<1x64xf32, 1>, %arg2: memref<1x64xf32, 1>) {
				%cst_0 = constant 0.000000e+00 : f32
				%cst_1 = constant 1.000000e+00 : f32
				%0 = memref.alloca() : memref<f32, 1>
				%1 = memref.alloca() : memref<f32, 1>
				affine.for %arg3 = 0 to 1 {
				affine.for %arg4 = 0 to 64 {
				%accum = affine.for %arg5 = 0 to 64 iter_args (%prevAccum = %cst_0) -> f32 {
				%4 = affine.load %arg0[%arg5, %arg4] : memref<64x64xf32, 1>
				%5 = addf %prevAccum, %4 : f32
				affine.yield %5 : f32
				}
				%accum_dbl = addf %accum, %accum : f32
				affine.store %accum_dbl, %arg1[%arg3, %arg4] : memref<1x64xf32, 1>
				}
				}
				affine.for %arg3 = 0 to 1 {
				affine.for %arg4 = 0 to 64 {
				%accum = affine.for %arg5 = 0 to 64 iter_args (%prevAccum = %cst_1) -> f32 {
				%4 = affine.load %arg0[%arg5, %arg4] : memref<64x64xf32, 1>
				%5 = mulf %prevAccum, %4 : f32
				affine.yield %5 : f32
				}
				%accum_sqr = mulf %accum, %accum : f32
				affine.store %accum_sqr, %arg2[%arg3, %arg4] : memref<1x64xf32, 1>
				}
				}
				return
				}
				// The two loops here get maximally sibling-fused at the innermost
				// insertion point. Test checks if the innermost reduction loop of the fused loop
				// gets promoted into its outerloop.
				// MAXIMAL-SAME: %[[arg_0:.*]]: memref<64x64xf32, 1>,
				// MAXIMAL-SAME: %[[arg_1:.*]]: memref<1x64xf32, 1>,
				// MAXIMAL-SAME: %[[arg_2:.*]]: memref<1x64xf32, 1>) {
				// MAXIMAL: %[[cst:.*]] = constant 0 : index
				// MAXIMAL-NEXT: %[[cst_0:.*]] = constant 0.000000e+00 : f32
				// MAXIMAL-NEXT: %[[cst_1:.*]] = constant 1.000000e+00 : f32
				// MAXIMAL: affine.for %[[idx_0:.*]] = 0 to 1 {
				// MAXIMAL-NEXT: affine.for %[[idx_1:.*]] = 0 to 64 {
				// MAXIMAL-NEXT: %[[results:.]]:2 = affine.for %[[idx_2:.]] = 0 to 64 iter_args(%[[iter_0:.]] = %[[cst_1]], %[[iter_1:.]] = %[[cst_0]]) -> (f32, f32) {
				// MAXIMAL-NEXT: %[[val_0:.*]] = affine.load %[[arg_0]][%[[idx_2]], %[[idx_1]]] : memref<64x64xf32, 1>
				// MAXIMAL-NEXT: %[[reduc_0:.*]] = addf %[[iter_1]], %[[val_0]] : f32
				// MAXIMAL-NEXT: %[[val_1:.*]] = affine.load %[[arg_0]][%[[idx_2]], %[[idx_1]]] : memref<64x64xf32, 1>
				// MAXIMAL-NEXT: %[[reduc_1:.*]] = mulf %[[iter_0]], %[[val_1]] : f32
				// MAXIMAL-NEXT: affine.yield %[[reduc_1]], %[[reduc_0]] : f32, f32
				// MAXIMAL-NEXT: }
				// MAXIMAL-NEXT: %[[reduc_0_dbl:.]] = addf %[[results:.]]#1, %[[results]]#1 : f32
				// MAXIMAL-NEXT: affine.store %[[reduc_0_dbl]], %[[arg_1]][%[[cst]], %[[idx_1]]] : memref<1x64xf32, 1>
				// MAXIMAL-NEXT: %[[reduc_1_sqr:.*]] = mulf %[[results]]#0, %[[results]]#0 : f32
				// MAXIMAL-NEXT: affine.store %[[reduc_1_sqr]], %[[arg_2]][%[[idx_0]], %[[idx_1]]] : memref<1x64xf32, 1>
				// MAXIMAL-NEXT: }
				// MAXIMAL-NEXT: }
				// MAXIMAL-NEXT: return
				// MAXIMAL-NEXT: }

				// -----

				// CHECK-LABEL: func @reduce_add_non_innermost
				func @reduce_add_non_innermost(%arg0: memref<64x64xf32, 1>, %arg1: memref<1x64xf32, 1>, %arg2: memref<1x64xf32, 1>) {
				%cst = constant 0.000000e+00 : f32
				%cst_0 = constant 1.000000e+00 : f32
				%0 = memref.alloca() : memref<f32, 1>
				%1 = memref.alloca() : memref<f32, 1>
				affine.for %arg3 = 0 to 1 {
				affine.for %arg4 = 0 to 64 {
				%accum = affine.for %arg5 = 0 to 64 iter_args (%prevAccum = %cst) -> f32 {
				%4 = affine.load %arg0[%arg5, %arg4] : memref<64x64xf32, 1>
				%5 = addf %prevAccum, %4 : f32
				affine.yield %5 : f32
				}
				%accum_dbl = addf %accum, %accum : f32
				affine.store %accum_dbl, %arg1[%arg3, %arg4] : memref<1x64xf32, 1>
				}
				}
				affine.for %arg3 = 0 to 1 {
				affine.for %arg4 = 0 to 64 {
				%accum = affine.for %arg5 = 0 to 64 iter_args (%prevAccum = %cst_0) -> f32 {
				%4 = affine.load %arg0[%arg5, %arg4] : memref<64x64xf32, 1>
				%5 = mulf %prevAccum, %4 : f32
				affine.yield %5 : f32
				}
				%accum_sqr = mulf %accum, %accum : f32
				affine.store %accum_sqr, %arg2[%arg3, %arg4] : memref<1x64xf32, 1>
				}
				}
				return
				}
				// Test checks the loop structure is preserved after sibling fusion.
				// CHECK: affine.for
				// CHECK-NEXT: affine.for
				// CHECK-NEXT: affine.for
				// CHECK affine.for

				// -----
				func @reduce_add_non_maximal_f32_f32(%arg0: memref<64x64xf32, 1>, %arg1 : memref<1x64xf32, 1>, %arg2 : memref<1x64xf32, 1>) {
				%cst_0 = constant 0.000000e+00 : f32
				%cst_1 = constant 1.000000e+00 : f32
				affine.for %arg3 = 0 to 1 {
				affine.for %arg4 = 0 to 64 {
				%accum = affine.for %arg5 = 0 to 64 iter_args (%prevAccum = %cst_0) -> f32 {
				%4 = affine.load %arg0[%arg5, %arg4] : memref<64x64xf32, 1>
				%5 = addf %prevAccum, %4 : f32
				affine.yield %5 : f32
				}
				%accum_dbl = addf %accum, %accum : f32
				affine.store %accum_dbl, %arg1[%arg3, %arg4] : memref<1x64xf32, 1>
				}
				}
				affine.for %arg3 = 0 to 1 {
				affine.for %arg4 = 0 to 64 {
				// Following loop trip count does not match the corresponding source trip count.
				%accum = affine.for %arg5 = 0 to 32 iter_args (%prevAccum = %cst_1) -> f32 {
				%4 = affine.load %arg0[%arg5, %arg4] : memref<64x64xf32, 1>
				%5 = mulf %prevAccum, %4 : f32
				affine.yield %5 : f32
				}
				%accum_sqr = mulf %accum, %accum : f32
				affine.store %accum_sqr, %arg2[%arg3, %arg4] : memref<1x64xf32, 1>
				}
				}
				return
				}
				// Test checks the loop structure is preserved after sibling fusion
				// since the destination loop and source loop trip counts do not
				// match.
				// MAXIMAL-LABEL: func @reduce_add_non_maximal_f32_f32(
				// MAXIMAL: %[[cst_0:.*]] = constant 0.000000e+00 : f32
				// MAXIMAL-NEXT: %[[cst_1:.*]] = constant 1.000000e+00 : f32
				// MAXIMAL-NEXT: affine.for %[[idx_0:.*]]= 0 to 1 {
				// MAXIMAL-NEXT: affine.for %[[idx_1:.*]] = 0 to 64 {
				// MAXIMAL-NEXT: %[[result_1:.]] = affine.for %[[idx_2:.]] = 0 to 32 iter_args(%[[iter_0:.*]] = %[[cst_1]]) -> (f32) {
				// MAXIMAL-NEXT: %[[result_0:.]] = affine.for %[[idx_3:.]] = 0 to 64 iter_args(%[[iter_1:.*]] = %[[cst_0]]) -> (f32) {

				// -----

				// CHECK-LABEL: func @fuse_large_number_of_loops
				func @fuse_large_number_of_loops(%arg0: memref<20x10xf32, 1>, %arg1: memref<20x10xf32, 1>, %arg2: memref<20x10xf32, 1>, %arg3: memref<20x10xf32, 1>, %arg4: memref<20x10xf32, 1>, %arg5: memref<f32, 1>, %arg6: memref<f32, 1>, %arg7: memref<f32, 1>, %arg8: memref<f32, 1>, %arg9: memref<20x10xf32, 1>, %arg10: memref<20x10xf32, 1>, %arg11: memref<20x10xf32, 1>, %arg12: memref<20x10xf32, 1>) {
				%cst = constant 1.000000e+00 : f32
				%0 = memref.alloc() : memref<f32, 1>
				affine.store %cst, %0[] : memref<f32, 1>
				%1 = memref.alloc() : memref<20x10xf32, 1>
				affine.for %arg13 = 0 to 20 {
				affine.for %arg14 = 0 to 10 {
				%21 = affine.load %arg6[] : memref<f32, 1>
				affine.store %21, %1[%arg13, %arg14] : memref<20x10xf32, 1>
				}
				}
				%2 = memref.alloc() : memref<20x10xf32, 1>
				affine.for %arg13 = 0 to 20 {
				affine.for %arg14 = 0 to 10 {
				%21 = affine.load %1[%arg13, %arg14] : memref<20x10xf32, 1>
				%22 = affine.load %arg3[%arg13, %arg14] : memref<20x10xf32, 1>
				%23 = mulf %22, %21 : f32
				affine.store %23, %2[%arg13, %arg14] : memref<20x10xf32, 1>
				}
				}
				%3 = memref.alloc() : memref<f32, 1>
				%4 = affine.load %arg6[] : memref<f32, 1>
				%5 = affine.load %0[] : memref<f32, 1>
				%6 = subf %5, %4 : f32
				affine.store %6, %3[] : memref<f32, 1>
				%7 = memref.alloc() : memref<20x10xf32, 1>
				affine.for %arg13 = 0 to 20 {
				affine.for %arg14 = 0 to 10 {
				%21 = affine.load %3[] : memref<f32, 1>
				affine.store %21, %7[%arg13, %arg14] : memref<20x10xf32, 1>
				}
				}
				%8 = memref.alloc() : memref<20x10xf32, 1>
				affine.for %arg13 = 0 to 20 {
				affine.for %arg14 = 0 to 10 {
				%21 = affine.load %arg1[%arg13, %arg14] : memref<20x10xf32, 1>
				%22 = affine.load %7[%arg13, %arg14] : memref<20x10xf32, 1>
				%23 = mulf %22, %21 : f32
				affine.store %23, %8[%arg13, %arg14] : memref<20x10xf32, 1>
				}
				}
				%9 = memref.alloc() : memref<20x10xf32, 1>
				affine.for %arg13 = 0 to 20 {
				affine.for %arg14 = 0 to 10 {
				%21 = affine.load %arg1[%arg13, %arg14] : memref<20x10xf32, 1>
				%22 = affine.load %8[%arg13, %arg14] : memref<20x10xf32, 1>
				%23 = mulf %22, %21 : f32
				affine.store %23, %9[%arg13, %arg14] : memref<20x10xf32, 1>
				}
				}
				affine.for %arg13 = 0 to 20 {
				affine.for %arg14 = 0 to 10 {
				%21 = affine.load %9[%arg13, %arg14] : memref<20x10xf32, 1>
				%22 = affine.load %2[%arg13, %arg14] : memref<20x10xf32, 1>
				%23 = addf %22, %21 : f32
				affine.store %23, %arg11[%arg13, %arg14] : memref<20x10xf32, 1>
				}
				}
				%10 = memref.alloc() : memref<20x10xf32, 1>
				affine.for %arg13 = 0 to 20 {
				affine.for %arg14 = 0 to 10 {
				%21 = affine.load %1[%arg13, %arg14] : memref<20x10xf32, 1>
				%22 = affine.load %arg2[%arg13, %arg14] : memref<20x10xf32, 1>
				%23 = mulf %22, %21 : f32
				affine.store %23, %10[%arg13, %arg14] : memref<20x10xf32, 1>
				}
				}
				affine.for %arg13 = 0 to 20 {
				affine.for %arg14 = 0 to 10 {
				%21 = affine.load %8[%arg13, %arg14] : memref<20x10xf32, 1>
				%22 = affine.load %10[%arg13, %arg14] : memref<20x10xf32, 1>
				%23 = addf %22, %21 : f32
				affine.store %23, %arg10[%arg13, %arg14] : memref<20x10xf32, 1>
				}
				}
				%11 = memref.alloc() : memref<20x10xf32, 1>
				affine.for %arg13 = 0 to 20 {
				affine.for %arg14 = 0 to 10 {
				%21 = affine.load %arg10[%arg13, %arg14] : memref<20x10xf32, 1>
				%22 = affine.load %arg10[%arg13, %arg14] : memref<20x10xf32, 1>
				%23 = mulf %22, %21 : f32
				affine.store %23, %11[%arg13, %arg14] : memref<20x10xf32, 1>
				}
				}
				%12 = memref.alloc() : memref<20x10xf32, 1>
				affine.for %arg13 = 0 to 20 {
				affine.for %arg14 = 0 to 10 {
				%21 = affine.load %11[%arg13, %arg14] : memref<20x10xf32, 1>
				%22 = affine.load %arg11[%arg13, %arg14] : memref<20x10xf32, 1>
				%23 = subf %22, %21 : f32
				affine.store %23, %12[%arg13, %arg14] : memref<20x10xf32, 1>
				}
				}
				%13 = memref.alloc() : memref<20x10xf32, 1>
				affine.for %arg13 = 0 to 20 {
				affine.for %arg14 = 0 to 10 {
				%21 = affine.load %arg7[] : memref<f32, 1>
				affine.store %21, %13[%arg13, %arg14] : memref<20x10xf32, 1>
				}
				}
				%14 = memref.alloc() : memref<20x10xf32, 1>
				affine.for %arg13 = 0 to 20 {
				affine.for %arg14 = 0 to 10 {
				%21 = affine.load %arg4[%arg13, %arg14] : memref<20x10xf32, 1>
				%22 = affine.load %13[%arg13, %arg14] : memref<20x10xf32, 1>
				%23 = mulf %22, %21 : f32
				affine.store %23, %14[%arg13, %arg14] : memref<20x10xf32, 1>
				}
				}
				%15 = memref.alloc() : memref<20x10xf32, 1>
				affine.for %arg13 = 0 to 20 {
				affine.for %arg14 = 0 to 10 {
				%21 = affine.load %arg8[] : memref<f32, 1>
				affine.store %21, %15[%arg13, %arg14] : memref<20x10xf32, 1>
				}
				}
				%16 = memref.alloc() : memref<20x10xf32, 1>
				affine.for %arg13 = 0 to 20 {
				affine.for %arg14 = 0 to 10 {
				%21 = affine.load %15[%arg13, %arg14] : memref<20x10xf32, 1>
				%22 = affine.load %12[%arg13, %arg14] : memref<20x10xf32, 1>
				%23 = addf %22, %21 : f32
				affine.store %23, %16[%arg13, %arg14] : memref<20x10xf32, 1>
				}
				}
				%17 = memref.alloc() : memref<20x10xf32, 1>
				affine.for %arg13 = 0 to 20 {
				affine.for %arg14 = 0 to 10 {
				%21 = affine.load %16[%arg13, %arg14] : memref<20x10xf32, 1>
				%22 = math.sqrt %21 : f32
				affine.store %22, %17[%arg13, %arg14] : memref<20x10xf32, 1>
				}
				}
				%18 = memref.alloc() : memref<20x10xf32, 1>
				affine.for %arg13 = 0 to 20 {
				affine.for %arg14 = 0 to 10 {
				%21 = affine.load %arg5[] : memref<f32, 1>
				affine.store %21, %18[%arg13, %arg14] : memref<20x10xf32, 1>
				}
				}
				%19 = memref.alloc() : memref<20x10xf32, 1>
				affine.for %arg13 = 0 to 20 {
				affine.for %arg14 = 0 to 10 {
				%21 = affine.load %arg1[%arg13, %arg14] : memref<20x10xf32, 1>
				%22 = affine.load %18[%arg13, %arg14] : memref<20x10xf32, 1>
				%23 = mulf %22, %21 : f32
				affine.store %23, %19[%arg13, %arg14] : memref<20x10xf32, 1>
				}
				}
				%20 = memref.alloc() : memref<20x10xf32, 1>
				affine.for %arg13 = 0 to 20 {
				affine.for %arg14 = 0 to 10 {
				%21 = affine.load %17[%arg13, %arg14] : memref<20x10xf32, 1>
				%22 = affine.load %19[%arg13, %arg14] : memref<20x10xf32, 1>
				%23 = divf %22, %21 : f32
				affine.store %23, %20[%arg13, %arg14] : memref<20x10xf32, 1>
				}
				}
				affine.for %arg13 = 0 to 20 {
				affine.for %arg14 = 0 to 10 {
				%21 = affine.load %20[%arg13, %arg14] : memref<20x10xf32, 1>
				%22 = affine.load %14[%arg13, %arg14] : memref<20x10xf32, 1>
				%23 = addf %22, %21 : f32
				affine.store %23, %arg12[%arg13, %arg14] : memref<20x10xf32, 1>
				}
				}
				affine.for %arg13 = 0 to 20 {
				affine.for %arg14 = 0 to 10 {
				%21 = affine.load %arg12[%arg13, %arg14] : memref<20x10xf32, 1>
				%22 = affine.load %arg0[%arg13, %arg14] : memref<20x10xf32, 1>
				%23 = subf %22, %21 : f32
				affine.store %23, %arg9[%arg13, %arg14] : memref<20x10xf32, 1>
				}
				}
				return
				}
				// CHECK: affine.for
				// CHECK: affine.for
				// CHECK-NOT: affine.for

				// -----

				// Expects fusion of producer into consumer at depth 4 and subsequent removal of
				// source loop.
				// CHECK-LABEL: func @unflatten4d
				func @unflatten4d(%arg1: memref<7x8x9x10xf32>) {
				%m = memref.alloc() : memref<5040xf32>
				%cf7 = constant 7.0 : f32

				affine.for %i0 = 0 to 7 {
				affine.for %i1 = 0 to 8 {
				affine.for %i2 = 0 to 9 {
				affine.for %i3 = 0 to 10 {
				affine.store %cf7, %m[720 * %i0 + 90 * %i1 + 10 * %i2 + %i3] : memref<5040xf32>
				}
				}
				}
				}
				affine.for %i0 = 0 to 7 {
				affine.for %i1 = 0 to 8 {
				affine.for %i2 = 0 to 9 {
				affine.for %i3 = 0 to 10 {
				%v0 = affine.load %m[720 * %i0 + 90 * %i1 + 10 * %i2 + %i3] : memref<5040xf32>
				affine.store %v0, %arg1[%i0, %i1, %i2, %i3] : memref<7x8x9x10xf32>
				}
				}
				}
				}
				return
				}

				// CHECK: affine.for
				// CHECK-NEXT: affine.for
				// CHECK-NEXT: affine.for
				// CHECK-NEXT: affine.for
				// CHECK-NOT: affine.for
				// CHECK: return

				// -----

				// Expects fusion of producer into consumer at depth 2 and subsequent removal of
				// source loop.
				// CHECK-LABEL: func @unflatten2d_with_transpose
				func @unflatten2d_with_transpose(%arg1: memref<8x7xf32>) {
				%m = memref.alloc() : memref<56xf32>
				%cf7 = constant 7.0 : f32

				affine.for %i0 = 0 to 7 {
				affine.for %i1 = 0 to 8 {
				affine.store %cf7, %m[8 * %i0 + %i1] : memref<56xf32>
				}
				}
				affine.for %i0 = 0 to 8 {
				affine.for %i1 = 0 to 7 {
				%v0 = affine.load %m[%i0 + 8 * %i1] : memref<56xf32>
				affine.store %v0, %arg1[%i0, %i1] : memref<8x7xf32>
				}
				}
				return
				}

				// CHECK: affine.for
				// CHECK-NEXT: affine.for
				// CHECK-NOT: affine.for
				// CHECK: return

mlir/test/Transforms/loop-fusion.mlir

	Show First 20 Lines • Show All 3,142 Lines • ▼ Show 20 Lines
	// CHECK: affine.for			// CHECK: affine.for
	// CHECK-NEXT: affine.load			// CHECK-NEXT: affine.load
	// CHECK-NEXT: affine.store			// CHECK-NEXT: affine.store
	// CHECK: affine.for			// CHECK: affine.for
	// CHECK-NEXT: affine.load			// CHECK-NEXT: affine.load
	// CHECK-NEXT: mulf			// CHECK-NEXT: mulf
	// CHECK-NEXT: affine.store			// CHECK-NEXT: affine.store

	// -----			// -----
				bondhugulaUnsubmitted Not Done Reply Inline Actions There shouldn't be a `-----` separator above. bondhugula: There shouldn't be a `-----` separator above.
				// Add further tests in mlir/test/Transforms/loop-fusion-2.mlir

	// MAXIMAL-LABEL: func @reduce_add_f32_f32(
	func @reduce_add_f32_f32(%arg0: memref<64x64xf32, 1>, %arg1: memref<1x64xf32, 1>, %arg2: memref<1x64xf32, 1>) {
	%cst_0 = constant 0.000000e+00 : f32
	%cst_1 = constant 1.000000e+00 : f32
	%0 = memref.alloca() : memref<f32, 1>
	%1 = memref.alloca() : memref<f32, 1>
	affine.for %arg3 = 0 to 1 {
	affine.for %arg4 = 0 to 64 {
	%accum = affine.for %arg5 = 0 to 64 iter_args (%prevAccum = %cst_0) -> f32 {
	%4 = affine.load %arg0[%arg5, %arg4] : memref<64x64xf32, 1>
	%5 = addf %prevAccum, %4 : f32
	affine.yield %5 : f32
	}
	%accum_dbl = addf %accum, %accum : f32
	affine.store %accum_dbl, %arg1[%arg3, %arg4] : memref<1x64xf32, 1>
	}
	}
	affine.for %arg3 = 0 to 1 {
	affine.for %arg4 = 0 to 64 {
	%accum = affine.for %arg5 = 0 to 64 iter_args (%prevAccum = %cst_1) -> f32 {
	%4 = affine.load %arg0[%arg5, %arg4] : memref<64x64xf32, 1>
	%5 = mulf %prevAccum, %4 : f32
	affine.yield %5 : f32
	}
	%accum_sqr = mulf %accum, %accum : f32
	affine.store %accum_sqr, %arg2[%arg3, %arg4] : memref<1x64xf32, 1>
	}
	}
	return
	}
	// The two loops here get maximally sibling-fused at the innermost
	// insertion point. Test checks if the innermost reduction loop of the fused loop
	// gets promoted into its outerloop.
	// MAXIMAL-SAME: %[[arg_0:.*]]: memref<64x64xf32, 1>,
	// MAXIMAL-SAME: %[[arg_1:.*]]: memref<1x64xf32, 1>,
	// MAXIMAL-SAME: %[[arg_2:.*]]: memref<1x64xf32, 1>) {
	// MAXIMAL: %[[cst:.*]] = constant 0 : index
	// MAXIMAL-NEXT: %[[cst_0:.*]] = constant 0.000000e+00 : f32
	// MAXIMAL-NEXT: %[[cst_1:.*]] = constant 1.000000e+00 : f32
	// MAXIMAL: affine.for %[[idx_0:.*]] = 0 to 1 {
	// MAXIMAL-NEXT: affine.for %[[idx_1:.*]] = 0 to 64 {
	// MAXIMAL-NEXT: %[[results:.]]:2 = affine.for %[[idx_2:.]] = 0 to 64 iter_args(%[[iter_0:.]] = %[[cst_1]], %[[iter_1:.]] = %[[cst_0]]) -> (f32, f32) {
	// MAXIMAL-NEXT: %[[val_0:.*]] = affine.load %[[arg_0]][%[[idx_2]], %[[idx_1]]] : memref<64x64xf32, 1>
	// MAXIMAL-NEXT: %[[reduc_0:.*]] = addf %[[iter_1]], %[[val_0]] : f32
	// MAXIMAL-NEXT: %[[val_1:.*]] = affine.load %[[arg_0]][%[[idx_2]], %[[idx_1]]] : memref<64x64xf32, 1>
	// MAXIMAL-NEXT: %[[reduc_1:.*]] = mulf %[[iter_0]], %[[val_1]] : f32
	// MAXIMAL-NEXT: affine.yield %[[reduc_1]], %[[reduc_0]] : f32, f32
	// MAXIMAL-NEXT: }
	// MAXIMAL-NEXT: %[[reduc_0_dbl:.]] = addf %[[results:.]]#1, %[[results]]#1 : f32
	// MAXIMAL-NEXT: affine.store %[[reduc_0_dbl]], %[[arg_1]][%[[cst]], %[[idx_1]]] : memref<1x64xf32, 1>
	// MAXIMAL-NEXT: %[[reduc_1_sqr:.*]] = mulf %[[results]]#0, %[[results]]#0 : f32
	// MAXIMAL-NEXT: affine.store %[[reduc_1_sqr]], %[[arg_2]][%[[idx_0]], %[[idx_1]]] : memref<1x64xf32, 1>
	// MAXIMAL-NEXT: }
	// MAXIMAL-NEXT: }
	// MAXIMAL-NEXT: return
	// MAXIMAL-NEXT: }

	// -----

	// CHECK-LABEL: func @reduce_add_non_innermost
	func @reduce_add_non_innermost(%arg0: memref<64x64xf32, 1>, %arg1: memref<1x64xf32, 1>, %arg2: memref<1x64xf32, 1>) {
	%cst = constant 0.000000e+00 : f32
	%cst_0 = constant 1.000000e+00 : f32
	%0 = memref.alloca() : memref<f32, 1>
	%1 = memref.alloca() : memref<f32, 1>
	affine.for %arg3 = 0 to 1 {
	affine.for %arg4 = 0 to 64 {
	%accum = affine.for %arg5 = 0 to 64 iter_args (%prevAccum = %cst) -> f32 {
	%4 = affine.load %arg0[%arg5, %arg4] : memref<64x64xf32, 1>
	%5 = addf %prevAccum, %4 : f32
	affine.yield %5 : f32
	}
	%accum_dbl = addf %accum, %accum : f32
	affine.store %accum_dbl, %arg1[%arg3, %arg4] : memref<1x64xf32, 1>
	}
	}
	affine.for %arg3 = 0 to 1 {
	affine.for %arg4 = 0 to 64 {
	%accum = affine.for %arg5 = 0 to 64 iter_args (%prevAccum = %cst_0) -> f32 {
	%4 = affine.load %arg0[%arg5, %arg4] : memref<64x64xf32, 1>
	%5 = mulf %prevAccum, %4 : f32
	affine.yield %5 : f32
	}
	%accum_sqr = mulf %accum, %accum : f32
	affine.store %accum_sqr, %arg2[%arg3, %arg4] : memref<1x64xf32, 1>
	}
	}
	return
	}
	// Test checks the loop structure is preserved after sibling fusion.
	// CHECK: affine.for
	// CHECK-NEXT: affine.for
	// CHECK-NEXT: affine.for
	// CHECK affine.for

	// -----
	func @reduce_add_non_maximal_f32_f32(%arg0: memref<64x64xf32, 1>, %arg1 : memref<1x64xf32, 1>, %arg2 : memref<1x64xf32, 1>) {
	%cst_0 = constant 0.000000e+00 : f32
	%cst_1 = constant 1.000000e+00 : f32
	affine.for %arg3 = 0 to 1 {
	affine.for %arg4 = 0 to 64 {
	%accum = affine.for %arg5 = 0 to 64 iter_args (%prevAccum = %cst_0) -> f32 {
	%4 = affine.load %arg0[%arg5, %arg4] : memref<64x64xf32, 1>
	%5 = addf %prevAccum, %4 : f32
	affine.yield %5 : f32
	}
	%accum_dbl = addf %accum, %accum : f32
	affine.store %accum_dbl, %arg1[%arg3, %arg4] : memref<1x64xf32, 1>
	}
	}
	affine.for %arg3 = 0 to 1 {
	affine.for %arg4 = 0 to 64 {
	// Following loop trip count does not match the corresponding source trip count.
	%accum = affine.for %arg5 = 0 to 32 iter_args (%prevAccum = %cst_1) -> f32 {
	%4 = affine.load %arg0[%arg5, %arg4] : memref<64x64xf32, 1>
	%5 = mulf %prevAccum, %4 : f32
	affine.yield %5 : f32
	}
	%accum_sqr = mulf %accum, %accum : f32
	affine.store %accum_sqr, %arg2[%arg3, %arg4] : memref<1x64xf32, 1>
	}
	}
	return
	}
	// Test checks the loop structure is preserved after sibling fusion
	// since the destination loop and source loop trip counts do not
	// match.
	// MAXIMAL-LABEL: func @reduce_add_non_maximal_f32_f32(
	// MAXIMAL: %[[cst_0:.*]] = constant 0.000000e+00 : f32
	// MAXIMAL-NEXT: %[[cst_1:.*]] = constant 1.000000e+00 : f32
	// MAXIMAL-NEXT: affine.for %[[idx_0:.*]]= 0 to 1 {
	// MAXIMAL-NEXT: affine.for %[[idx_1:.*]] = 0 to 64 {
	// MAXIMAL-NEXT: %[[result_1:.]] = affine.for %[[idx_2:.]] = 0 to 32 iter_args(%[[iter_0:.*]] = %[[cst_1]]) -> (f32) {
	// MAXIMAL-NEXT: %[[result_0:.]] = affine.for %[[idx_3:.]] = 0 to 64 iter_args(%[[iter_1:.*]] = %[[cst_0]]) -> (f32) {

	// -----

	// CHECK-LABEL: func @fuse_large_number_of_loops
	func @fuse_large_number_of_loops(%arg0: memref<20x10xf32, 1>, %arg1: memref<20x10xf32, 1>, %arg2: memref<20x10xf32, 1>, %arg3: memref<20x10xf32, 1>, %arg4: memref<20x10xf32, 1>, %arg5: memref<f32, 1>, %arg6: memref<f32, 1>, %arg7: memref<f32, 1>, %arg8: memref<f32, 1>, %arg9: memref<20x10xf32, 1>, %arg10: memref<20x10xf32, 1>, %arg11: memref<20x10xf32, 1>, %arg12: memref<20x10xf32, 1>) {
	%cst = constant 1.000000e+00 : f32
	%0 = memref.alloc() : memref<f32, 1>
	affine.store %cst, %0[] : memref<f32, 1>
	%1 = memref.alloc() : memref<20x10xf32, 1>
	affine.for %arg13 = 0 to 20 {
	affine.for %arg14 = 0 to 10 {
	%21 = affine.load %arg6[] : memref<f32, 1>
	affine.store %21, %1[%arg13, %arg14] : memref<20x10xf32, 1>
	}
	}
	%2 = memref.alloc() : memref<20x10xf32, 1>
	affine.for %arg13 = 0 to 20 {
	affine.for %arg14 = 0 to 10 {
	%21 = affine.load %1[%arg13, %arg14] : memref<20x10xf32, 1>
	%22 = affine.load %arg3[%arg13, %arg14] : memref<20x10xf32, 1>
	%23 = mulf %22, %21 : f32
	affine.store %23, %2[%arg13, %arg14] : memref<20x10xf32, 1>
	}
	}
	%3 = memref.alloc() : memref<f32, 1>
	%4 = affine.load %arg6[] : memref<f32, 1>
	%5 = affine.load %0[] : memref<f32, 1>
	%6 = subf %5, %4 : f32
	affine.store %6, %3[] : memref<f32, 1>
	%7 = memref.alloc() : memref<20x10xf32, 1>
	affine.for %arg13 = 0 to 20 {
	affine.for %arg14 = 0 to 10 {
	%21 = affine.load %3[] : memref<f32, 1>
	affine.store %21, %7[%arg13, %arg14] : memref<20x10xf32, 1>
	}
	}
	%8 = memref.alloc() : memref<20x10xf32, 1>
	affine.for %arg13 = 0 to 20 {
	affine.for %arg14 = 0 to 10 {
	%21 = affine.load %arg1[%arg13, %arg14] : memref<20x10xf32, 1>
	%22 = affine.load %7[%arg13, %arg14] : memref<20x10xf32, 1>
	%23 = mulf %22, %21 : f32
	affine.store %23, %8[%arg13, %arg14] : memref<20x10xf32, 1>
	}
	}
	%9 = memref.alloc() : memref<20x10xf32, 1>
	affine.for %arg13 = 0 to 20 {
	affine.for %arg14 = 0 to 10 {
	%21 = affine.load %arg1[%arg13, %arg14] : memref<20x10xf32, 1>
	%22 = affine.load %8[%arg13, %arg14] : memref<20x10xf32, 1>
	%23 = mulf %22, %21 : f32
	affine.store %23, %9[%arg13, %arg14] : memref<20x10xf32, 1>
	}
	}
	affine.for %arg13 = 0 to 20 {
	affine.for %arg14 = 0 to 10 {
	%21 = affine.load %9[%arg13, %arg14] : memref<20x10xf32, 1>
	%22 = affine.load %2[%arg13, %arg14] : memref<20x10xf32, 1>
	%23 = addf %22, %21 : f32
	affine.store %23, %arg11[%arg13, %arg14] : memref<20x10xf32, 1>
	}
	}
	%10 = memref.alloc() : memref<20x10xf32, 1>
	affine.for %arg13 = 0 to 20 {
	affine.for %arg14 = 0 to 10 {
	%21 = affine.load %1[%arg13, %arg14] : memref<20x10xf32, 1>
	%22 = affine.load %arg2[%arg13, %arg14] : memref<20x10xf32, 1>
	%23 = mulf %22, %21 : f32
	affine.store %23, %10[%arg13, %arg14] : memref<20x10xf32, 1>
	}
	}
	affine.for %arg13 = 0 to 20 {
	affine.for %arg14 = 0 to 10 {
	%21 = affine.load %8[%arg13, %arg14] : memref<20x10xf32, 1>
	%22 = affine.load %10[%arg13, %arg14] : memref<20x10xf32, 1>
	%23 = addf %22, %21 : f32
	affine.store %23, %arg10[%arg13, %arg14] : memref<20x10xf32, 1>
	}
	}
	%11 = memref.alloc() : memref<20x10xf32, 1>
	affine.for %arg13 = 0 to 20 {
	affine.for %arg14 = 0 to 10 {
	%21 = affine.load %arg10[%arg13, %arg14] : memref<20x10xf32, 1>
	%22 = affine.load %arg10[%arg13, %arg14] : memref<20x10xf32, 1>
	%23 = mulf %22, %21 : f32
	affine.store %23, %11[%arg13, %arg14] : memref<20x10xf32, 1>
	}
	}
	%12 = memref.alloc() : memref<20x10xf32, 1>
	affine.for %arg13 = 0 to 20 {
	affine.for %arg14 = 0 to 10 {
	%21 = affine.load %11[%arg13, %arg14] : memref<20x10xf32, 1>
	%22 = affine.load %arg11[%arg13, %arg14] : memref<20x10xf32, 1>
	%23 = subf %22, %21 : f32
	affine.store %23, %12[%arg13, %arg14] : memref<20x10xf32, 1>
	}
	}
	%13 = memref.alloc() : memref<20x10xf32, 1>
	affine.for %arg13 = 0 to 20 {
	affine.for %arg14 = 0 to 10 {
	%21 = affine.load %arg7[] : memref<f32, 1>
	affine.store %21, %13[%arg13, %arg14] : memref<20x10xf32, 1>
	}
	}
	%14 = memref.alloc() : memref<20x10xf32, 1>
	affine.for %arg13 = 0 to 20 {
	affine.for %arg14 = 0 to 10 {
	%21 = affine.load %arg4[%arg13, %arg14] : memref<20x10xf32, 1>
	%22 = affine.load %13[%arg13, %arg14] : memref<20x10xf32, 1>
	%23 = mulf %22, %21 : f32
	affine.store %23, %14[%arg13, %arg14] : memref<20x10xf32, 1>
	}
	}
	%15 = memref.alloc() : memref<20x10xf32, 1>
	affine.for %arg13 = 0 to 20 {
	affine.for %arg14 = 0 to 10 {
	%21 = affine.load %arg8[] : memref<f32, 1>
	affine.store %21, %15[%arg13, %arg14] : memref<20x10xf32, 1>
	}
	}
	%16 = memref.alloc() : memref<20x10xf32, 1>
	affine.for %arg13 = 0 to 20 {
	affine.for %arg14 = 0 to 10 {
	%21 = affine.load %15[%arg13, %arg14] : memref<20x10xf32, 1>
	%22 = affine.load %12[%arg13, %arg14] : memref<20x10xf32, 1>
	%23 = addf %22, %21 : f32
	affine.store %23, %16[%arg13, %arg14] : memref<20x10xf32, 1>
	}
	}
	%17 = memref.alloc() : memref<20x10xf32, 1>
	affine.for %arg13 = 0 to 20 {
	affine.for %arg14 = 0 to 10 {
	%21 = affine.load %16[%arg13, %arg14] : memref<20x10xf32, 1>
	%22 = math.sqrt %21 : f32
	affine.store %22, %17[%arg13, %arg14] : memref<20x10xf32, 1>
	}
	}
	%18 = memref.alloc() : memref<20x10xf32, 1>
	affine.for %arg13 = 0 to 20 {
	affine.for %arg14 = 0 to 10 {
	%21 = affine.load %arg5[] : memref<f32, 1>
	affine.store %21, %18[%arg13, %arg14] : memref<20x10xf32, 1>
	}
	}
	%19 = memref.alloc() : memref<20x10xf32, 1>
	affine.for %arg13 = 0 to 20 {
	affine.for %arg14 = 0 to 10 {
	%21 = affine.load %arg1[%arg13, %arg14] : memref<20x10xf32, 1>
	%22 = affine.load %18[%arg13, %arg14] : memref<20x10xf32, 1>
	%23 = mulf %22, %21 : f32
	affine.store %23, %19[%arg13, %arg14] : memref<20x10xf32, 1>
	}
	}
	%20 = memref.alloc() : memref<20x10xf32, 1>
	affine.for %arg13 = 0 to 20 {
	affine.for %arg14 = 0 to 10 {
	%21 = affine.load %17[%arg13, %arg14] : memref<20x10xf32, 1>
	%22 = affine.load %19[%arg13, %arg14] : memref<20x10xf32, 1>
	%23 = divf %22, %21 : f32
	affine.store %23, %20[%arg13, %arg14] : memref<20x10xf32, 1>
	}
	}
	affine.for %arg13 = 0 to 20 {
	affine.for %arg14 = 0 to 10 {
	%21 = affine.load %20[%arg13, %arg14] : memref<20x10xf32, 1>
	%22 = affine.load %14[%arg13, %arg14] : memref<20x10xf32, 1>
	%23 = addf %22, %21 : f32
	affine.store %23, %arg12[%arg13, %arg14] : memref<20x10xf32, 1>
	}
	}
	affine.for %arg13 = 0 to 20 {
	affine.for %arg14 = 0 to 10 {
	%21 = affine.load %arg12[%arg13, %arg14] : memref<20x10xf32, 1>
	%22 = affine.load %arg0[%arg13, %arg14] : memref<20x10xf32, 1>
	%23 = subf %22, %21 : f32
	affine.store %23, %arg9[%arg13, %arg14] : memref<20x10xf32, 1>
	}
	}
	return
	}
	// CHECK: affine.for
	// CHECK: affine.for
	// CHECK-NOT: affine.for

	// -----

	// Expects fusion of producer into consumer at depth 4 and subsequent removal of
	// source loop.
	// CHECK-LABEL: func @unflatten4d
	func @unflatten4d(%arg1: memref<7x8x9x10xf32>) {
	%m = memref.alloc() : memref<5040xf32>
	%cf7 = constant 7.0 : f32

	affine.for %i0 = 0 to 7 {
	affine.for %i1 = 0 to 8 {
	affine.for %i2 = 0 to 9 {
	affine.for %i3 = 0 to 10 {
	affine.store %cf7, %m[720 * %i0 + 90 * %i1 + 10 * %i2 + %i3] : memref<5040xf32>
	}
	}
	}
	}
	affine.for %i0 = 0 to 7 {
	affine.for %i1 = 0 to 8 {
	affine.for %i2 = 0 to 9 {
	affine.for %i3 = 0 to 10 {
	%v0 = affine.load %m[720 * %i0 + 90 * %i1 + 10 * %i2 + %i3] : memref<5040xf32>
	affine.store %v0, %arg1[%i0, %i1, %i2, %i3] : memref<7x8x9x10xf32>
	}
	}
	}
	}
	return
	}

	// CHECK: affine.for
	// CHECK-NEXT: affine.for
	// CHECK-NEXT: affine.for
	// CHECK-NEXT: affine.for
	// CHECK-NOT: affine.for
	// CHECK: return

	// -----

	// Expects fusion of producer into consumer at depth 2 and subsequent removal of
	// source loop.
	// CHECK-LABEL: func @unflatten2d_with_transpose
	func @unflatten2d_with_transpose(%arg1: memref<8x7xf32>) {
	%m = memref.alloc() : memref<56xf32>
	%cf7 = constant 7.0 : f32

	affine.for %i0 = 0 to 7 {
	affine.for %i1 = 0 to 8 {
	affine.store %cf7, %m[8 * %i0 + %i1] : memref<56xf32>
	}
	}
	affine.for %i0 = 0 to 8 {
	affine.for %i1 = 0 to 7 {
	%v0 = affine.load %m[%i0 + 8 * %i1] : memref<56xf32>
	affine.store %v0, %arg1[%i0, %i1] : memref<8x7xf32>
	}
	}
	return
	}

	// CHECK: affine.for
	// CHECK-NEXT: affine.for
	// CHECK-NOT: affine.for
	// CHECK: return