mlir/test/transforms/loop-fusion.mlir is too big and is split into mlir/test/transforms/loop-fusion.mlir,  mlir/test/transforms/loop-fusion-2.mlir, mlir/test/transforms/loop-fusion-3.mlir
and mlir/test/transforms/loop-fusion-4.mlir. Further tests can be added in mlir/test/transforms/loop-fusion-4.mlir
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
Thanks for splitting these test cases. A couple of minor comments.
| mlir/test/Transforms/loop-fusion-2.mlir | ||
|---|---|---|
| 5 | Nit: space after // | |
| 5 | Instead of this - you could just have a comment in loop-fusion.mlir // Part I of loop fusion tests. Part II in ... and then a // Part II of loop fusion tests for this file. | |
| mlir/test/Transforms/loop-fusion.mlir | ||
| 1548–1549 | There shouldn't be a ----- separator above. | |
mlir/test/transforms/loop-fusion.mlir is too big and is split into mlir/test/transforms/loop-fusion.mlir and mlir/test/transforms/loop-fusion-2.mlir
On which basis? Don't you think having _two similar names_ make confusion who don't know the context of test splitting.
| mlir/test/Transforms/loop-fusion-2.mlir | ||
|---|---|---|
| 3 | Since we're here, can someone look into removing the -allow-unregistered-dialect flag? | |
| mlir/test/Transforms/loop-fusion-2.mlir | ||
|---|---|---|
| 3 | I can do it - we should do this in another commit. Just wondering what are all the strong motivations to free these passes from allow-unregistered-dialect? | |
This test case file has been slowing down check-mlir for everyone. I won't be surprised if this is the one that runs the longest across all test cases in the codebase. It has to be split. Open to a better/logical way of partitioning (for eg. sibling fusion vs producer consumer fusion) but they both might just be intertwined to separate in a balanced way. It is common for large files to be split this way - both sources and test cases. For eg. in the TF/MLIR repo, tf_ops.cc became too large and they went with something like t_ops_a_m.cc and tf_ops_n_z.cc to cut down the 60+s compile time.
The main problem for me is that this split feels rather arbitrary. The original file still has >3000 lines. Has any benchmarking gone into determining the split point/number of partitions/etc?
This came as a feedback in a previous review. Basically all tests since that review have been moved to this new file and it also serves as a placeholder for future tests. Would be ideal if there were some logical division like sibling fusion however there seem to be very few tests that are purely sibling fusion.
| mlir/test/Transforms/loop-fusion-2.mlir | ||
|---|---|---|
| 3 | This flag makes the testing loses: they may be buggy (not declaring the dependent dialect) but we won't catch it during testing. | |
This test case file has been slowing down check-mlir for everyone.
Parallelism might help https://reviews.llvm.org/D106521 ([Docs] Mention how to run lit tests in parallel)?
I won't be surprised if this is the one that runs the longest across all test cases in the codebase.
This file is just 3398 lines long but see https://reviews.llvm.org/D103873 there are many many test cases longer than 14000 lines long for example.
Parallelism has no effect when a single test is slow. On the contrary: you are providing an argument in favor of splitting the test: after splitting it will be able to run in parallel threads!
Oh, nice. But I didn't object to spitting this testcase, problem was on which basis this splitting happen. I wish we can splits it into 30 files!
Just the number of lines would barely mean anything! Affine fusion that happens here uses a lot of analysis utilities (dependences, slices) and does a lot more in terms of checks and iterating over the graph; so it takes a noticeable time to run on these 3000 lines. OTOH, there could be 20000+ lines of MLIR that I've seen being canonicalized in a hundred milliseconds - so it's almost immaterial how long the test cases are.
@sumesh13 Could you please time just this test case before the split and the two parts separately after the split? I can do this as well on a 32-core system I have here. A 4/8-core one and a larger one would be useful data points.
Just the number of lines would barely mean anything! Affine fusion that happens here uses a lot of analysis utilities (dependences, slices) and does a lot more in terms of checks and iterating over the graph; so it takes a noticeable time to run on these 3000 lines. OTOH, there could be 20000+ lines of MLIR that I've seen being canonicalized in a hundred milliseconds - so it's almost immaterial how long the test cases are.
Oh thanks for explaining details, don't know about it. My pov was that it will understood better if we have name with a meaning of why it is named as it is, otherwise that is not a big issue, patch author can proceed without considering it 👍
Here is using time on a 16 core machine
loop-fusion.mlir 
real    0m2.324s
user    0m2.335s
sys     0m0.154s
loop-fusion-2.mlir
real    0m0.417s
user    0m0.377s
sys     0m0.052s
Is there a benchmark threshold to target ? If there is, I can split to achieve that (leaving some room for the last test file to grow)
Is the timing for loop-fusion.mlir above for before the split or after? We really want to see how much it was before the split as well to see how much this helped. On a really fast 32-core (64 with SMT) system, here's what I see:
BEFORE
time bin/llvm-lit -sv ../mlir/test/Transforms/loop-fusion.mlir
Testing Time: 1.01s
Passed: 1
real	0m1.081s
user	0m0.627s
sys	0m1.131s
AFTER
time bin/llvm-lit -sv ../mlir/test/Transforms/loop-fusion.mlir
Testing Time: 1.01s
Passed: 1
real	0m1.081s
user	0m0.591s
sys	0m1.143s
time bin/llvm-lit -sv ../mlir/test/Transforms/loop-fusion-2.mlir
Testing Time: 0.21s
Passed: 1
real	0m0.281s
user	0m0.198s
sys	0m0.128s
So, there's really no impact on the actual bottleneck on a "high"-way multicore; the critical path is still inside loop-fusion.mlir. The split might definitely help when running on fewer cores but I don't think there is a way to make lit run on a specified fewer number of threads to immediately test that.
I'd consider 2s to be quite a lot on a 16-core system. We should get this to one second through the split if possible.
4 -way split that brings down the time per test from 5.5 seconds on a 16 core machine to about 1.5 second. The last file has some more room to grow.
Since we're here, can someone look into removing the -allow-unregistered-dialect flag?