LoopFlatten improves a well known embedded benchmark with highly-popular industry applications with a few percentage points. But it is not restricted to just optimise a single benchmark case. Find below results for the llvm test suite and the number of loops it flattened:
Test # Loops flattened -------------------------------------------------------------------------------------------- MultiSource/Applications/JM/lencod/lencod 3 MultiSource/Benchmarks/mediabench/jpeg/jpeg-6a/cjpeg 1 MultiSource/Benchmarks/MiBench/consumer-jpeg/consumer-jpeg 3 MultiSource/Applications/JM/ldecod/ldecod 1 MultiSource/Benchmarks/ASCI_Purple/SMG2000/smg2000 3 MultiSource/Benchmarks/tramp3d-v4/tramp3d-v4 17 SingleSource/Benchmarks/Misc/himenobmtxpa 2 MicroBenchmarks/ImageProcessing/AnisotropicDiffusion/AnisotropicDiffusion 2 MicroBenchmarks/ImageProcessing/BilateralFiltering/BilateralFilter 2 MultiSource/Benchmarks/DOE-ProxyApps-C/miniGMG/miniGMG 20 MultiSource/Benchmarks/Rodinia/pathfinder/pathfinder 1 MicroBenchmarks/ImageProcessing/Blur/blur 2 MicroBenchmarks/ImageProcessing/Dither/Dither 2 MicroBenchmarks/ImageProcessing/Dilate/Dilate 2 MultiSource/Benchmarks/DOE-ProxyApps-C++/HPCCG/HPCCG 1 MultiSource/Benchmarks/DOE-ProxyApps-C/SimpleMOC/SimpleMOC 1 MicroBenchmarks/ImageProcessing/Interpolation/Interpolation 2 MultiSource/Benchmarks/ASC_Sequoia/AMGmk/AMGmk 2 MultiSource/Benchmarks/Rodinia/backprop/backprop 1 ---------------------------------------------------------------------------------------------- Total 68
While the implementation of LoopFlatten recognises a few patterns and could be made more generic, I believe these numbers show that it's generic enough to trigger on a wide variety of code bases, making it worthwile to enable it by default.
LoopFlatten is a relatively simple pass, it e.g. doesn't implement a computationally expensive algorithm, and doesn't require more analysis than a
typical loop pass. Compile-times for the llvm test suite (ClamAV, 7zip, tramp3d-v4, kimwitu++, sqlite3, mafft, SPASS, lencod, Bullet) show a very minor increase of ~0.04% to 0.28%. There are cases that improve compile times, but I haven't analysed that and don't want to claim of course that in general it will improve compile-times.
We have LoopFlatten enable by default downstream for many years now, thus it should have had a lot of exposure and usage and we are not aware of any problems.
I'm somewhat confused by what is going on here. Why do we now calculate MemorySSA and why does LoopUnroll get split into a separate LPM?