This is a implementation of unroll and jam, which is something that comes up as useful for our smaller embedded processors (and hopefully for other systems in general)
The basic idea is that we take an outer loop of the for:
for i.. ForeBlocks(i) for j.. SubLoopBlocks(i, j) AftBlocks(i)
Instead of doing normal inner or outer unrolling, we unroll as follows:
for i... i+=2 ForeBlocks(i) ForeBlocks(i+1) for j.. SubLoopBlocks(i, j) SubLoopBlocks(i+1, j) AftBlocks(i) AftBlocks(i+1) Remainder
To do this we need to ensure that the ForeBlocks(i+1) can be moved before the SubLoopBlocks(i) and AftBlocks(i), which means potentially moving the phi node operands from AftBlocks into Fore. There are also memory dependency checks and other safety checks that are needed. The transform is then a fairly simple job of using the excellent existing unroll code for cloning blocks and gluing them all back together correctly.
- The dependency analysis is built upon DependencyAnalysis that loop interchange uses. This might have been a mistake. I have made some changes to DA to ensure that the AA gave correct results (and enable TBAA). I have split these changes out into D42381. There may be more to do here to get values outside the loop to be ignored.
- I have made this into a separate pass that runs just before the existing loop unroll.
- The performance heuristics might not be sorted correctly yet. Several parts might be over-conservative when disabling for safety. The remainder loop may not be the best, I'm not sure how it will play with vectorisers, etc. It is hopefully a good enough start.
- I have tested this from the C level (i.e. csmith) but not a lot from the IR level with some form of fuzzer.
- Pragmas are mostly copied from unroll.
Can this overflow if the exit count is INT_MAX (or similar)? It's generally a good idea to comment on how that's handled.