This patch adds several tatics for RISCVMatInt::generateInstSeq to try decompose the constant/immediate to materialize, which can find less costly materialization alternatives for RV64 than what we emit with the generic approach. Some of those tactics depend on each other to be effective. This addresses many of the less-than-optimal materializations in the imm.ll test.
Thanks, this is a good improvement. Two thoughts, mainly for discussion rather than blocking issues:
- I don't want to unreasonably expand the scope of this patch, but I do note that at least the srli transformation may be beneficial on RV32 as well for code size (still two instructions, but more compressible). Did you consider adding this or look at other RVC considerations?
- Although it has the advantage that mistakes in edge cases etc won't lead to regressions in terms of longer instruction sequences, I'm slightly wary of the "generate and see if it's shorter" approach. Though perhaps this is going to come way down the list on any compile-time profiling vs all the other analyses+transformations being done. Any thoughts on this?
Originally I started by writing the optimizations in terms of "If this condition matches, do this", and I was rather please with the results, as the code was quite straightforward, clean and efficient. I was also trying keep the original structure of the materialization code, and just add those tweaks. But as I started trying to cover the last cases it kept getting hairier and hairier. Since that approach didn't backtrack it was very easy for those tweaks to interfere with each other, or to get ever more convoluted preconditions to avoid doing so. Maybe I was just tired and it shouldn't be that difficult. Eventually I decided to cut my losses and move some of the optimizations to the "give it a try" approach, but the other ones kept interfering, so eventually I ended up moving them all to the try approach. I have several ideas on how this could be improved, which would also address compressibility issues, but I wanted to get this baseline improvement merged.
Alex's concerns about the retry-based algorithm seem reasonable to me, but I have been advocating the approach of "this is an improvement, we can make more improvements later", so I think I still err towards landing this patch.
When it comes to improving the implementation approach here, it would be good to ensure that the overhead of getIntMatCost is also reduced, given it has an inefficient implementation based on materialising and then counting, rather than anything else.
These have been changed from (shift_left (add X, C) C) to (add (shift_left X, C), C).
This isn't an issue per-se, as the sequences are still the same number of instructions - it's likely due to slight changes in the results of getIntMatCost for these examples, and potentially it means we need to change these test cases to maintain coverage of the shift/add hook (which we can do in a follow-up commit).
Having checked the compression (something I know the cost model does not do), it seems the new code does not compress as well as the old code (the lui a0, 4095 is beyond its capabilities). Again, I think this is something to address in a follow-up commit.
I'm in favour of merging a patch that is functionally correct and makes incremental improvements. I'd rather avoid any regression in codesize though if possible. How feasible would it to be to avoid that case? It may be worthwhile leaving some of the materialisation improvements for a follow-up patch in order to land the clear wins now.