This rewrites big parts of the fast register allocator. The basic strategy of
doing block-local allocation hasn't changed but I tweaked several details:
- Track register state on register units instead of physical registers. This simplifies and speeds up handling of register aliases.
- Process basic blocks in reverse order: Definitions are known to end register livetimes when walking backwards (contrary when walking forward then uses may or may not be a kill so we need heuristics).
- Check register mask operands (calls) instead of conservatively assuming everything is clobbered.
- Enhance heuristics to detect killing uses: In case of a small number of defs/uses check if they are all in the same basic block and if so the last one is a killing use.
- Enhance heuristic for copy-coalescing through hinting: We check the first k defs of a register for COPYs rather than relying on there just being a single definition.
When testing this on the full llvm test-suite including SPEC externals I measured:
- average 5.1% reduction in code size for X86, 4.9% reduction in code on aarch64. (ranging between 0% and 20% depending on the test)
- 0.5% faster compiletime (some analysis suggests the pass is slightly slower than before, but we more than make up for it because later passes are faster with the reduced instruction count)
I'm still running benchmarks and I need to fix 120 tests with hardcoded registers (ugh)...
Should comment meaning / reason for these values?