On SystemZ, eliminateFrameIndex() may use a scratch register to load part of an offset that is out of range for the using instruction (this is likely also done on other targets - I believe at least AArch64 also does this). Since this is done per instruction (FI operand), several identical addressing "anchor points" may result in the MBB where actually only the first one is needed (provided the GPR is the same). I was expecting this to be cleaned up somehow afterwards, but to my surprise it is not.
I have therefore now made an initial experiment consisting of making PrologEpilogInserter try to clean up redundant definitions after all frame indices have been eliminated, and it would be great to get feedback from other targets to see if this is generally useful. It seems beneficial on SystemZ at least (see below). The patch isn't quite ready yet but I hope it can be the start of a discussion at least. It actually now seems that even though there are "load address" instructions removed, there are in addition many more redundant immediate loads removed (see below). This makes me think that maybe this would better belong in a separate pass run after PEI (or perhaps even later)?
I should also say that I looked into MachineCSE and DeadMachineInstructionElim, but they are at least not readily available: MachineCSE requires SSA, and DeadMachineInstructionElim works bottom-up which does not quite seem to work (the redundant definitions are not "dead").
The patch currently looks for simple "candidate" instructions which may be removed if an identical preceding one exists without any intervening clobbering of the value. The CFG is traversed breadth-first so that live-in definitions can be reused.
Given that the the global analysis adds complexity, I tried also doing this just locally (by passing -pei-localonly). I did find however that the reuse from predecessors is what gives the bulk of improvement, so I would say it is worth that extra effort. On SystemZ/SPEC17:
main <> "local only": lghi : 445967 443571 -2396 // Immediate load lay : 55090 54783 -307 // Load address vgbm : 10974 10848 -126 // Immediate load ... OPCDIFFS: -2969
main <> patch: lghi : 445967 432678 -13289 // Immediate load lhi : 219663 216594 -3069 // Immediate load la : 533531 531952 -1579 // Load address lay : 55090 54774 -316 // Load address ... OPCDIFFS: -19862
I would say that this is a number of instructions removed (~20k) that makes this look interesting to me... (*) Some examples:
LAYs removed in ./f507.cactuBSSN_r/build/ML_BSSN_Advect.s:
lay %r1, 4096(%r15) std %f4, 1024(%r1) - lay %r1, 4096(%r15) std %f1, 1000(%r1) - lay %r1, 4096(%r15) std %f5, 1016(%r1) - lay %r1, 4096(%r15)
LGHI removed in ./f507.cactuBSSN_r/build/RadiationBoundary.s:
lghi %r3, 0 clgijl %r5, 12, .LBB2_23 // Conditional branch risbgn %r4, %r4, 1, 189, 0 lghi %r5, 0 - lghi %r3, 0
VGBMs (vector immediate loads) removed in ./f510.parest_r/build/derivative_approximation.s
... vst %v0, 344(%r15), 3 - vgbm %v0, 0 vst %v0, 360(%r15), 3 - vgbm %v0, 0 vst %v0, 312(%r15), 3 - vgbm %v0, 0 vst %v0, 328(%r15), 3 - vgbm %v0, 0 vst %v0, 280(%r15), 3 - vgbm %v0, 0 vst %v0, 296(%r15), 3
(*) I also see a few hundred more lmg/br insructions in total, which I am guessing come from some missed CFG optimization, but not sure if this is important. For example:
- j .LBB50_33 + lochie %r13, 1 + llgfr %r2, %r13 + lmg %r10, %r15, 240(%r15) + br %r14 .LBB50_32: lghi %r0, -1 sllg %r0, %r0, 0(%r12) xihf %r0, 4294967295 xilf %r0, 4294967295 cg %r0, 24(%r11) -.LBB50_33: - lhi %r13, 0 lochie %r13, 1 llgfr %r2, %r13 lmg %r10, %r15, 240(%r15) br %r14
Current status is that benchmarks build with -verify-machineinstrs, but there are many regression tests failing (hopefully due to removed redundant defs :-). I am hoping that people working with those targets can help me understand those...
This looks like it should be sorted?