I recently found out that LBM performance can be improved by 5-10 % on SystemZ if the addressing in the hot loop (LBM_performStreamCollideTRT) is improved. It currently has a lot of unfolded offsets which all have to be computed with a register move + (two address) 32 bit immediate addition. I have experimented with LSR and found that this can be handled by doing two things:
- Reconcile unfoldable offsets. Currently, a Fixup with a foldable offset is placed into a pre-existing LSRUse. But all Fixups with unfoldable offsets get their own LSRUse - they are never grouped together even when their huge offsets have small (foldable) differences. A new method reconcileUnfoldedAddressOffsets() performs this task.
- Limit the number of filtered-out Formulas in NarrowSearchSpaceByFilterFormulaWithSameScaledReg() so that those without unfoldable offsets do not get lost.
Overall, this is an improvement of the AGFIs on SPEC, but there are also some rare cases where this gets worse. I think this is because SystemZTTI accepts long displacements in the LSR phase of building the LSRUses with their Fixups. Then, during Solve(), the Instruction pointer is passed to SystemZTTI::isLSRCostLess() which now then says that those offsets/Fixups are in fact not foldable, and a good solution is not to be found. I experimented with dissallowing the long displacements (for vector/fp) also in the early phase, but this changed a tremendous amount of files with mixed benchmark effects, so that seems to also be a matter of tuning. Since the cases that get worse with this patch are rare, and the patch now is relatively much simpler with a clear benchmark improvement, I would like to return to the other issues after this.
Four tests failed with this, and looking at CodeGen/ARM/ParallelDSP/unroll-n-jam-smlad.ll, it seemed that there were now more spills/reloads. I am not sure why, so I made this optional (for now) with a target hook TTI.LSRUnfOffsetsReconc().
LLVM :: CodeGen/ARM/ParallelDSP/unroll-n-jam-smlad.ll LLVM :: CodeGen/ARM/loop-indexing.ll LLVM :: CodeGen/PowerPC/bdzlr.ll LLVM :: CodeGen/PowerPC/lsr-profitable-chain.ll
Is this the right approach to remedy the LBM loop?
This name could be more descriptive, like LSRShouldGroupUnfoldableOffsets or LSRShouldReconcileUnfoldableOffsets, though "reconcile" is kind of generic and I don't really have a better term. Even "unfoldable" isn't clear unless you're deep into LSR. Is there some name that people developing targets would look at and have some idea of what their target should return, without being an LSR expert?