I recently found out that LBM performance can be improved by 5-10 % on SystemZ if the addressing in the hot loop (LBM_performStreamCollideTRT) is improved. It currently has a lot of unfolded offsets which all have to be computed with a register move + (two address) 32 bit immediate addition. I have experimented with LSR and found that this can be handled by doing two things:
- Reconcile unfoldable offsets. Currently, a Fixup with a foldable offset is placed into a pre-existing LSRUse. But all Fixups with unfoldable offsets get their own LSRUse - they are never grouped together even when their huge offsets have small (foldable) differences. A new method reconcileUnfoldedAddressOffsets() performs this task.
- Limit the number of filtered-out Formulas in NarrowSearchSpaceByFilterFormulaWithSameScaledReg() so that those without unfoldable offsets do not get lost.
Overall, this is an improvement of the AGFIs on SPEC, but there are also some rare cases where this gets worse. I think this is because SystemZTTI accepts long displacements in the LSR phase of building the LSRUses with their Fixups. Then, during Solve(), the Instruction pointer is passed to SystemZTTI::isLSRCostLess() which now then says that those offsets/Fixups are in fact not foldable, and a good solution is not to be found. I experimented with dissallowing the long displacements (for vector/fp) also in the early phase, but this changed a tremendous amount of files with mixed benchmark effects, so that seems to also be a matter of tuning. Since the cases that get worse with this patch are rare, and the patch now is relatively much simpler with a clear benchmark improvement, I would like to return to the other issues after this.
Four tests failed with this, and looking at CodeGen/ARM/ParallelDSP/unroll-n-jam-smlad.ll, it seemed that there were now more spills/reloads. I am not sure why, so I made this optional (for now) with a target hook TTI.LSRUnfOffsetsReconc().
LLVM :: CodeGen/ARM/ParallelDSP/unroll-n-jam-smlad.ll LLVM :: CodeGen/ARM/loop-indexing.ll LLVM :: CodeGen/PowerPC/bdzlr.ll LLVM :: CodeGen/PowerPC/lsr-profitable-chain.ll
Is this the right approach to remedy the LBM loop?
clang-tidy: warning: invalid case style for function 'LSRUnfOffsetsReconc' [readability-identifier-naming]
not useful