This is an archive of the discontinued LLVM Phabricator instance.

[LSR] Add reconciliation of unfoldable offsets
Needs ReviewPublic

Authored by jonpa on Mar 8 2021, 6:29 PM.

Details

Summary

I recently found out that LBM performance can be improved by 5-10 % on SystemZ if the addressing in the hot loop (LBM_performStreamCollideTRT) is improved. It currently has a lot of unfolded offsets which all have to be computed with a register move + (two address) 32 bit immediate addition. I have experimented with LSR and found that this can be handled by doing two things:

  • Reconcile unfoldable offsets. Currently, a Fixup with a foldable offset is placed into a pre-existing LSRUse. But all Fixups with unfoldable offsets get their own LSRUse - they are never grouped together even when their huge offsets have small (foldable) differences. A new method reconcileUnfoldedAddressOffsets() performs this task.
  • Limit the number of filtered-out Formulas in NarrowSearchSpaceByFilterFormulaWithSameScaledReg() so that those without unfoldable offsets do not get lost.

Overall, this is an improvement of the AGFIs on SPEC, but there are also some rare cases where this gets worse. I think this is because SystemZTTI accepts long displacements in the LSR phase of building the LSRUses with their Fixups. Then, during Solve(), the Instruction pointer is passed to SystemZTTI::isLSRCostLess() which now then says that those offsets/Fixups are in fact not foldable, and a good solution is not to be found. I experimented with dissallowing the long displacements (for vector/fp) also in the early phase, but this changed a tremendous amount of files with mixed benchmark effects, so that seems to also be a matter of tuning. Since the cases that get worse with this patch are rare, and the patch now is relatively much simpler with a clear benchmark improvement, I would like to return to the other issues after this.

Four tests failed with this, and looking at CodeGen/ARM/ParallelDSP/unroll-n-jam-smlad.ll, it seemed that there were now more spills/reloads. I am not sure why, so I made this optional (for now) with a target hook TTI.LSRUnfOffsetsReconc().

LLVM :: CodeGen/ARM/ParallelDSP/unroll-n-jam-smlad.ll
LLVM :: CodeGen/ARM/loop-indexing.ll
LLVM :: CodeGen/PowerPC/bdzlr.ll
LLVM :: CodeGen/PowerPC/lsr-profitable-chain.ll

Is this the right approach to remedy the LBM loop?

Diff Detail

Event Timeline

jonpa created this revision.Mar 8 2021, 6:29 PM
jonpa requested review of this revision.Mar 8 2021, 6:29 PM
Herald added a project: Restricted Project. · View Herald TranscriptMar 8 2021, 6:29 PM
jonpa updated this revision to Diff 330741.Mar 15 2021, 11:28 AM

Patch rebased.

lebedev.ri added inline comments.
llvm/lib/Transforms/Scalar/LoopStrengthReduce.cpp
1976

Should this be llvm::SmallMapVector<const SCEV *, SmallVector<size_t, 8>, 8> ?

jonpa added inline comments.Mar 16 2021, 1:03 PM
llvm/lib/Transforms/Scalar/LoopStrengthReduce.cpp
1976

I agree we should pick the best data structure available, but I would prefer to do wait until we are sure exactly what the patch will look like and then I can do some experiment to figure out the typical need here.

At this point I would first like to have a general opinion about the patch as a whole, please...

jonpa added a comment.Mar 23 2021, 1:47 PM

Ping!

Grouping huge offsets (like foldable ones) really should be a general win for most targets, although this patch is only enabled on SystemZ for now.

PING!

This could be enabled for SystemZ only for now, but review is still needed...

greened added inline comments.Jan 12 2022, 1:53 PM
llvm/include/llvm/Analysis/TargetTransformInfo.h
696

This name could be more descriptive, like LSRShouldGroupUnfoldableOffsets or LSRShouldReconcileUnfoldableOffsets, though "reconcile" is kind of generic and I don't really have a better term. Even "unfoldable" isn't clear unless you're deep into LSR. Is there some name that people developing targets would look at and have some idea of what their target should return, without being an LSR expert?

jonpa updated this revision to Diff 411550.Feb 25 2022, 4:24 PM
jonpa added reviewers: qcolombet, efriedma, eli.friedman.

I saw the need for this patch again recently when needing the DAGCombiner to handle multiple out-of-range offsets, which it did not. The problem is that we "lie" in SystemZ::isLegalAddressingMode() currently by saying that a vector type memory access accepts a big offset, which is not true. If I change this, the DAGCombiner does a good job, however LSR then produces worse code in some cases, which makes me come back to this.

For some reason or other, "lying" to LSR about these displacements is sometimes better. For instance with this loop:

define void @fun(i8* %arg) {
bb:
  %i = getelementptr inbounds i8, i8* %arg, i64 27548
  %i1 = bitcast i8* %i to [12 x [16 x i16]]*
  %i2 = getelementptr inbounds i8, i8* %arg, i64 28028
  %i3 = bitcast i8* %i2 to [12 x [16 x i16]]*
  br label %bb4

bb4:                                              ; preds = %bb4, %bb
  %i5 = phi i64 [ %i10, %bb4 ], [ 0, %bb ]
  %i6 = getelementptr inbounds [12 x [16 x i16]], [12 x [16 x i16]]* %i1, i64 0, i64 3, i64 %i5
  %i7 = bitcast i16* %i6 to <8 x i16>*
  store <8 x i16> zeroinitializer, <8 x i16>* %i7
  %i8 = getelementptr inbounds [12 x [16 x i16]], [12 x [16 x i16]]* %i3, i64 0, i64 3, i64 %i5
  %i9 = bitcast i16* %i8 to <16 x i8>*
  store <16 x i8> zeroinitializer, <16 x i8>* %i9
  %i10 = add nuw i64 %i5, 8
  br label %bb4
}

trunk:

	vgbm	%v0, 0
	aghi	%r2, 27644              ### Single increment
.LBB0_1:                                # %bb4
                                        # =>This Inner Loop Header: Depth=1
	vst	%v0, 0(%r2), 3
	vst	%v0, 480(%r2), 3
	la	%r2, 16(%r2)
	j	.LBB0_1

if rejecting big offsets for the Vector STores:

	vgbm	%v0, 0
	lay	%r1, 28124(%r2)         ### Two separate regs
	lay	%r2, 27644(%r2)
	lghi	%r3, 0
.LBB0_1:                                # %bb4
                                        # =>This Inner Loop Header: Depth=1
	vst	%v0, 0(%r3,%r2), 3
	vst	%v0, 0(%r3,%r1), 3
	la	%r3, 16(%r3)
	j	.LBB0_1

This is a small example, but it illustrates that LSR recognizes that the two offsets are out of range, and it then puts them in two separate registers. In this case it seems that trunk LSR managed to chose the better formula even when "thinking" the huge offsets were in range. Maybe the huge offset was added before the loop as a general heuristic to reduce the offsets in the fixups...? In real code, those LAYs are more plentiful and also inside the loop... :-/

So my simple idea here is to have LSR group these fixups together under one reg even though they are initially unfoldable: Together they form a group which can have foldable offsets if the right immediate is added to the formula before the loop. It seems to me that this would make sense and is kind of simply missing - unless there is some other reason for not doing this?

The first step is to make sure the fixup groups are formed under one LSRUse when they are all out of range, but within range between themselves.

When rejecting the addressing modes properly for the vector-type, I realized I also had to adjust these new groups (second step), which is done in adjustInitialFormulaeForOffsets(). This is what happens:

Fixup #1:   Offset = 28124  => Initial formula:  reg(A) + 28124
Fixup #2:   Offset = 27644  => reg(A) - 480.

reconcileNewOffset() checked that the offset range between #1 and #2 is in range, which it is. However the offset -480 is not legal, so adjustInitialFormulaeForOffsets() adjusts by subtracting 480 to the Formula and adds it to the fixup offsets:

Initial formula, adjusted: reg(A) + 27644
Fixup #1:   reg(A) + 480
Fixup #2:   reg(A) +   0

In English: The SystemZ::isLegalAddressingMode() returns true for a VectorType only if the offset is within 0 - 4095, so all fixups can only use such an offset against the formula.

The reason this adjustment is made after the fact (of creating the initial formula) is that it can only be done after the group of fixups have been found. But maybe this could be gotten for free if the user instructions where sorted by offsets somehow so that the lowest offset would be seen first, which should give the same result...?

I left some "EXPERIMENTAL" (/SystemZ) options in the patch for now, so that anyone who wanted could look into this. To investigate the small loop above:

trunk:
llc -mtriple=s390x-linux-gnu -mcpu=z15 -O3 ./tc_LSR.ll -o -

reject large displacements for vector type (as should be), with worse results:
llc -mtriple=s390x-linux-gnu -mcpu=z15 -O3 ./tc_LSR.ll -o - -legalam-vec

to handle the problem with this patch:
llc -mtriple=s390x-linux-gnu -mcpu=z15 -O3 ./tc_LSR.ll -o - -legalam-vec -lsr-unfolded-offs

Is this making sense? Is there some other way of handling this that would be even better? I am thinking that of course those addresses could be somewhat cleaned up later, but it would be really best to handle the loop addressing in LSR as loops are after all very important.

llvm/include/llvm/Analysis/TargetTransformInfo.h
696

Yeah, that's a better name :-)

Herald added a project: Restricted Project. · View Herald TranscriptMar 1 2022, 2:32 PM

Hi Jonas,

It seems to me that this would make sense and is kind of simply missing

Make sense to me.

I only glanced at the code, but the approach looks reasonable.
I haven't touched LSR code for quite some time so if someone else can do the actual review that would be best, but if not, I'll jump in.

Cheers,
-Quentin

jonpa added a comment.May 24 2022, 1:49 AM

I currently don't see any performance wins on SystemZ with this patch, although there are still nice improvements in the output. My original motivation for this patch was improvement of lbm with 5-10%, but although I see that same hot loop still being helped there is no longer any impact on performance.

On SystemZ I see over SPEC:

lay            :                55777                55520     -257
agfi           :                  426                  297     -129
lgr            :               842567               842448     -119
lg             :              1049005              1048942      -63
...
OPCDIFFS: -705

This is a nice improvement (especially as this is all about loops), but I am not sure it would be right to extend LSR with more than just a few lines without a clear motivation. So unless there is interest in this from other targets, I will pause my work on this until I see a clear win from it. As mentioned earlier there might be simpler ways of achieving the same result of the patch but in principal it does what it needs to now.

eopXD added a reviewer: Restricted Project.May 24 2022, 6:13 PM
eopXD added a subscriber: eopXD.May 24 2022, 7:14 PM

Hi Jonas,

I am trying to wrap my head around this patch. I would really appreciate
if you can tell me more on your approach. (or please correct me if I am
saying something stupid)

It sounds like you are trying to have an alternative Formula-s when two
or more LSRUse have mutual unfold-able offset. This sounds like what
LSRInstance::GenerateConstantOffsets is doing. Won't adding more
Formula-s to an LSRUse do the job (rather than overriding the offset)?

jonpa added a comment.May 25 2022, 1:45 AM

Hi Jonas,

I am trying to wrap my head around this patch. I would really appreciate
if you can tell me more on your approach. (or please correct me if I am
saying something stupid)

It sounds like you are trying to have an alternative Formula-s when two
or more LSRUse have mutual unfold-able offset. This sounds like what
LSRInstance::GenerateConstantOffsets is doing. Won't adding more
Formula-s to an LSRUse do the job (rather than overriding the offset)?

Hi Yueh-Ting,

thanks for taking a look :-)

LSR begins with looking over the interesting instructions and sorting them into groups (LSRUse:s) in CollectFixupsAndInitialFormulae(). This is needed to generate efficient code, i.e. to use the same IV register for more than just a single instruction if possible. This is implemented in getUse(): a previous LSRUse with the same SCEV and Kind is searched for, and if found it is used if reconcileNewOffset() succeeds. If the offset does not go with the pre-existing LSRUse a new LSRUse is created instead.

The problem then with your idea of GenerateConstantOffsets() is that per the above the two instructions end up in their own LSRUse with their own set of formulae which will then not work as they will always be separate. With that said, it might be interesting to experiment with GenerateCrossUseConstantOffsets() to see if the problem could be solved that way. There is however an advantage with putting the instructions in the same LSRUse right away as LSR is generating a lot of formulae and then prune the search space for the sake of compile time, so I would say getting something obvious like this right early is better.

Did you try the patch on your target and find any improvements?

eopXD added a comment.May 25 2022, 6:33 AM

Hi Jonas,

I am trying to wrap my head around this patch. I would really appreciate
if you can tell me more on your approach. (or please correct me if I am
saying something stupid)

It sounds like you are trying to have an alternative Formula-s when two
or more LSRUse have mutual unfold-able offset. This sounds like what
LSRInstance::GenerateConstantOffsets is doing. Won't adding more
Formula-s to an LSRUse do the job (rather than overriding the offset)?

Hi Yueh-Ting,

thanks for taking a look :-)

LSR begins with looking over the interesting instructions and sorting them into groups (LSRUse:s) in CollectFixupsAndInitialFormulae(). This is needed to generate efficient code, i.e. to use the same IV register for more than just a single instruction if possible. This is implemented in getUse(): a previous LSRUse with the same SCEV and Kind is searched for, and if found it is used if reconcileNewOffset() succeeds. If the offset does not go with the pre-existing LSRUse a new LSRUse is created instead.

The problem then with your idea of GenerateConstantOffsets() is that per the above the two instructions end up in their own LSRUse with their own set of formulae which will then not work as they will always be separate. With that said, it might be interesting to experiment with GenerateCrossUseConstantOffsets() to see if the problem could be solved that way. There is however an advantage with putting the instructions in the same LSRUse right away as LSR is generating a lot of formulae and then prune the search space for the sake of compile time, so I would say getting something obvious like this right early is better.

Did you try the patch on your target and find any improvements?

Thank you for the clear explanation. The approach make sense to me.
However, I can't successfully apply the patch, may you rebase it?

It looks like LSR can surely find improvement here, at least for the
code-gen I observed from your test case on the RISC-V backend.

Other than the rebase, may you add debug logs to show the changes
for your enhancement, thanks for the work!

jonpa updated this revision to Diff 432326.May 26 2022, 10:25 AM

However, I can't successfully apply the patch, may you rebase it?

Patch rebased.

It looks like LSR can surely find improvement here, at least for the code-gen I observed from your test case on the RISC-V backend.

I agree, but we need to find real benchmark improvements to motivate this, I think...

Other than the rebase, may you add debug logs to show the changes for your enhancement, thanks for the work!

I will do that later if there is an interest for the patch being committed...

lebedev.ri resigned from this revision.Jan 12 2023, 5:30 PM

This review may be stuck/dead, consider abandoning if no longer relevant.
Removing myself as reviewer in attempt to clean dashboard.