This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
include/llvm/Analysis/
-
llvm/
-
Analysis/
1/2
TargetTransformInfo.h
-
TargetTransformInfoImpl.h
-
lib/
-
Analysis/
-
TargetTransformInfo.cpp
-
Target/SystemZ/
-
SystemZ/
-
SystemZTargetTransformInfo.h
-
SystemZTargetTransformInfo.cpp
-
Transforms/Scalar/
-
Scalar/
1/2
LoopStrengthReduce.cpp
-
test/CodeGen/SystemZ/
-
CodeGen/
-
SystemZ/
-
loop-01.ll

Differential D98230

[LSR] Add reconciliation of unfoldable offsets
Needs ReviewPublic

Authored by jonpa on Mar 8 2021, 6:29 PM.

Download Raw Diff

Details

Reviewers

uweigand
reames
SjoerdMeijer
greened
dmgreen
hfinkel
wmi
Florian
qcolombet
efriedma
eli.friedman
lebedev.ri

Group Reviewers

Restricted Project

Summary

I recently found out that LBM performance can be improved by 5-10 % on SystemZ if the addressing in the hot loop (LBM_performStreamCollideTRT) is improved. It currently has a lot of unfolded offsets which all have to be computed with a register move + (two address) 32 bit immediate addition. I have experimented with LSR and found that this can be handled by doing two things:

Reconcile unfoldable offsets. Currently, a Fixup with a foldable offset is placed into a pre-existing LSRUse. But all Fixups with unfoldable offsets get their own LSRUse - they are never grouped together even when their huge offsets have small (foldable) differences. A new method reconcileUnfoldedAddressOffsets() performs this task.

Limit the number of filtered-out Formulas in NarrowSearchSpaceByFilterFormulaWithSameScaledReg() so that those without unfoldable offsets do not get lost.

Overall, this is an improvement of the AGFIs on SPEC, but there are also some rare cases where this gets worse. I think this is because SystemZTTI accepts long displacements in the LSR phase of building the LSRUses with their Fixups. Then, during Solve(), the Instruction pointer is passed to SystemZTTI::isLSRCostLess() which now then says that those offsets/Fixups are in fact not foldable, and a good solution is not to be found. I experimented with dissallowing the long displacements (for vector/fp) also in the early phase, but this changed a tremendous amount of files with mixed benchmark effects, so that seems to also be a matter of tuning. Since the cases that get worse with this patch are rare, and the patch now is relatively much simpler with a clear benchmark improvement, I would like to return to the other issues after this.

Four tests failed with this, and looking at CodeGen/ARM/ParallelDSP/unroll-n-jam-smlad.ll, it seemed that there were now more spills/reloads. I am not sure why, so I made this optional (for now) with a target hook TTI.LSRUnfOffsetsReconc().

LLVM :: CodeGen/ARM/ParallelDSP/unroll-n-jam-smlad.ll
LLVM :: CodeGen/ARM/loop-indexing.ll
LLVM :: CodeGen/PowerPC/bdzlr.ll
LLVM :: CodeGen/PowerPC/lsr-profitable-chain.ll

Is this the right approach to remedy the LBM loop?

Diff Detail

Unit TestsFailed

	Time	Test
	60,070 ms	x64 debian > AddressSanitizer-x86_64-linux-dynamic.TestCases::scariness_score_test.cpp
	60,080 ms	x64 debian > AddressSanitizer-x86_64-linux.TestCases::scariness_score_test.cpp
	2,460 ms	x64 debian > Clang.utils/update_cc_test_checks::check-globals.test
	560 ms	x64 debian > Clang.utils/update_cc_test_checks::global-hex-value-regex.test
	570 ms	x64 debian > Clang.utils/update_cc_test_checks::global-value-regex.test
		View Full Test Results (300 Failed)

Event Timeline

jonpa created this revision.Mar 8 2021, 6:29 PM

Herald added subscribers: steven.zhang, hiraditya, kristof.beyls. · View Herald TranscriptMar 8 2021, 6:29 PM

jonpa requested review of this revision.Mar 8 2021, 6:29 PM

Herald added a project: Restricted Project. · View Herald TranscriptMar 8 2021, 6:29 PM

Harbormaster completed remote builds in B92777: Diff 329183.Mar 9 2021, 2:04 AM

jonpa added a subscriber: Andreas-Krebbel.Mar 9 2021, 8:52 AM

SjoerdMeijer added a reviewer: dmgreen.Mar 9 2021, 9:13 AM

Patch rebased.

ping!

lebedev.ri added a subscriber: lebedev.ri.Mar 15 2021, 11:33 AM

lebedev.ri added inline comments.

llvm/lib/Transforms/Scalar/LoopStrengthReduce.cpp
2011	Should this be `llvm::SmallMapVector<const SCEV *, SmallVector<size_t, 8>, 8>` ?

Harbormaster completed remote builds in B93868: Diff 330741.Mar 15 2021, 1:02 PM

jonpa added inline comments.Mar 16 2021, 1:03 PM

llvm/lib/Transforms/Scalar/LoopStrengthReduce.cpp
2011	I agree we should pick the best data structure available, but I would prefer to do wait until we are sure exactly what the patch will look like and then I can do some experiment to figure out the typical need here. At this point I would first like to have a general opinion about the patch as a whole, please...

Ping!

Grouping huge offsets (like foldable ones) really should be a general win for most targets, although this patch is only enabled on SystemZ for now.

PING!

This could be enabled for SystemZ only for now, but review is still needed...

ping...

ping!

PING!

greened added inline comments.Jan 12 2022, 1:53 PM

llvm/include/llvm/Analysis/TargetTransformInfo.h
721	This name could be more descriptive, like `LSRShouldGroupUnfoldableOffsets` or `LSRShouldReconcileUnfoldableOffsets`, though "reconcile" is kind of generic and I don't really have a better term. Even "unfoldable" isn't clear unless you're deep into LSR. Is there some name that people developing targets would look at and have some idea of what their target should return, without being an LSR expert?

I saw the need for this patch again recently when needing the DAGCombiner to handle multiple out-of-range offsets, which it did not. The problem is that we "lie" in SystemZ::isLegalAddressingMode() currently by saying that a vector type memory access accepts a big offset, which is not true. If I change this, the DAGCombiner does a good job, however LSR then produces worse code in some cases, which makes me come back to this.

For some reason or other, "lying" to LSR about these displacements is sometimes better. For instance with this loop:

define void @fun(i8* %arg) {
bb:
  %i = getelementptr inbounds i8, i8* %arg, i64 27548
  %i1 = bitcast i8* %i to [12 x [16 x i16]]*
  %i2 = getelementptr inbounds i8, i8* %arg, i64 28028
  %i3 = bitcast i8* %i2 to [12 x [16 x i16]]*
  br label %bb4

bb4:                                              ; preds = %bb4, %bb
  %i5 = phi i64 [ %i10, %bb4 ], [ 0, %bb ]
  %i6 = getelementptr inbounds [12 x [16 x i16]], [12 x [16 x i16]]* %i1, i64 0, i64 3, i64 %i5
  %i7 = bitcast i16* %i6 to <8 x i16>*
  store <8 x i16> zeroinitializer, <8 x i16>* %i7
  %i8 = getelementptr inbounds [12 x [16 x i16]], [12 x [16 x i16]]* %i3, i64 0, i64 3, i64 %i5
  %i9 = bitcast i16* %i8 to <16 x i8>*
  store <16 x i8> zeroinitializer, <16 x i8>* %i9
  %i10 = add nuw i64 %i5, 8
  br label %bb4
}

trunk:

	vgbm	%v0, 0
	aghi	%r2, 27644              ### Single increment
.LBB0_1:                                # %bb4
                                        # =>This Inner Loop Header: Depth=1
	vst	%v0, 0(%r2), 3
	vst	%v0, 480(%r2), 3
	la	%r2, 16(%r2)
	j	.LBB0_1

if rejecting big offsets for the Vector STores:

	vgbm	%v0, 0
	lay	%r1, 28124(%r2)         ### Two separate regs
	lay	%r2, 27644(%r2)
	lghi	%r3, 0
.LBB0_1:                                # %bb4
                                        # =>This Inner Loop Header: Depth=1
	vst	%v0, 0(%r3,%r2), 3
	vst	%v0, 0(%r3,%r1), 3
	la	%r3, 16(%r3)
	j	.LBB0_1

This is a small example, but it illustrates that LSR recognizes that the two offsets are out of range, and it then puts them in two separate registers. In this case it seems that trunk LSR managed to chose the better formula even when "thinking" the huge offsets were in range. Maybe the huge offset was added before the loop as a general heuristic to reduce the offsets in the fixups...? In real code, those LAYs are more plentiful and also inside the loop... :-/

So my simple idea here is to have LSR group these fixups together under one reg even though they are initially unfoldable: Together they form a group which can have foldable offsets if the right immediate is added to the formula before the loop. It seems to me that this would make sense and is kind of simply missing - unless there is some other reason for not doing this?

The first step is to make sure the fixup groups are formed under one LSRUse when they are all out of range, but within range between themselves.

When rejecting the addressing modes properly for the vector-type, I realized I also had to adjust these new groups (second step), which is done in adjustInitialFormulaeForOffsets(). This is what happens:

Fixup #1:   Offset = 28124  => Initial formula:  reg(A) + 28124
Fixup #2:   Offset = 27644  => reg(A) - 480.

reconcileNewOffset() checked that the offset range between #1 and #2 is in range, which it is. However the offset -480 is not legal, so adjustInitialFormulaeForOffsets() adjusts by subtracting 480 to the Formula and adds it to the fixup offsets:

Initial formula, adjusted: reg(A) + 27644
Fixup #1:   reg(A) + 480
Fixup #2:   reg(A) +   0

In English: The SystemZ::isLegalAddressingMode() returns true for a VectorType only if the offset is within 0 - 4095, so all fixups can only use such an offset against the formula.

The reason this adjustment is made after the fact (of creating the initial formula) is that it can only be done after the group of fixups have been found. But maybe this could be gotten for free if the user instructions where sorted by offsets somehow so that the lowest offset would be seen first, which should give the same result...?

I left some "EXPERIMENTAL" (/SystemZ) options in the patch for now, so that anyone who wanted could look into this. To investigate the small loop above:

trunk:
llc -mtriple=s390x-linux-gnu -mcpu=z15 -O3 ./tc_LSR.ll -o -

reject large displacements for vector type (as should be), with worse results:
llc -mtriple=s390x-linux-gnu -mcpu=z15 -O3 ./tc_LSR.ll -o - -legalam-vec

to handle the problem with this patch:
llc -mtriple=s390x-linux-gnu -mcpu=z15 -O3 ./tc_LSR.ll -o - -legalam-vec -lsr-unfolded-offs

Is this making sense? Is there some other way of handling this that would be even better? I am thinking that of course those addresses could be somewhat cleaned up later, but it would be really best to handle the loop addressing in LSR as loops are after all very important.

llvm/include/llvm/Analysis/TargetTransformInfo.h
721	Yeah, that's a better name :-)

Harbormaster completed remote builds in B151571: Diff 411550.Feb 25 2022, 5:26 PM

efriedma added a subscriber: huihuiz.Mar 1 2022, 2:32 PM

Herald added a project: Restricted Project. · View Herald TranscriptMar 1 2022, 2:32 PM

Hi Jonas,

It seems to me that this would make sense and is kind of simply missing

Make sense to me.

I only glanced at the code, but the approach looks reasonable.
I haven't touched LSR code for quite some time so if someone else can do the actual review that would be best, but if not, I'll jump in.

Cheers,
-Quentin

I currently don't see any performance wins on SystemZ with this patch, although there are still nice improvements in the output. My original motivation for this patch was improvement of lbm with 5-10%, but although I see that same hot loop still being helped there is no longer any impact on performance.

On SystemZ I see over SPEC:

lay            :                55777                55520     -257
agfi           :                  426                  297     -129
lgr            :               842567               842448     -119
lg             :              1049005              1048942      -63
...
OPCDIFFS: -705

This is a nice improvement (especially as this is all about loops), but I am not sure it would be right to extend LSR with more than just a few lines without a clear motivation. So unless there is interest in this from other targets, I will pause my work on this until I see a clear win from it. As mentioned earlier there might be simpler ways of achieving the same result of the patch but in principal it does what it needs to now.

eopXD added a reviewer: Restricted Project.May 24 2022, 6:13 PM

Hi Jonas,

I am trying to wrap my head around this patch. I would really appreciate
if you can tell me more on your approach. (or please correct me if I am
saying something stupid)

It sounds like you are trying to have an alternative Formula-s when two
or more LSRUse have mutual unfold-able offset. This sounds like what
LSRInstance::GenerateConstantOffsets is doing. Won't adding more
Formula-s to an LSRUse do the job (rather than overriding the offset)?

In D98230#3536058, @eopXD wrote:

Hi Jonas,

I am trying to wrap my head around this patch. I would really appreciate
if you can tell me more on your approach. (or please correct me if I am
saying something stupid)

It sounds like you are trying to have an alternative Formula-s when two
or more LSRUse have mutual unfold-able offset. This sounds like what
LSRInstance::GenerateConstantOffsets is doing. Won't adding more
Formula-s to an LSRUse do the job (rather than overriding the offset)?

Hi Yueh-Ting,

thanks for taking a look :-)

LSR begins with looking over the interesting instructions and sorting them into groups (LSRUse:s) in CollectFixupsAndInitialFormulae(). This is needed to generate efficient code, i.e. to use the same IV register for more than just a single instruction if possible. This is implemented in getUse(): a previous LSRUse with the same SCEV and Kind is searched for, and if found it is used if reconcileNewOffset() succeeds. If the offset does not go with the pre-existing LSRUse a new LSRUse is created instead.

The problem then with your idea of GenerateConstantOffsets() is that per the above the two instructions end up in their own LSRUse with their own set of formulae which will then not work as they will always be separate. With that said, it might be interesting to experiment with GenerateCrossUseConstantOffsets() to see if the problem could be solved that way. There is however an advantage with putting the instructions in the same LSRUse right away as LSR is generating a lot of formulae and then prune the search space for the sake of compile time, so I would say getting something obvious like this right early is better.

Did you try the patch on your target and find any improvements?

In D98230#3536504, @jonpa wrote:

In D98230#3536058, @eopXD wrote:

Hi Jonas,

I am trying to wrap my head around this patch. I would really appreciate
if you can tell me more on your approach. (or please correct me if I am
saying something stupid)

It sounds like you are trying to have an alternative Formula-s when two
or more LSRUse have mutual unfold-able offset. This sounds like what
LSRInstance::GenerateConstantOffsets is doing. Won't adding more
Formula-s to an LSRUse do the job (rather than overriding the offset)?

Hi Yueh-Ting,

thanks for taking a look :-)

LSR begins with looking over the interesting instructions and sorting them into groups (LSRUse:s) in CollectFixupsAndInitialFormulae(). This is needed to generate efficient code, i.e. to use the same IV register for more than just a single instruction if possible. This is implemented in getUse(): a previous LSRUse with the same SCEV and Kind is searched for, and if found it is used if reconcileNewOffset() succeeds. If the offset does not go with the pre-existing LSRUse a new LSRUse is created instead.

The problem then with your idea of GenerateConstantOffsets() is that per the above the two instructions end up in their own LSRUse with their own set of formulae which will then not work as they will always be separate. With that said, it might be interesting to experiment with GenerateCrossUseConstantOffsets() to see if the problem could be solved that way. There is however an advantage with putting the instructions in the same LSRUse right away as LSR is generating a lot of formulae and then prune the search space for the sake of compile time, so I would say getting something obvious like this right early is better.

Did you try the patch on your target and find any improvements?

Thank you for the clear explanation. The approach make sense to me.
However, I can't successfully apply the patch, may you rebase it?

It looks like LSR can surely find improvement here, at least for the
code-gen I observed from your test case on the RISC-V backend.

Other than the rebase, may you add debug logs to show the changes
for your enhancement, thanks for the work!

However, I can't successfully apply the patch, may you rebase it?

Patch rebased.

It looks like LSR can surely find improvement here, at least for the code-gen I observed from your test case on the RISC-V backend.

I agree, but we need to find real benchmark improvements to motivate this, I think...

Other than the rebase, may you add debug logs to show the changes for your enhancement, thanks for the work!

I will do that later if there is an interest for the patch being committed...

Harbormaster completed remote builds in B166506: Diff 432326.May 26 2022, 11:33 AM

This review may be stuck/dead, consider abandoning if no longer relevant.
Removing myself as reviewer in attempt to clean dashboard.

Herald added a subscriber: StephenFan. · View Herald TranscriptJan 12 2023, 5:30 PM

Revision Contents

Path

Size

llvm/

include/

llvm/

Analysis/

TargetTransformInfo.h

8 lines

TargetTransformInfoImpl.h

2 lines

lib/

Analysis/

TargetTransformInfo.cpp

4 lines

Target/

SystemZ/

SystemZTargetTransformInfo.h

1 line

SystemZTargetTransformInfo.cpp

6 lines

Transforms/

Scalar/

LoopStrengthReduce.cpp

97 lines

test/

CodeGen/

SystemZ/

loop-01.ll

40 lines

Diff 432326

llvm/include/llvm/Analysis/TargetTransformInfo.h

Show First 20 Lines • Show All 709 Lines • ▼ Show 20 Lines	InstructionCost getScalingFactorCost(Type Ty, GlobalValue BaseGV,
unsigned AddrSpace = 0) const;		unsigned AddrSpace = 0) const;

/// Return true if the loop strength reduce pass should make		/// Return true if the loop strength reduce pass should make
/// Instruction* based TTI queries to isLegalAddressingMode(). This is		/// Instruction* based TTI queries to isLegalAddressingMode(). This is
/// needed on SystemZ, where e.g. a memcpy can only have a 12 bit unsigned		/// needed on SystemZ, where e.g. a memcpy can only have a 12 bit unsigned
/// immediate offset and no index register.		/// immediate offset and no index register.
bool LSRWithInstrQueries() const;		bool LSRWithInstrQueries() const;

		/// Return true if the loop strength reduce pass should try to group
		/// similar unfolded offsets together to use only one register.
		bool LSRShouldGroupUnfoldableOffsets() const;

		greenedUnsubmitted Not Done Reply Inline Actions This name could be more descriptive, like `LSRShouldGroupUnfoldableOffsets` or `LSRShouldReconcileUnfoldableOffsets`, though "reconcile" is kind of generic and I don't really have a better term. Even "unfoldable" isn't clear unless you're deep into LSR. Is there some name that people developing targets would look at and have some idea of what their target should return, without being an LSR expert? greened: This name could be more descriptive, like `LSRShouldGroupUnfoldableOffsets` or…
		jonpaAuthorUnsubmitted Done Reply Inline Actions Yeah, that's a better name :-) jonpa: Yeah, that's a better name :-)
/// Return true if it's free to truncate a value of type Ty1 to type		/// Return true if it's free to truncate a value of type Ty1 to type
/// Ty2. e.g. On x86 it's free to truncate a i32 value in register EAX to i16		/// Ty2. e.g. On x86 it's free to truncate a i32 value in register EAX to i16
/// by referencing its sub-register AX.		/// by referencing its sub-register AX.
bool isTruncateFree(Type Ty1, Type Ty2) const;		bool isTruncateFree(Type Ty1, Type Ty2) const;

/// Return true if it is profitable to hoist instruction in the		/// Return true if it is profitable to hoist instruction in the
/// then/else to before if.		/// then/else to before if.
bool isProfitableToHoist(Instruction *I) const;		bool isProfitableToHoist(Instruction *I) const;
▲ Show 20 Lines • Show All 859 Lines • ▼ Show 20 Lines	public:
virtual bool hasDivRemOp(Type *DataType, bool IsSigned) = 0;		virtual bool hasDivRemOp(Type *DataType, bool IsSigned) = 0;
virtual bool hasVolatileVariant(Instruction *I, unsigned AddrSpace) = 0;		virtual bool hasVolatileVariant(Instruction *I, unsigned AddrSpace) = 0;
virtual bool prefersVectorizedAddressing() = 0;		virtual bool prefersVectorizedAddressing() = 0;
virtual InstructionCost getScalingFactorCost(Type Ty, GlobalValue BaseGV,		virtual InstructionCost getScalingFactorCost(Type Ty, GlobalValue BaseGV,
int64_t BaseOffset,		int64_t BaseOffset,
bool HasBaseReg, int64_t Scale,		bool HasBaseReg, int64_t Scale,
unsigned AddrSpace) = 0;		unsigned AddrSpace) = 0;
virtual bool LSRWithInstrQueries() = 0;		virtual bool LSRWithInstrQueries() = 0;
		virtual bool LSRShouldGroupUnfoldableOffsets() = 0;
virtual bool isTruncateFree(Type Ty1, Type Ty2) = 0;		virtual bool isTruncateFree(Type Ty1, Type Ty2) = 0;
virtual bool isProfitableToHoist(Instruction *I) = 0;		virtual bool isProfitableToHoist(Instruction *I) = 0;
virtual bool useAA() = 0;		virtual bool useAA() = 0;
virtual bool isTypeLegal(Type *Ty) = 0;		virtual bool isTypeLegal(Type *Ty) = 0;
virtual unsigned getRegUsageForType(Type *Ty) = 0;		virtual unsigned getRegUsageForType(Type *Ty) = 0;
virtual bool shouldBuildLookupTables() = 0;		virtual bool shouldBuildLookupTables() = 0;
virtual bool shouldBuildLookupTablesForConstant(Constant *C) = 0;		virtual bool shouldBuildLookupTablesForConstant(Constant *C) = 0;
virtual bool shouldBuildRelLookupTables() = 0;		virtual bool shouldBuildRelLookupTables() = 0;
▲ Show 20 Lines • Show All 420 Lines • ▼ Show 20 Lines	public:
InstructionCost getScalingFactorCost(Type Ty, GlobalValue BaseGV,		InstructionCost getScalingFactorCost(Type Ty, GlobalValue BaseGV,
int64_t BaseOffset, bool HasBaseReg,		int64_t BaseOffset, bool HasBaseReg,
int64_t Scale,		int64_t Scale,
unsigned AddrSpace) override {		unsigned AddrSpace) override {
return Impl.getScalingFactorCost(Ty, BaseGV, BaseOffset, HasBaseReg, Scale,		return Impl.getScalingFactorCost(Ty, BaseGV, BaseOffset, HasBaseReg, Scale,
AddrSpace);		AddrSpace);
}		}
bool LSRWithInstrQueries() override { return Impl.LSRWithInstrQueries(); }		bool LSRWithInstrQueries() override { return Impl.LSRWithInstrQueries(); }
		bool LSRShouldGroupUnfoldableOffsets() override {
		return Impl.LSRShouldGroupUnfoldableOffsets();
		}
bool isTruncateFree(Type Ty1, Type Ty2) override {		bool isTruncateFree(Type Ty1, Type Ty2) override {
return Impl.isTruncateFree(Ty1, Ty2);		return Impl.isTruncateFree(Ty1, Ty2);
}		}
bool isProfitableToHoist(Instruction *I) override {		bool isProfitableToHoist(Instruction *I) override {
return Impl.isProfitableToHoist(I);		return Impl.isProfitableToHoist(I);
}		}
bool useAA() override { return Impl.useAA(); }		bool useAA() override { return Impl.useAA(); }
bool isTypeLegal(Type *Ty) override { return Impl.isTypeLegal(Ty); }		bool isTypeLegal(Type *Ty) override { return Impl.isTypeLegal(Ty); }
▲ Show 20 Lines • Show All 509 Lines • Show Last 20 Lines

llvm/include/llvm/Analysis/TargetTransformInfoImpl.h

Show First 20 Lines • Show All 298 Lines • ▼ Show 20 Lines	InstructionCost getScalingFactorCost(Type Ty, GlobalValue BaseGV,
if (isLegalAddressingMode(Ty, BaseGV, BaseOffset, HasBaseReg, Scale,		if (isLegalAddressingMode(Ty, BaseGV, BaseOffset, HasBaseReg, Scale,
AddrSpace))		AddrSpace))
return 0;		return 0;
return -1;		return -1;
}		}

bool LSRWithInstrQueries() const { return false; }		bool LSRWithInstrQueries() const { return false; }

		bool LSRShouldGroupUnfoldableOffsets() const { return false; }

bool isTruncateFree(Type Ty1, Type Ty2) const { return false; }		bool isTruncateFree(Type Ty1, Type Ty2) const { return false; }

bool isProfitableToHoist(Instruction *I) const { return true; }		bool isProfitableToHoist(Instruction *I) const { return true; }

bool useAA() const { return false; }		bool useAA() const { return false; }

bool isTypeLegal(Type *Ty) const { return false; }		bool isTypeLegal(Type *Ty) const { return false; }

▲ Show 20 Lines • Show All 951 Lines • Show Last 20 Lines

llvm/lib/Analysis/TargetTransformInfo.cpp

Show First 20 Lines • Show All 453 Lines • ▼ Show 20 Lines	InstructionCost TargetTransformInfo::getScalingFactorCost(
assert(Cost >= 0 && "TTI should not produce negative costs!");		assert(Cost >= 0 && "TTI should not produce negative costs!");
return Cost;		return Cost;
}		}

bool TargetTransformInfo::LSRWithInstrQueries() const {		bool TargetTransformInfo::LSRWithInstrQueries() const {
return TTIImpl->LSRWithInstrQueries();		return TTIImpl->LSRWithInstrQueries();
}		}

		bool TargetTransformInfo::LSRShouldGroupUnfoldableOffsets() const {
		return TTIImpl->LSRShouldGroupUnfoldableOffsets();
		}

bool TargetTransformInfo::isTruncateFree(Type Ty1, Type Ty2) const {		bool TargetTransformInfo::isTruncateFree(Type Ty1, Type Ty2) const {
return TTIImpl->isTruncateFree(Ty1, Ty2);		return TTIImpl->isTruncateFree(Ty1, Ty2);
}		}

bool TargetTransformInfo::isProfitableToHoist(Instruction *I) const {		bool TargetTransformInfo::isProfitableToHoist(Instruction *I) const {
return TTIImpl->isProfitableToHoist(I);		return TTIImpl->isProfitableToHoist(I);
}		}

▲ Show 20 Lines • Show All 748 Lines • Show Last 20 Lines

llvm/lib/Target/SystemZ/SystemZTargetTransformInfo.h

Show First 20 Lines • Show All 74 Lines • ▼ Show 20 Lines	unsigned getMinPrefetchStride(unsigned NumMemAccesses,
unsigned NumStridedMemAccesses,		unsigned NumStridedMemAccesses,
unsigned NumPrefetches,		unsigned NumPrefetches,
bool HasCall) const override;		bool HasCall) const override;
bool enableWritePrefetching() const override { return true; }		bool enableWritePrefetching() const override { return true; }

bool hasDivRemOp(Type *DataType, bool IsSigned);		bool hasDivRemOp(Type *DataType, bool IsSigned);
bool prefersVectorizedAddressing() { return false; }		bool prefersVectorizedAddressing() { return false; }
bool LSRWithInstrQueries() { return true; }		bool LSRWithInstrQueries() { return true; }
		bool LSRShouldGroupUnfoldableOffsets();
bool supportsEfficientVectorElementLoadStore() { return true; }		bool supportsEfficientVectorElementLoadStore() { return true; }
bool enableInterleavedAccessVectorization() { return true; }		bool enableInterleavedAccessVectorization() { return true; }

InstructionCost getArithmeticInstrCost(		InstructionCost getArithmeticInstrCost(
unsigned Opcode, Type *Ty, TTI::TargetCostKind CostKind,		unsigned Opcode, Type *Ty, TTI::TargetCostKind CostKind,
TTI::OperandValueKind Opd1Info = TTI::OK_AnyValue,		TTI::OperandValueKind Opd1Info = TTI::OK_AnyValue,
TTI::OperandValueKind Opd2Info = TTI::OK_AnyValue,		TTI::OperandValueKind Opd2Info = TTI::OK_AnyValue,
TTI::OperandValueProperties Opd1PropInfo = TTI::OP_None,		TTI::OperandValueProperties Opd1PropInfo = TTI::OP_None,
Show All 40 Lines

llvm/lib/Target/SystemZ/SystemZTargetTransformInfo.cpp

	Show First 20 Lines • Show All 411 Lines • ▼ Show 20 Lines
	// 3.			// 3.
	static unsigned getNumVectorRegs(Type *Ty) {			static unsigned getNumVectorRegs(Type *Ty) {
	auto *VTy = cast<FixedVectorType>(Ty);			auto *VTy = cast<FixedVectorType>(Ty);
	unsigned WideBits = getScalarSizeInBits(Ty) * VTy->getNumElements();			unsigned WideBits = getScalarSizeInBits(Ty) * VTy->getNumElements();
	assert(WideBits > 0 && "Could not compute size of vector");			assert(WideBits > 0 && "Could not compute size of vector");
	return ((WideBits % 128U) ? ((WideBits / 128U) + 1) : (WideBits / 128U));			return ((WideBits % 128U) ? ((WideBits / 128U) + 1) : (WideBits / 128U));
	}			}

				// EXPERIMENTAL
				static cl::opt<bool> LSRUNFOFFSETS("lsr-unfolded-offs", cl::init(true), cl::Hidden);
				bool SystemZTTIImpl::LSRShouldGroupUnfoldableOffsets() {
				return LSRUNFOFFSETS;
				}

	InstructionCost SystemZTTIImpl::getArithmeticInstrCost(			InstructionCost SystemZTTIImpl::getArithmeticInstrCost(
	unsigned Opcode, Type *Ty, TTI::TargetCostKind CostKind,			unsigned Opcode, Type *Ty, TTI::TargetCostKind CostKind,
	TTI::OperandValueKind Op1Info, TTI::OperandValueKind Op2Info,			TTI::OperandValueKind Op1Info, TTI::OperandValueKind Op2Info,
	TTI::OperandValueProperties Opd1PropInfo,			TTI::OperandValueProperties Opd1PropInfo,
	TTI::OperandValueProperties Opd2PropInfo, ArrayRef<const Value *> Args,			TTI::OperandValueProperties Opd2PropInfo, ArrayRef<const Value *> Args,
	const Instruction *CxtI) {			const Instruction *CxtI) {

	// TODO: Handle more cost kinds.			// TODO: Handle more cost kinds.
	▲ Show 20 Lines • Show All 817 Lines • Show Last 20 Lines

llvm/lib/Transforms/Scalar/LoopStrengthReduce.cpp

Show First 20 Lines • Show All 1,181 Lines • ▼ Show 20 Lines	public:

/// The list of operands which are to be replaced.		/// The list of operands which are to be replaced.
SmallVector<LSRFixup, 8> Fixups;		SmallVector<LSRFixup, 8> Fixups;

/// Keep track of the min and max offsets of the fixups.		/// Keep track of the min and max offsets of the fixups.
int64_t MinOffset = std::numeric_limits<int64_t>::max();		int64_t MinOffset = std::numeric_limits<int64_t>::max();
int64_t MaxOffset = std::numeric_limits<int64_t>::min();		int64_t MaxOffset = std::numeric_limits<int64_t>::min();

		bool HasUnfoldedOffsets = false;

/// This records whether all of the fixups using this LSRUse are outside of		/// This records whether all of the fixups using this LSRUse are outside of
/// the loop, in which case some special-case heuristics may be used.		/// the loop, in which case some special-case heuristics may be used.
bool AllFixupsOutsideLoop = true;		bool AllFixupsOutsideLoop = true;

/// RigidFormula is set to true to guarantee that this use will be associated		/// RigidFormula is set to true to guarantee that this use will be associated
/// with a single formula--the one that initially matched. Some SCEV		/// with a single formula--the one that initially matched. Some SCEV
/// expressions cannot be expanded. This allows LSR to consider the registers		/// expressions cannot be expanded. This allows LSR to consider the registers
/// used by those expressions without the need to expand them later after		/// used by those expressions without the need to expand them later after
▲ Show 20 Lines • Show All 798 Lines • ▼ Show 20 Lines	void ChainInstruction(Instruction UserInst, Instruction IVOper,
SmallVectorImpl<ChainUsers> &ChainUsersVec);		SmallVectorImpl<ChainUsers> &ChainUsersVec);
void FinalizeChain(IVChain &Chain);		void FinalizeChain(IVChain &Chain);
void CollectChains();		void CollectChains();
void GenerateIVChain(const IVChain &Chain, SCEVExpander &Rewriter,		void GenerateIVChain(const IVChain &Chain, SCEVExpander &Rewriter,
SmallVectorImpl<WeakTrackingVH> &DeadInsts);		SmallVectorImpl<WeakTrackingVH> &DeadInsts);

void CollectInterestingTypesAndFactors();		void CollectInterestingTypesAndFactors();
void CollectFixupsAndInitialFormulae();		void CollectFixupsAndInitialFormulae();
		void adjustInitialFormulaeForOffsets();

// Support for sharing of LSRUses between LSRFixups.		// Support for sharing of LSRUses between LSRFixups.
using UseMapTy = DenseMap<LSRUse::SCEVUseKindPair, size_t>;		using UseMapTy = DenseMap<LSRUse::SCEVUseKindPair, size_t>;
UseMapTy UseMap;		UseMapTy UseMap;
		struct UseUnfOffsMapTy : std::multimap<const SCEV *, size_t> {};
		lebedev.riUnsubmitted Not Done Reply Inline Actions Should this be `llvm::SmallMapVector<const SCEV , SmallVector<size_t, 8>, 8>` ? lebedev.ri:* Should this be `llvm::SmallMapVector<const SCEV *, SmallVector<size_t, 8>, 8>` ?
		jonpaAuthorUnsubmitted Done Reply Inline Actions I agree we should pick the best data structure available, but I would prefer to do wait until we are sure exactly what the patch will look like and then I can do some experiment to figure out the typical need here. At this point I would first like to have a general opinion about the patch as a whole, please... jonpa: I agree we should pick the best data structure available, but I would prefer to do wait until…
		UseUnfOffsMapTy UseUnfOffsMap;

bool reconcileNewOffset(LSRUse &LU, int64_t NewOffset, bool HasBaseReg,		bool reconcileNewOffset(LSRUse &LU, int64_t NewOffset, bool HasBaseReg,
LSRUse::KindType Kind, MemAccessTy AccessTy);		LSRUse::KindType Kind, MemAccessTy AccessTy);
		bool reconcileUnfoldedAddressOffsets(LSRUse &LU, const SCEV *Expr,
		int64_t &Offset, MemAccessTy AccessTy);

std::pair<size_t, int64_t> getUse(const SCEV *&Expr, LSRUse::KindType Kind,		std::pair<size_t, int64_t> getUse(const SCEV *&Expr, LSRUse::KindType Kind,
MemAccessTy AccessTy);		MemAccessTy AccessTy);

void DeleteUse(LSRUse &LU, size_t LUIdx);		void DeleteUse(LSRUse &LU, size_t LUIdx);

LSRUse *FindUseWithSimilarFormula(const Formula &F, const LSRUse &OrigLU);		LSRUse *FindUseWithSimilarFormula(const Formula &F, const LSRUse &OrigLU);

▲ Show 20 Lines • Show All 574 Lines • ▼ Show 20 Lines	bool LSRInstance::reconcileNewOffset(LSRUse &LU, int64_t NewOffset,

// Update the use.		// Update the use.
LU.MinOffset = NewMinOffset;		LU.MinOffset = NewMinOffset;
LU.MaxOffset = NewMaxOffset;		LU.MaxOffset = NewMaxOffset;
LU.AccessTy = NewAccessTy;		LU.AccessTy = NewAccessTy;
return true;		return true;
}		}

		bool LSRInstance::reconcileUnfoldedAddressOffsets(LSRUse &LU, const SCEV *Expr,
		int64_t &Offset,
		MemAccessTy AccessTy) {
		assert(LU.Formulae.size() == 1 && "Expected an initial formula.");
		assert(!LU.Fixups.empty() && "Expected at least one fixup.");

		Formula &F = LU.Formulae[0];
		const SCEV *Reg = F.ScaledReg ? F.ScaledReg : F.BaseRegs[0];

		const SCEV *Copy = Reg;
		int64_t RegOffs = ExtractImmediate(Copy, SE);
		if (RegOffs != 0 && Copy == Expr) {
		int64_t NewOffset = Offset - RegOffs;
		if (reconcileNewOffset(LU, NewOffset, /HasBaseReg=/true,
		LSRUse::Address, AccessTy)) {
		Offset = NewOffset;
		LU.HasUnfoldedOffsets = true;
		return true;
		}
		}

		return false;
		}

/// Return an LSRUse index and an offset value for a fixup which needs the given		/// Return an LSRUse index and an offset value for a fixup which needs the given
/// expression, with the given kind and optional access type. Either reuse an		/// expression, with the given kind and optional access type. Either reuse an
/// existing use or create a new one, as needed.		/// existing use or create a new one, as needed.
std::pair<size_t, int64_t> LSRInstance::getUse(const SCEV *&Expr,		std::pair<size_t, int64_t> LSRInstance::getUse(const SCEV *&Expr,
LSRUse::KindType Kind,		LSRUse::KindType Kind,
MemAccessTy AccessTy) {		MemAccessTy AccessTy) {
const SCEV *Copy = Expr;		const SCEV *Copy = Expr;
int64_t Offset = ExtractImmediate(Expr, SE);		int64_t Offset = ExtractImmediate(Expr, SE);

// Basic uses can't accept any offset, for example.		// Basic uses can't accept any offset, for example.
		UseUnfOffsMapTy::iterator ItrUnfolded = UseUnfOffsMap.end();
if (!isAlwaysFoldable(TTI, Kind, AccessTy, /BaseGV=/ nullptr,		if (!isAlwaysFoldable(TTI, Kind, AccessTy, /BaseGV=/ nullptr,
Offset, /HasBaseReg=/ true)) {		Offset, /HasBaseReg=/ true)) {
		if (Kind == LSRUse::Address) {
		if (TTI.LSRShouldGroupUnfoldableOffsets()) {
		// Try to find a usable existing LSRUse with an unfoldable offset
		// that is reconcilable with Offset.
		auto R = UseUnfOffsMap.equal_range(Expr);
		for (auto I = R.first; I != R.second; ++I) {
		size_t LUIdx = I->second;
		LSRUse &LU = Uses[LUIdx];
		if (reconcileUnfoldedAddressOffsets(LU, Expr, Offset, AccessTy))
		return std::make_pair(LUIdx, Offset);
		}
		// Remember that the bare Expr has an unfoldable offset and record
		// the new LUIdx for it below.
		ItrUnfolded = UseUnfOffsMap.insert(std::make_pair(Expr, SIZE_MAX));
		}
		}
Expr = Copy;		Expr = Copy;
Offset = 0;		Offset = 0;
}		}

std::pair<UseMapTy::iterator, bool> P =		std::pair<UseMapTy::iterator, bool> P =
UseMap.insert(std::make_pair(LSRUse::SCEVUseKindPair(Expr, Kind), 0));		UseMap.insert(std::make_pair(LSRUse::SCEVUseKindPair(Expr, Kind), 0));
if (!P.second) {		if (!P.second) {
		assert(ItrUnfolded == UseUnfOffsMap.end() &&
		"Could not reconcile unfolded offsets with same Expr?");
// A use already existed with this base.		// A use already existed with this base.
size_t LUIdx = P.first->second;		size_t LUIdx = P.first->second;
LSRUse &LU = Uses[LUIdx];		LSRUse &LU = Uses[LUIdx];
if (reconcileNewOffset(LU, Offset, /HasBaseReg=/true, Kind, AccessTy))		if (reconcileNewOffset(LU, Offset, /HasBaseReg=/true, Kind, AccessTy))
// Reuse this use.		// Reuse this use.
return std::make_pair(LUIdx, Offset);		return std::make_pair(LUIdx, Offset);
}		}

// Create a new use.		// Create a new use.
size_t LUIdx = Uses.size();		size_t LUIdx = Uses.size();
P.first->second = LUIdx;		P.first->second = LUIdx;
Uses.push_back(LSRUse(Kind, AccessTy));		Uses.push_back(LSRUse(Kind, AccessTy));
LSRUse &LU = Uses[LUIdx];		LSRUse &LU = Uses[LUIdx];
		if (ItrUnfolded != UseUnfOffsMap.end())
		ItrUnfolded->second = LUIdx;

LU.MinOffset = Offset;		LU.MinOffset = Offset;
LU.MaxOffset = Offset;		LU.MaxOffset = Offset;
return std::make_pair(LUIdx, Offset);		return std::make_pair(LUIdx, Offset);
}		}

/// Delete the given use from the Uses list.		/// Delete the given use from the Uses list.
void LSRInstance::DeleteUse(LSRUse &LU, size_t LUIdx) {		void LSRInstance::DeleteUse(LSRUse &LU, size_t LUIdx) {
▲ Show 20 Lines • Show All 732 Lines • ▼ Show 20 Lines	for (const IVStrideUse &U : IU) {

// If this is the first use of this LSRUse, give it a formula.		// If this is the first use of this LSRUse, give it a formula.
if (LU.Formulae.empty()) {		if (LU.Formulae.empty()) {
InsertInitialFormula(S, LU, LUIdx);		InsertInitialFormula(S, LU, LUIdx);
CountRegisters(LU.Formulae.back(), LUIdx);		CountRegisters(LU.Formulae.back(), LUIdx);
}		}
}		}

		if (TTI.LSRShouldGroupUnfoldableOffsets())
		adjustInitialFormulaeForOffsets();

LLVM_DEBUG(print_fixups(dbgs()));		LLVM_DEBUG(print_fixups(dbgs()));
}		}

		void LSRInstance::adjustInitialFormulaeForOffsets() {
		for (size_t LUIdx = 0, NumUses = Uses.size(); LUIdx != NumUses; ++LUIdx) {
		LSRUse &LU = Uses[LUIdx];
		if (LU.HasUnfoldedOffsets) {
		// The range LU.MaxOffset - LU.MinOffset which covers all offsets is
		// supposed to be foldable, but the initial formula may need to be
		// adjusted. Adjust to eliminate negative offsets if needed:
		if (LU.MinOffset < 0 &&
		!isAlwaysFoldable(TTI, LU.Kind, LU.AccessTy, /BaseGV=/nullptr,
		LU.MinOffset, /HasBaseReg=/true)) {
		int64_t OffsetAdjust = -LU.MinOffset;

		// Add OffsetAdjust to the formula.
		assert(LU.Formulae.size() == 1 && "Expected an initial formula.");
		Formula &F = LU.Formulae[0];
		const SCEV *Reg = F.ScaledReg ? F.ScaledReg : F.BaseRegs[0];
		const SCEV *NewReg =
		SE.getAddExpr(SE.getConstant(Reg->getType(), -OffsetAdjust), Reg);
		if (F.ScaledReg == Reg)
		F.ScaledReg = NewReg;
		for (size_t i = 0, e = F.BaseRegs.size(); i != e; ++i)
		if (F.BaseRegs[i] == Reg)
		F.BaseRegs[i] = NewReg;

		// Adjust offsets of all fixups.
		for (LSRFixup &Fixup : LU.Fixups)
		Fixup.Offset += OffsetAdjust;

		LU.MinOffset = 0;
		LU.MaxOffset += OffsetAdjust;
		assert(isAlwaysFoldable(TTI, LU.Kind, LU.AccessTy, /BaseGV=/nullptr,
		LU.MaxOffset, /HasBaseReg=/true) &&
		"The range of offsets should be foldable.");
		}
		}
		}
		}

/// Insert a formula for the given expression into the given use, separating out		/// Insert a formula for the given expression into the given use, separating out
/// loop-variant portions from loop-invariant and loop-computable portions.		/// loop-variant portions from loop-invariant and loop-computable portions.
void		void
LSRInstance::InsertInitialFormula(const SCEV *S, LSRUse &LU, size_t LUIdx) {		LSRInstance::InsertInitialFormula(const SCEV *S, LSRUse &LU, size_t LUIdx) {
// Mark uses whose expressions cannot be expanded.		// Mark uses whose expressions cannot be expanded.
if (!isSafeToExpand(S, SE, /CanonicalMode/ false))		if (!isSafeToExpand(S, SE, /CanonicalMode/ false))
LU.RigidFormula = true;		LU.RigidFormula = true;

▲ Show 20 Lines • Show All 1,339 Lines • ▼ Show 20 Lines	for (size_t FIdx = 0, NumForms = LU.Formulae.size(); FIdx != NumForms;
continue;		continue;
auto P = BestFormulae.insert({{F.ScaledReg, F.Scale}, FIdx});		auto P = BestFormulae.insert({{F.ScaledReg, F.Scale}, FIdx});
if (P.second)		if (P.second)
continue;		continue;

Formula &Best = LU.Formulae[P.first->second];		Formula &Best = LU.Formulae[P.first->second];
if (IsBetterThan(F, Best))		if (IsBetterThan(F, Best))
std::swap(F, Best);		std::swap(F, Best);
		if (TTI.LSRShouldGroupUnfoldableOffsets() && LU.Kind == LSRUse::Address &&
		Best.UnfoldedOffset && !F.UnfoldedOffset)
		continue; // Evaluate formula with unfolded offset also.

LLVM_DEBUG(dbgs() << " Filtering out formula "; F.print(dbgs());		LLVM_DEBUG(dbgs() << " Filtering out formula "; F.print(dbgs());
dbgs() << "\n"		dbgs() << "\n"
" in favor of formula ";		" in favor of formula ";
Best.print(dbgs()); dbgs() << '\n');		Best.print(dbgs()); dbgs() << '\n');
#ifndef NDEBUG		#ifndef NDEBUG
ChangedFormulae = true;		ChangedFormulae = true;
#endif		#endif
LU.DeleteFormula(F);		LU.DeleteFormula(F);
▲ Show 20 Lines • Show All 1,964 Lines • Show Last 20 Lines

llvm/test/CodeGen/SystemZ/loop-01.ll

	Show First 20 Lines • Show All 314 Lines • ▼ Show 20 Lines

	for.inc.i: ; preds = %for.body.i63			for.inc.i: ; preds = %for.body.i63
	%indvars.iv.next156.i = or i64 %indvars.iv155.i, 1			%indvars.iv.next156.i = or i64 %indvars.iv155.i, 1
	%arrayidx.i62.1 = getelementptr inbounds i32, i32* undef, i64 %indvars.iv.next156.i			%arrayidx.i62.1 = getelementptr inbounds i32, i32* undef, i64 %indvars.iv.next156.i
	%tmp1 = load i32, i32* %arrayidx.i62.1, align 4			%tmp1 = load i32, i32* %arrayidx.i62.1, align 4
	%indvars.iv.next156.i.3 = add nsw i64 %indvars.iv155.i, 4			%indvars.iv.next156.i.3 = add nsw i64 %indvars.iv155.i, 4
	br label %for.body.i63			br label %for.body.i63
	}			}

				; Test that there are two agfi:s in the preheader but none in the loop body.
				define void @f10(double* %arg, double %V) {
				; CHECK-Z13-LABEL: f10:
				; CHECK-Z13-LABEL: # %bb.0:
				; CHECK-Z13: lgr %r1, %r2
				; CHECK-Z13: lgr %r3, %r2
				; CHECK-Z13: agfi %r1, -1599952
				; CHECK-Z13: agfi %r3, 1600280
				; CHECK-Z13: lghi %r4, 0
				; CHECK-Z13-LABEL: .LBB9_1:
				; CHECK-Z13-NOT: agfi
				; CHECK-Z13: j .LBB9_1
				; CHECK-Z13-LABEL: .Lfunc_end9:

				bb:
				br label %bb1

				bb1:
				%i = phi i64 [ 0, %bb ], [ %i13, %bb1 ]
				%i2 = getelementptr inbounds double, double* %arg, i64 %i
				store volatile double %V, double* %i2, align 8
				%i3 = add nsw i64 %i, -199994
				%i4 = getelementptr inbounds double, double* %arg, i64 %i3
				store volatile double %V, double* %i4, align 8
				%i5 = add nuw nsw i64 %i, 202011
				%i6 = getelementptr inbounds double, double* %arg, i64 %i5
				store volatile double %V, double* %i6, align 8
				%i7 = add nuw nsw i64 %i, 198013
				%i8 = getelementptr inbounds double, double* %arg, i64 %i7
				store volatile double %V, double* %i8, align 8
				%i9 = add nuw nsw i64 %i, 200035
				%i10 = getelementptr inbounds double, double* %arg, i64 %i9
				store volatile double %V, double* %i10, align 8
				%i11 = add nsw i64 %i, -199964
				%i12 = getelementptr inbounds double, double* %arg, i64 %i11
				store volatile double %V, double* %i12, align 8
				%i13 = add nuw nsw i64 %i, 20
				br label %bb1
				}