This is an archive of the discontinued LLVM Phabricator instance.

[LSR] Canonicalize reg1 + ... + regN into reg1 + ... + 1*regN
ClosedPublic

Authored by qcolombet on May 19 2014, 1:59 PM.

Details

Reviewers
atrick
Summary

Hi,

This patch introduces a canonical representation for the formulae used in loop strength reduce.

Thanks for your review/feedbacks.

  • Context **

Loop strength reduce represents differently the formulae reg1 + reg2 + ... + regN and reg1 + reg2 + ... + 1*regN. Indeed, the first form keeps the list of registers reg1...regN within the BaseRegs field, whereas the second keeps the list of registers reg1...regN-1 within the BaseRegs field and regN in the ScaledReg field.

These two representations cannot live at the same time in an LSR instance (they are uniqued the same way and thus cannot be inserted together in the set of formulae) but yield different costs.
Moreover, at several places in our cost model we assume that ScaledReg == nullptr implies that only one base register is set, which is currently wrong.
For instance, an addressing mode using reg1 + reg2 will not query the target hook getScalingFactorCost, whereas reg1 + 1*reg2 will!

  • Proposed Solution **

This patch introduces a canonical representation for the formulae.
Basically, as soon as a formula has more that one base register, the scaled register field is used for one of them. The register put into the scaled register is preferably a loop variant.
The proposed patch refactors how the formulae are built in order to produce such representation.
This yields a more accurate, but still perfectible, cost model.

Note: This patch will change the way the code is generated for x86 since even a simple reg1 + reg2 addressing mode is more expensive than a simple reg1 access for this target. To some extend this patch may change the code for other targets as the cost model is supposedly more accurate.

Note2: We could do some more refactoring, (see the FIXME for instance) but those would induce a change in behavior whereas the goal here was to not change the behavior if possible.

  • Current Results **

I have benchmarked the patch on a Ivy Bridge machine fixed at 2900Mhz for both O3 and Os. On average I do not see any difference, but there are a few speed-ups.
Here are the results for the best of 10 runs for tests that run for at least 1 second. Smaller is better.

  • O3 *

Benchmark_ID Reference Test Expansion Percent

CINT2000/186.crafty/186.crafty 4.2485 4.2097 0.99 -1%
CINT2000/253.perlbmk/253.perlbmk 5.7467 5.5281 0.96 -4%
CINT2000/300.twolf/300.twolf 3.2226 3.2407 1.01 +1%
Polybench/stencils/fdtd-apml/fdtd-apml 1.0065 1.0123 1.01 +1%
Shootout-C++/lists 10.8331 10.8998 1.01 +1%
Shootout-C++/sieve 1.9528 1.9906 1.02 +2%
Shootout/lists 5.0692 5.1081 1.01 +1%
Shootout/matrix 1.4662 1.4505 0.99 -1%
TSVC/ControlFlow-dbl/ControlFlow-dbl 3.629 3.6646 1.01 +1%
TSVC/ControlFlow-flt/ControlFlow-flt 3.1577 3.215 1.02 +2%
TSVC/CrossingThresholds-flt/CrossingThr 2.3061 2.2619 0.98 -2%
lemon/lemon 1.151 1.141 0.99 -1%

llubenchmark/llu 3.8547 3.8213 0.99 -1%

Min (13) - - 0.96 -

Max (13) - - 1.02 -

Sum (13) 48 48 1 +0%

A.Mean (13) - - 1 +0%

G.Mean 2 (13) - - 1 +0%

Regressions:

  • CINT2000/300.twolf/300.twolf: The changes should not affect the performances. I have checked with IACA (Intel Architecture Code Analyzer) the few loops that are modified and the throughput and latency are identical.

I have looked at the encoding and the new encoding for this case is slightly more compact (incq uses 1-byte less encoding space than addq). Some alignment issues? Anyhow, this seems a side effect.

  • Polybench/stencils/fdtd-apml: Slightly different spill placement. Average is better after the refactoring but the minimum is slightly better for the old implementation, though in the noise (< 0.006s).
  • Shootout-C++/lists: Noise, same binary.
  • Shootout-C++/sieve: Noise, same binary.
  • Shootout/lists: Noise, same binary.
  • TSVC/ControlFlow-dbl/ControlFlow-dbl: The code looks slightly better: With the refactoring we got rid of one complex addressing mode in a loop. I do not know why we are observing that! Looks like a side effect too.
  • TSVC/ControlFlow-flt/ControlFlow-flt: Same as ControlFlow-dbl.
  • Os *

Benchmark_ID Reference Test Expansion Percent

7zip/7zip-benchmark 8.1713 8.1235 0.99 -1%
ASC_Sequoia/AMGmk/AMGmk 7.0611 6.9804 0.99 -1%
CFP2006/447.dealII/447.dealII 15.8169 15.294 0.97 -3%
CINT2000/175.vpr/175.vpr 2.8925 2.8674 0.99 -1%
CINT2000/197.parser/197.parser 2.2631 2.2849 1.01 +1%
CINT2000/300.twolf/300.twolf 3.2271 3.1912 0.99 -1%
CINT2006/401.bzip2/401.bzip2 2.0734 2.0838 1.01 +1%
CINT2006/456.hmmer/456.hmmer 2.5935 2.5781 0.99 -1%
CINT2006/464.h264ref/464.h264ref 12.3972 12.4701 1.01 +1%
CoyoteBench/lpbench 2.9263 2.9418 1.01 +1%
SIBsim4/SIBsim4 2.7361 2.6705 0.98 -2%
Shootout-C++/lists 10.8231 10.9064 1.01 +1%
TSVC/ControlFlow-flt/ControlFlow-flt 3.2565 3.3259 1.02 +2%
TSVC/CrossingThresholds-dbl/CrossingThr 3.4094 3.2462 0.95 -5%
TSVC/CrossingThresholds-flt/CrossingThr 2.6827 2.4896 0.93 -7%
lemon/lemon 1.1847 1.0853 0.92 -8%

mafft/pairlocalalign 24.6637 24.9703 1.01 +1%

Min (17) - - 0.92 -

Max (17) - - 1.02 -

Sum (17) 108 108 0.99 +1%

A.Mean (17) - - 0.99 -1%

G.Mean 2 (17) - - 0.99 -1%

Regressions:

  • CINT2000/197.parser/197.parser: Similar assembly, IACA gives the same throughput for the few loops that changed that I have checked.
  • CINT2006/401.bzip2/401.bzip2: The picked solutions for the loops in blocksort.c are slightly different due to the new cost. This results in different register allocation and different spill code placement. I do not see any wrong here.
  • CINT2006/464.h264ref/464.h264ref: Some of the loops (in dct_luma and dct_chroma for instance) avoid the cost of the 1* scaling factor but use a different comparison (i.e., increase in ImmCost). As a result, the comparison is now against 2 instead of 0. Therefore, we were previously able to remove the comparison since the addq was setting the zero flag but now we have to keep the comparison. That said, according to IACA we end up with 16 uops vs. 15 uops, but thanks to the new unscaled addressing mode, we have a throughput of 3.75 cycles vs. 3.9.
  • CoyoteBench/lpbench: Same as O3 - CINT2000/300.twolf/300.twolf.
  • Shootout-C++/lists: Noise, same binary.
  • TSVC/ControlFlow-flt/ControlFlow-flt: Same as O3 TSVC/ControlFlow-flt/ControlFlow-flt.
  • mafft/pairlocalalign: Same as CoyoteBench/lpbench.

rdar://problem/16731508

Thanks,
-Quentin

Diff Detail

Event Timeline

qcolombet updated this revision to Diff 9583.May 19 2014, 1:59 PM
qcolombet retitled this revision from to [LSR] Canonicalize reg1 + ... + regN into reg1 + ... + 1*regN.
qcolombet updated this object.
qcolombet edited the test plan for this revision. (Show Details)
qcolombet added a reviewer: atrick.
qcolombet added a subscriber: Unknown Object (MLST).

PS: Tested version was trunk r208630.

test/CodeGen/X86/avoid_complex_am.ll
25

Note: Since reg1 + <{0,+,8}> has the same cost as reg1 + <{0,+,1}> * 8 on x86, IIRC, in that example LSR chooses the second as the ImmCost is lower.

test/CodeGen/X86/masked-iv-safe.ll
8

Same here.

atrick accepted this revision.May 19 2014, 5:10 PM
atrick edited edge metadata.

Great work cleaning up this aspect of LSR!

lib/Transforms/Scalar/LoopStrengthReduce.cpp
839–841

I don't understand the name isAMCompletelyFolded(TTI, LU, F) in this case. It seems to be used to mean "can fold exactly 2 registers".

1454–1456

Word wrap.

This revision is now accepted and ready to land.May 19 2014, 5:10 PM
qcolombet added inline comments.May 19 2014, 5:26 PM
lib/Transforms/Scalar/LoopStrengthReduce.cpp
839–841

I killed "can fold exactly 2 registers” (no longer needed[1]) and renamed the isLegalUse into isAMCompletelyFolded (isLegalUse has a different semantic now).
So, now isAMCompletelyFolded is used for every cost related decision.

The fact that it ends up in this specific position in the diff is indeed confusing but necessary because of its uses.

[1] Thanks to the canonicalization, we have now the "two registers" in BaseRegs and ScaledReg. The base register is built with the successive base adds and we add the scaled reg on top of that according to whether or not it is folded or not. More specifically this code:

 // Determine how many (unfolded) adds we'll need inside the loop.
size_t NumBaseParts = F.getNumRegs();
if (NumBaseParts > 1)
  // Do not count the base and a possible second register if the target
  // allows to fold 2 registers.
  NumBaseAdds += NumBaseParts - (1 + isAMCompletelyFolded(TTI, LU, F));
NumBaseAdds += (F.UnfoldedOffset != 0);
1454–1456

Good catch!
Looks like clang-format does not reconcile the comments when it splits them.

qcolombet updated this revision to Diff 9630.May 20 2014, 10:02 AM
qcolombet edited edge metadata.

Refine the comments on isAMCompletelyFolded.
Fix a think-o in the base adds computation.

Hi Andy,

Thanks again for the feedbacks and the review!

Right. I was looking at the use of this function in exactly this code. isAMCompletelyFolded seems to mean that we can use two registers in the addressing mode for free. If we have 2 baseregs + scaled reg, the code will assume we need an extra add.

Correct!

I guess “completely” folded means 1 base + 1 scaled + offset can be folded, but never anything else. That’s what I would expect, but the name is a tad misleading without any comments.

Exactly. This matches indeed the existing behavior. I have updated the comments to detail that.
What do you think?

I derived the naming from the original comment of isLegalUse and retrospectively, this is not great. If you have ideas I take them :).

Thanks,
-Quentin

The extra comments help. Thanks. LGTM.

qcolombet closed this revision.May 20 2014, 12:32 PM

Thanks!

Committed revision 209230.

This is a really nice improvement. Thanks, guys!

-Jim