Hi,
This patch introduces a canonical representation for the formulae used in loop strength reduce.
Thanks for your review/feedbacks.
- Context **
Loop strength reduce represents differently the formulae reg1 + reg2 + ... + regN and reg1 + reg2 + ... + 1*regN. Indeed, the first form keeps the list of registers reg1...regN within the BaseRegs field, whereas the second keeps the list of registers reg1...regN-1 within the BaseRegs field and regN in the ScaledReg field.
These two representations cannot live at the same time in an LSR instance (they are uniqued the same way and thus cannot be inserted together in the set of formulae) but yield different costs.
Moreover, at several places in our cost model we assume that ScaledReg == nullptr implies that only one base register is set, which is currently wrong.
For instance, an addressing mode using reg1 + reg2 will not query the target hook getScalingFactorCost, whereas reg1 + 1*reg2 will!
- Proposed Solution **
This patch introduces a canonical representation for the formulae.
Basically, as soon as a formula has more that one base register, the scaled register field is used for one of them. The register put into the scaled register is preferably a loop variant.
The proposed patch refactors how the formulae are built in order to produce such representation.
This yields a more accurate, but still perfectible, cost model.
Note: This patch will change the way the code is generated for x86 since even a simple reg1 + reg2 addressing mode is more expensive than a simple reg1 access for this target. To some extend this patch may change the code for other targets as the cost model is supposedly more accurate.
Note2: We could do some more refactoring, (see the FIXME for instance) but those would induce a change in behavior whereas the goal here was to not change the behavior if possible.
- Current Results **
I have benchmarked the patch on a Ivy Bridge machine fixed at 2900Mhz for both O3 and Os. On average I do not see any difference, but there are a few speed-ups.
Here are the results for the best of 10 runs for tests that run for at least 1 second. Smaller is better.
- O3 *
Benchmark_ID Reference Test Expansion Percent
CINT2000/186.crafty/186.crafty 4.2485 4.2097 0.99 -1%
CINT2000/253.perlbmk/253.perlbmk 5.7467 5.5281 0.96 -4%
CINT2000/300.twolf/300.twolf 3.2226 3.2407 1.01 +1%
Polybench/stencils/fdtd-apml/fdtd-apml 1.0065 1.0123 1.01 +1%
Shootout-C++/lists 10.8331 10.8998 1.01 +1%
Shootout-C++/sieve 1.9528 1.9906 1.02 +2%
Shootout/lists 5.0692 5.1081 1.01 +1%
Shootout/matrix 1.4662 1.4505 0.99 -1%
TSVC/ControlFlow-dbl/ControlFlow-dbl 3.629 3.6646 1.01 +1%
TSVC/ControlFlow-flt/ControlFlow-flt 3.1577 3.215 1.02 +2%
TSVC/CrossingThresholds-flt/CrossingThr 2.3061 2.2619 0.98 -2%
lemon/lemon 1.151 1.141 0.99 -1%
llubenchmark/llu 3.8547 3.8213 0.99 -1%
Min (13) - - 0.96 -
Max (13) - - 1.02 -
Sum (13) 48 48 1 +0%
A.Mean (13) - - 1 +0%
G.Mean 2 (13) - - 1 +0%
Regressions:
- CINT2000/300.twolf/300.twolf: The changes should not affect the performances. I have checked with IACA (Intel Architecture Code Analyzer) the few loops that are modified and the throughput and latency are identical.
I have looked at the encoding and the new encoding for this case is slightly more compact (incq uses 1-byte less encoding space than addq). Some alignment issues? Anyhow, this seems a side effect.
- Polybench/stencils/fdtd-apml: Slightly different spill placement. Average is better after the refactoring but the minimum is slightly better for the old implementation, though in the noise (< 0.006s).
- Shootout-C++/lists: Noise, same binary.
- Shootout-C++/sieve: Noise, same binary.
- Shootout/lists: Noise, same binary.
- TSVC/ControlFlow-dbl/ControlFlow-dbl: The code looks slightly better: With the refactoring we got rid of one complex addressing mode in a loop. I do not know why we are observing that! Looks like a side effect too.
- TSVC/ControlFlow-flt/ControlFlow-flt: Same as ControlFlow-dbl.
- Os *
Benchmark_ID Reference Test Expansion Percent
7zip/7zip-benchmark 8.1713 8.1235 0.99 -1%
ASC_Sequoia/AMGmk/AMGmk 7.0611 6.9804 0.99 -1%
CFP2006/447.dealII/447.dealII 15.8169 15.294 0.97 -3%
CINT2000/175.vpr/175.vpr 2.8925 2.8674 0.99 -1%
CINT2000/197.parser/197.parser 2.2631 2.2849 1.01 +1%
CINT2000/300.twolf/300.twolf 3.2271 3.1912 0.99 -1%
CINT2006/401.bzip2/401.bzip2 2.0734 2.0838 1.01 +1%
CINT2006/456.hmmer/456.hmmer 2.5935 2.5781 0.99 -1%
CINT2006/464.h264ref/464.h264ref 12.3972 12.4701 1.01 +1%
CoyoteBench/lpbench 2.9263 2.9418 1.01 +1%
SIBsim4/SIBsim4 2.7361 2.6705 0.98 -2%
Shootout-C++/lists 10.8231 10.9064 1.01 +1%
TSVC/ControlFlow-flt/ControlFlow-flt 3.2565 3.3259 1.02 +2%
TSVC/CrossingThresholds-dbl/CrossingThr 3.4094 3.2462 0.95 -5%
TSVC/CrossingThresholds-flt/CrossingThr 2.6827 2.4896 0.93 -7%
lemon/lemon 1.1847 1.0853 0.92 -8%
mafft/pairlocalalign 24.6637 24.9703 1.01 +1%
Min (17) - - 0.92 -
Max (17) - - 1.02 -
Sum (17) 108 108 0.99 +1%
A.Mean (17) - - 0.99 -1%
G.Mean 2 (17) - - 0.99 -1%
Regressions:
- CINT2000/197.parser/197.parser: Similar assembly, IACA gives the same throughput for the few loops that changed that I have checked.
- CINT2006/401.bzip2/401.bzip2: The picked solutions for the loops in blocksort.c are slightly different due to the new cost. This results in different register allocation and different spill code placement. I do not see any wrong here.
- CINT2006/464.h264ref/464.h264ref: Some of the loops (in dct_luma and dct_chroma for instance) avoid the cost of the 1* scaling factor but use a different comparison (i.e., increase in ImmCost). As a result, the comparison is now against 2 instead of 0. Therefore, we were previously able to remove the comparison since the addq was setting the zero flag but now we have to keep the comparison. That said, according to IACA we end up with 16 uops vs. 15 uops, but thanks to the new unscaled addressing mode, we have a throughput of 3.75 cycles vs. 3.9.
- CoyoteBench/lpbench: Same as O3 - CINT2000/300.twolf/300.twolf.
- Shootout-C++/lists: Noise, same binary.
- TSVC/ControlFlow-flt/ControlFlow-flt: Same as O3 TSVC/ControlFlow-flt/ControlFlow-flt.
- mafft/pairlocalalign: Same as CoyoteBench/lpbench.
rdar://problem/16731508
Thanks,
-Quentin
I don't understand the name isAMCompletelyFolded(TTI, LU, F) in this case. It seems to be used to mean "can fold exactly 2 registers".