Download Raw Diff

Details

Reviewers

SjoerdMeijer
dmgreen
NickGuy
sanwou01
fhahn

Commits

rG94c495295135: [AArch64] Enable Upper bound unrolling universally

Summary

On the attached test, the LoopUnrollPass fails to unroll the loop because its trip count is not constant. %0 = select i1 %cond, i32 2, i32 3
In order to unroll the loop, we need to set up UP.UpperBound = true in getUnrollingPreferences.
Currently, it is enabled with certain target condition on AArch64 target. This patch enables it universally for AArch64 target.

The performance data from spec2017 is as below.

Benchmark		diff (%)
500.perlbench_r		1.226130632
502.gcc_r		0.22529927
505.mcf_r		0.00061882
520.omnetpp_r		-0.094707343
523.xalancbmk_r		-0.125968029
525.x264_r		0.45726427
531.deepsjeng_r		0.326290873
541.leela_r		-0.047175519
548.exchange2_r		0.101954068
557.xz_r		0.275802637

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

jaykang10 created this revision.Jul 14 2021, 9:31 AM

Herald added subscribers: danielkiss, zzheng, hiraditya, kristof.beyls. · View Herald TranscriptJul 14 2021, 9:31 AM

jaykang10 requested review of this revision.Jul 14 2021, 9:31 AM

Herald added a project: Restricted Project. · View Herald TranscriptJul 14 2021, 9:31 AM

Herald added a subscriber: llvm-commits. · View Herald Transcript

jaykang10 updated this revision to Diff 358644.Jul 14 2021, 9:37 AM

dmgreen added inline comments.Jul 14 2021, 10:25 AM

llvm/test/Transforms/LoopUnroll/AArch64/unroll-upperbound.ll
43	Does the test need an unreachable in it? They (and undef) can make tests less reliable in the long run, as other code changes in llvm.

Harbormaster completed remote builds in B114014: Diff 358644.Jul 14 2021, 10:57 AM

The perf data looks a bit thin, and I guess this is for 1 cpu, the N1? Not sure if we need some more data (and different core) to confirm this is a good thing? Or do we think this is enough?

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
1757	This is vague and without the context of this patch not so informative: not dependant upon the conditions below. I'd say just get rid of this part.

jaykang10 added inline comments.Jul 15 2021, 1:14 AM

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
1757	Yep, let me remove it.
llvm/test/Transforms/LoopUnroll/AArch64/unroll-upperbound.ll
43	Sorry, let me remove it.

Following comments from @SjoerdMeijer and @dmgreen, updated patch.

Harbormaster completed remote builds in B114169: Diff 358862.Jul 15 2021, 2:51 AM

In D105996#2877692, @SjoerdMeijer wrote:

The perf data looks a bit thin, and I guess this is for 1 cpu, the N1? Not sure if we need some more data (and different core) to confirm this is a good thing? Or do we think this is enough?

The performance data from spec2006 for AArch64

Benchmark		diff (%)
400.perlbench		0.228714529
401.bzip2		-0.943491024
403.gcc		0.515920164
429.mcf		2.09023576
445.gobmk		0.771512067
456.hmmer		0.083101943
458.sjeng		0.203260432
462.libquantum		4.381950477
464.h264ref		0.300054596
471.omnetpp		2.604745332
473.astar		1.309848679
483.xalancbmk		1.884877644

It looks like there was hiccup on testing machine so the performance data is updated.

It looks like there was hiccup on testing machine so the performance data is updated.

Oh OK, They do look better. It's standard in llvm to run the llvm-test-suite for reporting performance results, to test a wider range of cases. It can be quite noisy though, so you need to ensure the results are statistically significant.

This option sounds OK to me. I think it could probably be enabled universally as a part of runtime unrolling. But if it is a separate option I have no issues with enabling it for AArch64, and the results you have look OK. LGTM

This revision is now accepted and ready to land.Jul 19 2021, 12:28 AM

I'm not sure it is a good idea to enable it universally after just measuring on a single core in a single configuration. From the data you shared, it is not clear which build settings were used (-O3, -Os, LTO, PGO?) and it would also be good to specify what metric you are showing (score I guess, but it is not clear). For making such changes in general, I think it would be good to see data on as many impacted configurations as possible to rule out negative impacts. I am also curious what the impact is on -Os and it would also be good to check Geekbench.

Last time I measured this for Apple out-of-order cores, this was not really beneficial on a wider set of benchmarks, but things might have change. I'd recommend to only enable it for the cores you measured it (or cores you are confident that are very similar).

In D105996#2886560, @fhahn wrote:

I'm not sure it is a good idea to enable it universally after just measuring on a single core in a single configuration. From the data you shared, it is not clear which build settings were used (-O3, -Os, LTO, PGO?) and it would also be good to specify what metric you are showing (score I guess, but it is not clear). For making such changes in general, I think it would be good to see data on as many impacted configurations as possible to rule out negative impacts. I am also curious what the impact is on -Os and it would also be good to check Geekbench.

Last time I measured this for Apple out-of-order cores, this was not really beneficial on a wider set of benchmarks, but things might have change. I'd recommend to only enable it for the cores you measured it (or cores you are confident that are very similar).

Um... I think it is normally better to unroll more loops. At least, it reduces the number of branches and it provides opportunities for post/pre index load/store. It could cause more register pressure and spill codes but it needs to be handled by the unroll cost on the pass. From my opinion, it is better to enable upper bound unroll. If it causes performance degradation, I think the pass needs to checks the cost. I do not think it is good to add 'if' statements with the combination of conditions which has lots of variant.

In D105996#2887074, @jaykang10 wrote:

In D105996#2886560, @fhahn wrote:

I'm not sure it is a good idea to enable it universally after just measuring on a single core in a single configuration. From the data you shared, it is not clear which build settings were used (-O3, -Os, LTO, PGO?) and it would also be good to specify what metric you are showing (score I guess, but it is not clear). For making such changes in general, I think it would be good to see data on as many impacted configurations as possible to rule out negative impacts. I am also curious what the impact is on -Os and it would also be good to check Geekbench.

Last time I measured this for Apple out-of-order cores, this was not really beneficial on a wider set of benchmarks, but things might have change. I'd recommend to only enable it for the cores you measured it (or cores you are confident that are very similar).

Um... I think it is normally better to unroll more loops. At least, it reduces the number of branches and it provides opportunities for post/pre index load/store. It could cause more register pressure and spill codes but it needs to be handled by the unroll cost on the pass. From my opinion, it is better to enable upper bound unroll. If it causes performance degradation, I think the pass needs to checks the cost. I do not think it is good to add 'if' statements with the combination of conditions which has lots of variant.

Sorry @fhahn I could imagine something wrong for the situation with out-of-order cores. If possible, can you share the case, in which the upper bound unroll makes performance worse on out-of-order cores, please? I imagined there is dynamic instruction scheduling of hardware level in out-of-order cores. The unroll could cause the instruction window filled with instructions of smaller scope. I thought other benefits from the fully unrolled loop could bridge the gap... If I missed something, please let me know.

In D105996#2889824, @jaykang10 wrote:

In D105996#2887074, @jaykang10 wrote:

snip .

Um... I think it is normally better to unroll more loops. At least, it reduces the number of branches and it provides opportunities for post/pre index load/store. It could cause more register pressure and spill codes but it needs to be handled by the unroll cost on the pass. From my opinion, it is better to enable upper bound unroll. If it causes performance degradation, I think the pass needs to checks the cost. I do not think it is good to add 'if' statements with the combination of conditions which has lots of variant.

Sorry @fhahn I could imagine something wrong for the situation with out-of-order cores. If possible, can you share the case, in which the upper bound unroll makes performance worse on out-of-order cores, please? I imagined there is dynamic instruction scheduling of hardware level in out-of-order cores. The unroll could cause the instruction window filled with instructions of smaller scope. I thought other benefits from the fully unrolled loop could bridge the gap... If I missed something, please let me know.

I think the impact may be highly specific to the uarch, hence I think it would be good to test the change with different CPUs. There may be very little benefit from unrolling on its own for CPUs that have enough execution units to hide the branch overhead for example. As you mentioned, unrolling could also cause more spilling or cause some other optimisations not to trigger. I agree the latter are issues in LLVM and should be fixed.

But it would be good to iron out potential issues before landing the patch, which is why I think it would be good to also test with other build options as well (unless I missed something I think there is no mention of the configuration you used to produce the numbers)

I can test the patch with some CPUs, but I will only be able to do so next week, as I don't have access to my devices at the moment. Not sure about other vendors, but if you have access to other Arm OOO cores, it might be worth to double check those as well.

In D105996#2892646, @fhahn wrote:

In D105996#2889824, @jaykang10 wrote:

In D105996#2887074, @jaykang10 wrote:

snip .

Um... I think it is normally better to unroll more loops. At least, it reduces the number of branches and it provides opportunities for post/pre index load/store. It could cause more register pressure and spill codes but it needs to be handled by the unroll cost on the pass. From my opinion, it is better to enable upper bound unroll. If it causes performance degradation, I think the pass needs to checks the cost. I do not think it is good to add 'if' statements with the combination of conditions which has lots of variant.

Sorry @fhahn I could imagine something wrong for the situation with out-of-order cores. If possible, can you share the case, in which the upper bound unroll makes performance worse on out-of-order cores, please? I imagined there is dynamic instruction scheduling of hardware level in out-of-order cores. The unroll could cause the instruction window filled with instructions of smaller scope. I thought other benefits from the fully unrolled loop could bridge the gap... If I missed something, please let me know.

I think the impact may be highly specific to the uarch, hence I think it would be good to test the change with different CPUs. There may be very little benefit from unrolling on its own for CPUs that have enough execution units to hide the branch overhead for example. As you mentioned, unrolling could also cause more spilling or cause some other optimisations not to trigger. I agree the latter are issues in LLVM and should be fixed.

But it would be good to iron out potential issues before landing the patch, which is why I think it would be good to also test with other build options as well (unless I missed something I think there is no mention of the configuration you used to produce the numbers)

I can test the patch with some CPUs, but I will only be able to do so next week, as I don't have access to my devices at the moment. Not sure about other vendors, but if you have access to other Arm OOO cores, it might be worth to double check those as well.

Yep, thanks for good suggestion. @fhahn I am trying to test this unrolling patch on other AArch64 machines. Once I have got the results, I will let you know the information which is configuration, scores and etc.

Matt added a subscriber: Matt.Jul 22 2021, 3:35 AM

I have checked the performance number from SPEC benchmarks on neoverse-n1 core.

spec2006 O3 lto	
Benchmark	score improvement(%)
400.perlbench	0.35011018
401.bzip2	-3.003732177
403.gcc	0.171077131
429.mcf	2.115844504
445.gobmk	0.201857868
456.hmmer	-0.143194012
458.sjeng	0.207366997
462.libquantum	3.935575739
464.h264ref	0.167041326
471.omnetpp	1.355350874
473.astar	0.733528789
483.xalancbmk	1.064585199
	
spec2006 Ofast	
Benchmark	score improvement(%)
400.perlbench	-0.213126131
401.bzip2	3.244142727
403.gcc	1.345302812
429.mcf	0.194343375
445.gobmk	0.766024658
456.hmmer	-0.096760298
458.sjeng	0.078527561
462.libquantum	0.214315254
464.h264ref	0.957874944
471.omnetpp	0.057501946
473.astar	0.296876709
483.xalancbmk	0.261506157
	
spec2017 O3 lto	
Benchmark	score improvement(%)
500.perlbench_r	1.094911941
502.gcc_r	0.559111539
505.mcf_r	0
520.omnetpp_r	-0.140907845
523.xalancbmk_r	0.564247335
525.x264_r	0.198455239
531.deepsjeng_r	-0.081793385
541.leela_r	0.038333017
548.exchange2_r	0.009064456
557.xz_r	0.167682904
	
spec2017 Ofast	
Benchmark	score improvement(%)
500.perlbench_r	0.708957995
502.gcc_r	0.871004714
505.mcf_r	0
520.omnetpp_r	1.416668547
523.xalancbmk_r	-0.30611
525.x264_r	0.20238427
531.deepsjeng_r	0.550866022
541.leela_r	0.095203276
548.exchange2_r	0.002682641
557.xz_r	0.896193368

There could be noise... Once I get performance number on other core, I will add the performance number.

jaykang10 retitled this revision from [AArch64] Enable Upper bound unrolling universally to [AArch64] Enable Upper bound unrolling for neoverse-n1 core.Jul 29 2021, 6:59 AM

jaykang10 retitled this revision from [AArch64] Enable Upper bound unrolling for neoverse-n1 core to [AArch64] Enable Upper bound unrolling universally.Jul 29 2021, 7:27 AM

Do you have number for other benchmarks too? I believe that whether the option is beneficial - although it can depend on the sophistication of the cpu's branch predictor - is more dependent on the code that is being compiled. Some code is better suited for this type of unrolling, in cases because it helps further simplification or it just happens not to run into the drawbacks of the extra branches. It appears that SPEC is a win - I'm wondering if the llvm-test-suite would be the same as a larger selection of tests.

In D105996#2913839, @dmgreen wrote:

Do you have number for other benchmarks too? I believe that whether the option is beneficial - although it can depend on the sophistication of the cpu's branch predictor - is more dependent on the code that is being compiled. Some code is better suited for this type of unrolling, in cases because it helps further simplification or it just happens not to run into the drawbacks of the extra branches. It appears that SPEC is a win - I'm wondering if the llvm-test-suite would be the same as a larger selection of tests.

Yep, let me try to get the number from llvm-test-suite too.

The result of llvm-test-suite on neoverse-n1 is as below.

Metric: exec_time

Program                                        results_org results_mod diff 
 test-suite..._MemCmp<5, LessThanZero, None>   1464.48     1586.24      8.3%
 test-suite...mp<16, GreaterThanZero, First>   469.91      494.79       5.3%
 test-suite...ing-dbl/Equivalencing-dbl.test     1.54        1.62       5.0%
 test-suite...dbl/LoopRestructuring-dbl.test     2.52        2.62       4.0%
 test-suite...mCmp<5, GreaterThanZero, Last>   1529.02     1585.66      3.7%
 test-suite...ow-dbl/GlobalDataFlow-dbl.test     2.59        2.66       2.6%
 test-suite...BENCHMARK_asinf_autovec_float_   343.94      352.35       2.4%
 test-suite...netbench-crc/netbench-crc.test     0.62        0.63       2.2%
 test-suite...l/StatementReordering-dbl.test     2.67        2.73       2.1%
 test-suite...Raw.test:BM_HYDRO_1D_RAW/44217    39.04       39.86       2.1%
 test-suite...w.test:BM_INT_PREDICT_RAW/5001    35.27       35.93       1.9%
 test-suite...mbolics-dbl/Symbolics-dbl.test     1.74        1.77       1.8%
 test-suite...est:BM_ENERGY_CALC_LAMBDA/5001    62.91       64.01       1.7%
 test-suite...bl/CrossingThresholds-dbl.test     2.58        2.62       1.7%
 test-suite...-dbl/LinearDependence-dbl.test     2.75        2.79       1.6%
 Geomean difference                                                     nan%
         results_org    results_mod        diff
count  582.000000     582.000000     581.000000
mean   2669.685396    2665.482627   -0.000234  
std    28714.786308   28644.777094   0.009457  
min    0.606500       0.600200      -0.092831  
25%    3.190475       3.185000      -0.000765  
50%    131.649140     131.477139     0.000016  
75%    602.095967     602.268365     0.000851  
max    486980.720000  484782.360000  0.083140

On ThunderX2, the scores from SPEC2017 are as below.

spec2017		
Ofast		score improvement(%)
500.perlbench_r		0.794469107
502.gcc_r		-0.113128593
505.mcf_r		-0.299424705
520.omnetpp_r		-0.111756111
523.xalancbmk_r		-0.056942476
525.x264_r		-0.005589718
531.deepsjeng_r		0.006257447
541.leela_r		-0.064192085
557.xz_r		0.919346409
		
		
spec2017		
O3 lto		score improvement(%)
500.perlbench_r		-0.772206292
502.gcc_r		-0.015262341
505.mcf_r		0.517061187
520.omnetpp_r		0.456657306
523.xalancbmk_r		0.914423559
525.x264_r		-0.287806217
531.deepsjeng_r		0.254324778
541.leela_r		0.01651545
557.xz_r		-0.40528841

Thank you very much for sharing the additional numbers!

A few regressions jump out. Do you know if those are real?

SPEC2017 ThunderX2 -O3 -flto : 500.perlbench_r -0.772206292
SPEC2006 neoverse-n1 core. -O3 -flto: 401.bzip2 -3.003732177

In D105996#2916050, @jaykang10 wrote:

The result of llvm-test-suite on neoverse-n1 is as below.

Metric: exec_time

Program                                        results_org results_mod diff 
 test-suite..._MemCmp<5, LessThanZero, None>   1464.48     1586.24      8.3%
 test-suite...mp<16, GreaterThanZero, First>   469.91      494.79       5.3%
 test-suite...ing-dbl/Equivalencing-dbl.test     1.54        1.62       5.0%
 test-suite...dbl/LoopRestructuring-dbl.test     2.52        2.62       4.0%
 test-suite...mCmp<5, GreaterThanZero, Last>   1529.02     1585.66      3.7%
 test-suite...ow-dbl/GlobalDataFlow-dbl.test     2.59        2.66       2.6%
 test-suite...BENCHMARK_asinf_autovec_float_   343.94      352.35       2.4%
 test-suite...netbench-crc/netbench-crc.test     0.62        0.63       2.2%
 test-suite...l/StatementReordering-dbl.test     2.67        2.73       2.1%
 test-suite...Raw.test:BM_HYDRO_1D_RAW/44217    39.04       39.86       2.1%
 test-suite...w.test:BM_INT_PREDICT_RAW/5001    35.27       35.93       1.9%
 test-suite...mbolics-dbl/Symbolics-dbl.test     1.74        1.77       1.8%
 test-suite...est:BM_ENERGY_CALC_LAMBDA/5001    62.91       64.01       1.7%
 test-suite...bl/CrossingThresholds-dbl.test     2.58        2.62       1.7%
 test-suite...-dbl/LinearDependence-dbl.test     2.75        2.79       1.6%
 Geomean difference                                                     nan%
         results_org    results_mod        diff
count  582.000000     582.000000     581.000000
mean   2669.685396    2665.482627   -0.000234  
std    28714.786308   28644.777094   0.009457  
min    0.606500       0.600200      -0.092831  
25%    3.190475       3.185000      -0.000765  
50%    131.649140     131.477139     0.000016  
75%    602.095967     602.268365     0.000851  
max    486980.720000  484782.360000  0.083140

I think by default compare.py only shows a small fixed number of results. You can show all by passing --all. You can filter unchanged binaries between runs by using --filter-hash. IIUC from the data here it looks like there are a few notable regressions in this test set. Do you know if that's noise are actual regressions?

In D105996#2916990, @fhahn wrote:

Thank you very much for sharing the additional numbers!

A few regressions jump out. Do you know if those are real?

SPEC2017 ThunderX2 -O3 -flto : 500.perlbench_r -0.772206292
SPEC2006 neoverse-n1 core. -O3 -flto: 401.bzip2 -3.003732177

In D105996#2916050, @jaykang10 wrote:

The result of llvm-test-suite on neoverse-n1 is as below.

Metric: exec_time

Program                                        results_org results_mod diff 
 test-suite..._MemCmp<5, LessThanZero, None>   1464.48     1586.24      8.3%
 test-suite...mp<16, GreaterThanZero, First>   469.91      494.79       5.3%
 test-suite...ing-dbl/Equivalencing-dbl.test     1.54        1.62       5.0%
 test-suite...dbl/LoopRestructuring-dbl.test     2.52        2.62       4.0%
 test-suite...mCmp<5, GreaterThanZero, Last>   1529.02     1585.66      3.7%
 test-suite...ow-dbl/GlobalDataFlow-dbl.test     2.59        2.66       2.6%
 test-suite...BENCHMARK_asinf_autovec_float_   343.94      352.35       2.4%
 test-suite...netbench-crc/netbench-crc.test     0.62        0.63       2.2%
 test-suite...l/StatementReordering-dbl.test     2.67        2.73       2.1%
 test-suite...Raw.test:BM_HYDRO_1D_RAW/44217    39.04       39.86       2.1%
 test-suite...w.test:BM_INT_PREDICT_RAW/5001    35.27       35.93       1.9%
 test-suite...mbolics-dbl/Symbolics-dbl.test     1.74        1.77       1.8%
 test-suite...est:BM_ENERGY_CALC_LAMBDA/5001    62.91       64.01       1.7%
 test-suite...bl/CrossingThresholds-dbl.test     2.58        2.62       1.7%
 test-suite...-dbl/LinearDependence-dbl.test     2.75        2.79       1.6%
 Geomean difference                                                     nan%
         results_org    results_mod        diff
count  582.000000     582.000000     581.000000
mean   2669.685396    2665.482627   -0.000234  
std    28714.786308   28644.777094   0.009457  
min    0.606500       0.600200      -0.092831  
25%    3.190475       3.185000      -0.000765  
50%    131.649140     131.477139     0.000016  
75%    602.095967     602.268365     0.000851  
max    486980.720000  484782.360000  0.083140

I have not looked into them in detail yet. Let me check it next week. If possible, can you also share your side data please?

In D105996#2917239, @jaykang10 wrote:

I have not looked into them in detail yet. Let me check it next week. If possible, can you also share your side data please?

So far I ran SPEC2017 and all changes look to be within the noise. Still waiting on Geekbench

In D105996#2922232, @fhahn wrote:

In D105996#2917239, @jaykang10 wrote:

I have not looked into them in detail yet. Let me check it next week. If possible, can you also share your side data please?

So far I ran SPEC2017 and all changes look to be within the noise. Still waiting on Geekbench

Thanks for sharing it.

SPEC2017 ThunderX2 -O3 -flto : 500.perlbench_r -0.772206292
SPEC2006 neoverse-n1 core. -O3 -flto: 401.bzip2 -3.003732177

I have run them multiple times again. It looks the numbers are not real. The result is as below.

SPEC2006 neoverse-n1 core. -O3 -flto: 401.bzip2: mean 0.32% score improvement
SPEC2017 ThunderX2 -O3 -flto : 500.perlbench_r: mean 0.65% score improvement

I think by default compare.py only shows a small fixed number of results. You can show all by passing --all. You can filter unchanged binaries between runs by using --filter-hash. IIUC from the data here it looks like there are a few notable regressions in this test set. Do you know if that's noise are actual regressions?

I have run it with -j 1 mutiple times again. It looks the numbers are not real too... The result with --filter-hash is as below.

Metric: exec_time

/usr/local/lib/python2.7/dist-packages/scipy/stats/stats.py:308: RuntimeWarning: divide by zero encountered in log
  log_a = np.log(np.array(a, dtype=dtype))
Program                                        results.org results.mod diff  
 test-suite...ootout/Shootout-ackermann.test     0.00        0.01      100.0%
 test-suite.../Prolangs-C++/simul/simul.test     0.00        0.01      100.0%
 test-suite...out-C++/Shootout-C++-ary2.test     0.01        0.02      94.9% 
 test-suite...plications/d/make_dparser.test     0.01        0.02      66.1% 
 test-suite...ijndael/security-rijndael.test     0.02        0.03      60.3% 
 test-suite...marks/Stanford/Bubblesort.test     0.02        0.02      25.7% 
 test-suite...hmarks/VersaBench/bmm/bmm.test     1.75        2.01      14.9% 
 test-suite...patricia/network-patricia.test     0.06        0.07      11.9% 
 test-suite...rks/Olden/treeadd/treeadd.test     0.13        0.14      11.7% 
 test-suite...ing/covariance/covariance.test     1.40        1.54      10.6% 
 test-suite...-2d-imper/jacobi-2d-imper.test     0.12        0.13       9.6% 
 test-suite.../Applications/spiff/spiff.test     1.31        1.43       8.9% 
 test-suite...g/correlation/correlation.test     1.26        1.37       8.6% 
 test-suite...rks/FreeBench/mason/mason.test     0.09        0.10       8.2% 
 test-suite...enchmarks/Misc-C++/bigfib.test     0.15        0.16       8.0% 
 Geomean difference                                                     nan% 
         results.org    results.mod          diff
count  760.000000     760.000000     7.580000e+02
mean   2025.018981    2101.590727   -2.405934e-03
std    24798.482931   26215.053402   1.127594e-01
min    0.000000       0.000000      -1.000000e+00
25%    0.919290       0.921498      -6.456376e-04
50%    9.972822       9.972357       5.963245e-07
75%    449.831035     449.823480     1.218490e-03
max    479628.130000  507694.300000  1.000000e+00

I think it looks the results are overall fine. Any objection to push this change?

@fhahn Can you share the GeekBench score please? If it is ok, I would like to push this change.

In D105996#2954149, @jaykang10 wrote:

@fhahn Can you share the GeekBench score please? If it is ok, I would like to push this change.

Looks like there are no notable changes to geekbench scores either.

LGTM, which a few nits about the test case

llvm/test/Transforms/LoopUnroll/AArch64/unroll-upperbound.ll
5	It would be helpful to add a brief comment here to explain what the test is checking.
43	nit: maybe indent the case?
46	nit: can the name of the block be shortened/improved?
47	might also be good to add a call or something here so the loop has some observable effect
49	nit: maybe name this block `latch`/`loop.latch` to make it slightly easier to read?

In D105996#2956941, @fhahn wrote:

In D105996#2954149, @jaykang10 wrote:

@fhahn Can you share the GeekBench score please? If it is ok, I would like to push this change.

Looks like there are no notable changes to geekbench scores either.

LGTM, which a few nits about the test case

Thanks for checking the geek bench score.

After updating the test, let me push this change.

llvm/test/Transforms/LoopUnroll/AArch64/unroll-upperbound.ll
5	Yep, let me add a comment.
43	Yep, let me change it.
46	Yep, let me change it.
47	Yep, let me change it.
49	Yep, let me change it.

Following comments from @fhahn, updated test.

This revision was landed with ongoing or failed builds.Aug 20 2021, 3:27 AM

Closed by commit rG94c495295135: [AArch64] Enable Upper bound unrolling universally (authored by jaykang10). · Explain Why

This revision was automatically updated to reflect the committed changes.

jaykang10 added a commit: rG94c495295135: [AArch64] Enable Upper bound unrolling universally.

Harbormaster completed remote builds in B120527: Diff 367753.Aug 20 2021, 4:26 AM

This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Enable Upper bound unrolling universally
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 367754

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp

llvm/test/Transforms/LoopUnroll/AArch64/unroll-upperbound.ll

This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Enable Upper bound unrolling universallyClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 367754

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp

llvm/test/Transforms/LoopUnroll/AArch64/unroll-upperbound.ll

[AArch64] Enable Upper bound unrolling universally
ClosedPublic