Page MenuHomePhabricator

Cortex-A57 scheduling model for ARM backend (AArch32)
ClosedPublic

Authored by andrew.zhogin on Dec 28 2016, 9:26 PM.

Details

Summary

Implemented Cortex-A57 scheduling model: main code in ARMScheduleA57.td, ARMScheduleA57WriteRes.td.
Small changes in cpp/h files to support required scheduling predicates.
Scheduling model implemented according to http://infocenter.arm.com/help/topic/com.arm.doc.uan0015b/Cortex_A57_Software_Optimization_Guide_external.pdf.

Diff Detail

Repository
rL LLVM

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
andrew.zhogin added subscribers: asl, llvm-commits.
rengolin retitled this revision from Cortex-A57 scheduling model for AMD backend (AArch32) to Cortex-A57 scheduling model for ARM backend (AArch32).
rovka edited edge metadata.Jan 11 2017, 8:56 AM

Hi,

This looks like a lot of work :) Could you please split it up into smaller patches, e.g. to separate the mechanical scheduling definitions from the more structural changes?

Thanks,
Diana

lib/Target/ARM/ARMScheduleA57.td
17 ↗(On Diff #82648)

Typo: superscalar

29 ↗(On Diff #82648)

Could you break this off into a separate patch? I.e. write an initial patch as if everything were r1p0 or later, and another patch with the r0px delta (or the other way around if it's more convenient for you).

86 ↗(On Diff #82648)

Cool, can you share this data?

102 ↗(On Diff #82648)

Isn't this fixed now?

lib/Target/ARM/ARMSubtarget.h
439 ↗(On Diff #82648)

Please don't add this function if it can be avoided. We're trying to get rid of these in the long run (in favor of using subtarget features for each specific behavior).

andrew.zhogin added inline comments.Jan 11 2017, 11:42 AM
lib/Target/ARM/ARMScheduleA57.td
86 ↗(On Diff #82648)

"Common description and scheduling model parameters taken from AArch64" (for the same processor, AArch64SchedA57.td). I will delete this comment.

andrew.zhogin edited edge metadata.

Small fixes according review comments: typo, commentary, obsolete TODO, isCortexA57() function removed.

Hi,

This looks like a lot of work :) Could you please split it up into smaller patches, e.g. to separate the mechanical scheduling definitions from the more structural changes?

Thanks,
Diana

I'm not quite understand what is "more structural changes". New functions in ARMBaseInstrInfo? They are required for scheduling predicates to handle situations like: "Load, register offset, plus", "Load, register offset, minus", "Load, scaled register offset, plus LSL2", ...
And they are not used by any side code.

For example:

// For "Load, register offset, minus" we need +1cyc, +1I
def A57WriteLdrAm3 : SchedWriteVariant<[
  SchedVar<IsLdrAm3NegRegOffPred, [A57Write_5cyc_1I_1L]>,
  SchedVar<NoSchedPred,           [A57Write_4cyc_1L]>
]>;
def : InstRW<[A57WriteLdrAm3], (instregex "LDR(H|SH|SB)$")>;
andrew.zhogin marked 3 inline comments as done.Jan 11 2017, 12:33 PM
andrew.zhogin added inline comments.
lib/Target/ARM/ARMScheduleA57.td
29 ↗(On Diff #82648)

Sure, I will.

jgreenhalgh added inline comments.
lib/Target/ARM/ARM.td
783 ↗(On Diff #84007)

I'm not sure that splitting the option like this is a good idea, certainly this would create an incompatibility with GCC, which does not recognise -mcpu=cortex-a57-r0px. In GCC -mcpu=cortex-a57 enables scheduling for all the optimized instruction pairs (e.g. MOV/MOVT), and uses the r0p0 latency values for the Advanced SIMD multiply accumulate instructions.

In your patch, this extra flag changes the scheduling of vector multiply, multiply accumulate, and the mov/movt instructions. While I can see that this can improve the resulting schedule in some circumstances, my opinion is that fragmenting the option like this is not worth the extra cost of carrying a special option.

rengolin added inline comments.Jan 12 2017, 8:30 AM
lib/Target/ARM/ARM.td
783 ↗(On Diff #84007)

James' point is valid even if you ignore GCC compatibility. Scheduler models are "best effort" at best, so splitting it like this makes no sense. If there are multiple revisions of the same core, then pick the most common and optimise for it.

The only case where you can split like that is for recognised new sub-architectures, like Swift, Kyro, Cyclone, which not only don't have to follow the same scheduling decisions, but are recognised names.

How many people know which A57 revision will go into their servers/mobile phones? How many companies will openly publicise that kind of information?

So, please, remove that processor model and its associated features and let's make this about vanilla A57. This will mean test and benchmark in many variants, but at least we'll get it right for the majority of people using it.

andrew.zhogin added inline comments.Jan 12 2017, 9:34 AM
lib/Target/ARM/ARM.td
783 ↗(On Diff #84007)

Sure.
Maybe it is better to keep SchedulingPredicate IsR1P0AndLaterPred only inside ARMScheduleA57.td and make it always false? No options/platforms. To keep the information about r0px/r1px scheduling differences for future use.

rengolin added inline comments.Jan 12 2017, 9:56 AM
lib/Target/ARM/ARM.td
783 ↗(On Diff #84007)

If this is only used by actual cores, I don't mind it staying there, as long as they have a clear purpose.

Removed r0px flag and specific cpu. IsR1P0AndLaterPred is now false (vanilla).
Added few scheduling tests.

andrew.zhogin marked 3 inline comments as done.Jan 13 2017, 7:11 AM
andrew.zhogin added inline comments.
lib/Target/ARM/ARMScheduleA57.td
29 ↗(On Diff #82648)

IsR1P0AndLaterPred is always false for now, run-time option removed.

andrew.zhogin marked 5 inline comments as done.Jan 15 2017, 11:03 PM

More optimal (compact) way to describe LDM/VLDM scheduling info using list<SchedWriteRes> (from ARMScheduleA9.td).

Hi,

I don't see anything wrong with this, but I'd like to get a second opinion from someone with more experience in the area.
Also, do you have / can you get any performance results with/without this patch (at least on the test-suite)?

Thanks,
Diana

test/CodeGen/ARM/cortex-a57-misched-alu.ll
1 ↗(On Diff #84852)

Shouldn't this and all other added test require asserts?

javed.absar edited edge metadata.Jan 22 2017, 9:08 AM

Hi.
Some inline comments. Also, could you please share some performance results that you have for this sched-model.
Best Regards
Javed

lib/Target/ARM/ARMScheduleA57.td
498 ↗(On Diff #84852)

Do we need sched-variant here since outcome is identical?

1214 ↗(On Diff #84852)

You may want to combine VREV* VSWP* VTBL .. into one instregex comma separated def, to make definitions shorter.
Similarly in other places too, where possible without compromising clarity

lib/Target/ARM/ARMScheduleA57WriteRes.td
19 ↗(On Diff #84852)

You may want to rephrase "issued down one " to "issued as follows: one to I pipe, six to S pipe and ...", for better clarity.

55 ↗(On Diff #84852)

You could shorten the following using someitng like -

foreach Lat = 3-20 in {

def A57Write_#Lat#cyc_1L : SchedWriteRes<[A57UnitL]> {   
   let Latency = Lat;

}

74 ↗(On Diff #84852)

Same here. You could shorten the following -
foreach Lat = 5-16 in {
def A57Write_#Lat#cyc_1S : SchedWriteRes<[A57UnitS]> {

let Latency = Lat;

}

243 ↗(On Diff #84852)

Same here. You could shorten with the following -
foreach Lat = 1-20 in {
def A57Write_#Lat#cyc_1L_1I : SchedWriteRes<[A57UnitL, A57UnitI]>
{
let Latency = Lat; let NumMicroOps = 2;
}

754 ↗(On Diff #84852)

Something seems amiss as its defined earlier there is only 1 processor-unit of type A57UnitS.

test/CodeGen/ARM/cortex-a57-misched-alu.ll
30 ↗(On Diff #84852)

This part below is not required I think as you have already checked the latencies here against the model.

I don't see anything wrong with this, but I'd like to get a second opinion from someone with more experience in the area.
Also, do you have / can you get any performance results with/without this patch (at least on the test-suite)?

No, I don't have access to Cortex-A57 system. Maybe, you can recommend some hosting with such processors?

rengolin edited edge metadata.Jan 22 2017, 10:44 AM

No, I don't have access to Cortex-A57 system. Maybe, you can recommend some hosting with such processors?

I don't know many cloud / hosting services, maybe someone else can chime in.

@kristof.beyls, do you have infrastructure to benchmark AArch32 on A57?

cheers,
--renato

ARMScheduleA57WriteRes.td cleaned up from unused scheduling classes (legacy from AArch64). Small corrections according to the review comments.

andrew.zhogin marked 9 inline comments as done.Jan 23 2017, 8:54 AM
andrew.zhogin added inline comments.
lib/Target/ARM/ARMScheduleA57.td
1214 ↗(On Diff #84852)

Well, I tried to keep structure from the documentation. 1 row from table = 1 InstRW def. To be easy to verify.

lib/Target/ARM/ARMScheduleA57WriteRes.td
754 ↗(On Diff #84852)

Unused scheduling classes deleted (legacy from AArch64).

test/CodeGen/ARM/cortex-a57-misched-alu.ll
30 ↗(On Diff #84852)

I did it to check scheduling units (A57UnitI or A57UnitM). Not just latency.

andrew.zhogin marked 6 inline comments as done.Jan 24 2017, 2:19 AM
kristof.beyls edited edge metadata.Jan 25 2017, 1:18 AM

No, I don't have access to Cortex-A57 system. Maybe, you can recommend some hosting with such processors?

I don't know many cloud / hosting services, maybe someone else can chime in.

@kristof.beyls, do you have infrastructure to benchmark AArch32 on A57?

cheers,
--renato

@andrew.zhogin : I tried to measure the impact of this patch on performance on a Cortex-A57 system, but with this patch, testing in Thumb mode, the cmake-based test-suite runs fail early during the configuration step, with the following error:

-- Check size of unsigned long - failed
CMake Error at /usr/share/cmake-3.5/Modules/TestBigEndian.cmake:51 (message):
  no suitable type found
Call Stack (most recent call first):
  CMakeLists.txt:115 (test_big_endian)

This doesn't happen when testing in ARM mode, or when testing without your patch applied.
I tested on r292764.
Could you try and see if you could reproduce this?
The lnt runtest test-suite command line I used looks as follows:

lnt runtest test-suite --sandbox SANDBOX --no-timestamp --test-suite /work/llvm-test-suite --benchmarking-only --cppflags '-O3 -DNDEBUG -mcpu=cortex-a57 -mthumb -fomit-frame-pointer ' --threads 1 --build-threads 6 --use-perf time --use-lit lit --exec-multisample 1 --only-test=SingleSource/Benchmarks --cmake-define 'CMAKE_C_FLAGS_RELEASE=""' --cmake-define 'CMAKE_CXX_FLAGS_RELEASE=""'

Thanks,

Kristof

Added scheduling info for missed thumb instructions to have working CompleteModel=1.
All instructions supported according to the schedcover.py utility.

Added scheduling info for missed thumb instructions to have working CompleteModel=1.
All instructions supported according to the schedcover.py utility.

W00t!

andrew.zhogin added a comment.EditedJan 30 2017, 5:40 AM

@andrew.zhogin : I tried to measure the impact of this patch on performance on a Cortex-A57 system, but with this patch, testing in Thumb mode, the cmake-based test-suite runs fail early during the configuration step, with the following error:

-- Check size of unsigned long - failed
CMake Error at /usr/share/cmake-3.5/Modules/TestBigEndian.cmake:51 (message):
  no suitable type found
Call Stack (most recent call first):
  CMakeLists.txt:115 (test_big_endian)

This doesn't happen when testing in ARM mode, or when testing without your patch applied.
I tested on r292764.
Could you try and see if you could reproduce this?
The lnt runtest test-suite command line I used looks as follows:

lnt runtest test-suite --sandbox SANDBOX --no-timestamp --test-suite /work/llvm-test-suite --benchmarking-only --cppflags '-O3 -DNDEBUG -mcpu=cortex-a57 -mthumb -fomit-frame-pointer ' --threads 1 --build-threads 6 --use-perf time --use-lit lit --exec-multisample 1 --only-test=SingleSource/Benchmarks --cmake-define 'CMAKE_C_FLAGS_RELEASE=""' --cmake-define 'CMAKE_CXX_FLAGS_RELEASE=""'

I have updated patch with full support of thumb instructions.
I'm not familiar with lit - such command just succeeded on my local computer (x86_64) and I wonder why? It runs tests at some VM or remotely?

**********
Testing Time: 264.70s
  Expected Passes    : 126
2017-01-30 13:11:51: submitting result to dummy instance

@andrew.zhogin : I tried to measure the impact of this patch on performance on a Cortex-A57 system, but with this patch, testing in Thumb mode, the cmake-based test-suite runs fail early during the configuration step, with the following error:

-- Check size of unsigned long - failed
CMake Error at /usr/share/cmake-3.5/Modules/TestBigEndian.cmake:51 (message):
  no suitable type found
Call Stack (most recent call first):
  CMakeLists.txt:115 (test_big_endian)

This doesn't happen when testing in ARM mode, or when testing without your patch applied.
I tested on r292764.
Could you try and see if you could reproduce this?
The lnt runtest test-suite command line I used looks as follows:

lnt runtest test-suite --sandbox SANDBOX --no-timestamp --test-suite /work/llvm-test-suite --benchmarking-only --cppflags '-O3 -DNDEBUG -mcpu=cortex-a57 -mthumb -fomit-frame-pointer ' --threads 1 --build-threads 6 --use-perf time --use-lit lit --exec-multisample 1 --only-test=SingleSource/Benchmarks --cmake-define 'CMAKE_C_FLAGS_RELEASE=""' --cmake-define 'CMAKE_CXX_FLAGS_RELEASE=""'

I have updated patch with full support of thumb instructions.
I'm not familiar with lit - such command just succeeded on my local computer (x86_64) and I wonder why? It runs tests at some VM or remotely?

**********
Testing Time: 264.70s
  Expected Passes    : 126
2017-01-30 13:11:51: submitting result to dummy instance

Hi Andrew,

You'd need to run lnt (not lit) with the above command line on an AArch64 environment to be able to reproduce the issue.
On an x86_64 box, you probably tested code generation for x86_64, not AArch64.
Anyway, I've just kicked of another benchmarking run with your new patch. Will let you know the outcome.

@andrew.zhogin : I tried to measure the impact of this patch on performance on a Cortex-A57 system, but with this patch, testing in Thumb mode, the cmake-based test-suite runs fail early during the configuration step, with the following error:

-- Check size of unsigned long - failed
CMake Error at /usr/share/cmake-3.5/Modules/TestBigEndian.cmake:51 (message):
  no suitable type found
Call Stack (most recent call first):
  CMakeLists.txt:115 (test_big_endian)

This doesn't happen when testing in ARM mode, or when testing without your patch applied.
I tested on r292764.
Could you try and see if you could reproduce this?
The lnt runtest test-suite command line I used looks as follows:

lnt runtest test-suite --sandbox SANDBOX --no-timestamp --test-suite /work/llvm-test-suite --benchmarking-only --cppflags '-O3 -DNDEBUG -mcpu=cortex-a57 -mthumb -fomit-frame-pointer ' --threads 1 --build-threads 6 --use-perf time --use-lit lit --exec-multisample 1 --only-test=SingleSource/Benchmarks --cmake-define 'CMAKE_C_FLAGS_RELEASE=""' --cmake-define 'CMAKE_CXX_FLAGS_RELEASE=""'

I have updated patch with full support of thumb instructions.
I'm not familiar with lit - such command just succeeded on my local computer (x86_64) and I wonder why? It runs tests at some VM or remotely?

**********
Testing Time: 264.70s
  Expected Passes    : 126
2017-01-30 13:11:51: submitting result to dummy instance

Hi Andrew,

You'd need to run lnt (not lit) with the above command line on an AArch64 environment to be able to reproduce the issue.
On an x86_64 box, you probably tested code generation for x86_64, not AArch64.
Anyway, I've just kicked of another benchmarking run with your new patch. Will let you know the outcome.

I'm afraid I still see the same failures for cmake-driven benchmark suites.
For non-cmake-driven benchmark suites, I see the following error message at compile time:

DefIdx 0 exceeds machine model writes for %R0<def> = tLDRi %R0<kill>, 0, pred:14, pred:%noreg; mem:LD4[@Reg](tbaa=!12)(dereferenceable)
 (Try with MCSchedModel.CompleteModel set to false)incomplete machine model
UNREACHABLE executed at /work/llvm-test/slave2/cross-build/build/llvm/lib/CodeGen/TargetSchedule.cpp:216!

I'm afraid I still see the same failures for cmake-driven benchmark suites.
For non-cmake-driven benchmark suites, I see the following error message at compile time:

DefIdx 0 exceeds machine model writes for %R0<def> = tLDRi %R0<kill>, 0, pred:14, pred:%noreg; mem:LD4[@Reg](tbaa=!12)(dereferenceable)
 (Try with MCSchedModel.CompleteModel set to false)incomplete machine model
UNREACHABLE executed at /work/llvm-test/slave2/cross-build/build/llvm/lib/CodeGen/TargetSchedule.cpp:216!

I don't understand - tLDRi must be covered by "tLDR" regexp here:

def : InstRW<[A57Write_4cyc_1L], (instregex "LDRi12", "LDRBi12",
  "LDRcp", "(t2|t)?LDRConstPool", "LDRLIT_ga_(pcrel|abs)",
  "PICLDR", "tLDR")>;

Are you sure using the updated patch and clang?
And yes, I used lnt really:

./lnt runtest test-suite --sandbox ~/fast/sandbox_arm --no-timestamp --test-suite ~/fast/test-suite --benchmarking-only --cppflags '-O3 -DNDEBUG -mcpu=cortex-a57 -mthumb -fomit-frame-pointer ' --threads 1 --build-threads 6 --use-perf time --use-lit lit --exec-multisample 1 --only-test=SingleSource/Benchmarks --cmake-define 'CMAKE_C_FLAGS_RELEASE=""' --cmake-define 'CMAKE_CXX_FLAGS_RELEASE=""' --cc ~/fast/llvm_trunk.build/bin/clang

I'm afraid I still see the same failures for cmake-driven benchmark suites.
For non-cmake-driven benchmark suites, I see the following error message at compile time:

DefIdx 0 exceeds machine model writes for %R0<def> = tLDRi %R0<kill>, 0, pred:14, pred:%noreg; mem:LD4[@Reg](tbaa=!12)(dereferenceable)
 (Try with MCSchedModel.CompleteModel set to false)incomplete machine model
UNREACHABLE executed at /work/llvm-test/slave2/cross-build/build/llvm/lib/CodeGen/TargetSchedule.cpp:216!

I don't understand - tLDRi must be covered by "tLDR" regexp here:

def : InstRW<[A57Write_4cyc_1L], (instregex "LDRi12", "LDRBi12",
  "LDRcp", "(t2|t)?LDRConstPool", "LDRLIT_ga_(pcrel|abs)",
  "PICLDR", "tLDR")>;

Are you sure using the updated patch and clang?
And yes, I used lnt really:

./lnt runtest test-suite --sandbox ~/fast/sandbox_arm --no-timestamp --test-suite ~/fast/test-suite --benchmarking-only --cppflags '-O3 -DNDEBUG -mcpu=cortex-a57 -mthumb -fomit-frame-pointer ' --threads 1 --build-threads 6 --use-perf time --use-lit lit --exec-multisample 1 --only-test=SingleSource/Benchmarks --cmake-define 'CMAKE_C_FLAGS_RELEASE=""' --cmake-define 'CMAKE_CXX_FLAGS_RELEASE=""' --cc ~/fast/llvm_trunk.build/bin/clang

I double checked, and indeed it seems that somehow I received the old version of the patch when re-downloading it from phabricator.
Anyway, I've downloaded the latest version of the patch again, checking I indeed received the latest version.
With that version, benchmarking passes.
As expected, there are lots of performance swings either way, but on geomean, across a large number of programs (both the ones in the test-suite and from proprietary benchmarks), I see 0.65% improvement for the benchmarks reporting execution time and a 0.35% improvement for the benchmarks reporting scores.
So, in summary, the patch results in an improvement on average.

FWIW, here are the programs with the biggest swings in the test-suite:

Regressions:
MultiSource/Benchmarks/FreeBench/analyzer/analyzer 17.26%
SingleSource/Benchmarks/McGill/queens 10.57%
MultiSource/Benchmarks/Ptrdist/ft/ft 10.49%
MultiSource/Applications/siod/siod 5.32%
SingleSource/Benchmarks/BenchmarkGame/fannkuch 5.26%
SingleSource/Benchmarks/Misc/mandel-2 2.68%
MultiSource/Benchmarks/VersaBench/8b10b/8b10b 2.30%
MultiSource/Benchmarks/Fhourstones-3.1/fhourstones3.1 1.59%
MultiSource/Benchmarks/TSVC/NodeSplitting-flt/NodeSplitting-flt 1.43%
MultiSource/Benchmarks/NPB-serial/is/is 1.34%
SingleSource/Benchmarks/Misc/perlin 1.19%
SingleSource/Benchmarks/Misc/matmul_f64_4x4 1.05%
MultiSource/Benchmarks/Ptrdist/bc/bc 1.00%

Improvements:
SingleSource/Benchmarks/Adobe-C++/simple_types_constant_folding -26.47%
MultiSource/Applications/sgefa/sgefa -22.98%
MultiSource/Benchmarks/TSVC/Searching-flt/Searching-flt -16.13%
MultiSource/Benchmarks/TSVC/Searching-dbl/Searching-dbl -15.85%
SingleSource/Benchmarks/Shootout/random -13.33%
SingleSource/Benchmarks/Shootout-C++/Shootout-C++-random -13.33%
MultiSource/Benchmarks/TSVC/LoopRerolling-flt/LoopRerolling-flt -8.33%
MultiSource/Benchmarks/TSVC/LoopRerolling-dbl/LoopRerolling-dbl -8.02%
MultiSource/Benchmarks/TSVC/LinearDependence-dbl/LinearDependence-dbl -7.28%
MultiSource/Benchmarks/TSVC/ControlLoops-dbl/ControlLoops-dbl -7.06%
MultiSource/Benchmarks/Trimaran/enc-rc4/enc-rc4 -6.82%
MultiSource/Benchmarks/TSVC/ControlLoops-flt/ControlLoops-flt -6.67%
MultiSource/Applications/aha/aha -6.49%
MultiSource/Benchmarks/TSVC/CrossingThresholds-flt/CrossingThresholds-flt -6.38%
MultiSource/Benchmarks/TSVC/CrossingThresholds-dbl/CrossingThresholds-dbl -6.34%
SingleSource/Benchmarks/Shootout/ary3 -6.32%
SingleSource/Benchmarks/Shootout-C++/Shootout-C++-ary3 -6.10%
MultiSource/Benchmarks/TSVC/LinearDependence-flt/LinearDependence-flt -5.71%
MultiSource/Benchmarks/TSVC/Expansion-flt/Expansion-flt -4.76%
MultiSource/Benchmarks/Olden/mst/mst -4.66%
MultiSource/Benchmarks/TSVC/Expansion-dbl/Expansion-dbl -4.46%
SingleSource/Benchmarks/Misc/pi -4.01%
MultiSource/Benchmarks/TSVC/GlobalDataFlow-flt/GlobalDataFlow-flt -3.76%
SingleSource/Benchmarks/Polybench/linear-algebra/kernels/cholesky/cholesky -3.60%
SingleSource/Benchmarks/CoyoteBench/huffbench -3.57%
SingleSource/Benchmarks/Shootout-C++/Shootout-C++-hash -3.41%
MultiSource/Benchmarks/Bullet/bullet -3.08%
SingleSource/Benchmarks/Misc/ReedSolomon -2.87%
SingleSource/Benchmarks/Misc/flops-2 -2.78%
SingleSource/Benchmarks/Misc-C++/mandel-text -2.46%
MultiSource/Benchmarks/TSVC/ControlFlow-dbl/ControlFlow-dbl -2.30%
MultiSource/Benchmarks/TSVC/ControlFlow-flt/ControlFlow-flt -2.01%
MultiSource/Benchmarks/Ptrdist/yacr2/yacr2 -1.78%
MultiSource/Benchmarks/ASC_Sequoia/CrystalMk/CrystalMk -1.76%
SingleSource/Benchmarks/Stanford/FloatMM -1.66%
SingleSource/Benchmarks/Misc-C++-EH/spirit -1.46%
MultiSource/Benchmarks/tramp3d-v4/tramp3d-v4 -1.22%
MultiSource/Benchmarks/mafft/pairlocalalign -1.16%
MultiSource/Benchmarks/VersaBench/dbms/dbms -1.11%

With that version, benchmarking passes.
As expected, there are lots of performance swings either way, but on geomean, across a large number of programs (both the ones in the test-suite and from proprietary benchmarks), I see 0.65% improvement for the benchmarks reporting execution time and a 0.35% improvement for the benchmarks reporting scores.
So, in summary, the patch results in an improvement on average.

Thank you, Kristof!

I will look at major regressions.

FWIW, here are the programs with the biggest swings in the test-suite:

BTW, it is the results for thumb-mode, right?

FWIW, here are the programs with the biggest swings in the test-suite:

BTW, it is the results for thumb-mode, right?

Indeed, Thumb mode. Using -O3 -DNDEBUG -mcpu=cortex-a57 -mthumb -fomit-frame-pointer as flags to be more precise.

Changes to fix the performance regressions:

  • Implemented FeatureCheapPredicableCPSR to disable +1 predication cost for CPSR-defining instructions and MispredictPenalty increased to 16. It tweaks if-conversion.
  • Added FeatureAvoidPartialCPSR in the ProcessorModel.

With those changes only 2 tests have significant stable regressions (with --exec-multisample 5):
SingleSource/Benchmarks/BenchmarkGame/fannkuch 6.19%
MultiSource/Benchmarks/Ptrdist/ft/ft 3.75%

Both tests have problem with unrolling of inner loops with small (dynamic) loop counts.

With PGO-compilation the regressions are gone.

Also fixed problem with atomic loads second argument.

Small fix for non-thumb mode UMULL second argument.

Performance Regressions - Execution Time Δ
SingleSource/Benchmarks/BenchmarkGame/fannkuch 5.42%
MultiSource/Benchmarks/VersaBench/bmm/bmm 4.75%
MultiSource/Benchmarks/Ptrdist/ft/ft 4.52%
SingleSource/Benchmarks/CoyoteBench/huffbench 3.99%
SingleSource/Benchmarks/Misc/mandel-2 1.78%

Performance Improvements - Execution Time Δ
MultiSource/Benchmarks/ASC_Sequoia/IRSmk/IRSmk -42.58%
SingleSource/Benchmarks/Adobe-C++/simple_types_constant_folding -40.24%
MultiSource/Applications/sgefa/sgefa -23.67%
SingleSource/Benchmarks/Shootout-C++/Shootout-C++-random -19.97%
SingleSource/Benchmarks/Shootout/Shootout-random -19.95%
MultiSource/Benchmarks/TSVC/Searching-flt/Searching-flt -16.11%
MultiSource/Benchmarks/TSVC/Searching-dbl/Searching-dbl -15.70%
MultiSource/Benchmarks/TSVC/CrossingThresholds-dbl/CrossingThresholds-dbl -8.18%
MultiSource/Benchmarks/TSVC/LoopRerolling-dbl/LoopRerolling-dbl -8.16%
MultiSource/Benchmarks/TSVC/LoopRerolling-flt/LoopRerolling-flt -8.15%
MultiSource/Benchmarks/TSVC/Expansion-dbl/Expansion-dbl -7.92%
MultiSource/Benchmarks/TSVC/LinearDependence-dbl/LinearDependence-dbl -7.65%
MultiSource/Benchmarks/TSVC/ControlLoops-flt/ControlLoops-flt -7.06%
MultiSource/Benchmarks/Trimaran/enc-rc4/enc-rc4 -6.87%
SingleSource/Benchmarks/Polybench/linear-algebra/kernels/cholesky/cholesky -6.73%
SingleSource/Benchmarks/Shootout-C++/Shootout-C++-ary -6.63%
SingleSource/Benchmarks/Misc/pi -6.24%
MultiSource/Benchmarks/TSVC/LinearDependence-flt/LinearDependence-flt -5.94%
MultiSource/Benchmarks/llubenchmark/llu -5.82%
MultiSource/Benchmarks/tramp3d-v4/tramp3d-v4 -5.80%
SingleSource/Benchmarks/Polybench/linear-algebra/kernels/doitgen/doitgen -5.76%
MultiSource/Benchmarks/Prolangs-C++/ocean/ocean -5.39%
MultiSource/Benchmarks/TSVC/CrossingThresholds-flt/CrossingThresholds-flt -5.28%
MultiSource/Benchmarks/TSVC/GlobalDataFlow-dbl/GlobalDataFlow-dbl -4.88%
SingleSource/Benchmarks/Shootout-C++/Shootout-C++-moments -4.75%
SingleSource/Benchmarks/Misc/ReedSolomon -4.72%
SingleSource/Benchmarks/Polybench/stencils/adi/adi -4.69%
MultiSource/Benchmarks/FreeBench/fourinarow/fourinarow -4.58%
MultiSource/Benchmarks/Olden/mst/mst -4.32%
SingleSource/Benchmarks/Polybench/stencils/jacobi-2d-imper/jacobi-2d-imper -4.23%

So I think the scheduler patch is ready to land.

I've also run the latest version of this patch with command line options '-O3 -DNDEBUG -mcpu=cortex-a57 -mthumb -fomit-frame-pointer' on the test-suite and a range of other benchmarks.
I do see the same few performance regressions as reported by Andrew, for the same reasons; and many performance improvements.
Overall, I see a geomean performance improvement of 1.70% for the benchmarks reporting execution time and a 1.63% performance improvement for the benchmarks reporting scores.

In summary, the benchmark results suggest this should be committed.

I haven't looked at the actual code changes in this path though, so it would be good if someone else could look at those and give a final approval.

Other than that LGTM.

lib/Target/ARM/ARMScheduleA57.td
263 ↗(On Diff #93868)

Would it be possible to Alias some of these to Sched classes that are pre-defined in ARMSchedule.td e.g. SchedAlias A57Write_20cyc_1M to WriteMUL/WriteDIV. That way the model could be more compact.

andrew.zhogin added inline comments.May 1 2017, 9:37 AM
lib/Target/ARM/ARMScheduleA57.td
263 ↗(On Diff #93868)

I don't think it is a good idea. For example, SDIV/UDIV instructions are not binded to WriteDIV SchedWrite. They are binded to IIC_iDIV InstrItinClass (as I see in ARMInstrInfo.td).
That's why I prefer to have full target-specific model binded only to instruction names.
Am I missing something?

javed.absar added inline comments.May 2 2017, 1:14 AM
lib/Target/ARM/ARMScheduleA57.td
263 ↗(On Diff #93868)

They are bound to both IIC_iDIV and " Sched<[WriteDIV]>;" Please see last item in the list:

def SDIV : ADivA1I<0b001, (outs GPR:$Rd), (ins GPR:$Rn, GPR:$Rm), IIC_iDIV,

        "sdiv", "\t$Rd, $Rn, $Rm",
        [(set GPR:$Rd, (sdiv GPR:$Rn, GPR:$Rm))]>,
Requires<[IsARM, HasDivideInARM]>,

Sched<[WriteDIV]>;

Updated to the current repository state. Common SchedWrites/SchedReads aliased to target-specific. Deleted unnecessary InstRWs.

andrew.zhogin marked 3 inline comments as done.May 17 2017, 9:43 AM
andrew.zhogin added inline comments.
lib/Target/ARM/ARMScheduleA57.td
263 ↗(On Diff #93868)

Yes, I have updated code to the current repository state and implemented aliases.
Do you think this patch is ok now?

andrew.zhogin marked an inline comment as done.

Default bindings for new WriteVLD1-4 and WriteST1-WriteST4 sched types. Updated tests cortex-a57-misched-vldm.ll and cortex-a57-misched-vstm-wrback.ll.

javed.absar accepted this revision.May 31 2017, 5:37 AM

Approving as all actions requested are resolved and the overall geomean is positive. Thanks for this work!

This revision is now accepted and ready to land.May 31 2017, 5:37 AM
andrew.zhogin added a comment.EditedJun 1 2017, 1:59 AM

Thanks for the review.

I don't have commit access (waiting for an answer) so could you please commit the patch?

Or I will do it as soon as I receive commit rights.

Hi Andrew:

I built with the patch in a view to committing it on your behalf, as your requested. However, I found some tests failing (probably some trivial matching issue). Please have a look.

e.g.
test/CodeGen/ARM/cortex-a57-misched-vstm-wrback.ll:4:10: error: expected string not found in input
; CHECK: ** MI Scheduling **

^

<stdin>:1:1: note: scanning from here
Disabled scoreboard hazard recognizer

"-debug-only=misched" in tests corrected to "-debug-only=machine-scheduler". DEBUG_TYPE in MachineScheduler was changed recently.

I built with the patch in a view to committing it on your behalf, as your requested. However, I found some tests failing (probably some trivial matching issue). Please have a look.

I've fixed it. DEBUG_TYPE in MachineScheduler.cpp was changed last friday (misched -> machine-scheduler). Now tests are OK.
Sorry to bother you.

This revision was automatically updated to reflect the committed changes.