Implemented Cortex-A57 scheduling model: main code in ARMScheduleA57.td, ARMScheduleA57WriteRes.td.
Small changes in cpp/h files to support required scheduling predicates.
Scheduling model implemented according to http://infocenter.arm.com/help/topic/com.arm.doc.uan0015b/Cortex_A57_Software_Optimization_Guide_external.pdf.
Details
Diff Detail
Event Timeline
Hi,
This looks like a lot of work :) Could you please split it up into smaller patches, e.g. to separate the mechanical scheduling definitions from the more structural changes?
Thanks,
Diana
lib/Target/ARM/ARMScheduleA57.td | ||
---|---|---|
18 | Typo: superscalar | |
30 | Could you break this off into a separate patch? I.e. write an initial patch as if everything were r1p0 or later, and another patch with the r0px delta (or the other way around if it's more convenient for you). | |
87 | Cool, can you share this data? | |
103 | Isn't this fixed now? | |
lib/Target/ARM/ARMSubtarget.h | ||
439 ↗ | (On Diff #82648) | Please don't add this function if it can be avoided. We're trying to get rid of these in the long run (in favor of using subtarget features for each specific behavior). |
lib/Target/ARM/ARMScheduleA57.td | ||
---|---|---|
87 | "Common description and scheduling model parameters taken from AArch64" (for the same processor, AArch64SchedA57.td). I will delete this comment. |
Small fixes according review comments: typo, commentary, obsolete TODO, isCortexA57() function removed.
I'm not quite understand what is "more structural changes". New functions in ARMBaseInstrInfo? They are required for scheduling predicates to handle situations like: "Load, register offset, plus", "Load, register offset, minus", "Load, scaled register offset, plus LSL2", ...
And they are not used by any side code.
For example:
// For "Load, register offset, minus" we need +1cyc, +1I def A57WriteLdrAm3 : SchedWriteVariant<[ SchedVar<IsLdrAm3NegRegOffPred, [A57Write_5cyc_1I_1L]>, SchedVar<NoSchedPred, [A57Write_4cyc_1L]> ]>; def : InstRW<[A57WriteLdrAm3], (instregex "LDR(H|SH|SB)$")>;
lib/Target/ARM/ARMScheduleA57.td | ||
---|---|---|
30 | Sure, I will. |
lib/Target/ARM/ARM.td | ||
---|---|---|
781 | I'm not sure that splitting the option like this is a good idea, certainly this would create an incompatibility with GCC, which does not recognise -mcpu=cortex-a57-r0px. In GCC -mcpu=cortex-a57 enables scheduling for all the optimized instruction pairs (e.g. MOV/MOVT), and uses the r0p0 latency values for the Advanced SIMD multiply accumulate instructions. In your patch, this extra flag changes the scheduling of vector multiply, multiply accumulate, and the mov/movt instructions. While I can see that this can improve the resulting schedule in some circumstances, my opinion is that fragmenting the option like this is not worth the extra cost of carrying a special option. |
lib/Target/ARM/ARM.td | ||
---|---|---|
781 | James' point is valid even if you ignore GCC compatibility. Scheduler models are "best effort" at best, so splitting it like this makes no sense. If there are multiple revisions of the same core, then pick the most common and optimise for it. The only case where you can split like that is for recognised new sub-architectures, like Swift, Kyro, Cyclone, which not only don't have to follow the same scheduling decisions, but are recognised names. How many people know which A57 revision will go into their servers/mobile phones? How many companies will openly publicise that kind of information? So, please, remove that processor model and its associated features and let's make this about vanilla A57. This will mean test and benchmark in many variants, but at least we'll get it right for the majority of people using it. |
lib/Target/ARM/ARM.td | ||
---|---|---|
781 | Sure. |
lib/Target/ARM/ARM.td | ||
---|---|---|
781 | If this is only used by actual cores, I don't mind it staying there, as long as they have a clear purpose. |
Removed r0px flag and specific cpu. IsR1P0AndLaterPred is now false (vanilla).
Added few scheduling tests.
lib/Target/ARM/ARMScheduleA57.td | ||
---|---|---|
30 | IsR1P0AndLaterPred is always false for now, run-time option removed. |
More optimal (compact) way to describe LDM/VLDM scheduling info using list<SchedWriteRes> (from ARMScheduleA9.td).
Hi,
I don't see anything wrong with this, but I'd like to get a second opinion from someone with more experience in the area.
Also, do you have / can you get any performance results with/without this patch (at least on the test-suite)?
Thanks,
Diana
test/CodeGen/ARM/cortex-a57-misched-alu.ll | ||
---|---|---|
2 | Shouldn't this and all other added test require asserts? |
Hi.
Some inline comments. Also, could you please share some performance results that you have for this sched-model.
Best Regards
Javed
lib/Target/ARM/ARMScheduleA57.td | ||
---|---|---|
499 | Do we need sched-variant here since outcome is identical? | |
1215 | You may want to combine VREV* VSWP* VTBL .. into one instregex comma separated def, to make definitions shorter. | |
lib/Target/ARM/ARMScheduleA57WriteRes.td | ||
20 | You may want to rephrase "issued down one " to "issued as follows: one to I pipe, six to S pipe and ...", for better clarity. | |
56 | You could shorten the following using someitng like - foreach Lat = 3-20 in { def A57Write_#Lat#cyc_1L : SchedWriteRes<[A57UnitL]> { let Latency = Lat; } | |
75 | Same here. You could shorten the following - let Latency = Lat; } | |
244 | Same here. You could shorten with the following - | |
755 | Something seems amiss as its defined earlier there is only 1 processor-unit of type A57UnitS. | |
test/CodeGen/ARM/cortex-a57-misched-alu.ll | ||
31 | This part below is not required I think as you have already checked the latencies here against the model. |
No, I don't have access to Cortex-A57 system. Maybe, you can recommend some hosting with such processors?
I don't know many cloud / hosting services, maybe someone else can chime in.
@kristof.beyls, do you have infrastructure to benchmark AArch32 on A57?
cheers,
--renato
ARMScheduleA57WriteRes.td cleaned up from unused scheduling classes (legacy from AArch64). Small corrections according to the review comments.
lib/Target/ARM/ARMScheduleA57.td | ||
---|---|---|
1215 | Well, I tried to keep structure from the documentation. 1 row from table = 1 InstRW def. To be easy to verify. | |
lib/Target/ARM/ARMScheduleA57WriteRes.td | ||
755 | Unused scheduling classes deleted (legacy from AArch64). | |
test/CodeGen/ARM/cortex-a57-misched-alu.ll | ||
31 | I did it to check scheduling units (A57UnitI or A57UnitM). Not just latency. |
@andrew.zhogin : I tried to measure the impact of this patch on performance on a Cortex-A57 system, but with this patch, testing in Thumb mode, the cmake-based test-suite runs fail early during the configuration step, with the following error:
-- Check size of unsigned long - failed CMake Error at /usr/share/cmake-3.5/Modules/TestBigEndian.cmake:51 (message): no suitable type found Call Stack (most recent call first): CMakeLists.txt:115 (test_big_endian)
This doesn't happen when testing in ARM mode, or when testing without your patch applied.
I tested on r292764.
Could you try and see if you could reproduce this?
The lnt runtest test-suite command line I used looks as follows:
lnt runtest test-suite --sandbox SANDBOX --no-timestamp --test-suite /work/llvm-test-suite --benchmarking-only --cppflags '-O3 -DNDEBUG -mcpu=cortex-a57 -mthumb -fomit-frame-pointer ' --threads 1 --build-threads 6 --use-perf time --use-lit lit --exec-multisample 1 --only-test=SingleSource/Benchmarks --cmake-define 'CMAKE_C_FLAGS_RELEASE=""' --cmake-define 'CMAKE_CXX_FLAGS_RELEASE=""'
Thanks,
Kristof
Added scheduling info for missed thumb instructions to have working CompleteModel=1.
All instructions supported according to the schedcover.py utility.
@andrew.zhogin : I tried to measure the impact of this patch on performance on a Cortex-A57 system, but with this patch, testing in Thumb mode, the cmake-based test-suite runs fail early during the configuration step, with the following error:
-- Check size of unsigned long - failed CMake Error at /usr/share/cmake-3.5/Modules/TestBigEndian.cmake:51 (message): no suitable type found Call Stack (most recent call first): CMakeLists.txt:115 (test_big_endian)This doesn't happen when testing in ARM mode, or when testing without your patch applied.
I tested on r292764.
Could you try and see if you could reproduce this?
The lnt runtest test-suite command line I used looks as follows:lnt runtest test-suite --sandbox SANDBOX --no-timestamp --test-suite /work/llvm-test-suite --benchmarking-only --cppflags '-O3 -DNDEBUG -mcpu=cortex-a57 -mthumb -fomit-frame-pointer ' --threads 1 --build-threads 6 --use-perf time --use-lit lit --exec-multisample 1 --only-test=SingleSource/Benchmarks --cmake-define 'CMAKE_C_FLAGS_RELEASE=""' --cmake-define 'CMAKE_CXX_FLAGS_RELEASE=""'
I have updated patch with full support of thumb instructions.
I'm not familiar with lit - such command just succeeded on my local computer (x86_64) and I wonder why? It runs tests at some VM or remotely?
********** Testing Time: 264.70s Expected Passes : 126 2017-01-30 13:11:51: submitting result to dummy instance
Hi Andrew,
You'd need to run lnt (not lit) with the above command line on an AArch64 environment to be able to reproduce the issue.
On an x86_64 box, you probably tested code generation for x86_64, not AArch64.
Anyway, I've just kicked of another benchmarking run with your new patch. Will let you know the outcome.
I'm afraid I still see the same failures for cmake-driven benchmark suites.
For non-cmake-driven benchmark suites, I see the following error message at compile time:
DefIdx 0 exceeds machine model writes for %R0<def> = tLDRi %R0<kill>, 0, pred:14, pred:%noreg; mem:LD4[@Reg](tbaa=!12)(dereferenceable) (Try with MCSchedModel.CompleteModel set to false)incomplete machine model UNREACHABLE executed at /work/llvm-test/slave2/cross-build/build/llvm/lib/CodeGen/TargetSchedule.cpp:216!
I don't understand - tLDRi must be covered by "tLDR" regexp here:
def : InstRW<[A57Write_4cyc_1L], (instregex "LDRi12", "LDRBi12", "LDRcp", "(t2|t)?LDRConstPool", "LDRLIT_ga_(pcrel|abs)", "PICLDR", "tLDR")>;
Are you sure using the updated patch and clang?
And yes, I used lnt really:
./lnt runtest test-suite --sandbox ~/fast/sandbox_arm --no-timestamp --test-suite ~/fast/test-suite --benchmarking-only --cppflags '-O3 -DNDEBUG -mcpu=cortex-a57 -mthumb -fomit-frame-pointer ' --threads 1 --build-threads 6 --use-perf time --use-lit lit --exec-multisample 1 --only-test=SingleSource/Benchmarks --cmake-define 'CMAKE_C_FLAGS_RELEASE=""' --cmake-define 'CMAKE_CXX_FLAGS_RELEASE=""' --cc ~/fast/llvm_trunk.build/bin/clang
I double checked, and indeed it seems that somehow I received the old version of the patch when re-downloading it from phabricator.
Anyway, I've downloaded the latest version of the patch again, checking I indeed received the latest version.
With that version, benchmarking passes.
As expected, there are lots of performance swings either way, but on geomean, across a large number of programs (both the ones in the test-suite and from proprietary benchmarks), I see 0.65% improvement for the benchmarks reporting execution time and a 0.35% improvement for the benchmarks reporting scores.
So, in summary, the patch results in an improvement on average.
FWIW, here are the programs with the biggest swings in the test-suite:
Regressions:
MultiSource/Benchmarks/FreeBench/analyzer/analyzer 17.26%
SingleSource/Benchmarks/McGill/queens 10.57%
MultiSource/Benchmarks/Ptrdist/ft/ft 10.49%
MultiSource/Applications/siod/siod 5.32%
SingleSource/Benchmarks/BenchmarkGame/fannkuch 5.26%
SingleSource/Benchmarks/Misc/mandel-2 2.68%
MultiSource/Benchmarks/VersaBench/8b10b/8b10b 2.30%
MultiSource/Benchmarks/Fhourstones-3.1/fhourstones3.1 1.59%
MultiSource/Benchmarks/TSVC/NodeSplitting-flt/NodeSplitting-flt 1.43%
MultiSource/Benchmarks/NPB-serial/is/is 1.34%
SingleSource/Benchmarks/Misc/perlin 1.19%
SingleSource/Benchmarks/Misc/matmul_f64_4x4 1.05%
MultiSource/Benchmarks/Ptrdist/bc/bc 1.00%
Improvements:
SingleSource/Benchmarks/Adobe-C++/simple_types_constant_folding -26.47%
MultiSource/Applications/sgefa/sgefa -22.98%
MultiSource/Benchmarks/TSVC/Searching-flt/Searching-flt -16.13%
MultiSource/Benchmarks/TSVC/Searching-dbl/Searching-dbl -15.85%
SingleSource/Benchmarks/Shootout/random -13.33%
SingleSource/Benchmarks/Shootout-C++/Shootout-C++-random -13.33%
MultiSource/Benchmarks/TSVC/LoopRerolling-flt/LoopRerolling-flt -8.33%
MultiSource/Benchmarks/TSVC/LoopRerolling-dbl/LoopRerolling-dbl -8.02%
MultiSource/Benchmarks/TSVC/LinearDependence-dbl/LinearDependence-dbl -7.28%
MultiSource/Benchmarks/TSVC/ControlLoops-dbl/ControlLoops-dbl -7.06%
MultiSource/Benchmarks/Trimaran/enc-rc4/enc-rc4 -6.82%
MultiSource/Benchmarks/TSVC/ControlLoops-flt/ControlLoops-flt -6.67%
MultiSource/Applications/aha/aha -6.49%
MultiSource/Benchmarks/TSVC/CrossingThresholds-flt/CrossingThresholds-flt -6.38%
MultiSource/Benchmarks/TSVC/CrossingThresholds-dbl/CrossingThresholds-dbl -6.34%
SingleSource/Benchmarks/Shootout/ary3 -6.32%
SingleSource/Benchmarks/Shootout-C++/Shootout-C++-ary3 -6.10%
MultiSource/Benchmarks/TSVC/LinearDependence-flt/LinearDependence-flt -5.71%
MultiSource/Benchmarks/TSVC/Expansion-flt/Expansion-flt -4.76%
MultiSource/Benchmarks/Olden/mst/mst -4.66%
MultiSource/Benchmarks/TSVC/Expansion-dbl/Expansion-dbl -4.46%
SingleSource/Benchmarks/Misc/pi -4.01%
MultiSource/Benchmarks/TSVC/GlobalDataFlow-flt/GlobalDataFlow-flt -3.76%
SingleSource/Benchmarks/Polybench/linear-algebra/kernels/cholesky/cholesky -3.60%
SingleSource/Benchmarks/CoyoteBench/huffbench -3.57%
SingleSource/Benchmarks/Shootout-C++/Shootout-C++-hash -3.41%
MultiSource/Benchmarks/Bullet/bullet -3.08%
SingleSource/Benchmarks/Misc/ReedSolomon -2.87%
SingleSource/Benchmarks/Misc/flops-2 -2.78%
SingleSource/Benchmarks/Misc-C++/mandel-text -2.46%
MultiSource/Benchmarks/TSVC/ControlFlow-dbl/ControlFlow-dbl -2.30%
MultiSource/Benchmarks/TSVC/ControlFlow-flt/ControlFlow-flt -2.01%
MultiSource/Benchmarks/Ptrdist/yacr2/yacr2 -1.78%
MultiSource/Benchmarks/ASC_Sequoia/CrystalMk/CrystalMk -1.76%
SingleSource/Benchmarks/Stanford/FloatMM -1.66%
SingleSource/Benchmarks/Misc-C++-EH/spirit -1.46%
MultiSource/Benchmarks/tramp3d-v4/tramp3d-v4 -1.22%
MultiSource/Benchmarks/mafft/pairlocalalign -1.16%
MultiSource/Benchmarks/VersaBench/dbms/dbms -1.11%
Indeed, Thumb mode. Using -O3 -DNDEBUG -mcpu=cortex-a57 -mthumb -fomit-frame-pointer as flags to be more precise.
Changes to fix the performance regressions:
- Implemented FeatureCheapPredicableCPSR to disable +1 predication cost for CPSR-defining instructions and MispredictPenalty increased to 16. It tweaks if-conversion.
- Added FeatureAvoidPartialCPSR in the ProcessorModel.
With those changes only 2 tests have significant stable regressions (with --exec-multisample 5):
SingleSource/Benchmarks/BenchmarkGame/fannkuch 6.19%
MultiSource/Benchmarks/Ptrdist/ft/ft 3.75%
Both tests have problem with unrolling of inner loops with small (dynamic) loop counts.
With PGO-compilation the regressions are gone.
Also fixed problem with atomic loads second argument.
Performance Regressions - Execution Time Δ
SingleSource/Benchmarks/BenchmarkGame/fannkuch 5.42%
MultiSource/Benchmarks/VersaBench/bmm/bmm 4.75%
MultiSource/Benchmarks/Ptrdist/ft/ft 4.52%
SingleSource/Benchmarks/CoyoteBench/huffbench 3.99%
SingleSource/Benchmarks/Misc/mandel-2 1.78%
Performance Improvements - Execution Time Δ
MultiSource/Benchmarks/ASC_Sequoia/IRSmk/IRSmk -42.58%
SingleSource/Benchmarks/Adobe-C++/simple_types_constant_folding -40.24%
MultiSource/Applications/sgefa/sgefa -23.67%
SingleSource/Benchmarks/Shootout-C++/Shootout-C++-random -19.97%
SingleSource/Benchmarks/Shootout/Shootout-random -19.95%
MultiSource/Benchmarks/TSVC/Searching-flt/Searching-flt -16.11%
MultiSource/Benchmarks/TSVC/Searching-dbl/Searching-dbl -15.70%
MultiSource/Benchmarks/TSVC/CrossingThresholds-dbl/CrossingThresholds-dbl -8.18%
MultiSource/Benchmarks/TSVC/LoopRerolling-dbl/LoopRerolling-dbl -8.16%
MultiSource/Benchmarks/TSVC/LoopRerolling-flt/LoopRerolling-flt -8.15%
MultiSource/Benchmarks/TSVC/Expansion-dbl/Expansion-dbl -7.92%
MultiSource/Benchmarks/TSVC/LinearDependence-dbl/LinearDependence-dbl -7.65%
MultiSource/Benchmarks/TSVC/ControlLoops-flt/ControlLoops-flt -7.06%
MultiSource/Benchmarks/Trimaran/enc-rc4/enc-rc4 -6.87%
SingleSource/Benchmarks/Polybench/linear-algebra/kernels/cholesky/cholesky -6.73%
SingleSource/Benchmarks/Shootout-C++/Shootout-C++-ary -6.63%
SingleSource/Benchmarks/Misc/pi -6.24%
MultiSource/Benchmarks/TSVC/LinearDependence-flt/LinearDependence-flt -5.94%
MultiSource/Benchmarks/llubenchmark/llu -5.82%
MultiSource/Benchmarks/tramp3d-v4/tramp3d-v4 -5.80%
SingleSource/Benchmarks/Polybench/linear-algebra/kernels/doitgen/doitgen -5.76%
MultiSource/Benchmarks/Prolangs-C++/ocean/ocean -5.39%
MultiSource/Benchmarks/TSVC/CrossingThresholds-flt/CrossingThresholds-flt -5.28%
MultiSource/Benchmarks/TSVC/GlobalDataFlow-dbl/GlobalDataFlow-dbl -4.88%
SingleSource/Benchmarks/Shootout-C++/Shootout-C++-moments -4.75%
SingleSource/Benchmarks/Misc/ReedSolomon -4.72%
SingleSource/Benchmarks/Polybench/stencils/adi/adi -4.69%
MultiSource/Benchmarks/FreeBench/fourinarow/fourinarow -4.58%
MultiSource/Benchmarks/Olden/mst/mst -4.32%
SingleSource/Benchmarks/Polybench/stencils/jacobi-2d-imper/jacobi-2d-imper -4.23%
I've also run the latest version of this patch with command line options '-O3 -DNDEBUG -mcpu=cortex-a57 -mthumb -fomit-frame-pointer' on the test-suite and a range of other benchmarks.
I do see the same few performance regressions as reported by Andrew, for the same reasons; and many performance improvements.
Overall, I see a geomean performance improvement of 1.70% for the benchmarks reporting execution time and a 1.63% performance improvement for the benchmarks reporting scores.
In summary, the benchmark results suggest this should be committed.
I haven't looked at the actual code changes in this path though, so it would be good if someone else could look at those and give a final approval.
Other than that LGTM.
lib/Target/ARM/ARMScheduleA57.td | ||
---|---|---|
264 | Would it be possible to Alias some of these to Sched classes that are pre-defined in ARMSchedule.td e.g. SchedAlias A57Write_20cyc_1M to WriteMUL/WriteDIV. That way the model could be more compact. |
lib/Target/ARM/ARMScheduleA57.td | ||
---|---|---|
264 | I don't think it is a good idea. For example, SDIV/UDIV instructions are not binded to WriteDIV SchedWrite. They are binded to IIC_iDIV InstrItinClass (as I see in ARMInstrInfo.td). |
lib/Target/ARM/ARMScheduleA57.td | ||
---|---|---|
264 | They are bound to both IIC_iDIV and " Sched<[WriteDIV]>;" Please see last item in the list: def SDIV : ADivA1I<0b001, (outs GPR:$Rd), (ins GPR:$Rn, GPR:$Rm), IIC_iDIV, "sdiv", "\t$Rd, $Rn, $Rm", [(set GPR:$Rd, (sdiv GPR:$Rn, GPR:$Rm))]>, Requires<[IsARM, HasDivideInARM]>, Sched<[WriteDIV]>; |
Updated to the current repository state. Common SchedWrites/SchedReads aliased to target-specific. Deleted unnecessary InstRWs.
lib/Target/ARM/ARMScheduleA57.td | ||
---|---|---|
264 | Yes, I have updated code to the current repository state and implemented aliases. |
Default bindings for new WriteVLD1-4 and WriteST1-WriteST4 sched types. Updated tests cortex-a57-misched-vldm.ll and cortex-a57-misched-vstm-wrback.ll.
Approving as all actions requested are resolved and the overall geomean is positive. Thanks for this work!
Thanks for the review.
I don't have commit access (waiting for an answer) so could you please commit the patch?
Or I will do it as soon as I receive commit rights.
Hi Andrew:
I built with the patch in a view to committing it on your behalf, as your requested. However, I found some tests failing (probably some trivial matching issue). Please have a look.
e.g.
test/CodeGen/ARM/cortex-a57-misched-vstm-wrback.ll:4:10: error: expected string not found in input
; CHECK: ** MI Scheduling **
^
<stdin>:1:1: note: scanning from here
Disabled scoreboard hazard recognizer
"-debug-only=misched" in tests corrected to "-debug-only=machine-scheduler". DEBUG_TYPE in MachineScheduler was changed recently.
I've fixed it. DEBUG_TYPE in MachineScheduler.cpp was changed last friday (misched -> machine-scheduler). Now tests are OK.
Sorry to bother you.
I'm not sure that splitting the option like this is a good idea, certainly this would create an incompatibility with GCC, which does not recognise -mcpu=cortex-a57-r0px. In GCC -mcpu=cortex-a57 enables scheduling for all the optimized instruction pairs (e.g. MOV/MOVT), and uses the r0p0 latency values for the Advanced SIMD multiply accumulate instructions.
In your patch, this extra flag changes the scheduling of vector multiply, multiply accumulate, and the mov/movt instructions. While I can see that this can improve the resulting schedule in some circumstances, my opinion is that fragmenting the option like this is not worth the extra cost of carrying a special option.