If we want classify OoO or not, using getSchedModel().isOutOfOrder() could be more proper way than using Subtarget->isLikeA9().
Details
Diff Detail
- Repository
- rL LLVM
Event Timeline
Below is Sanjay's comment in D16836.
The ARM version is a bigger mess (it won't be NFC), so I'll let someone who knows that backend better make the identical change for that target.
After reading this, I realized that there is no target description file for CortexA15. So, isLikeA9() can't be changed to isOutOfOrder().
bool isLikeA9() const { return isCortexA9() || isCortexA15() || isKrait(); }
ARMScheduleA9.td (lib\target\arm): let MicroOpBufferSize = 56; // Based on available renamed registers. ARMScheduleSwift.td (lib\target\arm): let MicroOpBufferSize = 45; // Based on NEON renamed registers.
If we can't use isOutOfOrder() which based on MicroOpBufferSize , How about changing the function name isLikeA9 to isOutOfOrder?
But this makes people can be confused which one is proper way to call from Subtarget->getSchedModel().isOutOfOrder(); or Subtarget->isOutOfOrder();.
I think that correct answer is making td file & applying isOutOfOrder(). But what I know is we are currently focusing on aarch64. Not 32bit arm.
(We also don't have a plan for making M1's target description about 32bit arm.)
James, Do you have any idea about this? Please, let me know.
Junmo.
I think this change does make sense, but I'll let James or Tim have a look, since this needs more than a pair of eyes.
Have you made any benchmarks for this? On both A9 and A15 as well as AArch32? Without concrete numbers, it's easy to assume the best and get the worst.
cheers,
--renato
I think it's ok that there's no sched model for A15 because:
// FIXME: A15 has currently the same Schedule model as A9.
But it's also true that:
FIXME: A12 has currently the same Schedule model as A9
FIXME: A17 has currently the same Schedule model as A9
So isLikeA9() is a *subset* of isOutOfOrder(). Therefore, the reason I said that this change would not be 'no functional change' is because A12 and A17 would now have 'PredictableSelectIsExpensive'. If A12 and A17 are in fact OoO designs, this is probably what we want.
This is the problem with having codegen predicates based on micro-arch models rather than features ("isLikeA9"); they tend to drift in meaning as newer designs are added. I've been trying to remove all of those in the x86 backend.
A12 == A17, and both are OoO. So, isLikeA9, in this context, is the same as isOoO if you assume whoever created that method didn't know about A12/A17 (which I think it's true). Also, the issue here is to predict the impact of selects, which is mostly problematic on OoO cores, which is what the isLikeA9 was trying (and failing) to map.
This is the problem with having codegen predicates based on micro-arch models rather than features ("isLikeA9"); they tend to drift in meaning as newer designs are added. I've been trying to remove all of those in the x86 backend.
But you're absolutely right that this is a horrible way of modelling features, and we should really move away, and hopefully delete that method ASAP.
Since the scheduler model is what matters here, the change is exactly what we need here. Also, once we have a scheduler for A15 and A17, this code will still be correct, so it's also future-proof.
Junmo, please remove the NFC comment from the commit message. Also, do you have access to A9/A15 cores?
cheers,
--renato
Thanks for detailed comment, Sanjay.
As you said,
CortexA9Model : cortex-a9, a12, a15, a17, krait
SwiftModel : swift, cyclone
isLikeA9 : cortex-a9,a15,krait
above targets will be affected by "PredictableSelectIsExpensive".
Hi, renato.
Junmo, please remove the NFC comment from the commit message.
Sure, I will.
I don't have access to a9/a15/a17. (What I have is only for cortex-a53/57, x86's i series)
And I also don't have access to swift/cyclone. I have iphone devices. But I don't know how to measure the performance with test-suite.
Thanks.
Right, if the behaviour changes to Swift or Cyclone, than we should think better about this change. Someone form Apple should give an opinion on this before we take any action.
I don't have access to a9/a15/a17. (What I have is only for cortex-a53/57, x86's i series)
I do. I'll run some benchmarks on A15 and will let you know. You run the benchmarks on A53/57. Feel free to run EEMBC or SPEC, if you have access, too.
@jmolloy, can you run this on an A9, just to be sure?
And I also don't have access to swift/cyclone.
Me neither, that's why someone from Apple should try it out, too.
But I don't know how to measure the performance with test-suite.
If you know how to run the test-suite (http://llvm.org/docs/lnt/quickstart.html), you can just add the following options to lnt:
--threads=1 --build-threads=$CPUS --use-perf --benchmarking-only --multisample=8
Threads = 1 allows the benchmarking to be less noisy, built-threads should be on as much as the number of CPUs, use-perf if you have perf installed (linux-tools), the results are much more accurate, "benchmarking-only" is the key option here, which only runs the tests that are marked as benchmarks and multisample will run the same benchmarks N times (for ARM, you should run at least 3, but 5 is a good number, and 8 a better one).
cheers,
--renato
Hi renato.
But I don't know how to measure the performance with test-suite.
If you know how to run the test-suite (http://llvm.org/docs/lnt/quickstart.html), you can just add the following options to lnt:--threads=1 --build-threads=$CPUS --use-perf --benchmarking-only --multisample=8
Threads = 1 allows the benchmarking to be less noisy, built-threads should be on as much as the number of CPUs, use-perf if you have perf installed (linux-tools), the results are much more accurate, "benchmarking-only" is the key option here, which only runs the tests that are marked as benchmarks and multisample will run the same benchmarks N times (for ARM, you should run at least 3, but 5 is a good number, and 8 a better one).
Oh.. What I wanted to say is that I don't know how to run test-suite on Apple devices. I have iPhone6S, 6, 5, mac mini. But I don't know how to run benchmarks/test-suite on these devices.
Anyway, I really appreciate your detailed comments. Thanks. :)
You run the benchmarks on A53/57. Feel free to run EEMBC or SPEC, if you have access, too.
Cortex-a53/57 didn't set schedule model for ARM. So, they don't have to be tested. Because the default setting for PredictableSelectIsExpensive is false.
If I misunderstand about this, please let me know.
Junmo.
Hi all,
My twopenniworth on this: cortex-a15, a12, a17 and krait have been lazy and inherited all the behaviours of cortex-a9. Therefore it is fine to choose a behaviour for a9 that doesn't take into account those other cores. They can change if it's an issue for them.
I feel that this flag will cause improvements and regressions whatever way it is set. The only real way to do this is using PGO. Therefore I feel that it should be set to whatever the most appropriate setting is for the core without real regard to benchmark results. Someone's going to be unhappy either way.
Just my 2c.
James
Thanks for comment James.
// Prefer likely predicted branches to selects on out-of-order cores. PredictableSelectIsExpensive = Subtarget->isLikeA9();
I think if we believe PredictableSelectIsExpensive is proper for out-of-order cores, we can set this by Subtarget->getSchedModel().isOutOfOrder().
And if some OoO cores have to set this false due to some regressions or whatever reasons, then we can add some code setting false for specific cores.
Junmo.
I agree that sounds like a sensible default, as long as that default can be easily overridden (I haven't looked at the code).
James
Good point.
I feel that this flag will cause improvements and regressions whatever way it is set. The only real way to do this is using PGO. Therefore I feel that it should be set to whatever the most appropriate setting is for the core without real regard to benchmark results. Someone's going to be unhappy either way.
I've run a quick benchmark on A15 by turning on and off that flag, and strangely, with the flag on, the overall performance is only slightly worse (0.5%), but consistently so, across all benchmarks.
Since my methodology was less than kosher, the impact is less than meaningful and the code looks better as it is than before, I think this should go in.
cheers,
-renato