According D125377, we order STP Q's by ascending address. While on some
targets, paired 128 bit loads and stores are slow, so the STP will split
into STRQ and STUR, so I hope these stores will also be ordered.
Also add subtarget feature ascend-store-address to control the aggressive order.
Details
Diff Detail
Unit Tests
Event Timeline
Do you have more details on where and when you expect this to be beneficial? It may not look like it from the review, but we did a fair amount of testing and benchmarking was done on D125377 on all kinds of CPUs (in-order vs out-of-order, little vs big, etc) to make sure that it was an improvement or benign on the cases we tried. I don't think I would be against this - so long as we had a decent reason to do so. It does constrain the scheduling graph though, so we shouldn't do so unnecessarily.
Thanks @dmgreen, base on the following test case , I'll gain benefits base on the tsv110 target (kunpeng 920)
BTW:I have another version, which only do some refactor(keep the logic unchanged), as we don't add new store instructions STRQ and STUR, see detail on https://reviews.llvm.org/D126700?id=433075, which is better ?
double array[0x8000]; void foo (double array[], int size) { for (int i=0; i<size; i++) { array[i] = 2.0; } return; } int main (){ for (int i=0; i<100000; i++) { foo (array, sizeof (array)/sizeof (array[0])); } return 0; }
OK I see, thanks. In this version on godbolt I only see ordered STPQ's, perhaps you have some downstream differences that would alter the codegen?
https://godbolt.org/z/1nbr7Eh8r
With this kind of thing we always have two options. Either we can do it unconditionally for all cpus, or we can add a subtarget tuning feature to steer the codegen for specific cpus. There are a number already like FeatureFuseAES that control fusing in the machine scheduler. Adding another to control a more aggressive ordering in AArch64PostRASchedStrategy sounds perfectly sensible to me.
The other option is to do this for all cpus. The general rule there is it should be something that we expect to be beneficial or benign on all/most cpus that would run -mcpu=generic code. I'm not strongly against that for this patch, we need to check that with benchmarks and I did see a case in testing which changed the code from:
STR q1,[sp,#0x40] LDP q1,q2,[x25,#-0x10] STR q0,[sp,#0x60] LDUR q0,[x24,#-0x10] FDIV v0.2D,v0.2D,v1.2D STP q2,q1,[sp,#0x20] ... => STR q1,[sp,#0x40] LDP q1,q2,[x25,#-0x10] STP q2,q1,[sp,#0x20] STR q0,[sp,#0x60] LDUR q0,[x24,#-0x10] FDIV v0.2D,v0.2D,v1.2D ...
Those clumped-together stores were apparently a little slower. Perhaps it was just unlucky, and it is only one case. I would guess that in-order memory operations are a little easier for prefetchers than ones that bobble around, but I don't have any evidence of that, and prefetchers are usually pretty sophisticated nowadays. It would take a complex case for something like that to matter, and complex cases can easily defy expectation.
So - do you have any other benchmark results, on any other cpus? Or as an alternative, do you think that adding a subtarget feature is an OK solution?
Many thanks @dmgreen for your detailed reply, and I found the newest upstream already sort the stores by ascending address for tsv110 target. while on my downstream, the assemble is a bit similar with cortex-a57 target. https://godbolt.org/z/bWeKq59ha
So it seems add a subtarget feature is a more reasonable, I'll try that.
A subtarget feature sounds like a good way to gate this, thanks! It makes sense to only do this for CPUs where it's proven to be profitable IMO.
llvm/lib/Target/AArch64/AArch64.td | ||
---|---|---|
219 ↗ | (On Diff #435447) | Capitalize Schedule. Maybe reword as "Schedule vector stores by ascending address". |
llvm/lib/Target/AArch64/AArch64InstrInfo.h | ||
106 | Can you add a /// doc string | |
497–498 | I think you can probably move these into AArch64InstrInfo, near to isPairedLdSt | |
llvm/lib/Target/AArch64/AArch64MachineScheduler.cpp | ||
28 | Why do we need to check it is an immediate? |
Thanks for the update
llvm/lib/Target/AArch64/AArch64MachineScheduler.cpp | ||
---|---|---|
28 | Why is this needed? |
llvm/lib/Target/AArch64/AArch64.td | ||
---|---|---|
219 ↗ | (On Diff #435447) | Thanks, apply your comment. |
llvm/lib/Target/AArch64/AArch64InstrInfo.h | ||
497–498 | Done, thanks! | |
llvm/lib/Target/AArch64/AArch64MachineScheduler.cpp | ||
28 | this is because some IR used to match the str q0, [x8, :lo12:seed_lo] for example, and we can't get its offset | |
28 | if we don't check this, it will crash to use getImm() get the immediate with the above str q0, [x8, :lo12:seed_lo] |
Thanks. LGTM
llvm/lib/Target/AArch64/AArch64.td | ||
---|---|---|
219 ↗ | (On Diff #435447) | A q store is a store of a vector. This subtarget feature doesn't alter the scalar scores like x/w/s. |
llvm/lib/Target/AArch64/AArch64MachineScheduler.cpp | ||
28 | OK I see, it is a TargetFlags. That makes sense. It doesn't look like we ever generate them for STQP. |
Can you add a /// doc string