This is an archive of the discontinued LLVM Phabricator instance.

[MachineScheduler] Order more stores by ascending address
ClosedPublic

Authored by Allen on May 31 2022, 7:01 AM.

Details

Summary

According D125377, we order STP Q's by ascending address. While on some
targets, paired 128 bit loads and stores are slow, so the STP will split
into STRQ and STUR, so I hope these stores will also be ordered.
Also add subtarget feature ascend-store-address to control the aggressive order.

Diff Detail

Event Timeline

Allen created this revision.May 31 2022, 7:01 AM
Allen requested review of this revision.May 31 2022, 7:01 AM
Herald added a project: Restricted Project. · View Herald TranscriptMay 31 2022, 7:01 AM
Allen edited the summary of this revision. (Show Details)May 31 2022, 5:56 PM
Allen added a comment.Jun 1 2022, 9:08 AM
This comment was removed by Allen.
Allen updated this revision to Diff 433617.Jun 1 2022, 6:48 PM
Allen retitled this revision from [AArch64][NFC] Refactor order STP by ascending address to [MachineScheduler] Order more stores by ascending address.
Allen edited the summary of this revision. (Show Details)

Do you have more details on where and when you expect this to be beneficial? It may not look like it from the review, but we did a fair amount of testing and benchmarking was done on D125377 on all kinds of CPUs (in-order vs out-of-order, little vs big, etc) to make sure that it was an improvement or benign on the cases we tried. I don't think I would be against this - so long as we had a decent reason to do so. It does constrain the scheduling graph though, so we shouldn't do so unnecessarily.

Allen added a comment.Jun 6 2022, 5:53 AM

Do you have more details on where and when you expect this to be beneficial? It may not look like it from the review, but we did a fair amount of testing and benchmarking was done on D125377 on all kinds of CPUs (in-order vs out-of-order, little vs big, etc) to make sure that it was an improvement or benign on the cases we tried. I don't think I would be against this - so long as we had a decent reason to do so. It does constrain the scheduling graph though, so we shouldn't do so unnecessarily.

Thanks @dmgreen, base on the following test case , I'll gain benefits base on the tsv110 target (kunpeng 920)
BTW:I have another version, which only do some refactor(keep the logic unchanged), as we don't add new store instructions STRQ and STUR, see detail on https://reviews.llvm.org/D126700?id=433075, which is better ?

double array[0x8000];

void foo (double array[], int size) {
  for (int i=0; i<size; i++) {
     array[i] = 2.0;
  }
  return;
}

int main (){
   
  for (int i=0; i<100000; i++) {
    foo (array, sizeof (array)/sizeof (array[0]));  
  }
  return 0;
}

OK I see, thanks. In this version on godbolt I only see ordered STPQ's, perhaps you have some downstream differences that would alter the codegen?
https://godbolt.org/z/1nbr7Eh8r

With this kind of thing we always have two options. Either we can do it unconditionally for all cpus, or we can add a subtarget tuning feature to steer the codegen for specific cpus. There are a number already like FeatureFuseAES that control fusing in the machine scheduler. Adding another to control a more aggressive ordering in AArch64PostRASchedStrategy sounds perfectly sensible to me.

The other option is to do this for all cpus. The general rule there is it should be something that we expect to be beneficial or benign on all/most cpus that would run -mcpu=generic code. I'm not strongly against that for this patch, we need to check that with benchmarks and I did see a case in testing which changed the code from:

STR      q1,[sp,#0x40]     
LDP      q1,q2,[x25,#-0x10]
STR      q0,[sp,#0x60]     
LDUR     q0,[x24,#-0x10]   
FDIV     v0.2D,v0.2D,v1.2D 
STP      q2,q1,[sp,#0x20]  
...
=>
STR      q1,[sp,#0x40]     
LDP      q1,q2,[x25,#-0x10]
STP      q2,q1,[sp,#0x20]  
STR      q0,[sp,#0x60]     
LDUR     q0,[x24,#-0x10]   
FDIV     v0.2D,v0.2D,v1.2D 
...

Those clumped-together stores were apparently a little slower. Perhaps it was just unlucky, and it is only one case. I would guess that in-order memory operations are a little easier for prefetchers than ones that bobble around, but I don't have any evidence of that, and prefetchers are usually pretty sophisticated nowadays. It would take a complex case for something like that to matter, and complex cases can easily defy expectation.

So - do you have any other benchmark results, on any other cpus? Or as an alternative, do you think that adding a subtarget feature is an OK solution?

Allen added a comment.EditedJun 7 2022, 7:16 PM

OK I see, thanks. In this version on godbolt I only see ordered STPQ's, perhaps you have some downstream differences that would alter the codegen?
So - do you have any other benchmark results, on any other cpus? Or as an alternative, do you think that adding a subtarget feature is an OK solution?

Many thanks @dmgreen for your detailed reply, and I found the newest upstream already sort the stores by ascending address for tsv110 target. while on my downstream, the assemble is a bit similar with cortex-a57 target. https://godbolt.org/z/bWeKq59ha

So it seems add a subtarget feature is a more reasonable, I'll try that.

Allen updated this revision to Diff 435447.Jun 9 2022, 1:27 AM
Allen edited the summary of this revision. (Show Details)

Also add subtarget feature ascend-store-address to control the aggressive order.

fhahn added a comment.Jun 9 2022, 1:34 AM

A subtarget feature sounds like a good way to gate this, thanks! It makes sense to only do this for CPUs where it's proven to be profitable IMO.

dmgreen added inline comments.Jun 10 2022, 12:53 AM
llvm/lib/Target/AArch64/AArch64.td
219

Capitalize Schedule. Maybe reword as "Schedule vector stores by ascending address".

llvm/lib/Target/AArch64/AArch64InstrInfo.h
106

Can you add a /// doc string

504–505

I think you can probably move these into AArch64InstrInfo, near to isPairedLdSt

llvm/lib/Target/AArch64/AArch64MachineScheduler.cpp
29

Why do we need to check it is an immediate?

Allen updated this revision to Diff 435848.EditedJun 10 2022, 3:01 AM

Adress review comment, thanks @dmgreen @fhahn

Thanks for the update

llvm/lib/Target/AArch64/AArch64MachineScheduler.cpp
29

Why is this needed?

Allen marked 3 inline comments as done.Jun 13 2022, 1:42 AM
Allen added inline comments.
llvm/lib/Target/AArch64/AArch64.td
219

Thanks, apply your comment.

llvm/lib/Target/AArch64/AArch64InstrInfo.h
504–505

Done, thanks!

llvm/lib/Target/AArch64/AArch64MachineScheduler.cpp
29

this is because some IR used to match the str q0, [x8, :lo12:seed_lo] for example, and we can't get its offset

29

if we don't check this, it will crash to use getImm() get the immediate with the above str q0, [x8, :lo12:seed_lo]

dmgreen accepted this revision.Jun 13 2022, 2:15 AM

Thanks. LGTM

llvm/lib/Target/AArch64/AArch64.td
219

A q store is a store of a vector. This subtarget feature doesn't alter the scalar scores like x/w/s.

llvm/lib/Target/AArch64/AArch64MachineScheduler.cpp
29

OK I see, it is a TargetFlags. That makes sense. It doesn't look like we ever generate them for STQP.

This revision is now accepted and ready to land.Jun 13 2022, 2:15 AM
This revision was landed with ongoing or failed builds.Jun 13 2022, 2:37 AM
This revision was automatically updated to reflect the committed changes.
Allen marked 2 inline comments as done.