This is an archive of the discontinued LLVM Phabricator instance.

[SVE][Analysis] Tune the cost model according to the tune-cpu attribute
ClosedPublic

Authored by david-arm on Sep 22 2021, 8:06 AM.

Details

Summary

This patch introduces a new function:

AArch64Subtarget::getVScaleForTuning

that returns a value for vscale that can be used for tuning the cost
model when using scalable vectors. The VScaleForTuning option in
AArch64Subtarget is initialised according to the following rules:

  1. If the user has specified the CPU to tune for we use that, else
  2. If the target CPU was specified we use that, else
  3. The tuning is set to "generic".

For CPUs of type "generic" I have assumed that vscale=2.

New tests added here:

Analysis/CostModel/AArch64/sve-gather.ll
Analysis/CostModel/AArch64/sve-scatter.ll
Transforms/LoopVectorize/AArch64/sve-strict-fadd-cost.ll

Diff Detail

Event Timeline

david-arm created this revision.Sep 22 2021, 8:06 AM
david-arm requested review of this revision.Sep 22 2021, 8:06 AM
Herald added a project: Restricted Project. · View Herald TranscriptSep 22 2021, 8:06 AM

Of course totally up to you:

  • MaxScale feels like None
  • you may want to add the Fujitsu A64FX
Matt added a subscriber: Matt.Sep 22 2021, 8:24 AM

Would this be used as the expected vector width the vectorizer optimizes around? If so 4 for generic sounds a bit high and it should maybe be closer to the average of the hardware we expect to be out in the field. Likely 2 I would suspect.

llvm/lib/Target/AArch64/AArch64Subtarget.cpp
371–373

These per-cpu values should probably be subtarget features or defined in AArch64Subtarget::initializeProperties like the other alignments and whatnot.

david-arm added inline comments.Sep 22 2021, 8:49 AM
llvm/lib/Target/AArch64/AArch64Subtarget.cpp
371–373

Hi @dmgreen, I think that's perfectly sensible when you only have a target-cpu feature. I originally tried doing it that way, but realised if target-cpu=generic and tune-cpu=neoverse-n1 then the subtarget won't have max vscale set, since we chose a generic subtarget.

dmgreen added inline comments.Sep 22 2021, 9:09 AM
llvm/lib/Target/AArch64/AArch64Subtarget.cpp
371–373

Tune-cpu has never worked in the AArch64 backend, as far as I understand. Like I said in the other patch the subtarget features do not distinguish between performance features and architecture features.

Having one feature work off tune-cpu whereas everything else in the backend works on target-cpu sounds wrong to me, without fixing it properly and making -mtune work as you would expect across the board. Which would be great but until then I would base everything off the cpu for consistency.

If you do want to make tune-cpu work, look at how the X86 backend does it. This shouldn't have to look into the Functions target-cpu and tune-cpu attributes, it should be initialized with the correct CPU and TuneCPU's passed in to the subtarget.

fhahn added a subscriber: fhahn.Sep 22 2021, 1:27 PM
sdesmalen added inline comments.Sep 23 2021, 1:47 AM
llvm/lib/Target/AArch64/AArch64Subtarget.cpp
359

nit: redundant comment, the code says the same thing, just shorter :)

378
  1. F is already known to be != nullptr.
  2. For consistency, can you do the same as above here, get the attribute and then check whether it's valid (as opposed to checking whether the function has the attribute.
david-arm updated this revision to Diff 375179.Sep 27 2021, 2:52 AM
david-arm edited the summary of this revision. (Show Details)
  • Rebased.
  • Now uses a VScaleForTuning property in AArch64Subtarget, which is defined according to the -mcpu/-mtune flags. By default VScaleForTuning is set to 2.
david-arm updated this revision to Diff 378181.Oct 8 2021, 5:27 AM
  • Fixed formatting issues.
david-arm marked 3 inline comments as done.Oct 8 2021, 7:21 AM

Gentle ping!

dmgreen added inline comments.Oct 11 2021, 9:55 AM
llvm/lib/Target/AArch64/AArch64Subtarget.cpp
198

I don't think these should be needed, as the above ARMProcFamily should be based on TuneCPU after D110258/D111551.

david-arm marked an inline comment as done.Oct 12 2021, 12:17 AM
david-arm added inline comments.
llvm/lib/Target/AArch64/AArch64Subtarget.cpp
198

Hi @dmgreen, I was quite worried that D111551 will hold up this patch and I don't really see why it has to, therefore I have kept them independent. I was thinking that once D111551 lands then I can fix up this code and use ARMProcFamily instead. We'd really like to get the cost model changes in asap because it's preventing vectorisation on machines with SVE hardware.

I feel that D111551 is a nice thing to have, but not a requirement for the cost model. I also expect it will take a lot longer to get approved. :)

dmgreen added inline comments.Oct 12 2021, 6:30 AM
llvm/lib/Target/AArch64/AArch64Subtarget.cpp
198

I'm not sure I see why we need to rush with a suboptimal solution. SVE autovec was not enabled, and this patch just enabled some extra gather/scatters and reductions. D111551 doesn't look like it should take too long either, but maybe I'm being optimistic there.

If you want it to land sooner, you could just remove all the references to tune-cpu in this patch. That way it works with -mcpu, the same as every other tuning feature has in llvm for the last 10 year :)

It should then automatically become a tuning features when those other two patches are done.

david-arm updated this revision to Diff 380672.Oct 19 2021, 6:24 AM
david-arm marked an inline comment as done.

Left two small nits, but I'm otherwise happy with the patch.

llvm/lib/Target/AArch64/AArch64Subtarget.cpp
150

nit: seems unrelated?

llvm/lib/Target/AArch64/AArch64Subtarget.h
275

nit: This may be personal preference, but would VScaleForCPU be a more suitable name?

david-arm added inline comments.Oct 20 2021, 8:03 AM
llvm/lib/Target/AArch64/AArch64Subtarget.cpp
150

I had to split this out as a separate case to avoid the fallthrough into the NeoverseN2 case that's all.

llvm/lib/Target/AArch64/AArch64Subtarget.h
275

I don't mind really. I don't have a strong preference. I guess the reason I used VScaleForTuning was just to make it clear that this is not an indication of what the exact vscale will be at runtime, since it could be anything from 1 -> max vscale for the CPU. Rather it's just an indication of how we'd like to tune the vectoriser, optimisations and codegen.

However, happy to change it if you prefer it!

Sorry I missed the update. Yeah looks good. There should be a A510 now too I think if you rebase, which has an sve vector length of 128bits.

sdesmalen added inline comments.Oct 20 2021, 8:17 AM
llvm/lib/Target/AArch64/AArch64Subtarget.h
275

Okay fair enough, maybe 'VScaleForTuning' is more appropriate.

david-arm updated this revision to Diff 380984.Oct 20 2021, 9:08 AM
  • Updated vscale for cortex-a510.
  • Changed gather/scatter CHECK lines to refer to the VSCALE value rather than the CPU.
  • Added a cortex-a510 RUN line to the gather/scatter tests.
sdesmalen accepted this revision.Oct 20 2021, 9:16 AM

LGTM @david-arm!

Changed gather/scatter CHECK lines to refer to the VSCALE value rather than the CPU

That's a nice change, thanks.

This revision is now accepted and ready to land.Oct 20 2021, 9:16 AM