Page MenuHomePhabricator

[AArch64][SVE] More unpredicated ld1/st1 patterns for reg+reg addressing modes
ClosedPublic

Authored by efriedma on Apr 14 2021, 10:32 PM.

Details

Summary

In some cases, we can improve the generated code by using a load with the "wrong" element width: in particular, using ld1b/st1b when we see reg+reg without a shift.

Diff Detail

Event Timeline

efriedma created this revision.Apr 14 2021, 10:32 PM
efriedma requested review of this revision.Apr 14 2021, 10:32 PM
Herald added a project: Restricted Project. · View Herald TranscriptApr 14 2021, 10:32 PM

Sounds sensible to me but wonder if there's a nicer way to create the patterns as it seems we're likely to have a significant number of patterns targeting the same instructions. Do you think it's possible to have something like load_8 match all things where LD1_B (reg+reg) can be used?

I'll multiclass the patterns so they don't repeat; maybe a little more verbose than the ideal, but probably good enough.

efriedma updated this revision to Diff 341620.Thu, Apr 29, 1:15 PM
efriedma edited the summary of this revision. (Show Details)

Updated to handle other relevant types.

Are we concerned about the reduced code quality shown by the splice tests? Is there a way to know if the basic block is, or within, a loop body? so that we can restrict the patterns to instances where we're confident the extra instructions are likely to be hoisted. I guess for the PTRUE case we already know there is a need for some kind of machine pass to remove larger element PTRUEs when a smaller element one is safe to use.

llvm/test/CodeGen/AArch64/named-vector-shuffles-sve.ll
1166–1169

This suggest we should limit the usage to only when there's a single use of the address? or more precisely not kick in unless we can guarantee the add/sub will be omitted.

The splice tests show two issues:

  1. We don't care about the number of uses of the address; this can lead to an extra loop-invariant instruction if one of the operands of the add is a small constant. This is a issue with all patterns using the am_sve_regreg_lsl0 matcher; see the generated code for splice_nxv16i1. This seems like an edge case... you specifically need a small constant offset, and the current formulation probably has lower latency. If you think it's worth addressing, I can try messing with the matcher.
  2. We generate an extra ptrue because of the element width. The ptrue thing seems like a general issue we'd want to address elsewhere; not sure it should block this.

Is there a way to know if the basic block is, or within, a loop body?

SelectionDAGISel has LoopInfo, so in theory, we could compute it? But I don't think we pass through the information at the moment.

Matt added a subscriber: Matt.Sat, May 1, 8:08 AM
paulwalker-arm accepted this revision.Sun, May 2, 2:34 AM

If you think it's worth addressing, I can try messing with the matcher.

That's OK, as you say, this is not a problem introduced by this patch and we have a good story for how to resolve the issues at a later date.

This revision is now accepted and ready to land.Sun, May 2, 2:34 AM
This revision was landed with ongoing or failed builds.Mon, May 3, 3:06 PM
This revision was automatically updated to reflect the committed changes.