In some cases, we can improve the generated code by using a load with the "wrong" element width: in particular, using ld1b/st1b when we see reg+reg without a shift.
Sounds sensible to me but wonder if there's a nicer way to create the patterns as it seems we're likely to have a significant number of patterns targeting the same instructions. Do you think it's possible to have something like load_8 match all things where LD1_B (reg+reg) can be used?
Are we concerned about the reduced code quality shown by the splice tests? Is there a way to know if the basic block is, or within, a loop body? so that we can restrict the patterns to instances where we're confident the extra instructions are likely to be hoisted. I guess for the PTRUE case we already know there is a need for some kind of machine pass to remove larger element PTRUEs when a smaller element one is safe to use.
This suggest we should limit the usage to only when there's a single use of the address? or more precisely not kick in unless we can guarantee the add/sub will be omitted.
The splice tests show two issues:
- We don't care about the number of uses of the address; this can lead to an extra loop-invariant instruction if one of the operands of the add is a small constant. This is a issue with all patterns using the am_sve_regreg_lsl0 matcher; see the generated code for splice_nxv16i1. This seems like an edge case... you specifically need a small constant offset, and the current formulation probably has lower latency. If you think it's worth addressing, I can try messing with the matcher.
- We generate an extra ptrue because of the element width. The ptrue thing seems like a general issue we'd want to address elsewhere; not sure it should block this.
Is there a way to know if the basic block is, or within, a loop body?
SelectionDAGISel has LoopInfo, so in theory, we could compute it? But I don't think we pass through the information at the moment.