Page MenuHomePhabricator

[AArch64][SVE] Optimise repeated floating-point complex patterns to splat
Changes PlannedPublic

Authored by MattDevereau on Sep 1 2022, 8:21 AM.



Repeated floating-point complex patterns in BUILD_VECTOR such as (f32 a, f32 b,
f32 a, f32 b) can be simplified to SPLAT_VECTOR(f64(a, b))

Diff Detail

Event Timeline

MattDevereau created this revision.Sep 1 2022, 8:21 AM
MattDevereau requested review of this revision.Sep 1 2022, 8:21 AM
Herald added a project: Restricted Project. · View Herald TranscriptSep 1 2022, 8:21 AM
Matt added a subscriber: Matt.Sep 1 2022, 11:13 AM

As mentioned below I don't believe this transform is save outside of the context of how it's used in the AArch64ISD::DUPLANE128 case you care about. Hence my do it as a DAG combine suggestion.


ISD::BUILD_VECTOR has its own class BuildVectorSDNode that provides getRepeatedSequence(), which can probably help here and perhaps mean we can cover more cases?


This is not a good place for this transformation as it's really an optimisation rather than anything related to lowering. I think you want a DAG combine for AArch64ISD::DUPLANE128 because useWideSplatForBuildVectorRepeatedComplexPattern looks like a helper function rather than a correct optimisation in its own right.


Not sure if I'm sending you down a dead end but I'm wondering if instead of this logic we've reach the stage where it's preferable to enable the other AArch64ISD::DUPLANE## nodes for scalable vectors. That way you're essentially doing:

AArch64ISD::DUPLANE128(INSERT_SUBVECTOR(BUILD_VECTOR(f32 a, f32 b, f32 a, f32 b))) ->
AArch64ISD::DUPLANE64(INSERT_SUBVECTOR(BUILD_VECTOR(f32 a, f32 b, f32 a, f32 b)))

And then if necessary we add combines to remove the unused parts of the subvector.

The problem with ISD::SPLAT_VECTOR is that is does what you want for floating point types but if you had integer tests you may well see FPR<->GPR transitions you don't want. Of course you can use bitcast to ensure everything looks like a floating point value but that might not be the cleanest approach.

MattDevereau planned changes to this revision.Sep 26 2022, 10:13 AM
MattDevereau marked 2 inline comments as done.

@paulwalker-arm I've moved the new function to DAG combine, and have somewhat side-graded the current complex pattern matching logic to work with the nodes present after DAG combine, which are different from the nodes previously in ISel lowering. I'll focus on implementing your other suggestion of enabling scalable vectors for AArch64ISD::DUPLANE## nodes, feel free to ignore this review for the time being


I've moved this to be a DAG combine inside performDupLane128Combine


This worked perfectly when I tried it, although moving the new function inside DAGCombine meant that the ISD::BUILD_VECTOR node has already been lowered to an ISD::INSERT_SUBVECTOR node.
I've changed the code to something similar here but with ISD::INSERT_SUBVECTOR, if you also have a suggestion similar to the buildvector improvement then that would be optimal. Regardless, this might not be a big deal as I'm focusing on your other suggestion of exposing scalable vector types for AArch64ISD::DUPLANE## for now

MattDevereau requested review of this revision.Sep 28 2022, 6:51 AM
MattDevereau marked 2 inline comments as done.
peterwaller-arm requested changes to this revision.Thu, Nov 10, 4:08 AM
peterwaller-arm added inline comments.

Nothing here considers the insertion index of the insert_vector_elt, and although there could be a canonicalization which puts them in order, I'm not certain can rely on the order of the operations to give you the full sequence (e.g. what happens if some insertions are missing?).


This condition doesn't look right: it doesn't consider all of the elements of the sequence, only N/2+2 of them.


This considers Sequence[0] but not Sequence[1], so I believe it's getting the right codegen by coincidence.

Also worth highlighting in a comment that Sequence is in reversed order.


This kill indicates that q0 contains live data, and it's lost after the change. This could allow the compiler to reorder instructions ignorant of the fact that the data going into the mov via v0 is live. Looking at the code I can see you're not passing %x along, so that needs to be fixed.

This revision now requires changes to proceed.Thu, Nov 10, 4:08 AM

Added test dupq_f32_repeat_complex_unordered and dupq_f32_repeat_complex_rev_unordered which assert the simplifiction does not trigger when insert_vector_elt's indices are not incremental
Added test dupq_f32_repeat_complex_rev to assert the simplification takes place when starting from an insertelement into the final vector index.
Added tests dupq_f16_repeat_complex_omit_pair the simplification does not trigger when vector elements are unknown.
Added tests dupq_f16_repeat_complex_mismatched_front, dupq_f16_repeat_complex_mismatched_end and dupq_f16_repeat_complex_mismatched_middle to verify the simplifaction considers the full range of elements

MattDevereau added inline comments.Thu, Nov 10, 7:53 AM

I've added logic to check the ordering with NumElements. I've added tests dupq_f32_repeat_complex_unordered, dupq_f32_repeat_complex_rev_unorderedand dupq_f16_repeat_complex_omit_pair to assert this


I've changed this to RSequence.size() - 2;. I've added tests dupq_f16_repeat_complex_mismatched_front, dupq_f16_repeat_complex_mismatched_end and dupq_f16_repeat_complex_mismatched_middle to assert this.


I think the updated tests now reflect this with 2 kill messages:
; CHECK-NEXT: // kill: def $s0 killed $s0 def $z0
; CHECK-NEXT: // kill: def $s1 killed $s1 def $q1

peterwaller-arm accepted this revision.Thu, Nov 10, 8:30 AM

One suggestion inline.

Accepting to remove my block and encourage other reviewers to weigh in if they have problems with the approach. Please hold off submission until late Monday as I would like to do one more pass over it before we push.


Would it be better to express this condition in terms of 'if (VecElTy not in (set of supported types))` so that it's impossible for new types to arise which would be broken? (Also makes it read as an explicit gate of supported types).

This revision is now accepted and ready to land.Thu, Nov 10, 8:30 AM

Hi @MattDevereau, Sorry if we've discussed this before but is there a reason this cannot be solved at the IR level as an instcombine?

MattDevereau planned changes to this revision.Thu, Nov 17, 5:36 AM

@paulwalker-arm I've moved this patch to an instcombine here: I'll mark this as changes planned for now and abandon it after a short period of time.