Repeated floating-point complex patterns in BUILD_VECTOR such as (f32 a, f32 b,
f32 a, f32 b) can be simplified to SPLAT_VECTOR(f64(a, b))
Details
Diff Detail
Event Timeline
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp | ||
---|---|---|
11340–11341 | As mentioned below I don't believe this transform is save outside of the context of how it's used in the AArch64ISD::DUPLANE128 case you care about. Hence my do it as a DAG combine suggestion. | |
11356–11362 | ISD::BUILD_VECTOR has its own class BuildVectorSDNode that provides getRepeatedSequence(), which can probably help here and perhaps mean we can cover more cases? | |
11361–11363 | This is not a good place for this transformation as it's really an optimisation rather than anything related to lowering. I think you want a DAG combine for AArch64ISD::DUPLANE128 because useWideSplatForBuildVectorRepeatedComplexPattern looks like a helper function rather than a correct optimisation in its own right. | |
11371–11382 | Not sure if I'm sending you down a dead end but I'm wondering if instead of this logic we've reach the stage where it's preferable to enable the other AArch64ISD::DUPLANE## nodes for scalable vectors. That way you're essentially doing: AArch64ISD::DUPLANE128(INSERT_SUBVECTOR(BUILD_VECTOR(f32 a, f32 b, f32 a, f32 b))) -> AArch64ISD::DUPLANE64(INSERT_SUBVECTOR(BUILD_VECTOR(f32 a, f32 b, f32 a, f32 b))) And then if necessary we add combines to remove the unused parts of the subvector. The problem with ISD::SPLAT_VECTOR is that is does what you want for floating point types but if you had integer tests you may well see FPR<->GPR transitions you don't want. Of course you can use bitcast to ensure everything looks like a floating point value but that might not be the cleanest approach. |
@paulwalker-arm I've moved the new function to DAG combine, and have somewhat side-graded the current complex pattern matching logic to work with the nodes present after DAG combine, which are different from the nodes previously in ISel lowering. I'll focus on implementing your other suggestion of enabling scalable vectors for AArch64ISD::DUPLANE## nodes, feel free to ignore this review for the time being
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp | ||
---|---|---|
11340–11341 | I've moved this to be a DAG combine inside performDupLane128Combine | |
11356–11362 | This worked perfectly when I tried it, although moving the new function inside DAGCombine meant that the ISD::BUILD_VECTOR node has already been lowered to an ISD::INSERT_SUBVECTOR node. |
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp | ||
---|---|---|
20704 | Nothing here considers the insertion index of the insert_vector_elt, and although there could be a canonicalization which puts them in order, I'm not certain can rely on the order of the operations to give you the full sequence (e.g. what happens if some insertions are missing?). | |
20715 | This condition doesn't look right: it doesn't consider all of the elements of the sequence, only N/2+2 of them. | |
20729 | This considers Sequence[0] but not Sequence[1], so I believe it's getting the right codegen by coincidence. Also worth highlighting in a comment that Sequence is in reversed order. | |
llvm/test/CodeGen/AArch64/sve-intrinsics-perm-select.ll | ||
590 | This kill indicates that q0 contains live data, and it's lost after the change. This could allow the compiler to reorder instructions ignorant of the fact that the data going into the mov via v0 is live. Looking at the code I can see you're not passing %x along, so that needs to be fixed. |
Added test dupq_f32_repeat_complex_unordered and dupq_f32_repeat_complex_rev_unordered which assert the simplifiction does not trigger when insert_vector_elt's indices are not incremental
Added test dupq_f32_repeat_complex_rev to assert the simplification takes place when starting from an insertelement into the final vector index.
Added tests dupq_f16_repeat_complex_omit_pair the simplification does not trigger when vector elements are unknown.
Added tests dupq_f16_repeat_complex_mismatched_front, dupq_f16_repeat_complex_mismatched_end and dupq_f16_repeat_complex_mismatched_middle to verify the simplifaction considers the full range of elements
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp | ||
---|---|---|
20704 | I've added logic to check the ordering with NumElements. I've added tests dupq_f32_repeat_complex_unordered, dupq_f32_repeat_complex_rev_unorderedand dupq_f16_repeat_complex_omit_pair to assert this | |
20715 | I've changed this to RSequence.size() - 2;. I've added tests dupq_f16_repeat_complex_mismatched_front, dupq_f16_repeat_complex_mismatched_end and dupq_f16_repeat_complex_mismatched_middle to assert this. | |
20729 | I think the updated tests now reflect this with 2 kill messages: |
One suggestion inline.
Accepting to remove my block and encourage other reviewers to weigh in if they have problems with the approach. Please hold off submission until late Monday as I would like to do one more pass over it before we push.
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp | ||
---|---|---|
20700 | Would it be better to express this condition in terms of 'if (VecElTy not in (set of supported types))` so that it's impossible for new types to arise which would be broken? (Also makes it read as an explicit gate of supported types). |
Hi @MattDevereau, Sorry if we've discussed this before but is there a reason this cannot be solved at the IR level as an instcombine?
@paulwalker-arm I've moved this patch to an instcombine here: https://reviews.llvm.org/D138203. I'll mark this as changes planned for now and abandon it after a short period of time.
As mentioned below I don't believe this transform is save outside of the context of how it's used in the AArch64ISD::DUPLANE128 case you care about. Hence my do it as a DAG combine suggestion.