This is an archive of the discontinued LLVM Phabricator instance.

[AArch64][SVE] Combine add and index_vector
ClosedPublic

Authored by junparser on Apr 8 2021, 7:13 AM.

Details

Summary

This patch tries to combine pattern add(index_vector(zero, step), dup(X)) into index_vector(X, step)

TestPlan: check-llvm

Diff Detail

Event Timeline

junparser created this revision.Apr 8 2021, 7:13 AM
junparser requested review of this revision.Apr 8 2021, 7:13 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 8 2021, 7:13 AM
junparser updated this revision to Diff 336096.Apr 8 2021, 7:15 AM

fix typo.

junparser updated this revision to Diff 336099.Apr 8 2021, 7:21 AM
This comment was removed by junparser.
junparser retitled this revision from [AArch64][SVE] Optimize index_vector with add to [AArch64][SVE] Combine add and index_vector.Apr 8 2021, 7:22 AM
junparser edited the summary of this revision. (Show Details)
Harbormaster completed remote builds in B97721: Diff 336096.
sdesmalen added inline comments.Apr 9 2021, 4:42 AM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
13437 ↗(On Diff #336099)

Hi @junparser, there are a few things to take into account here:

  • Folding the add into the index_vector may be more expensive in practice, if index_vector(zero, step) is created once, then subsequent adds may be cheaper than having multiple index_vector instructions which each need to calculate the expansion of the (very similar) series. It's probably best to check that the node (add) has only a single result.
  • The form where it uses an add may come in useful when selecting an addressing mode for the gather/scatter operations, in which case: add(nxv4i32 dup(X), nxv4i32 index_vector(zero, step)) can be implemented with a scalar+vector addressing mode such as [x0, z0.s]. You may want to add a check that the result is not used as the address in a load/store operation.

This can probably be implemented quite easily in TableGen with some patterns that try to do the fold, but where the PatFrag itself has a condition that it only has a single use that this use is not the address in a MemSDNode. See e.g. AArch64mul_p_oneuse where it also checks for a single use before selecting an MLA/MLS.

Because of how we have currently implemented the legalization of gathers/scatters, I don't think you can currently test the condition that the node is not used in a MemSDNode, but I'm not sure if that matters much.

What I want know something more is what is the boundary between tablegen pattern match and dag combine? I never figure this out. Just use this case as example, we can implement the feature in both place, but I don't know how to handle commutivity in tablegen (maybe SDNPCommutative? I don't know.) , and their effects on the next step: combine load/store with index_vector.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
13437 ↗(On Diff #336099)

Hi @sdesmalen It looks reasonable to me to check add has only one use. as for addressing mode for gather/scatter, have we already implemented it now? I haven't see it in trunk. Since my next plan is to combine load/store with index_vector.

13437 ↗(On Diff #336099)

Hi @junparser, there are a few things to take into account here:

  • Folding the add into the index_vector may be more expensive in practice, if index_vector(zero, step) is created once, then subsequent adds may be cheaper than having multiple index_vector instructions which each need to calculate the expansion of the (very similar) series. It's probably best to check that the node (add) has only a single result.
  • The form where it uses an add may come in useful when selecting an addressing mode for the gather/scatter operations, in which case: add(nxv4i32 dup(X), nxv4i32 index_vector(zero, step)) can be implemented with a scalar+vector addressing mode such as [x0, z0.s]. You may want to add a check that the result is not used as the address in a load/store operation.

This can probably be implemented quite easily in TableGen with some patterns that try to do the fold, but where the PatFrag itself has a condition that it only has a single use that this use is not the address in a MemSDNode. See e.g. AArch64mul_p_oneuse where it also checks for a single use before selecting an MLA/MLS.

Because of how we have currently implemented the legalization of gathers/scatters, I don't think you can currently test the condition that the node is not used in a MemSDNode, but I'm not sure if that matters much.

What I want know something more is what is the boundary between tablegen pattern match and dag combine? I never figure this out. Just use this case as example, we can implement the feature in both place, but I don't know how to handle commutivity in tablegen (maybe SDNPCommutative? I don't know.) , and their effects on the next step: combine load/store with index_vector.

There isn't a firm rule to follow re tablegen pattern vs manual dag combine, but I think the steer is that when the combine is needed after legalization and it's possible to specify as a pattern, a pattern is preferred. In this case, probably a pattern would be a clean way to implement it. I haven't used SDNPCommutative before, but that looks like it should do the trick.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
13437 ↗(On Diff #336099)

The gather/scatter nodes are not implemented as patterns at the moment, only as custom dagcombine code. So adding a check that it isn't used in an addressing mode would currently have no effect, but it probably would if we change our gather/scatter implementation to not use the intrinsics, but use common nodes instead. This is something we want to do in the not-too-distant future. You can alternatively also leave a FIXME.

Since my next plan is to combine load/store with index_vector

Can you elaborate what you mean by that?

What I want know something more is what is the boundary between tablegen pattern match and dag combine? I never figure this out. Just use this case as example, we can implement the feature in both place, but I don't know how to handle commutivity in tablegen (maybe SDNPCommutative? I don't know.) , and their effects on the next step: combine load/store with index_vector.

There isn't a firm rule to follow re tablegen pattern vs manual dag combine, but I think the steer is that when the combine is needed after legalization and it's possible to specify as a pattern, a pattern is preferred. In this case, probably a pattern would be a clean way to implement it. I haven't used SDNPCommutative before, but that looks like it should do the trick.

Thanks for explain this.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
13437 ↗(On Diff #336099)

We already have node like GLD1_MERGE_ZERO and combine function like performGLD1Combine, so rather than implementing it as tablegen pattern, i prefer to do it in here, and then do some more work in performGLD1Combine with index_vector, that the plan.

Also what is the difference between scalar+vector and. vector+ index addressing mode? since we can also use vector+ index addressing mode after this combination.

junparser updated this revision to Diff 337380.Apr 14 2021, 2:30 AM
junparser edited the summary of this revision. (Show Details)

Address comment.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
13437 ↗(On Diff #336099)

Hi @junparser, please can you provide an example for what you're ultimately trying to achieve.

As Sander points out the decision as to where things should live is not not clear cut and I've nothing against doing things within DAGCombine if it helps reduce isel pattern explosion. That said I am very keen to cut down the need for duplicated combines due to the lack of a canonicalised DAG. With regards to the addressing modes we should already have good support by looking for ADD based arithmetic used to compute addresses. Unless there's a good reason I'd rather not have to duplicate this logic to handle INDEX_VECTOR.

Personally I see INDEX_VECTOR as a last stage optimisation whereby everything else has had a chance to remove the ADD (or MUL if we're talking about a stride) and thus the last option is to merge stray ADDs into the INDEX_VECTOR.

For what it's worth the target specific INDEX_VECTOR existed before the common STEP_VECTOR nodes so I think there's a good chance we'll put effort into ensuring any early uses of INDEX_VECTOR are first converted to STEP_VECTOR to maximum the effectiveness of future common STEP_VECTOR combines.

junparser added inline comments.Apr 14 2021, 5:39 AM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
13437 ↗(On Diff #336099)

Hi @junparser, please can you provide an example for what you're ultimately trying to achieve.

combine add+index_vector here, and do some address mode choice in selectGatherScatterAddrMode between scalar+vector and. vector+ index , and convert it back to index_vector + gather.

As Sander points out the decision as to where things should live is not not clear cut and I've nothing against doing things within DAGCombine if it helps reduce isel pattern explosion. That said I am very keen to cut down the need for duplicated combines due to the lack of a canonicalised DAG. With regards to the addressing modes we should already have good support by looking for ADD based arithmetic used to compute addresses. Unless there's a good reason I'd rather not have to duplicate this logic to handle INDEX_VECTOR.

Personally I see INDEX_VECTOR as a last stage optimisation whereby everything else has had a chance to remove the ADD (or MUL if we're talking about a stride) and thus the last option is to merge stray ADDs into the INDEX_VECTOR.

sounds reasonable to me.

For what it's worth the target specific INDEX_VECTOR existed before the common STEP_VECTOR nodes so I think there's a good chance we'll put effort into ensuring any early uses of INDEX_VECTOR are first converted to STEP_VECTOR to maximum the effectiveness of future common STEP_VECTOR combines.

I see we lower step_vector to index_vector currently, do you mean that we only generate index_vector in instruction selection, and only use step_vector before that? Hmm, maybe it's a good idea.

Address comments. using tablegen pattern match.

Thanks, that looks quite neat. Can you also add a few tests when there is >1 use of the stepvector (e.g. using the stepvector in two adds), so we can test the fold indeed doesn't happen?
One test for each of the instructions should be sufficient.

llvm/test/CodeGen/AArch64/sve-stepvector.ll
147

nit: add_stepvector_nxv8i8_2_commutative ?

junparser updated this revision to Diff 337700.Apr 15 2021, 4:07 AM

Address comments.

junparser updated this revision to Diff 337712.Apr 15 2021, 4:30 AM

Update testcase.

Matt added a subscriber: Matt.Apr 17 2021, 8:41 AM
paulwalker-arm accepted this revision.Apr 19 2021, 3:18 AM
This revision is now accepted and ready to land.Apr 19 2021, 3:18 AM
sdesmalen accepted this revision.Apr 19 2021, 7:53 AM

Other than my comment on two missing tests, the patch looks good to me.

llvm/test/CodeGen/AArch64/sve-stepvector.ll
134

nit: I think you're still missing a test for the scalar+scalar case and one for imm+scalar (for start/stride respectively)

This revision was automatically updated to reflect the committed changes.