This is an archive of the discontinued LLVM Phabricator instance.

[LV] Only generate 1st part outside of vector region for VPInstruction.
Needs ReviewPublic

Authored by fhahn on Jun 24 2023, 3:43 AM.

Details

Reviewers
Ayal
gilr
rengolin
Summary

Update VPInstruction::execute to only generate code for the first part
if the recipe is outside the vector loop region (i.e. the pre-headers
for now).

This avoids generating unnecessary code in the pre-headers (see adjusted
tests) and going forward allows using VPInstructions more generally for
code-generation outside the vector loop.

To facilitate this, VPTransformState::set/get() have been extended with
versions that take a std::variant with either the Part or a VPIteration.

When getting a the result for a definition outside the vector loop
region, always return Part 0. This means the callers do not need to
check where the definition resides.

Diff Detail

Event Timeline

fhahn created this revision.Jun 24 2023, 3:43 AM
Herald added a project: Restricted Project. · View Herald TranscriptJun 24 2023, 3:43 AM
fhahn requested review of this revision.Jun 24 2023, 3:43 AM
Herald added a project: Restricted Project. · View Herald TranscriptJun 24 2023, 3:43 AM
Herald added a subscriber: vkmr. · View Herald Transcript
Ayal added a comment.Jun 26 2023, 8:15 AM

Update VPInstruction::execute to only generate code for the first part
if the recipe is outside the vector loop region (i.e. the pre-headers
for now).

If a recipe should generate code for first part only, depends in general on its uniformity rather than it being placed inside or outside the vector loop region.

Simpler and more consistent to register uniform values across all parts when defined, rather than check their uniformity during retrieval when used?

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
9650 ↗(On Diff #534188)

If a recipe should generate one invariant/uniform-across-UF IR value across all parts, regardless of its position being inside or outside the vectorized loop, it can do so while registering it for all parts, as done here.

9863 ↗(On Diff #534188)

Def->isDefinedOutsideVectorRegions()?

fhahn updated this revision to Diff 535716.Jun 29 2023, 4:29 AM
fhahn marked an inline comment as done.

Address latest comments, thanks!

Update VPInstruction::execute to only generate code for the first part
if the recipe is outside the vector loop region (i.e. the pre-headers
for now).

If a recipe should generate code for first part only, depends in general on its uniformity rather than it being placed inside or outside the vector loop region.

Simpler and more consistent to register uniform values across all parts when defined, rather than check their uniformity during retrieval when used?

Updated to generate the first part only and register the value accross all parts as suggested. For now, I updated the code to use isDefinedOutsideVectorRegions. This can be further extended to check for invariant/uniform-across-VF in the future.

fhahn marked an inline comment as done.Jun 29 2023, 4:31 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
9650 ↗(On Diff #534188)

I updated the patch to set the value for all parts for values defined outside of the loop; to catch this cases we would need to either hoist the load out of the loop or add way to check for values uniform across all parts/invariance, which could be done in the future based on VPlan.

9863 ↗(On Diff #534188)

Adjusted, thanks!

Ayal added inline comments.Jun 30 2023, 7:51 AM
llvm/lib/Transforms/Vectorize/VPlan.cpp
1062

nit: this actually checks if a VPValue is "top level" - either live-in or defined outside any region. But non-vector regions currently reside inside vector regions, so this is equivalent.

llvm/lib/Transforms/Vectorize/VPlan.h
272 ↗(On Diff #535716)

Are these changes to the interface of State::get() still needed?

296 ↗(On Diff #535716)

Better keep code-gen and State::set() simple, modeling uniformity behavior explicitly and consistently in recipes, and only there?

llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
297

TODO: this replication should be taken out to the caller of generateInstruction for such VPInstructions known to be uniform but appearing inside the vector loop.

1164

Let's leave it for a recipe generating an invariant/uniform to set all parts, for consistency, as in Replicate Recipe?

llvm/lib/Transforms/Vectorize/VPlanValue.h
190–191

The last i.e. part above is obsolete?

Ayal added inline comments.Jun 30 2023, 7:53 AM
llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
371–379

Worth separating the generation of Instruction(s) from setting the State, and be explicit about both, as in the above?

Can check hasResult() instead of num users to ensure that the generated instruction produces a value, or have generateInstruction() return null if the Instruction(s) it produces do not supply a result that can be used, as in stores and branches.

fhahn updated this revision to Diff 536323.Jun 30 2023, 10:58 AM
fhahn marked 8 inline comments as done.

Address latest comments, thanks!

Ayal added a comment.Jul 2 2023, 2:14 AM

Looks good to me, thanks!
May be good to reason about the VPlan recipes and/or show their dump for tests where changes in the IR are harder to track due to glitches in the diff.

llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
375

The other assert is also/more important - if hasResult then GeneratedValue being set in State must not be null?

llvm/test/Transforms/LoopVectorize/AArch64/scalable-strict-fadd.ll
394

While we're here, vscale seems to provide additional opportunities above for deduplication in the preheader.

403

Suffice to have one instance of this {call-vscale;mul-by-32;sub;icmp;select} sequence defined in the preheader than four uniform replicas. But what's the reason for eliminating the {call-vscale;mul-by-8/16/24;add} non-identical sequences defined in the preheader, which are not uniform? Would be good to clarify if they all stem from the same recipe, and what the UF is.

Note that, in general, a recipe placed in the preheader could prepare a distinct value per each part. E.g., to initialize an add reduction with distinct starting value in first lane/part, rather than adding it to the sum at the end. But suffice to generate a uniform-across-UF value for a single part, and reuse it across all parts.

404–405

Fine - suffice to have one instance of this call-active-lane-mask defined in the preheader than four replicas.

406

Fine - there are no real changes inside vector.body, just a misalignment in the diff.

llvm/test/Transforms/LoopVectorize/scalable-first-order-recurrence.ll
205–206

Fine - suffice to have a single instance of this identical broadcasting {insert; shuffle} sequence in the preheader.

fhahn updated this revision to Diff 537041.Jul 4 2023, 4:03 AM

Update after recent changes to D154240

Ayal added inline comments.Jul 4 2023, 7:09 AM
llvm/test/Transforms/LoopVectorize/AArch64/scalable-strict-fadd.ll
403

Is this cleanup the expected behavior, presumably associated with (one? two?) VPInstruction::ActiveLaneMask placed in the preheader? May be helpful to dump the VPlan, although its (uniform or non-uniform) recipes are compressed for all parts.

The rest seems fine.