This is an archive of the discontinued LLVM Phabricator instance.

[VPlan] Replace IR based truncateToMinimalBitwidths with VPlan version.
ClosedPublic

Authored by fhahn on May 4 2023, 2:07 PM.

Details

Summary

This patch replaces the IR based truncateToMinimalBitwidths with a VPlan
version. This has 2 benefits:

  1. the VPlan-based version is simpler; we don't need to implement special codegen for each supported instruction type like the IR based one.
  2. Removes a dependency on the cost-model after VPlan execution and
  3. Removes a use of getVPValue that uses underlying values after VPlan execution (See removed FIXME).

Depends on D149081.

Depends on D149079.

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Ayal added inline comments.Oct 4 2023, 4:03 PM
llvm/lib/Transforms/Vectorize/VPlan.h
280 ↗(On Diff #553760)

How/Is this removal related?

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
769

(Future) Thought: wonder if instead of iterating over all live-ins looking to truncate any, it may be better to iterate over MinBWs and check if any are live-ins. Or lookup MinBWs upon construction of a live-in.

770

nit: use LiveInInst or something similar rather than UI?

776
783

Set once before the loop for all live-ins to be truncated.

792

Any order other than depth first would also do, right?

803

(Future) Thought: this is an awkward way of retrieving "the" recipe that corresponds to each member of MinBWs - look through all recipes for those having the desired "underlying" insn. Perhaps better lookup MinBWs upon construction of a recipe for an Instruction.
Or migrate the analysis that builds MinBWs to run on VPlan.

804

nit: lookup.

809

Would be good to comment how memory and replicate cases are (not) processed.

815

Better assert than continue? Here ProcessedRecipes was already bumped, but should all MinBWs members correspond to Integer types, of distinct (smaller) size, whether live-in or not?

825

This deals only with ZExt/SExt, easier to check directly if Opcode is one or the other?

OTOH, better handle Trunc here as well? Is it handled well below?

829

// SExt/Zext is redundant - stick with its operand.
?

836

Place assert earlier?

838–839
850

This means the size of all operands is equal to NewResSizeInBits, can this be?

854–855

nit: keep consistent with above.

863–865

nit: keep consistent with above.

llvm/lib/Transforms/Vectorize/VPlanTransforms.h
47

nit: a VPlan transform should fold redundant ZExt-Trunc pairs rather than leaving them ("as hints") to InstCombine.

Being a public method, which does not need SE, should the caller of optimize() precede its call with a direct call to trunctateToMinimalBitwidth(), rather than pass MinBWs to optimize()?

fhahn updated this revision to Diff 557740.Oct 17 2023, 1:04 PM
fhahn marked 21 inline comments as done.

Address latest comments, apologies for the delay!

llvm/lib/Transforms/Vectorize/VPlan.h
280 ↗(On Diff #553760)

The last user of this function has been removed in the patch.

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
759–760

Code has been moved to D159202

762–764

code has been moved to D159202

768

Wrapped and added comment, thanks!

770

Renamed, thanks!

771

Updated, thanks!

776

Adjusted, thanks!

777

Turned into assert, thanks!

783

hoisted, thanks!

792

Yes, I think the order doesn't matter here.

804

Done, thanks!

809

Added a comment, thanks!

815

Turned isIntegerTy into assert but retained size check as there entries where the sizes are the same (e.g. for truncs).

825

Thanks, changed to if. I don't think Trunc is handled explicitly in the latest version.

829

this check has been moved up and is not needed any longer.

836

moved up,, thanks!

838–839

adjusted, thanks!

850

There are cases where a Zext narrowed earlier is used as operand here, so the tie is already adjusted.

854–855

Adjusted, thanks!

863–865

reordered, thanks!

Ayal added a comment.Oct 21 2023, 2:02 PM

Various comments, also trying to reason about how this patch changes tests.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
3564

Retain a comment explaining why replicate recipes are not truncated?

3599

Retain this comment regarding dropping wrapping flags?

3614

A Trunc is handled by shrinking its operand.

3639

(If nothing is done to the operands, what is the result extended too?)

llvm/lib/Transforms/Vectorize/VPlan.h
280 ↗(On Diff #553760)

Very well!

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
760

Thought: worth introducing as a member of VPValue, to be overridden by VPWidenCastRecipe? Note that this is Element/Scalar Type.

Effectively adding scalar type info to all VPValues? Might be good to investigate separately, although the current use-cases would probably be very limited

Very well.

768

Suffice to ask if (!NewResSizeInBits)?

769

Thoughts about the above? Hopefully avoids exposing getLiveIns(), at the expense of holding a mapping between Values and LiveIns, as in LiveOuts.

772

assert "MinBW member must be integer" rather than continue - thereby skipping a MinBW member.

785

Thought: could/should each MinBW be attached to its recipe asap - when the latter is created, considering it depends on associated underlying instruction?

Might be a potential follow-up, but we would still potentially updated MinBWs on each recipe replacement?

Sure, like updating any other property of a recipe when replaced.

788

Can skip phi's, none are included in MinBWs.

789

Are any loads included in MinBWs, or is this dead code? Stores of course are irrelevant.

801

Agreed - MinBW should specify a consistent minimal bit width for all users, and for all operands, but there seems to be some discrepancy that is confusing:

A. Instructions whose operands and return value are all of a single type (excluding condition operand of selects) are converted to operate on a narrower type by (a) shrinking their operands to the narrower type and (b) extending their result from the narrower type to their original type. Instructions that feed values to such instructions or use their values, continue to feed and use values of the original type.
A pair of such instructions where one feeds the other will be added a zext-trunc pair between them which will later be folded.

B. Instructions that convert between two distinct types, continue to digest the original source type but are updated to produce values of the new destination type. Their users, when reached subsequently, need to check if any of their operands have been narrowed. But if this is the case, why bother expanding results in (b) above? OTOH, the narrowed results of conversion instructions can also be expanded (to be folded later), keeping the treatment consistent? Always expecting the new type to be strictly smaller than the current one. Perhaps conversion instructions could be skipped now and handled by subsequent folding pass - looking for trunc-trunc and sext-trunc pairs in addition to zext-trunc ones?

C. Loads are ignored - excluded from MiinBWs? They could potentially be narrowed to load only the required bits, though its unclear if a strided narrow load is better than a unit-strided wider load and trunc - as in an interleave-group(?)

D. Phis are ignored - excluded from MinBWs. Truncated header induction phi's are handled separately. Other phi's may deserve narrowing(?)

802

Suffice to ask if (!NewResSizeInBits)?

803

Thoughts about the above?

811

Should replicate recipes be handled next to handling widen memory recipes above?

815

nit: ResTy >> OldResTy, ResSizeInBits >> OldResSizeInBits

818

assert(ResSizeInBits > NewResSizeInBits && "Nothing to shrink?"); here instead of below?

824

nit: VPC >> OldExt, Opc >> OldOpc?

825

Does Trunc (which can truncate to a smaller bitwidth) implicitly fall through and has its operand shrunk to the smaller bitwidth, effectively turning it into a ZExt?

828

Comment is obsolete here - dealt with new type being equal to operand type, which should result in replacing the SExt/ZExt with its operand, see below.

829

?

833

nit: C >> NewCast?

If getTypeSizeInBits(Op) == NewResSizeInBits should C be set to Op (w/o inserting it) instead of creating a redundant cast?

850

Maybe worth a comment.

llvm/lib/Transforms/Vectorize/VPlanTransforms.h
47

Thoughts on the above?
Better truncate to minimal bitwidth asap, as it relies on IR information? Conceptually a scalar transform.
Does "as hints to InstCombine" below still hold?

llvm/test/Transforms/LoopVectorize/AArch64/deterministic-type-shrinkage.ll
9–10

hmm, we now spot the redundant duplicate zext of WIDE_LOAD from <16 x i8> to <16 x i16>, originally both TMP4 and TMP10.

15

Spotted and removed duplicate zext of WIDE_LOAD8.

55

BROADCAST_SPLAT is (still) trunc'ed twice due to UF=2?

55

BROADCAST_SPLAT2 is (still) trunc'ed twice due to UF=2?

55

This testcase stores the 2nd least significant byte of a 32b product (of two invariant values, one 16b and the other 32b) checking that computing 16b product suffices. But more optimizations should take place: the expansion of the multipliers to 32b should be eliminated (along with their truncation to 16b), and the invariant multiplication-lshr-trunc sequence should be hoisted out of the loop.

56

Both insertelement's now use poison.

llvm/test/Transforms/LoopVectorize/AArch64/loop-vectorization-factors.ll
308 ↗(On Diff #557740)

We now fold a trunc-zext of zext'ed WIDE_LOAD from <16 x i16> => <16 x i32> => <16 x i16>,
but fail to fold a similar one following the add-2's?

338 ↗(On Diff #557740)

We now get rid of a pair of <8 x i16> => <8 x i32> => <8 x i16> before the add-2's (so this is not an NFC patch), but still retain the pair of <8 x i16> => <8 x i32> => <8 x i16> after it - missed MinBW/trunc-zext opportunity?

484 ↗(On Diff #557740)

Hmm, before we narrowed these two sufflevectors to operate on <16 x i8> and zext-trunc their result, now we let them operate on original <16 x i32> and truncate the result?

498 ↗(On Diff #557740)

Many zext-trunc pairs left to collect.

513 ↗(On Diff #557740)

Above trunc of TMP2 is redundant along with its zext in the ph.

520 ↗(On Diff #557740)

Above trunc of TMP4 is redundant along with its zext in the ph.

llvm/test/Transforms/LoopVectorize/trunc-shifts.ll
334 ↗(On Diff #557740)

We now get rid of a pair of <4 x i16> => <4 x i32> => <4 x i16> before the lshr (so this is not an NFC patch), but still retain the pair/triple of <4 x i16> => <4 x i32> => <4 x i16> => <4 x i8> after it - missed MinBW opportunity?

fhahn updated this revision to Diff 558116.Nov 16 2023, 2:15 PM
fhahn marked 29 inline comments as done.

Address comments and major simplification after moving cast folding to simplifyRecipes.

Hope all comments should be addressed, hope i didn't miss any.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
3564

Retained when skipping VPReplicateRecipe.

3599

Done, thanks!

3639

It stays the same, there's no extend in that case.

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
760

This has been updated to now use VPTypeAnalysis.

768

This code has now been removed; LiveIns are handled when truncating the other operands of an instruction; otherwise we leave the type info in an inconsistent state.

769

LiveIns are now handled directly when truncating other operands; getLiveIns has been removed.

772

Turned into an assert, thanks!

788

There's an early continue now that skips phis and other unsupported recipes.

789

Nope, looks like this is not needed in the latest version.

801

The latest version doesn't have special treatment for casts, they remain unchanged and VPlan recipe simplification will take care of folding them if possible.

802

Simplified, thanks!

803

I think it would be best to have the analysis based on VPlan. Building MinBWs early would probably require extra work to update/invalidate it during transforms.

811

We still need to count them for verification

fhahn added inline comments.Nov 16 2023, 2:15 PM
llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
815

Renamed, thanks!

818

Done, and also removed continue

824

This code is now gone, handled by recipe simplification.

828

Code is gone now

829

Code now gone.

833

Code gone now.

llvm/test/Transforms/LoopVectorize/AArch64/deterministic-type-shrinkage.ll
55

The latest version avoids truncating the same value twice.

55

The latest version avoids truncating the same value twice.

55

still more work to do :)

Arguably the invariant instructions are artificial, in the regular pipeline, no invariant instructions should remain.

56

I think the use of undef is a leftover that wasn't updated; it should be poison.

llvm/test/Transforms/LoopVectorize/AArch64/loop-vectorization-factors.ll
308 ↗(On Diff #557740)

folding now happens all in simplifyRecieps, should handle this now

484 ↗(On Diff #557740)

I think there's nothing we can do about that; we first need to splat the value when generating code, but InstCombine should take care of that.

498 ↗(On Diff #557740)

Should be better cleaned up now

llvm/test/Transforms/LoopVectorize/trunc-shifts.ll
334 ↗(On Diff #557740)

trunc/ext pairs should be better cleaned up in the latest version

Ayal added a comment.Nov 22 2023, 1:58 PM

Looks much simpler! Minor last nits.

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
757

nit: are these still hints to InstCombine, or to subsequent VPlan cleanups?

757–759

nit

766–769

?

792

But a (more) expensive RPOT order is needed, to handle defs before uses?

816

Is it possible for MinBWs not to contain Op's live-in IR value in this case?

fhahn updated this revision to Diff 558159.Nov 23 2023, 4:08 AM
fhahn marked 5 inline comments as done.

Address latest comments, thanks!

fhahn added inline comments.Nov 23 2023, 4:10 AM
llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
757

Updated, thanks!

757–759

Done thanks! This also limits the scope of TypeInfo to the range where it is valid. after `truncateToMinimalBitwidths, we would need to invalidate the info for the modified recipes otherwise. This can be done in the future.

766–769

Simplified , thanks!

792

The latest version should not need RPO, as the bit width of the results do not change for any user (previously they might due to early cast simplifications). Changed to depth first.

816

Yes, MinBWs only contains instructions, but not other values like arguments. Added a clarifying assert.

Ayal added inline comments.Nov 29 2023, 9:23 AM
llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
757–759

Very well. Worth commenting that TypeInfo should not be used following truncateToMinimalBitwidths.

763

nit: ProcessedRecipesNum?

763

ProcessedTruncs is used outside ifdef below, move its definition out of ifdef here? Or is it meant to ensure truncated operands are counted once by ProcessedRecipes for debugging only? If an operand is truncated multiple times, all its truncations must be to the same size, because "MinBW should specify a consistent minimal bit width for all users(, and for all operands)"?

Worth explaining why processed truncs are recorded.

766

Should PH be skipped? Trying to shrink the (live-in) operands of recipes in PH will insert them at the end of PH...

769

Shrunk operands are placed before R, but its extension is placed after - and calls for this make_early_inc_range, right?

785

Just note that the counting of ProcessedRecipes may miss casts that fail to be processed later.

785

Just noting potential follow-up, possibly as a TODO somewhere: attach each MinBW to its recipe when the latter is created, supplementing its underlying inst.

790

Does OldResSizeInBits equal to the size of OldResTy, for the non-cast Widen or Select R?

807

Ins? Perhaps ProcessedTrunc?

808

Handle the simple if !ins.second /* Op already processed */ case first, potentially early-continuing?

Clearer to check if ProcessedTruncs.lookup(Op) or if ProcessedTruncs.contains(Op) and if so use ProcessedTruncs[Op], otherwise insert it?

811

nit: place simpler if !isLiveIn case first?

813–816

nit

821

Note that truncations of live-ins could also be inserted before R, thereby leaving the treatment of live-ins to debugging only, and leaving their LICM and commoning to a subsequent VPlan cleanup pass, along with trunc-zext foldings.

llvm/lib/Transforms/Vectorize/VPlanTransforms.h
47

WDYT on the above: should the caller of optimize() precede its call with a direct call to trunctateToMinimalBitwidth(), rather than pass MinBWs to optimize()?

fhahn updated this revision to Diff 558186.Nov 29 2023, 11:36 AM
fhahn marked 15 inline comments as done.

Rebase and address latest comments, thanks!

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
757–759

Sunk further into truncateToMinimualBitwidths

763

Changed to NumProcessedRecipes

763

It's to re-use previously generated truncates. Note that we cannot RAUW after creating the new truncate, as this may make other uses not well typed (until they are processed and all their operands are truncated)

Moved out of ifdef

766

Good point, there should be nothing to shrink in PH for now, as the analysis is for the loop body only, adjusted!

769

Yep

785

Do you mean updating the comment here or just a general note? We need to include the recipes in the count, otherwise the verification later will fail

790

Yes, I forgot to remove this use of IR getType. Updated to use TypeInfo.inferScalarType(ResultVPV) and then getScalarSizeInBits of the returned type.

807

Updated, thanks!

808

Early continue would mean duplicating the code to update the operands, I left things for now as is, including using insert. insert means we only need to lookup the insert-pos once, vs 2 lookups with separate lookup and then `[]. WDYT?

811

Done, thanks!

821

Yep, for now it is simpler and results in a smaller test diff to do it directly there as it is not only LICM but also very simple CSE

llvm/lib/Transforms/Vectorize/VPlanTransforms.h
47

Sounds good, updated, thanks!

Ayal added inline comments.Nov 29 2023, 2:23 PM
llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
756–835

nit: redundant move of empty line?

763

Note that we cannot RAUW after creating the new truncate, as this may make other uses not well typed (until they are processed and all their operands are truncated)

Very well, may deserve a comment.

785

I mean we count casts as if they are processed, expecting they will be later, w/o checking that they actually do.

790

Ah, ok, wondered if using the size of the type of UI directly would be simpler?

795

Should be the same Ctx passed in as parameter?

808

OK, WDYT of the something as follows:

        auto [ProcessedIter, DidNotExist] = ProcessedTruncs.insert({Op, nullptr});
        VPWidenCastRecipe *NewOp = DidNotExist ? new VPWidenCastRecipe(Instruction::Trunc, Op, NewResTy)
                                               : ProcessedIter->second;
        R.setOperand(Idx, NewOp);
        if (!DidNotExist)
          continue;
        ProcessedIter->second = NewOp;
        if (!Op->isLiveIn()) {
          Shrunk->insertBefore(&R);
        } else {
          PH->appendRecipe(Shrunk);
#ifndef NDEBUG
          auto *OpInst = dyn_cast<Instruction>(Op->getLiveInIRValue());
          bool IsContained = MinBWs.contains(OpInst);
          assert((!OpInst || IsContained) &&
                 "All processed instructions should be contained in MinBWs.");
          NumProcessedRecipes += IsContained;
#endif
        }
llvm/lib/Transforms/Vectorize/VPlanTransforms.h
47
Ayal added inline comments.Nov 30 2023, 12:05 AM
llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
808

Maybe IterIsEmpty would be a better name, to avoid double negation, as in:

        auto [ProcessedIter, IterIsEmpty] = ProcessedTruncs.insert({Op, nullptr});
        VPWidenCastRecipe *NewOp = IterIsEmpty ? new VPWidenCastRecipe(Instruction::Trunc, Op, NewResTy)
                                               : ProcessedIter->second;
        R.setOperand(Idx, NewOp);
        if (!IterIsEmpty)
          continue;
        ProcessedIter->second = NewOp;
        if (!Op->isLiveIn()) {
          NewOp->insertBefore(&R);
        } else {
          PH->appendRecipe(NewOp);
#ifndef NDEBUG
          auto *OpInst = dyn_cast<Instruction>(Op->getLiveInIRValue());
          bool IsContained = MinBWs.contains(OpInst);
          assert((!OpInst || IsContained) &&
                 "All processed instructions should be contained in MinBWs.");
          NumProcessedRecipes += IsContained;
#endif
        }
fhahn updated this revision to Diff 558195.Nov 30 2023, 3:12 AM
fhahn marked 6 inline comments as done.

Addressed latest comments, thanks!

fhahn added inline comments.Nov 30 2023, 5:14 AM
llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
756–835

changed back, thanks!

763

Added a comment to ProcessedTruncs definition.

785

They don't need handling explicitly, as redundant casts will be removed later. Expanded the comment slightly to

Also skip casts which do not need to be handled explicitly here, as redundant casts will be removed during recipe simplification.
790

It might be slightly simpler, but would mean this may lead to a crash further down the line, once we support recipes without underlying values/instructions (and we forget to update this line) and/or if some other transform adjusted the type. Left as is for now

795

Yes, fixed!

llvm/lib/Transforms/Vectorize/VPlanTransforms.h
47

Fixed, thanks!

Ayal accepted this revision.Nov 30 2023, 5:22 AM

This looks good to me, thanks for accommodating!
Adding a minor redundancy spotted plus some test related notes.

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
833

redundant - hoist above the early-continue.

llvm/test/Transforms/LoopVectorize/AArch64/deterministic-type-shrinkage.ll
54

Trunc & insertelement LICM'd from vec.epilog.vector.body to vec.epilog.ph.

55

Duplicated TMP0 and TMP1 still here?

55

Still seeing duplicate TMP2 and TMP3?

55–56

ditto.

llvm/test/Transforms/LoopVectorize/AArch64/loop-vectorization-factors.ll
308 ↗(On Diff #557740)

The one following the add-2's is also folded now.

338 ↗(On Diff #557740)

Other pair also folded now.

484 ↗(On Diff #557740)

Worth testing with a subsequent instCombine, to ensure pessimization is avoided?

498 ↗(On Diff #557740)

Indeed looks like it!

30 ↗(On Diff #558195)

Fold zext-trunc pair, several such cases follow.

llvm/test/Transforms/LoopVectorize/trunc-shifts.ll
334 ↗(On Diff #557740)

Indeed!

This revision is now accepted and ready to land.Nov 30 2023, 5:22 AM
fhahn marked 2 inline comments as done.Dec 2 2023, 8:15 AM
fhahn added inline comments.
llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
833

Fixed in the committed version, thanks!

llvm/test/Transforms/LoopVectorize/AArch64/deterministic-type-shrinkage.ll
55

They were due to redundant casts being added for Live-in values, fixed by checking in VPWidenCastRecipe::execute for now, with a FIXME to address this with explicit unrolling.