This is an archive of the discontinued LLVM Phabricator instance.

[VPlan] Replace IR based truncateToMinimalBitwidths with VPlan version.
ClosedPublic

Authored by fhahn on May 4 2023, 2:07 PM.

Details

Summary

This patch replaces the IR based truncateToMinimalBitwidths with a VPlan
version. This has 2 benefits:

  1. the VPlan-based version is simpler; we don't need to implement special codegen for each supported instruction type like the IR based one.
  2. Removes a dependency on the cost-model after VPlan execution and
  3. Removes a use of getVPValue that uses underlying values after VPlan execution (See removed FIXME).

Depends on D149081.

Depends on D149079.

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Ayal added inline comments.Oct 4 2023, 4:03 PM
llvm/lib/Transforms/Vectorize/VPlan.h
281

How/Is this removal related?

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
974

(Future) Thought: wonder if instead of iterating over all live-ins looking to truncate any, it may be better to iterate over MinBWs and check if any are live-ins. Or lookup MinBWs upon construction of a live-in.

975

nit: use LiveInInst or something similar rather than UI?

981
988

Set once before the loop for all live-ins to be truncated.

997

Any order other than depth first would also do, right?

1008

(Future) Thought: this is an awkward way of retrieving "the" recipe that corresponds to each member of MinBWs - look through all recipes for those having the desired "underlying" insn. Perhaps better lookup MinBWs upon construction of a recipe for an Instruction.
Or migrate the analysis that builds MinBWs to run on VPlan.

1009

nit: lookup.

1014

Would be good to comment how memory and replicate cases are (not) processed.

1020

Better assert than continue? Here ProcessedRecipes was already bumped, but should all MinBWs members correspond to Integer types, of distinct (smaller) size, whether live-in or not?

1030

This deals only with ZExt/SExt, easier to check directly if Opcode is one or the other?

OTOH, better handle Trunc here as well? Is it handled well below?

1034

// SExt/Zext is redundant - stick with its operand.
?

1041

Place assert earlier?

1043–1044
1055

This means the size of all operands is equal to NewResSizeInBits, can this be?

1059–1060

nit: keep consistent with above.

1068–1070

nit: keep consistent with above.

llvm/lib/Transforms/Vectorize/VPlanTransforms.h
71

nit: a VPlan transform should fold redundant ZExt-Trunc pairs rather than leaving them ("as hints") to InstCombine.

Being a public method, which does not need SE, should the caller of optimize() precede its call with a direct call to trunctateToMinimalBitwidth(), rather than pass MinBWs to optimize()?

fhahn updated this revision to Diff 557740.Oct 17 2023, 1:04 PM
fhahn marked 21 inline comments as done.

Address latest comments, apologies for the delay!

llvm/lib/Transforms/Vectorize/VPlan.h
281

The last user of this function has been removed in the patch.

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
964–965

Code has been moved to D159202

967–969

code has been moved to D159202

973

Wrapped and added comment, thanks!

975

Renamed, thanks!

976

Updated, thanks!

981

Adjusted, thanks!

982

Turned into assert, thanks!

988

hoisted, thanks!

997

Yes, I think the order doesn't matter here.

1009

Done, thanks!

1014

Added a comment, thanks!

1020

Turned isIntegerTy into assert but retained size check as there entries where the sizes are the same (e.g. for truncs).

1030

Thanks, changed to if. I don't think Trunc is handled explicitly in the latest version.

1034

this check has been moved up and is not needed any longer.

1041

moved up,, thanks!

1043–1044

adjusted, thanks!

1055

There are cases where a Zext narrowed earlier is used as operand here, so the tie is already adjusted.

1059–1060

Adjusted, thanks!

1068–1070

reordered, thanks!

Ayal added a comment.Oct 21 2023, 2:02 PM

Various comments, also trying to reason about how this patch changes tests.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
3439

Retain a comment explaining why replicate recipes are not truncated?

3478

Retain this comment regarding dropping wrapping flags?

3493

A Trunc is handled by shrinking its operand.

3518

(If nothing is done to the operands, what is the result extended too?)

llvm/lib/Transforms/Vectorize/VPlan.h
281

Very well!

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
755

Thought: worth introducing as a member of VPValue, to be overridden by VPWidenCastRecipe? Note that this is Element/Scalar Type.

Effectively adding scalar type info to all VPValues? Might be good to investigate separately, although the current use-cases would probably be very limited

Very well.

780

Thought: could/should each MinBW be attached to its recipe asap - when the latter is created, considering it depends on associated underlying instruction?

Might be a potential follow-up, but we would still potentially updated MinBWs on each recipe replacement?

Sure, like updating any other property of a recipe when replaced.

796

Agreed - MinBW should specify a consistent minimal bit width for all users, and for all operands, but there seems to be some discrepancy that is confusing:

A. Instructions whose operands and return value are all of a single type (excluding condition operand of selects) are converted to operate on a narrower type by (a) shrinking their operands to the narrower type and (b) extending their result from the narrower type to their original type. Instructions that feed values to such instructions or use their values, continue to feed and use values of the original type.
A pair of such instructions where one feeds the other will be added a zext-trunc pair between them which will later be folded.

B. Instructions that convert between two distinct types, continue to digest the original source type but are updated to produce values of the new destination type. Their users, when reached subsequently, need to check if any of their operands have been narrowed. But if this is the case, why bother expanding results in (b) above? OTOH, the narrowed results of conversion instructions can also be expanded (to be folded later), keeping the treatment consistent? Always expecting the new type to be strictly smaller than the current one. Perhaps conversion instructions could be skipped now and handled by subsequent folding pass - looking for trunc-trunc and sext-trunc pairs in addition to zext-trunc ones?

C. Loads are ignored - excluded from MiinBWs? They could potentially be narrowed to load only the required bits, though its unclear if a strided narrow load is better than a unit-strided wider load and trunc - as in an interleave-group(?)

D. Phis are ignored - excluded from MinBWs. Truncated header induction phi's are handled separately. Other phi's may deserve narrowing(?)

973

Suffice to ask if (!NewResSizeInBits)?

974

Thoughts about the above? Hopefully avoids exposing getLiveIns(), at the expense of holding a mapping between Values and LiveIns, as in LiveOuts.

977

assert "MinBW member must be integer" rather than continue - thereby skipping a MinBW member.

993

Can skip phi's, none are included in MinBWs.

994

Are any loads included in MinBWs, or is this dead code? Stores of course are irrelevant.

1007

Suffice to ask if (!NewResSizeInBits)?

1008

Thoughts about the above?

1016

Should replicate recipes be handled next to handling widen memory recipes above?

1020

nit: ResTy >> OldResTy, ResSizeInBits >> OldResSizeInBits

1023

assert(ResSizeInBits > NewResSizeInBits && "Nothing to shrink?"); here instead of below?

1029

nit: VPC >> OldExt, Opc >> OldOpc?

1030

Does Trunc (which can truncate to a smaller bitwidth) implicitly fall through and has its operand shrunk to the smaller bitwidth, effectively turning it into a ZExt?

1033

Comment is obsolete here - dealt with new type being equal to operand type, which should result in replacing the SExt/ZExt with its operand, see below.

1034

?

1038

nit: C >> NewCast?

If getTypeSizeInBits(Op) == NewResSizeInBits should C be set to Op (w/o inserting it) instead of creating a redundant cast?

1055

Maybe worth a comment.

llvm/lib/Transforms/Vectorize/VPlanTransforms.h
71

Thoughts on the above?
Better truncate to minimal bitwidth asap, as it relies on IR information? Conceptually a scalar transform.
Does "as hints to InstCombine" below still hold?

llvm/test/Transforms/LoopVectorize/AArch64/deterministic-type-shrinkage.ll
41–42

hmm, we now spot the redundant duplicate zext of WIDE_LOAD from <16 x i8> to <16 x i16>, originally both TMP4 and TMP10.

68

Spotted and removed duplicate zext of WIDE_LOAD8.

159

This testcase stores the 2nd least significant byte of a 32b product (of two invariant values, one 16b and the other 32b) checking that computing 16b product suffices. But more optimizations should take place: the expansion of the multipliers to 32b should be eliminated (along with their truncation to 16b), and the invariant multiplication-lshr-trunc sequence should be hoisted out of the loop.

167

BROADCAST_SPLAT is (still) trunc'ed twice due to UF=2?

168

Both insertelement's now use poison.

176

BROADCAST_SPLAT2 is (still) trunc'ed twice due to UF=2?

llvm/test/Transforms/LoopVectorize/AArch64/loop-vectorization-factors.ll
302

We now fold a trunc-zext of zext'ed WIDE_LOAD from <16 x i16> => <16 x i32> => <16 x i16>,
but fail to fold a similar one following the add-2's?

330

We now get rid of a pair of <8 x i16> => <8 x i32> => <8 x i16> before the add-2's (so this is not an NFC patch), but still retain the pair of <8 x i16> => <8 x i32> => <8 x i16> after it - missed MinBW/trunc-zext opportunity?

474–475

Hmm, before we narrowed these two sufflevectors to operate on <16 x i8> and zext-trunc their result, now we let them operate on original <16 x i32> and truncate the result?

487

Many zext-trunc pairs left to collect.

513

Above trunc of TMP2 is redundant along with its zext in the ph.

520

Above trunc of TMP4 is redundant along with its zext in the ph.

llvm/test/Transforms/LoopVectorize/trunc-shifts.ll
334

We now get rid of a pair of <4 x i16> => <4 x i32> => <4 x i16> before the lshr (so this is not an NFC patch), but still retain the pair/triple of <4 x i16> => <4 x i32> => <4 x i16> => <4 x i8> after it - missed MinBW opportunity?

fhahn updated this revision to Diff 558116.Nov 16 2023, 2:15 PM
fhahn marked 29 inline comments as done.

Address comments and major simplification after moving cast folding to simplifyRecipes.

Hope all comments should be addressed, hope i didn't miss any.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
3439

Retained when skipping VPReplicateRecipe.

3478

Done, thanks!

3518

It stays the same, there's no extend in that case.

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
755

This has been updated to now use VPTypeAnalysis.

796

The latest version doesn't have special treatment for casts, they remain unchanged and VPlan recipe simplification will take care of folding them if possible.

973

This code has now been removed; LiveIns are handled when truncating the other operands of an instruction; otherwise we leave the type info in an inconsistent state.

974

LiveIns are now handled directly when truncating other operands; getLiveIns has been removed.

977

Turned into an assert, thanks!

993

There's an early continue now that skips phis and other unsupported recipes.

994

Nope, looks like this is not needed in the latest version.

1007

Simplified, thanks!

1008

I think it would be best to have the analysis based on VPlan. Building MinBWs early would probably require extra work to update/invalidate it during transforms.

1016

We still need to count them for verification

fhahn added inline comments.Nov 16 2023, 2:15 PM
llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
1020

Renamed, thanks!

1023

Done, and also removed continue

1029

This code is now gone, handled by recipe simplification.

1033

Code is gone now

1034

Code now gone.

1038

Code gone now.

llvm/test/Transforms/LoopVectorize/AArch64/deterministic-type-shrinkage.ll
159

still more work to do :)

Arguably the invariant instructions are artificial, in the regular pipeline, no invariant instructions should remain.

167

The latest version avoids truncating the same value twice.

168

I think the use of undef is a leftover that wasn't updated; it should be poison.

176

The latest version avoids truncating the same value twice.

llvm/test/Transforms/LoopVectorize/AArch64/loop-vectorization-factors.ll
302

folding now happens all in simplifyRecieps, should handle this now

474–475

I think there's nothing we can do about that; we first need to splat the value when generating code, but InstCombine should take care of that.

487

Should be better cleaned up now

llvm/test/Transforms/LoopVectorize/trunc-shifts.ll
334

trunc/ext pairs should be better cleaned up in the latest version

Ayal added a comment.Nov 22 2023, 1:58 PM

Looks much simpler! Minor last nits.

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
962

nit: are these still hints to InstCombine, or to subsequent VPlan cleanups?

965–967

nit

971–974

?

997

But a (more) expensive RPOT order is needed, to handle defs before uses?

1021

Is it possible for MinBWs not to contain Op's live-in IR value in this case?

fhahn updated this revision to Diff 558159.Nov 23 2023, 4:08 AM
fhahn marked 5 inline comments as done.

Address latest comments, thanks!

fhahn added inline comments.Nov 23 2023, 4:10 AM
llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
962

Updated, thanks!

965–967

Done thanks! This also limits the scope of TypeInfo to the range where it is valid. after `truncateToMinimalBitwidths, we would need to invalidate the info for the modified recipes otherwise. This can be done in the future.

971–974

Simplified , thanks!

997

The latest version should not need RPO, as the bit width of the results do not change for any user (previously they might due to early cast simplifications). Changed to depth first.

1021

Yes, MinBWs only contains instructions, but not other values like arguments. Added a clarifying assert.

Ayal added inline comments.Nov 29 2023, 9:23 AM
llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
780

Just noting potential follow-up, possibly as a TODO somewhere: attach each MinBW to its recipe when the latter is created, supplementing its underlying inst.

965–967

Very well. Worth commenting that TypeInfo should not be used following truncateToMinimalBitwidths.

968

nit: ProcessedRecipesNum?

968

ProcessedTruncs is used outside ifdef below, move its definition out of ifdef here? Or is it meant to ensure truncated operands are counted once by ProcessedRecipes for debugging only? If an operand is truncated multiple times, all its truncations must be to the same size, because "MinBW should specify a consistent minimal bit width for all users(, and for all operands)"?

Worth explaining why processed truncs are recorded.

971

Should PH be skipped? Trying to shrink the (live-in) operands of recipes in PH will insert them at the end of PH...

974

Shrunk operands are placed before R, but its extension is placed after - and calls for this make_early_inc_range, right?

990

Just note that the counting of ProcessedRecipes may miss casts that fail to be processed later.

995

Does OldResSizeInBits equal to the size of OldResTy, for the non-cast Widen or Select R?

1012

Ins? Perhaps ProcessedTrunc?

1013

Handle the simple if !ins.second /* Op already processed */ case first, potentially early-continuing?

Clearer to check if ProcessedTruncs.lookup(Op) or if ProcessedTruncs.contains(Op) and if so use ProcessedTruncs[Op], otherwise insert it?

1016

nit: place simpler if !isLiveIn case first?

1018–1021

nit

1026

Note that truncations of live-ins could also be inserted before R, thereby leaving the treatment of live-ins to debugging only, and leaving their LICM and commoning to a subsequent VPlan cleanup pass, along with trunc-zext foldings.

llvm/lib/Transforms/Vectorize/VPlanTransforms.h
71

WDYT on the above: should the caller of optimize() precede its call with a direct call to trunctateToMinimalBitwidth(), rather than pass MinBWs to optimize()?

fhahn updated this revision to Diff 558186.Nov 29 2023, 11:36 AM
fhahn marked 15 inline comments as done.

Rebase and address latest comments, thanks!

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
965–967

Sunk further into truncateToMinimualBitwidths

968

Changed to NumProcessedRecipes

968

It's to re-use previously generated truncates. Note that we cannot RAUW after creating the new truncate, as this may make other uses not well typed (until they are processed and all their operands are truncated)

Moved out of ifdef

971

Good point, there should be nothing to shrink in PH for now, as the analysis is for the loop body only, adjusted!

974

Yep

990

Do you mean updating the comment here or just a general note? We need to include the recipes in the count, otherwise the verification later will fail

995

Yes, I forgot to remove this use of IR getType. Updated to use TypeInfo.inferScalarType(ResultVPV) and then getScalarSizeInBits of the returned type.

1012

Updated, thanks!

1013

Early continue would mean duplicating the code to update the operands, I left things for now as is, including using insert. insert means we only need to lookup the insert-pos once, vs 2 lookups with separate lookup and then `[]. WDYT?

1016

Done, thanks!

1026

Yep, for now it is simpler and results in a smaller test diff to do it directly there as it is not only LICM but also very simple CSE

llvm/lib/Transforms/Vectorize/VPlanTransforms.h
71

Sounds good, updated, thanks!

Ayal added inline comments.Nov 29 2023, 2:23 PM
llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
911

Should be the same Ctx passed in as parameter?

963–965

nit: redundant move of empty line?

968

Note that we cannot RAUW after creating the new truncate, as this may make other uses not well typed (until they are processed and all their operands are truncated)

Very well, may deserve a comment.

990

I mean we count casts as if they are processed, expecting they will be later, w/o checking that they actually do.

995

Ah, ok, wondered if using the size of the type of UI directly would be simpler?

1013

OK, WDYT of the something as follows:

        auto [ProcessedIter, DidNotExist] = ProcessedTruncs.insert({Op, nullptr});
        VPWidenCastRecipe *NewOp = DidNotExist ? new VPWidenCastRecipe(Instruction::Trunc, Op, NewResTy)
                                               : ProcessedIter->second;
        R.setOperand(Idx, NewOp);
        if (!DidNotExist)
          continue;
        ProcessedIter->second = NewOp;
        if (!Op->isLiveIn()) {
          Shrunk->insertBefore(&R);
        } else {
          PH->appendRecipe(Shrunk);
#ifndef NDEBUG
          auto *OpInst = dyn_cast<Instruction>(Op->getLiveInIRValue());
          bool IsContained = MinBWs.contains(OpInst);
          assert((!OpInst || IsContained) &&
                 "All processed instructions should be contained in MinBWs.");
          NumProcessedRecipes += IsContained;
#endif
        }
llvm/lib/Transforms/Vectorize/VPlanTransforms.h
83
Ayal added inline comments.Nov 30 2023, 12:05 AM
llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
1013

Maybe IterIsEmpty would be a better name, to avoid double negation, as in:

        auto [ProcessedIter, IterIsEmpty] = ProcessedTruncs.insert({Op, nullptr});
        VPWidenCastRecipe *NewOp = IterIsEmpty ? new VPWidenCastRecipe(Instruction::Trunc, Op, NewResTy)
                                               : ProcessedIter->second;
        R.setOperand(Idx, NewOp);
        if (!IterIsEmpty)
          continue;
        ProcessedIter->second = NewOp;
        if (!Op->isLiveIn()) {
          NewOp->insertBefore(&R);
        } else {
          PH->appendRecipe(NewOp);
#ifndef NDEBUG
          auto *OpInst = dyn_cast<Instruction>(Op->getLiveInIRValue());
          bool IsContained = MinBWs.contains(OpInst);
          assert((!OpInst || IsContained) &&
                 "All processed instructions should be contained in MinBWs.");
          NumProcessedRecipes += IsContained;
#endif
        }
fhahn updated this revision to Diff 558195.Nov 30 2023, 3:12 AM
fhahn marked 6 inline comments as done.

Addressed latest comments, thanks!

fhahn added inline comments.Nov 30 2023, 5:14 AM
llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
911

Yes, fixed!

963–965

changed back, thanks!

968

Added a comment to ProcessedTruncs definition.

990

They don't need handling explicitly, as redundant casts will be removed later. Expanded the comment slightly to

Also skip casts which do not need to be handled explicitly here, as redundant casts will be removed during recipe simplification.
995

It might be slightly simpler, but would mean this may lead to a crash further down the line, once we support recipes without underlying values/instructions (and we forget to update this line) and/or if some other transform adjusted the type. Left as is for now

llvm/lib/Transforms/Vectorize/VPlanTransforms.h
83

Fixed, thanks!

Ayal accepted this revision.Nov 30 2023, 5:22 AM

This looks good to me, thanks for accommodating!
Adding a minor redundancy spotted plus some test related notes.

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
949

redundant - hoist above the early-continue.

llvm/test/Transforms/LoopVectorize/AArch64/deterministic-type-shrinkage.ll
167

Duplicated TMP0 and TMP1 still here?

176

Still seeing duplicate TMP2 and TMP3?

194–196

Trunc & insertelement LICM'd from vec.epilog.vector.body to vec.epilog.ph.

249–250

ditto.

llvm/test/Transforms/LoopVectorize/AArch64/loop-vectorization-factors.ll
30–31

Fold zext-trunc pair, several such cases follow.

302

The one following the add-2's is also folded now.

330

Other pair also folded now.

474–475

Worth testing with a subsequent instCombine, to ensure pessimization is avoided?

487

Indeed looks like it!

llvm/test/Transforms/LoopVectorize/trunc-shifts.ll
334

Indeed!

This revision is now accepted and ready to land.Nov 30 2023, 5:22 AM
fhahn marked 2 inline comments as done.Dec 2 2023, 8:15 AM
fhahn added inline comments.
llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
949

Fixed in the committed version, thanks!

llvm/test/Transforms/LoopVectorize/AArch64/deterministic-type-shrinkage.ll
167

They were due to redundant casts being added for Live-in values, fixed by checking in VPWidenCastRecipe::execute for now, with a FIXME to address this with explicit unrolling.