This is an archive of the discontinued LLVM Phabricator instance.

[VPlan] Replace IR based truncateToMinimalBitwidths with VPlan version.
ClosedPublic

Authored by fhahn on May 4 2023, 2:07 PM.

Details

Summary

This patch replaces the IR based truncateToMinimalBitwidths with a VPlan
version. This has 2 benefits:

  1. the VPlan-based version is simpler; we don't need to implement special codegen for each supported instruction type like the IR based one.
  2. Removes a dependency on the cost-model after VPlan execution and
  3. Removes a use of getVPValue that uses underlying values after VPlan execution (See removed FIXME).

Depends on D149081.

Depends on D149079.

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Ayal added inline comments.Oct 4 2023, 4:03 PM
llvm/lib/Transforms/Vectorize/VPlan.h
281

How/Is this removal related?

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
980

(Future) Thought: wonder if instead of iterating over all live-ins looking to truncate any, it may be better to iterate over MinBWs and check if any are live-ins. Or lookup MinBWs upon construction of a live-in.

981

nit: use LiveInInst or something similar rather than UI?

987
994

Set once before the loop for all live-ins to be truncated.

1003

Any order other than depth first would also do, right?

1014

(Future) Thought: this is an awkward way of retrieving "the" recipe that corresponds to each member of MinBWs - look through all recipes for those having the desired "underlying" insn. Perhaps better lookup MinBWs upon construction of a recipe for an Instruction.
Or migrate the analysis that builds MinBWs to run on VPlan.

1015

nit: lookup.

1020

Would be good to comment how memory and replicate cases are (not) processed.

1026

Better assert than continue? Here ProcessedRecipes was already bumped, but should all MinBWs members correspond to Integer types, of distinct (smaller) size, whether live-in or not?

1036

This deals only with ZExt/SExt, easier to check directly if Opcode is one or the other?

OTOH, better handle Trunc here as well? Is it handled well below?

1040

// SExt/Zext is redundant - stick with its operand.
?

1047

Place assert earlier?

1049–1050
1061

This means the size of all operands is equal to NewResSizeInBits, can this be?

1065–1066

nit: keep consistent with above.

1074–1076

nit: keep consistent with above.

llvm/lib/Transforms/Vectorize/VPlanTransforms.h
71

nit: a VPlan transform should fold redundant ZExt-Trunc pairs rather than leaving them ("as hints") to InstCombine.

Being a public method, which does not need SE, should the caller of optimize() precede its call with a direct call to trunctateToMinimalBitwidth(), rather than pass MinBWs to optimize()?

fhahn updated this revision to Diff 557740.Oct 17 2023, 1:04 PM
fhahn marked 21 inline comments as done.

Address latest comments, apologies for the delay!

llvm/lib/Transforms/Vectorize/VPlan.h
281

The last user of this function has been removed in the patch.

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
970–971

Code has been moved to D159202

973–975

code has been moved to D159202

979

Wrapped and added comment, thanks!

981

Renamed, thanks!

982

Updated, thanks!

987

Adjusted, thanks!

988

Turned into assert, thanks!

994

hoisted, thanks!

1003

Yes, I think the order doesn't matter here.

1015

Done, thanks!

1020

Added a comment, thanks!

1026

Turned isIntegerTy into assert but retained size check as there entries where the sizes are the same (e.g. for truncs).

1036

Thanks, changed to if. I don't think Trunc is handled explicitly in the latest version.

1040

this check has been moved up and is not needed any longer.

1047

moved up,, thanks!

1049–1050

adjusted, thanks!

1061

There are cases where a Zext narrowed earlier is used as operand here, so the tie is already adjusted.

1065–1066

Adjusted, thanks!

1074–1076

reordered, thanks!

Ayal added a comment.Oct 21 2023, 2:02 PM

Various comments, also trying to reason about how this patch changes tests.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
3439

Retain a comment explaining why replicate recipes are not truncated?

3478

Retain this comment regarding dropping wrapping flags?

3493

A Trunc is handled by shrinking its operand.

3518

(If nothing is done to the operands, what is the result extended too?)

llvm/lib/Transforms/Vectorize/VPlan.h
281

Very well!

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
755

Thought: worth introducing as a member of VPValue, to be overridden by VPWidenCastRecipe? Note that this is Element/Scalar Type.

Effectively adding scalar type info to all VPValues? Might be good to investigate separately, although the current use-cases would probably be very limited

Very well.

780

Thought: could/should each MinBW be attached to its recipe asap - when the latter is created, considering it depends on associated underlying instruction?

Might be a potential follow-up, but we would still potentially updated MinBWs on each recipe replacement?

Sure, like updating any other property of a recipe when replaced.

796

Agreed - MinBW should specify a consistent minimal bit width for all users, and for all operands, but there seems to be some discrepancy that is confusing:

A. Instructions whose operands and return value are all of a single type (excluding condition operand of selects) are converted to operate on a narrower type by (a) shrinking their operands to the narrower type and (b) extending their result from the narrower type to their original type. Instructions that feed values to such instructions or use their values, continue to feed and use values of the original type.
A pair of such instructions where one feeds the other will be added a zext-trunc pair between them which will later be folded.

B. Instructions that convert between two distinct types, continue to digest the original source type but are updated to produce values of the new destination type. Their users, when reached subsequently, need to check if any of their operands have been narrowed. But if this is the case, why bother expanding results in (b) above? OTOH, the narrowed results of conversion instructions can also be expanded (to be folded later), keeping the treatment consistent? Always expecting the new type to be strictly smaller than the current one. Perhaps conversion instructions could be skipped now and handled by subsequent folding pass - looking for trunc-trunc and sext-trunc pairs in addition to zext-trunc ones?

C. Loads are ignored - excluded from MiinBWs? They could potentially be narrowed to load only the required bits, though its unclear if a strided narrow load is better than a unit-strided wider load and trunc - as in an interleave-group(?)

D. Phis are ignored - excluded from MinBWs. Truncated header induction phi's are handled separately. Other phi's may deserve narrowing(?)

979

Suffice to ask if (!NewResSizeInBits)?

980

Thoughts about the above? Hopefully avoids exposing getLiveIns(), at the expense of holding a mapping between Values and LiveIns, as in LiveOuts.

983

assert "MinBW member must be integer" rather than continue - thereby skipping a MinBW member.

999

Can skip phi's, none are included in MinBWs.

1000

Are any loads included in MinBWs, or is this dead code? Stores of course are irrelevant.

1013

Suffice to ask if (!NewResSizeInBits)?

1014

Thoughts about the above?

1022

Should replicate recipes be handled next to handling widen memory recipes above?

1026

nit: ResTy >> OldResTy, ResSizeInBits >> OldResSizeInBits

1029

assert(ResSizeInBits > NewResSizeInBits && "Nothing to shrink?"); here instead of below?

1035

nit: VPC >> OldExt, Opc >> OldOpc?

1036

Does Trunc (which can truncate to a smaller bitwidth) implicitly fall through and has its operand shrunk to the smaller bitwidth, effectively turning it into a ZExt?

1039

Comment is obsolete here - dealt with new type being equal to operand type, which should result in replacing the SExt/ZExt with its operand, see below.

1040

?

1044

nit: C >> NewCast?

If getTypeSizeInBits(Op) == NewResSizeInBits should C be set to Op (w/o inserting it) instead of creating a redundant cast?

1061

Maybe worth a comment.

llvm/lib/Transforms/Vectorize/VPlanTransforms.h
71

Thoughts on the above?
Better truncate to minimal bitwidth asap, as it relies on IR information? Conceptually a scalar transform.
Does "as hints to InstCombine" below still hold?

llvm/test/Transforms/LoopVectorize/AArch64/deterministic-type-shrinkage.ll
41–42

hmm, we now spot the redundant duplicate zext of WIDE_LOAD from <16 x i8> to <16 x i16>, originally both TMP4 and TMP10.

68

Spotted and removed duplicate zext of WIDE_LOAD8.

159

This testcase stores the 2nd least significant byte of a 32b product (of two invariant values, one 16b and the other 32b) checking that computing 16b product suffices. But more optimizations should take place: the expansion of the multipliers to 32b should be eliminated (along with their truncation to 16b), and the invariant multiplication-lshr-trunc sequence should be hoisted out of the loop.

164

BROADCAST_SPLAT is (still) trunc'ed twice due to UF=2?

165

Both insertelement's now use poison.

172

BROADCAST_SPLAT2 is (still) trunc'ed twice due to UF=2?

llvm/test/Transforms/LoopVectorize/AArch64/loop-vectorization-factors.ll
302

We now fold a trunc-zext of zext'ed WIDE_LOAD from <16 x i16> => <16 x i32> => <16 x i16>,
but fail to fold a similar one following the add-2's?

330

We now get rid of a pair of <8 x i16> => <8 x i32> => <8 x i16> before the add-2's (so this is not an NFC patch), but still retain the pair of <8 x i16> => <8 x i32> => <8 x i16> after it - missed MinBW/trunc-zext opportunity?

474–475

Hmm, before we narrowed these two sufflevectors to operate on <16 x i8> and zext-trunc their result, now we let them operate on original <16 x i32> and truncate the result?

487

Many zext-trunc pairs left to collect.

513

Above trunc of TMP2 is redundant along with its zext in the ph.

520

Above trunc of TMP4 is redundant along with its zext in the ph.

llvm/test/Transforms/LoopVectorize/trunc-shifts.ll
334

We now get rid of a pair of <4 x i16> => <4 x i32> => <4 x i16> before the lshr (so this is not an NFC patch), but still retain the pair/triple of <4 x i16> => <4 x i32> => <4 x i16> => <4 x i8> after it - missed MinBW opportunity?

fhahn updated this revision to Diff 558116.Nov 16 2023, 2:15 PM
fhahn marked 29 inline comments as done.

Address comments and major simplification after moving cast folding to simplifyRecipes.

Hope all comments should be addressed, hope i didn't miss any.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
3439

Retained when skipping VPReplicateRecipe.

3478

Done, thanks!

3518

It stays the same, there's no extend in that case.

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
755

This has been updated to now use VPTypeAnalysis.

796

The latest version doesn't have special treatment for casts, they remain unchanged and VPlan recipe simplification will take care of folding them if possible.

979

This code has now been removed; LiveIns are handled when truncating the other operands of an instruction; otherwise we leave the type info in an inconsistent state.

980

LiveIns are now handled directly when truncating other operands; getLiveIns has been removed.

983

Turned into an assert, thanks!

999

There's an early continue now that skips phis and other unsupported recipes.

1000

Nope, looks like this is not needed in the latest version.

1013

Simplified, thanks!

1014

I think it would be best to have the analysis based on VPlan. Building MinBWs early would probably require extra work to update/invalidate it during transforms.

1022

We still need to count them for verification

fhahn added inline comments.Nov 16 2023, 2:15 PM
llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
1026

Renamed, thanks!

1029

Done, and also removed continue

1035

This code is now gone, handled by recipe simplification.

1039

Code is gone now

1040

Code now gone.

1044

Code gone now.

llvm/test/Transforms/LoopVectorize/AArch64/deterministic-type-shrinkage.ll
159

still more work to do :)

Arguably the invariant instructions are artificial, in the regular pipeline, no invariant instructions should remain.

164

The latest version avoids truncating the same value twice.

165

I think the use of undef is a leftover that wasn't updated; it should be poison.

172

The latest version avoids truncating the same value twice.

llvm/test/Transforms/LoopVectorize/AArch64/loop-vectorization-factors.ll
302

folding now happens all in simplifyRecieps, should handle this now

474–475

I think there's nothing we can do about that; we first need to splat the value when generating code, but InstCombine should take care of that.

487

Should be better cleaned up now

llvm/test/Transforms/LoopVectorize/trunc-shifts.ll
334

trunc/ext pairs should be better cleaned up in the latest version

Ayal added a comment.Nov 22 2023, 1:58 PM

Looks much simpler! Minor last nits.

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
968

nit: are these still hints to InstCombine, or to subsequent VPlan cleanups?

972–974

nit

977–980

?

1003

But a (more) expensive RPOT order is needed, to handle defs before uses?

1027

Is it possible for MinBWs not to contain Op's live-in IR value in this case?

fhahn updated this revision to Diff 558159.Nov 23 2023, 4:08 AM
fhahn marked 5 inline comments as done.

Address latest comments, thanks!

fhahn added inline comments.Nov 23 2023, 4:10 AM
llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
968

Updated, thanks!

972–974

Done thanks! This also limits the scope of TypeInfo to the range where it is valid. after `truncateToMinimalBitwidths, we would need to invalidate the info for the modified recipes otherwise. This can be done in the future.

977–980

Simplified , thanks!

1003

The latest version should not need RPO, as the bit width of the results do not change for any user (previously they might due to early cast simplifications). Changed to depth first.

1027

Yes, MinBWs only contains instructions, but not other values like arguments. Added a clarifying assert.

Ayal added inline comments.Nov 29 2023, 9:23 AM
llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
780

Just noting potential follow-up, possibly as a TODO somewhere: attach each MinBW to its recipe when the latter is created, supplementing its underlying inst.

972–974

Very well. Worth commenting that TypeInfo should not be used following truncateToMinimalBitwidths.

974

nit: ProcessedRecipesNum?

974

ProcessedTruncs is used outside ifdef below, move its definition out of ifdef here? Or is it meant to ensure truncated operands are counted once by ProcessedRecipes for debugging only? If an operand is truncated multiple times, all its truncations must be to the same size, because "MinBW should specify a consistent minimal bit width for all users(, and for all operands)"?

Worth explaining why processed truncs are recorded.

977

Should PH be skipped? Trying to shrink the (live-in) operands of recipes in PH will insert them at the end of PH...

980

Shrunk operands are placed before R, but its extension is placed after - and calls for this make_early_inc_range, right?

996

Just note that the counting of ProcessedRecipes may miss casts that fail to be processed later.

1001

Does OldResSizeInBits equal to the size of OldResTy, for the non-cast Widen or Select R?

1018

Ins? Perhaps ProcessedTrunc?

1019

Handle the simple if !ins.second /* Op already processed */ case first, potentially early-continuing?

Clearer to check if ProcessedTruncs.lookup(Op) or if ProcessedTruncs.contains(Op) and if so use ProcessedTruncs[Op], otherwise insert it?

1022

nit: place simpler if !isLiveIn case first?

1024–1027

nit

1032

Note that truncations of live-ins could also be inserted before R, thereby leaving the treatment of live-ins to debugging only, and leaving their LICM and commoning to a subsequent VPlan cleanup pass, along with trunc-zext foldings.

llvm/lib/Transforms/Vectorize/VPlanTransforms.h
71

WDYT on the above: should the caller of optimize() precede its call with a direct call to trunctateToMinimalBitwidth(), rather than pass MinBWs to optimize()?

fhahn updated this revision to Diff 558186.Nov 29 2023, 11:36 AM
fhahn marked 15 inline comments as done.

Rebase and address latest comments, thanks!

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
972–974

Sunk further into truncateToMinimualBitwidths

974

Changed to NumProcessedRecipes

974

It's to re-use previously generated truncates. Note that we cannot RAUW after creating the new truncate, as this may make other uses not well typed (until they are processed and all their operands are truncated)

Moved out of ifdef

977

Good point, there should be nothing to shrink in PH for now, as the analysis is for the loop body only, adjusted!

980

Yep

996

Do you mean updating the comment here or just a general note? We need to include the recipes in the count, otherwise the verification later will fail

1001

Yes, I forgot to remove this use of IR getType. Updated to use TypeInfo.inferScalarType(ResultVPV) and then getScalarSizeInBits of the returned type.

1018

Updated, thanks!

1019

Early continue would mean duplicating the code to update the operands, I left things for now as is, including using insert. insert means we only need to lookup the insert-pos once, vs 2 lookups with separate lookup and then `[]. WDYT?

1022

Done, thanks!

1032

Yep, for now it is simpler and results in a smaller test diff to do it directly there as it is not only LICM but also very simple CSE

llvm/lib/Transforms/Vectorize/VPlanTransforms.h
71

Sounds good, updated, thanks!

Ayal added inline comments.Nov 29 2023, 2:23 PM
llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
911

Should be the same Ctx passed in as parameter?

969–972

nit: redundant move of empty line?

974

Note that we cannot RAUW after creating the new truncate, as this may make other uses not well typed (until they are processed and all their operands are truncated)

Very well, may deserve a comment.

996

I mean we count casts as if they are processed, expecting they will be later, w/o checking that they actually do.

1001

Ah, ok, wondered if using the size of the type of UI directly would be simpler?

1019

OK, WDYT of the something as follows:

        auto [ProcessedIter, DidNotExist] = ProcessedTruncs.insert({Op, nullptr});
        VPWidenCastRecipe *NewOp = DidNotExist ? new VPWidenCastRecipe(Instruction::Trunc, Op, NewResTy)
                                               : ProcessedIter->second;
        R.setOperand(Idx, NewOp);
        if (!DidNotExist)
          continue;
        ProcessedIter->second = NewOp;
        if (!Op->isLiveIn()) {
          Shrunk->insertBefore(&R);
        } else {
          PH->appendRecipe(Shrunk);
#ifndef NDEBUG
          auto *OpInst = dyn_cast<Instruction>(Op->getLiveInIRValue());
          bool IsContained = MinBWs.contains(OpInst);
          assert((!OpInst || IsContained) &&
                 "All processed instructions should be contained in MinBWs.");
          NumProcessedRecipes += IsContained;
#endif
        }
llvm/lib/Transforms/Vectorize/VPlanTransforms.h
83
Ayal added inline comments.Nov 30 2023, 12:05 AM
llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
1019

Maybe IterIsEmpty would be a better name, to avoid double negation, as in:

        auto [ProcessedIter, IterIsEmpty] = ProcessedTruncs.insert({Op, nullptr});
        VPWidenCastRecipe *NewOp = IterIsEmpty ? new VPWidenCastRecipe(Instruction::Trunc, Op, NewResTy)
                                               : ProcessedIter->second;
        R.setOperand(Idx, NewOp);
        if (!IterIsEmpty)
          continue;
        ProcessedIter->second = NewOp;
        if (!Op->isLiveIn()) {
          NewOp->insertBefore(&R);
        } else {
          PH->appendRecipe(NewOp);
#ifndef NDEBUG
          auto *OpInst = dyn_cast<Instruction>(Op->getLiveInIRValue());
          bool IsContained = MinBWs.contains(OpInst);
          assert((!OpInst || IsContained) &&
                 "All processed instructions should be contained in MinBWs.");
          NumProcessedRecipes += IsContained;
#endif
        }
fhahn updated this revision to Diff 558195.Nov 30 2023, 3:12 AM
fhahn marked 6 inline comments as done.

Addressed latest comments, thanks!

fhahn added inline comments.Nov 30 2023, 5:14 AM
llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
911

Yes, fixed!

969–972

changed back, thanks!

974

Added a comment to ProcessedTruncs definition.

996

They don't need handling explicitly, as redundant casts will be removed later. Expanded the comment slightly to

Also skip casts which do not need to be handled explicitly here, as redundant casts will be removed during recipe simplification.
1001

It might be slightly simpler, but would mean this may lead to a crash further down the line, once we support recipes without underlying values/instructions (and we forget to update this line) and/or if some other transform adjusted the type. Left as is for now

llvm/lib/Transforms/Vectorize/VPlanTransforms.h
83

Fixed, thanks!

Ayal accepted this revision.Nov 30 2023, 5:22 AM

This looks good to me, thanks for accommodating!
Adding a minor redundancy spotted plus some test related notes.

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
949

redundant - hoist above the early-continue.

llvm/test/Transforms/LoopVectorize/AArch64/deterministic-type-shrinkage.ll
164

Duplicated TMP0 and TMP1 still here?

172

Still seeing duplicate TMP2 and TMP3?

190–192

Trunc & insertelement LICM'd from vec.epilog.vector.body to vec.epilog.ph.

245–246

ditto.

llvm/test/Transforms/LoopVectorize/AArch64/loop-vectorization-factors.ll
30–31

Fold zext-trunc pair, several such cases follow.

302

The one following the add-2's is also folded now.

330

Other pair also folded now.

474–475

Worth testing with a subsequent instCombine, to ensure pessimization is avoided?

487

Indeed looks like it!

llvm/test/Transforms/LoopVectorize/trunc-shifts.ll
334

Indeed!

This revision is now accepted and ready to land.Nov 30 2023, 5:22 AM
fhahn marked 2 inline comments as done.Dec 2 2023, 8:15 AM
fhahn added inline comments.
llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
949

Fixed in the committed version, thanks!

llvm/test/Transforms/LoopVectorize/AArch64/deterministic-type-shrinkage.ll
164

They were due to redundant casts being added for Live-in values, fixed by checking in VPWidenCastRecipe::execute for now, with a FIXME to address this with explicit unrolling.