This is an archive of the discontinued LLVM Phabricator instance.

[VPlan] Replace IR based truncateToMinimalBitwidths with VPlan version.
ClosedPublic

Authored by fhahn on May 4 2023, 2:07 PM.

Details

Summary

This patch replaces the IR based truncateToMinimalBitwidths with a VPlan
version. This has 2 benefits:

  1. the VPlan-based version is simpler; we don't need to implement special codegen for each supported instruction type like the IR based one.
  2. Removes a dependency on the cost-model after VPlan execution and
  3. Removes a use of getVPValue that uses underlying values after VPlan execution (See removed FIXME).

Depends on D149081.

Depends on D149079.

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Ayal added inline comments.Oct 4 2023, 4:03 PM
llvm/lib/Transforms/Vectorize/VPlan.h
280

How/Is this removal related?

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
884

(Future) Thought: wonder if instead of iterating over all live-ins looking to truncate any, it may be better to iterate over MinBWs and check if any are live-ins. Or lookup MinBWs upon construction of a live-in.

885

nit: use LiveInInst or something similar rather than UI?

891
898

Set once before the loop for all live-ins to be truncated.

907

Any order other than depth first would also do, right?

918

(Future) Thought: this is an awkward way of retrieving "the" recipe that corresponds to each member of MinBWs - look through all recipes for those having the desired "underlying" insn. Perhaps better lookup MinBWs upon construction of a recipe for an Instruction.
Or migrate the analysis that builds MinBWs to run on VPlan.

919

nit: lookup.

924

Would be good to comment how memory and replicate cases are (not) processed.

930

Better assert than continue? Here ProcessedRecipes was already bumped, but should all MinBWs members correspond to Integer types, of distinct (smaller) size, whether live-in or not?

940

This deals only with ZExt/SExt, easier to check directly if Opcode is one or the other?

OTOH, better handle Trunc here as well? Is it handled well below?

944

// SExt/Zext is redundant - stick with its operand.
?

951

Place assert earlier?

953–954
965

This means the size of all operands is equal to NewResSizeInBits, can this be?

969–970

nit: keep consistent with above.

978–980

nit: keep consistent with above.

llvm/lib/Transforms/Vectorize/VPlanTransforms.h
72

nit: a VPlan transform should fold redundant ZExt-Trunc pairs rather than leaving them ("as hints") to InstCombine.

Being a public method, which does not need SE, should the caller of optimize() precede its call with a direct call to trunctateToMinimalBitwidth(), rather than pass MinBWs to optimize()?

fhahn updated this revision to Diff 557740.Oct 17 2023, 1:04 PM
fhahn marked 21 inline comments as done.

Address latest comments, apologies for the delay!

llvm/lib/Transforms/Vectorize/VPlan.h
280

The last user of this function has been removed in the patch.

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
874–875

Code has been moved to D159202

877–879

code has been moved to D159202

883

Wrapped and added comment, thanks!

885

Renamed, thanks!

886

Updated, thanks!

891

Adjusted, thanks!

892

Turned into assert, thanks!

898

hoisted, thanks!

907

Yes, I think the order doesn't matter here.

919

Done, thanks!

924

Added a comment, thanks!

930

Turned isIntegerTy into assert but retained size check as there entries where the sizes are the same (e.g. for truncs).

940

Thanks, changed to if. I don't think Trunc is handled explicitly in the latest version.

944

this check has been moved up and is not needed any longer.

951

moved up,, thanks!

953–954

adjusted, thanks!

965

There are cases where a Zext narrowed earlier is used as operand here, so the tie is already adjusted.

969–970

Adjusted, thanks!

978–980

reordered, thanks!

Ayal added a comment.Oct 21 2023, 2:02 PM

Various comments, also trying to reason about how this patch changes tests.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
3396

Retain a comment explaining why replicate recipes are not truncated?

3435

Retain this comment regarding dropping wrapping flags?

3450

A Trunc is handled by shrinking its operand.

3475

(If nothing is done to the operands, what is the result extended too?)

llvm/lib/Transforms/Vectorize/VPlan.h
280

Very well!

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
773

Thought: worth introducing as a member of VPValue, to be overridden by VPWidenCastRecipe? Note that this is Element/Scalar Type.

Effectively adding scalar type info to all VPValues? Might be good to investigate separately, although the current use-cases would probably be very limited

Very well.

798

Thought: could/should each MinBW be attached to its recipe asap - when the latter is created, considering it depends on associated underlying instruction?

Might be a potential follow-up, but we would still potentially updated MinBWs on each recipe replacement?

Sure, like updating any other property of a recipe when replaced.

814

Agreed - MinBW should specify a consistent minimal bit width for all users, and for all operands, but there seems to be some discrepancy that is confusing:

A. Instructions whose operands and return value are all of a single type (excluding condition operand of selects) are converted to operate on a narrower type by (a) shrinking their operands to the narrower type and (b) extending their result from the narrower type to their original type. Instructions that feed values to such instructions or use their values, continue to feed and use values of the original type.
A pair of such instructions where one feeds the other will be added a zext-trunc pair between them which will later be folded.

B. Instructions that convert between two distinct types, continue to digest the original source type but are updated to produce values of the new destination type. Their users, when reached subsequently, need to check if any of their operands have been narrowed. But if this is the case, why bother expanding results in (b) above? OTOH, the narrowed results of conversion instructions can also be expanded (to be folded later), keeping the treatment consistent? Always expecting the new type to be strictly smaller than the current one. Perhaps conversion instructions could be skipped now and handled by subsequent folding pass - looking for trunc-trunc and sext-trunc pairs in addition to zext-trunc ones?

C. Loads are ignored - excluded from MiinBWs? They could potentially be narrowed to load only the required bits, though its unclear if a strided narrow load is better than a unit-strided wider load and trunc - as in an interleave-group(?)

D. Phis are ignored - excluded from MinBWs. Truncated header induction phi's are handled separately. Other phi's may deserve narrowing(?)

883

Suffice to ask if (!NewResSizeInBits)?

884

Thoughts about the above? Hopefully avoids exposing getLiveIns(), at the expense of holding a mapping between Values and LiveIns, as in LiveOuts.

887

assert "MinBW member must be integer" rather than continue - thereby skipping a MinBW member.

903

Can skip phi's, none are included in MinBWs.

904

Are any loads included in MinBWs, or is this dead code? Stores of course are irrelevant.

917

Suffice to ask if (!NewResSizeInBits)?

918

Thoughts about the above?

926

Should replicate recipes be handled next to handling widen memory recipes above?

930

nit: ResTy >> OldResTy, ResSizeInBits >> OldResSizeInBits

933

assert(ResSizeInBits > NewResSizeInBits && "Nothing to shrink?"); here instead of below?

939

nit: VPC >> OldExt, Opc >> OldOpc?

940

Does Trunc (which can truncate to a smaller bitwidth) implicitly fall through and has its operand shrunk to the smaller bitwidth, effectively turning it into a ZExt?

943

Comment is obsolete here - dealt with new type being equal to operand type, which should result in replacing the SExt/ZExt with its operand, see below.

944

?

948

nit: C >> NewCast?

If getTypeSizeInBits(Op) == NewResSizeInBits should C be set to Op (w/o inserting it) instead of creating a redundant cast?

965

Maybe worth a comment.

llvm/lib/Transforms/Vectorize/VPlanTransforms.h
72

Thoughts on the above?
Better truncate to minimal bitwidth asap, as it relies on IR information? Conceptually a scalar transform.
Does "as hints to InstCombine" below still hold?

llvm/test/Transforms/LoopVectorize/AArch64/deterministic-type-shrinkage.ll
41

hmm, we now spot the redundant duplicate zext of WIDE_LOAD from <16 x i8> to <16 x i16>, originally both TMP4 and TMP10.

74

Spotted and removed duplicate zext of WIDE_LOAD8.

159

This testcase stores the 2nd least significant byte of a 32b product (of two invariant values, one 16b and the other 32b) checking that computing 16b product suffices. But more optimizations should take place: the expansion of the multipliers to 32b should be eliminated (along with their truncation to 16b), and the invariant multiplication-lshr-trunc sequence should be hoisted out of the loop.

167

BROADCAST_SPLAT is (still) trunc'ed twice due to UF=2?

168

Both insertelement's now use poison.

174

BROADCAST_SPLAT2 is (still) trunc'ed twice due to UF=2?

llvm/test/Transforms/LoopVectorize/AArch64/loop-vectorization-factors.ll
308

We now fold a trunc-zext of zext'ed WIDE_LOAD from <16 x i16> => <16 x i32> => <16 x i16>,
but fail to fold a similar one following the add-2's?

338

We now get rid of a pair of <8 x i16> => <8 x i32> => <8 x i16> before the add-2's (so this is not an NFC patch), but still retain the pair of <8 x i16> => <8 x i32> => <8 x i16> after it - missed MinBW/trunc-zext opportunity?

484

Hmm, before we narrowed these two sufflevectors to operate on <16 x i8> and zext-trunc their result, now we let them operate on original <16 x i32> and truncate the result?

498

Many zext-trunc pairs left to collect.

513

Above trunc of TMP2 is redundant along with its zext in the ph.

520

Above trunc of TMP4 is redundant along with its zext in the ph.

llvm/test/Transforms/LoopVectorize/trunc-shifts.ll
334

We now get rid of a pair of <4 x i16> => <4 x i32> => <4 x i16> before the lshr (so this is not an NFC patch), but still retain the pair/triple of <4 x i16> => <4 x i32> => <4 x i16> => <4 x i8> after it - missed MinBW opportunity?

fhahn updated this revision to Diff 558116.Nov 16 2023, 2:15 PM
fhahn marked 29 inline comments as done.

Address comments and major simplification after moving cast folding to simplifyRecipes.

Hope all comments should be addressed, hope i didn't miss any.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
3396

Retained when skipping VPReplicateRecipe.

3435

Done, thanks!

3475

It stays the same, there's no extend in that case.

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
773

This has been updated to now use VPTypeAnalysis.

814

The latest version doesn't have special treatment for casts, they remain unchanged and VPlan recipe simplification will take care of folding them if possible.

883

This code has now been removed; LiveIns are handled when truncating the other operands of an instruction; otherwise we leave the type info in an inconsistent state.

884

LiveIns are now handled directly when truncating other operands; getLiveIns has been removed.

887

Turned into an assert, thanks!

903

There's an early continue now that skips phis and other unsupported recipes.

904

Nope, looks like this is not needed in the latest version.

917

Simplified, thanks!

918

I think it would be best to have the analysis based on VPlan. Building MinBWs early would probably require extra work to update/invalidate it during transforms.

926

We still need to count them for verification

fhahn added inline comments.Nov 16 2023, 2:15 PM
llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
930

Renamed, thanks!

933

Done, and also removed continue

939

This code is now gone, handled by recipe simplification.

943

Code is gone now

944

Code now gone.

948

Code gone now.

llvm/test/Transforms/LoopVectorize/AArch64/deterministic-type-shrinkage.ll
159

still more work to do :)

Arguably the invariant instructions are artificial, in the regular pipeline, no invariant instructions should remain.

167

The latest version avoids truncating the same value twice.

168

I think the use of undef is a leftover that wasn't updated; it should be poison.

174

The latest version avoids truncating the same value twice.

llvm/test/Transforms/LoopVectorize/AArch64/loop-vectorization-factors.ll
308

folding now happens all in simplifyRecieps, should handle this now

484

I think there's nothing we can do about that; we first need to splat the value when generating code, but InstCombine should take care of that.

498

Should be better cleaned up now

llvm/test/Transforms/LoopVectorize/trunc-shifts.ll
334

trunc/ext pairs should be better cleaned up in the latest version

Ayal added a comment.Nov 22 2023, 1:58 PM

Looks much simpler! Minor last nits.

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
872

nit: are these still hints to InstCombine, or to subsequent VPlan cleanups?

881–884

?

907

But a (more) expensive RPOT order is needed, to handle defs before uses?

931

Is it possible for MinBWs not to contain Op's live-in IR value in this case?

989–991

nit

fhahn updated this revision to Diff 558159.Nov 23 2023, 4:08 AM
fhahn marked 5 inline comments as done.

Address latest comments, thanks!

fhahn added inline comments.Nov 23 2023, 4:10 AM
llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
872

Updated, thanks!

881–884

Simplified , thanks!

907

The latest version should not need RPO, as the bit width of the results do not change for any user (previously they might due to early cast simplifications). Changed to depth first.

931

Yes, MinBWs only contains instructions, but not other values like arguments. Added a clarifying assert.

989–991

Done thanks! This also limits the scope of TypeInfo to the range where it is valid. after `truncateToMinimalBitwidths, we would need to invalidate the info for the modified recipes otherwise. This can be done in the future.

Ayal added inline comments.Nov 29 2023, 9:23 AM
llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
798

Just noting potential follow-up, possibly as a TODO somewhere: attach each MinBW to its recipe when the latter is created, supplementing its underlying inst.

878

nit: ProcessedRecipesNum?

878

ProcessedTruncs is used outside ifdef below, move its definition out of ifdef here? Or is it meant to ensure truncated operands are counted once by ProcessedRecipes for debugging only? If an operand is truncated multiple times, all its truncations must be to the same size, because "MinBW should specify a consistent minimal bit width for all users(, and for all operands)"?

Worth explaining why processed truncs are recorded.

881

Should PH be skipped? Trying to shrink the (live-in) operands of recipes in PH will insert them at the end of PH...

884

Shrunk operands are placed before R, but its extension is placed after - and calls for this make_early_inc_range, right?

900

Just note that the counting of ProcessedRecipes may miss casts that fail to be processed later.

905

Does OldResSizeInBits equal to the size of OldResTy, for the non-cast Widen or Select R?

922

Ins? Perhaps ProcessedTrunc?

923

Handle the simple if !ins.second /* Op already processed */ case first, potentially early-continuing?

Clearer to check if ProcessedTruncs.lookup(Op) or if ProcessedTruncs.contains(Op) and if so use ProcessedTruncs[Op], otherwise insert it?

926

nit: place simpler if !isLiveIn case first?

928–931

nit

936

Note that truncations of live-ins could also be inserted before R, thereby leaving the treatment of live-ins to debugging only, and leaving their LICM and commoning to a subsequent VPlan cleanup pass, along with trunc-zext foldings.

989–991

Very well. Worth commenting that TypeInfo should not be used following truncateToMinimalBitwidths.

llvm/lib/Transforms/Vectorize/VPlanTransforms.h
72

WDYT on the above: should the caller of optimize() precede its call with a direct call to trunctateToMinimalBitwidth(), rather than pass MinBWs to optimize()?

fhahn updated this revision to Diff 558186.Nov 29 2023, 11:36 AM
fhahn marked 15 inline comments as done.

Rebase and address latest comments, thanks!

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
878

Changed to NumProcessedRecipes

878

It's to re-use previously generated truncates. Note that we cannot RAUW after creating the new truncate, as this may make other uses not well typed (until they are processed and all their operands are truncated)

Moved out of ifdef

881

Good point, there should be nothing to shrink in PH for now, as the analysis is for the loop body only, adjusted!

884

Yep

900

Do you mean updating the comment here or just a general note? We need to include the recipes in the count, otherwise the verification later will fail

905

Yes, I forgot to remove this use of IR getType. Updated to use TypeInfo.inferScalarType(ResultVPV) and then getScalarSizeInBits of the returned type.

922

Updated, thanks!

923

Early continue would mean duplicating the code to update the operands, I left things for now as is, including using insert. insert means we only need to lookup the insert-pos once, vs 2 lookups with separate lookup and then `[]. WDYT?

926

Done, thanks!

936

Yep, for now it is simpler and results in a smaller test diff to do it directly there as it is not only LICM but also very simple CSE

989–991

Sunk further into truncateToMinimualBitwidths

llvm/lib/Transforms/Vectorize/VPlanTransforms.h
72

Sounds good, updated, thanks!

Ayal added inline comments.Nov 29 2023, 2:23 PM
llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
878

Note that we cannot RAUW after creating the new truncate, as this may make other uses not well typed (until they are processed and all their operands are truncated)

Very well, may deserve a comment.

900

I mean we count casts as if they are processed, expecting they will be later, w/o checking that they actually do.

905

Ah, ok, wondered if using the size of the type of UI directly would be simpler?

909

Should be the same Ctx passed in as parameter?

923

OK, WDYT of the something as follows:

        auto [ProcessedIter, DidNotExist] = ProcessedTruncs.insert({Op, nullptr});
        VPWidenCastRecipe *NewOp = DidNotExist ? new VPWidenCastRecipe(Instruction::Trunc, Op, NewResTy)
                                               : ProcessedIter->second;
        R.setOperand(Idx, NewOp);
        if (!DidNotExist)
          continue;
        ProcessedIter->second = NewOp;
        if (!Op->isLiveIn()) {
          Shrunk->insertBefore(&R);
        } else {
          PH->appendRecipe(Shrunk);
#ifndef NDEBUG
          auto *OpInst = dyn_cast<Instruction>(Op->getLiveInIRValue());
          bool IsContained = MinBWs.contains(OpInst);
          assert((!OpInst || IsContained) &&
                 "All processed instructions should be contained in MinBWs.");
          NumProcessedRecipes += IsContained;
#endif
        }
987–994

nit: redundant move of empty line?

llvm/lib/Transforms/Vectorize/VPlanTransforms.h
84
Ayal added inline comments.Nov 30 2023, 12:05 AM
llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
923

Maybe IterIsEmpty would be a better name, to avoid double negation, as in:

        auto [ProcessedIter, IterIsEmpty] = ProcessedTruncs.insert({Op, nullptr});
        VPWidenCastRecipe *NewOp = IterIsEmpty ? new VPWidenCastRecipe(Instruction::Trunc, Op, NewResTy)
                                               : ProcessedIter->second;
        R.setOperand(Idx, NewOp);
        if (!IterIsEmpty)
          continue;
        ProcessedIter->second = NewOp;
        if (!Op->isLiveIn()) {
          NewOp->insertBefore(&R);
        } else {
          PH->appendRecipe(NewOp);
#ifndef NDEBUG
          auto *OpInst = dyn_cast<Instruction>(Op->getLiveInIRValue());
          bool IsContained = MinBWs.contains(OpInst);
          assert((!OpInst || IsContained) &&
                 "All processed instructions should be contained in MinBWs.");
          NumProcessedRecipes += IsContained;
#endif
        }
fhahn updated this revision to Diff 558195.Nov 30 2023, 3:12 AM
fhahn marked 6 inline comments as done.

Addressed latest comments, thanks!

fhahn added inline comments.Nov 30 2023, 5:14 AM
llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
878

Added a comment to ProcessedTruncs definition.

900

They don't need handling explicitly, as redundant casts will be removed later. Expanded the comment slightly to

Also skip casts which do not need to be handled explicitly here, as redundant casts will be removed during recipe simplification.
905

It might be slightly simpler, but would mean this may lead to a crash further down the line, once we support recipes without underlying values/instructions (and we forget to update this line) and/or if some other transform adjusted the type. Left as is for now

909

Yes, fixed!

987–994

changed back, thanks!

llvm/lib/Transforms/Vectorize/VPlanTransforms.h
84

Fixed, thanks!

Ayal accepted this revision.Nov 30 2023, 5:22 AM

This looks good to me, thanks for accommodating!
Adding a minor redundancy spotted plus some test related notes.

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
947

redundant - hoist above the early-continue.

llvm/test/Transforms/LoopVectorize/AArch64/deterministic-type-shrinkage.ll
167

Duplicated TMP0 and TMP1 still here?

174

Still seeing duplicate TMP2 and TMP3?

196–197

Trunc & insertelement LICM'd from vec.epilog.vector.body to vec.epilog.ph.

249–250

ditto.

llvm/test/Transforms/LoopVectorize/AArch64/loop-vectorization-factors.ll
30–34

Fold zext-trunc pair, several such cases follow.

308

The one following the add-2's is also folded now.

338

Other pair also folded now.

484

Worth testing with a subsequent instCombine, to ensure pessimization is avoided?

498

Indeed looks like it!

llvm/test/Transforms/LoopVectorize/trunc-shifts.ll
334

Indeed!

This revision is now accepted and ready to land.Nov 30 2023, 5:22 AM
fhahn marked 2 inline comments as done.Dec 2 2023, 8:15 AM
fhahn added inline comments.
llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
947

Fixed in the committed version, thanks!

llvm/test/Transforms/LoopVectorize/AArch64/deterministic-type-shrinkage.ll
167

They were due to redundant casts being added for Live-in values, fixed by checking in VPWidenCastRecipe::execute for now, with a FIXME to address this with explicit unrolling.