This is an archive of the discontinued LLVM Phabricator instance.

[VPlan] Replace IR based truncateToMinimalBitwidths with VPlan version.
ClosedPublic

Authored by fhahn on May 4 2023, 2:07 PM.

Details

Summary

This patch replaces the IR based truncateToMinimalBitwidths with a VPlan
version. This has 2 benefits:

  1. the VPlan-based version is simpler; we don't need to implement special codegen for each supported instruction type like the IR based one.
  2. Removes a dependency on the cost-model after VPlan execution and
  3. Removes a use of getVPValue that uses underlying values after VPlan execution (See removed FIXME).

Depends on D149081.

Depends on D149079.

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Ayal added inline comments.Oct 4 2023, 4:03 PM
llvm/lib/Transforms/Vectorize/VPlan.h
281

How/Is this removal related?

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
886

(Future) Thought: wonder if instead of iterating over all live-ins looking to truncate any, it may be better to iterate over MinBWs and check if any are live-ins. Or lookup MinBWs upon construction of a live-in.

887

nit: use LiveInInst or something similar rather than UI?

893
900

Set once before the loop for all live-ins to be truncated.

909

Any order other than depth first would also do, right?

920

(Future) Thought: this is an awkward way of retrieving "the" recipe that corresponds to each member of MinBWs - look through all recipes for those having the desired "underlying" insn. Perhaps better lookup MinBWs upon construction of a recipe for an Instruction.
Or migrate the analysis that builds MinBWs to run on VPlan.

921

nit: lookup.

926

Would be good to comment how memory and replicate cases are (not) processed.

932

Better assert than continue? Here ProcessedRecipes was already bumped, but should all MinBWs members correspond to Integer types, of distinct (smaller) size, whether live-in or not?

942

This deals only with ZExt/SExt, easier to check directly if Opcode is one or the other?

OTOH, better handle Trunc here as well? Is it handled well below?

946

// SExt/Zext is redundant - stick with its operand.
?

953

Place assert earlier?

955–956
967

This means the size of all operands is equal to NewResSizeInBits, can this be?

971–972

nit: keep consistent with above.

980–982

nit: keep consistent with above.

llvm/lib/Transforms/Vectorize/VPlanTransforms.h
72

nit: a VPlan transform should fold redundant ZExt-Trunc pairs rather than leaving them ("as hints") to InstCombine.

Being a public method, which does not need SE, should the caller of optimize() precede its call with a direct call to trunctateToMinimalBitwidth(), rather than pass MinBWs to optimize()?

fhahn updated this revision to Diff 557740.Oct 17 2023, 1:04 PM
fhahn marked 21 inline comments as done.

Address latest comments, apologies for the delay!

llvm/lib/Transforms/Vectorize/VPlan.h
281

The last user of this function has been removed in the patch.

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
876–877

Code has been moved to D159202

879–881

code has been moved to D159202

885

Wrapped and added comment, thanks!

887

Renamed, thanks!

888

Updated, thanks!

893

Adjusted, thanks!

894

Turned into assert, thanks!

900

hoisted, thanks!

909

Yes, I think the order doesn't matter here.

921

Done, thanks!

926

Added a comment, thanks!

932

Turned isIntegerTy into assert but retained size check as there entries where the sizes are the same (e.g. for truncs).

942

Thanks, changed to if. I don't think Trunc is handled explicitly in the latest version.

946

this check has been moved up and is not needed any longer.

953

moved up,, thanks!

955–956

adjusted, thanks!

967

There are cases where a Zext narrowed earlier is used as operand here, so the tie is already adjusted.

971–972

Adjusted, thanks!

980–982

reordered, thanks!

Ayal added a comment.Oct 21 2023, 2:02 PM

Various comments, also trying to reason about how this patch changes tests.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
3443

Retain a comment explaining why replicate recipes are not truncated?

3482

Retain this comment regarding dropping wrapping flags?

3497

A Trunc is handled by shrinking its operand.

3522

(If nothing is done to the operands, what is the result extended too?)

llvm/lib/Transforms/Vectorize/VPlan.h
281

Very well!

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
755

Thought: worth introducing as a member of VPValue, to be overridden by VPWidenCastRecipe? Note that this is Element/Scalar Type.

Effectively adding scalar type info to all VPValues? Might be good to investigate separately, although the current use-cases would probably be very limited

Very well.

780

Thought: could/should each MinBW be attached to its recipe asap - when the latter is created, considering it depends on associated underlying instruction?

Might be a potential follow-up, but we would still potentially updated MinBWs on each recipe replacement?

Sure, like updating any other property of a recipe when replaced.

796

Agreed - MinBW should specify a consistent minimal bit width for all users, and for all operands, but there seems to be some discrepancy that is confusing:

A. Instructions whose operands and return value are all of a single type (excluding condition operand of selects) are converted to operate on a narrower type by (a) shrinking their operands to the narrower type and (b) extending their result from the narrower type to their original type. Instructions that feed values to such instructions or use their values, continue to feed and use values of the original type.
A pair of such instructions where one feeds the other will be added a zext-trunc pair between them which will later be folded.

B. Instructions that convert between two distinct types, continue to digest the original source type but are updated to produce values of the new destination type. Their users, when reached subsequently, need to check if any of their operands have been narrowed. But if this is the case, why bother expanding results in (b) above? OTOH, the narrowed results of conversion instructions can also be expanded (to be folded later), keeping the treatment consistent? Always expecting the new type to be strictly smaller than the current one. Perhaps conversion instructions could be skipped now and handled by subsequent folding pass - looking for trunc-trunc and sext-trunc pairs in addition to zext-trunc ones?

C. Loads are ignored - excluded from MiinBWs? They could potentially be narrowed to load only the required bits, though its unclear if a strided narrow load is better than a unit-strided wider load and trunc - as in an interleave-group(?)

D. Phis are ignored - excluded from MinBWs. Truncated header induction phi's are handled separately. Other phi's may deserve narrowing(?)

885

Suffice to ask if (!NewResSizeInBits)?

886

Thoughts about the above? Hopefully avoids exposing getLiveIns(), at the expense of holding a mapping between Values and LiveIns, as in LiveOuts.

889

assert "MinBW member must be integer" rather than continue - thereby skipping a MinBW member.

905

Can skip phi's, none are included in MinBWs.

906

Are any loads included in MinBWs, or is this dead code? Stores of course are irrelevant.

919

Suffice to ask if (!NewResSizeInBits)?

920

Thoughts about the above?

928

Should replicate recipes be handled next to handling widen memory recipes above?

932

nit: ResTy >> OldResTy, ResSizeInBits >> OldResSizeInBits

935

assert(ResSizeInBits > NewResSizeInBits && "Nothing to shrink?"); here instead of below?

941

nit: VPC >> OldExt, Opc >> OldOpc?

942

Does Trunc (which can truncate to a smaller bitwidth) implicitly fall through and has its operand shrunk to the smaller bitwidth, effectively turning it into a ZExt?

945

Comment is obsolete here - dealt with new type being equal to operand type, which should result in replacing the SExt/ZExt with its operand, see below.

946

?

950

nit: C >> NewCast?

If getTypeSizeInBits(Op) == NewResSizeInBits should C be set to Op (w/o inserting it) instead of creating a redundant cast?

967

Maybe worth a comment.

llvm/lib/Transforms/Vectorize/VPlanTransforms.h
72

Thoughts on the above?
Better truncate to minimal bitwidth asap, as it relies on IR information? Conceptually a scalar transform.
Does "as hints to InstCombine" below still hold?

llvm/test/Transforms/LoopVectorize/AArch64/deterministic-type-shrinkage.ll
41–42

hmm, we now spot the redundant duplicate zext of WIDE_LOAD from <16 x i8> to <16 x i16>, originally both TMP4 and TMP10.

68

Spotted and removed duplicate zext of WIDE_LOAD8.

159

This testcase stores the 2nd least significant byte of a 32b product (of two invariant values, one 16b and the other 32b) checking that computing 16b product suffices. But more optimizations should take place: the expansion of the multipliers to 32b should be eliminated (along with their truncation to 16b), and the invariant multiplication-lshr-trunc sequence should be hoisted out of the loop.

167

BROADCAST_SPLAT is (still) trunc'ed twice due to UF=2?

168

Both insertelement's now use poison.

176

BROADCAST_SPLAT2 is (still) trunc'ed twice due to UF=2?

llvm/test/Transforms/LoopVectorize/AArch64/loop-vectorization-factors.ll
302

We now fold a trunc-zext of zext'ed WIDE_LOAD from <16 x i16> => <16 x i32> => <16 x i16>,
but fail to fold a similar one following the add-2's?

330

We now get rid of a pair of <8 x i16> => <8 x i32> => <8 x i16> before the add-2's (so this is not an NFC patch), but still retain the pair of <8 x i16> => <8 x i32> => <8 x i16> after it - missed MinBW/trunc-zext opportunity?

474–475

Hmm, before we narrowed these two sufflevectors to operate on <16 x i8> and zext-trunc their result, now we let them operate on original <16 x i32> and truncate the result?

487

Many zext-trunc pairs left to collect.

513

Above trunc of TMP2 is redundant along with its zext in the ph.

520

Above trunc of TMP4 is redundant along with its zext in the ph.

llvm/test/Transforms/LoopVectorize/trunc-shifts.ll
334

We now get rid of a pair of <4 x i16> => <4 x i32> => <4 x i16> before the lshr (so this is not an NFC patch), but still retain the pair/triple of <4 x i16> => <4 x i32> => <4 x i16> => <4 x i8> after it - missed MinBW opportunity?

fhahn updated this revision to Diff 558116.Nov 16 2023, 2:15 PM
fhahn marked 29 inline comments as done.

Address comments and major simplification after moving cast folding to simplifyRecipes.

Hope all comments should be addressed, hope i didn't miss any.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
3443

Retained when skipping VPReplicateRecipe.

3482

Done, thanks!

3522

It stays the same, there's no extend in that case.

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
755

This has been updated to now use VPTypeAnalysis.

796

The latest version doesn't have special treatment for casts, they remain unchanged and VPlan recipe simplification will take care of folding them if possible.

885

This code has now been removed; LiveIns are handled when truncating the other operands of an instruction; otherwise we leave the type info in an inconsistent state.

886

LiveIns are now handled directly when truncating other operands; getLiveIns has been removed.

889

Turned into an assert, thanks!

905

There's an early continue now that skips phis and other unsupported recipes.

906

Nope, looks like this is not needed in the latest version.

919

Simplified, thanks!

920

I think it would be best to have the analysis based on VPlan. Building MinBWs early would probably require extra work to update/invalidate it during transforms.

928

We still need to count them for verification

fhahn added inline comments.Nov 16 2023, 2:15 PM
llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
932

Renamed, thanks!

935

Done, and also removed continue

941

This code is now gone, handled by recipe simplification.

945

Code is gone now

946

Code now gone.

950

Code gone now.

llvm/test/Transforms/LoopVectorize/AArch64/deterministic-type-shrinkage.ll
159

still more work to do :)

Arguably the invariant instructions are artificial, in the regular pipeline, no invariant instructions should remain.

167

The latest version avoids truncating the same value twice.

168

I think the use of undef is a leftover that wasn't updated; it should be poison.

176

The latest version avoids truncating the same value twice.

llvm/test/Transforms/LoopVectorize/AArch64/loop-vectorization-factors.ll
302

folding now happens all in simplifyRecieps, should handle this now

474–475

I think there's nothing we can do about that; we first need to splat the value when generating code, but InstCombine should take care of that.

487

Should be better cleaned up now

llvm/test/Transforms/LoopVectorize/trunc-shifts.ll
334

trunc/ext pairs should be better cleaned up in the latest version

Ayal added a comment.Nov 22 2023, 1:58 PM

Looks much simpler! Minor last nits.

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
874

nit: are these still hints to InstCombine, or to subsequent VPlan cleanups?

883–886

?

909

But a (more) expensive RPOT order is needed, to handle defs before uses?

933

Is it possible for MinBWs not to contain Op's live-in IR value in this case?

967–969

nit

fhahn updated this revision to Diff 558159.Nov 23 2023, 4:08 AM
fhahn marked 5 inline comments as done.

Address latest comments, thanks!

fhahn added inline comments.Nov 23 2023, 4:10 AM
llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
874

Updated, thanks!

883–886

Simplified , thanks!

909

The latest version should not need RPO, as the bit width of the results do not change for any user (previously they might due to early cast simplifications). Changed to depth first.

933

Yes, MinBWs only contains instructions, but not other values like arguments. Added a clarifying assert.

967–969

Done thanks! This also limits the scope of TypeInfo to the range where it is valid. after `truncateToMinimalBitwidths, we would need to invalidate the info for the modified recipes otherwise. This can be done in the future.

Ayal added inline comments.Nov 29 2023, 9:23 AM
llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
780

Just noting potential follow-up, possibly as a TODO somewhere: attach each MinBW to its recipe when the latter is created, supplementing its underlying inst.

880

nit: ProcessedRecipesNum?

880

ProcessedTruncs is used outside ifdef below, move its definition out of ifdef here? Or is it meant to ensure truncated operands are counted once by ProcessedRecipes for debugging only? If an operand is truncated multiple times, all its truncations must be to the same size, because "MinBW should specify a consistent minimal bit width for all users(, and for all operands)"?

Worth explaining why processed truncs are recorded.

883

Should PH be skipped? Trying to shrink the (live-in) operands of recipes in PH will insert them at the end of PH...

886

Shrunk operands are placed before R, but its extension is placed after - and calls for this make_early_inc_range, right?

902

Just note that the counting of ProcessedRecipes may miss casts that fail to be processed later.

907

Does OldResSizeInBits equal to the size of OldResTy, for the non-cast Widen or Select R?

924

Ins? Perhaps ProcessedTrunc?

925

Handle the simple if !ins.second /* Op already processed */ case first, potentially early-continuing?

Clearer to check if ProcessedTruncs.lookup(Op) or if ProcessedTruncs.contains(Op) and if so use ProcessedTruncs[Op], otherwise insert it?

928

nit: place simpler if !isLiveIn case first?

930–933

nit

938

Note that truncations of live-ins could also be inserted before R, thereby leaving the treatment of live-ins to debugging only, and leaving their LICM and commoning to a subsequent VPlan cleanup pass, along with trunc-zext foldings.

967–969

Very well. Worth commenting that TypeInfo should not be used following truncateToMinimalBitwidths.

llvm/lib/Transforms/Vectorize/VPlanTransforms.h
72

WDYT on the above: should the caller of optimize() precede its call with a direct call to trunctateToMinimalBitwidth(), rather than pass MinBWs to optimize()?

fhahn updated this revision to Diff 558186.Nov 29 2023, 11:36 AM
fhahn marked 15 inline comments as done.

Rebase and address latest comments, thanks!

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
880

Changed to NumProcessedRecipes

880

It's to re-use previously generated truncates. Note that we cannot RAUW after creating the new truncate, as this may make other uses not well typed (until they are processed and all their operands are truncated)

Moved out of ifdef

883

Good point, there should be nothing to shrink in PH for now, as the analysis is for the loop body only, adjusted!

886

Yep

902

Do you mean updating the comment here or just a general note? We need to include the recipes in the count, otherwise the verification later will fail

907

Yes, I forgot to remove this use of IR getType. Updated to use TypeInfo.inferScalarType(ResultVPV) and then getScalarSizeInBits of the returned type.

924

Updated, thanks!

925

Early continue would mean duplicating the code to update the operands, I left things for now as is, including using insert. insert means we only need to lookup the insert-pos once, vs 2 lookups with separate lookup and then `[]. WDYT?

928

Done, thanks!

938

Yep, for now it is simpler and results in a smaller test diff to do it directly there as it is not only LICM but also very simple CSE

967–969

Sunk further into truncateToMinimualBitwidths

llvm/lib/Transforms/Vectorize/VPlanTransforms.h
72

Sounds good, updated, thanks!

Ayal added inline comments.Nov 29 2023, 2:23 PM
llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
880

Note that we cannot RAUW after creating the new truncate, as this may make other uses not well typed (until they are processed and all their operands are truncated)

Very well, may deserve a comment.

902

I mean we count casts as if they are processed, expecting they will be later, w/o checking that they actually do.

907

Ah, ok, wondered if using the size of the type of UI directly would be simpler?

911

Should be the same Ctx passed in as parameter?

925

OK, WDYT of the something as follows:

        auto [ProcessedIter, DidNotExist] = ProcessedTruncs.insert({Op, nullptr});
        VPWidenCastRecipe *NewOp = DidNotExist ? new VPWidenCastRecipe(Instruction::Trunc, Op, NewResTy)
                                               : ProcessedIter->second;
        R.setOperand(Idx, NewOp);
        if (!DidNotExist)
          continue;
        ProcessedIter->second = NewOp;
        if (!Op->isLiveIn()) {
          Shrunk->insertBefore(&R);
        } else {
          PH->appendRecipe(Shrunk);
#ifndef NDEBUG
          auto *OpInst = dyn_cast<Instruction>(Op->getLiveInIRValue());
          bool IsContained = MinBWs.contains(OpInst);
          assert((!OpInst || IsContained) &&
                 "All processed instructions should be contained in MinBWs.");
          NumProcessedRecipes += IsContained;
#endif
        }
965–971

nit: redundant move of empty line?

llvm/lib/Transforms/Vectorize/VPlanTransforms.h
84
Ayal added inline comments.Nov 30 2023, 12:05 AM
llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
925

Maybe IterIsEmpty would be a better name, to avoid double negation, as in:

        auto [ProcessedIter, IterIsEmpty] = ProcessedTruncs.insert({Op, nullptr});
        VPWidenCastRecipe *NewOp = IterIsEmpty ? new VPWidenCastRecipe(Instruction::Trunc, Op, NewResTy)
                                               : ProcessedIter->second;
        R.setOperand(Idx, NewOp);
        if (!IterIsEmpty)
          continue;
        ProcessedIter->second = NewOp;
        if (!Op->isLiveIn()) {
          NewOp->insertBefore(&R);
        } else {
          PH->appendRecipe(NewOp);
#ifndef NDEBUG
          auto *OpInst = dyn_cast<Instruction>(Op->getLiveInIRValue());
          bool IsContained = MinBWs.contains(OpInst);
          assert((!OpInst || IsContained) &&
                 "All processed instructions should be contained in MinBWs.");
          NumProcessedRecipes += IsContained;
#endif
        }
fhahn updated this revision to Diff 558195.Nov 30 2023, 3:12 AM
fhahn marked 6 inline comments as done.

Addressed latest comments, thanks!

fhahn added inline comments.Nov 30 2023, 5:14 AM
llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
880

Added a comment to ProcessedTruncs definition.

902

They don't need handling explicitly, as redundant casts will be removed later. Expanded the comment slightly to

Also skip casts which do not need to be handled explicitly here, as redundant casts will be removed during recipe simplification.
907

It might be slightly simpler, but would mean this may lead to a crash further down the line, once we support recipes without underlying values/instructions (and we forget to update this line) and/or if some other transform adjusted the type. Left as is for now

911

Yes, fixed!

965–971

changed back, thanks!

llvm/lib/Transforms/Vectorize/VPlanTransforms.h
84

Fixed, thanks!

Ayal accepted this revision.Nov 30 2023, 5:22 AM

This looks good to me, thanks for accommodating!
Adding a minor redundancy spotted plus some test related notes.

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
949

redundant - hoist above the early-continue.

llvm/test/Transforms/LoopVectorize/AArch64/deterministic-type-shrinkage.ll
167

Duplicated TMP0 and TMP1 still here?

176

Still seeing duplicate TMP2 and TMP3?

194–196

Trunc & insertelement LICM'd from vec.epilog.vector.body to vec.epilog.ph.

249–250

ditto.

llvm/test/Transforms/LoopVectorize/AArch64/loop-vectorization-factors.ll
30–31

Fold zext-trunc pair, several such cases follow.

302

The one following the add-2's is also folded now.

330

Other pair also folded now.

474–475

Worth testing with a subsequent instCombine, to ensure pessimization is avoided?

487

Indeed looks like it!

llvm/test/Transforms/LoopVectorize/trunc-shifts.ll
334

Indeed!

This revision is now accepted and ready to land.Nov 30 2023, 5:22 AM
fhahn marked 2 inline comments as done.Dec 2 2023, 8:15 AM
fhahn added inline comments.
llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
949

Fixed in the committed version, thanks!

llvm/test/Transforms/LoopVectorize/AArch64/deterministic-type-shrinkage.ll
167

They were due to redundant casts being added for Live-in values, fixed by checking in VPWidenCastRecipe::execute for now, with a FIXME to address this with explicit unrolling.