This is an archive of the discontinued LLVM Phabricator instance.

[InstCombine] canonicalize trunc + insert as bitcast + shuffle, part 1
ClosedPublic

Authored by spatel on Nov 28 2022, 3:33 PM.

Details

Summary

This is the main patch for converting a truncated scalar that is inserted into a vector to bitcast+shuffle. We could go either way on patterns like this, but this direction will allow collapsing a pair of these sequences on the motivating example from issue #17113.

The patch is split into 3 parts to make it easier to see the progression of tests diffs. We allow inserting/shuffling into a different size vector for flexibility, so there are several test variations. The length-changing is handled by shortening/padding the shuffle mask with undef elements.

In part 1, handle the basic pattern:

inselt undef, (trunc T), IndexC --> shuffle (bitcast T), IdentityMask

Proof for the endian-dependency behaving as expected:
https://alive2.llvm.org/ce/z/BsA7yC

The TODO items for handling shifts and insert into an arbitrary base vector value are implemented as follow-ups.

Diff Detail

Event Timeline

spatel created this revision.Nov 28 2022, 3:33 PM
Herald added a project: Restricted Project. · View Herald TranscriptNov 28 2022, 3:33 PM
spatel requested review of this revision.Nov 28 2022, 3:33 PM
RKSimon accepted this revision.Nov 30 2022, 3:40 AM

LGTM with one (very) minor

llvm/lib/Transforms/InstCombine/InstCombineVectorOps.cpp
1518

Add a head comment describing the fold.

This revision is now accepted and ready to land.Nov 30 2022, 3:40 AM

Why is a trunc, and an one-use at that, a requirement?
It seems like looking through trunc should either be a follow-up for this fold, or for shuffle combining.

llvm/lib/Transforms/InstCombine/InstCombineVectorOps.cpp
1525

Why is a trunc, and an one-use at that, a requirement?
Please at least add a FIXME.

Looking into follow-ups, it really seems like there should be part 0 without trunc handling.
If not, could you please explain why we must see a trunc?

Looking into follow-ups, it really seems like there should be part 0 without trunc handling.
If not, could you please explain why we must see a trunc?

@lebedev.ri Do you mean this pattern?

define <8 x i16> @src(i64 %a0) {
  %v = insertelement <2 x i64> poison, i64 %a0, i32 0
  %r = bitcast <2 x i64> %v to <8 x i16>
  ret <8 x i16> %r
}
define <8 x i16> @tgt(i64 %a0) {
  %v = bitcast i64 %a0 to <4 x i16>
  %r = shufflevector <4 x i16> %v, <4 x i16> poison, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef>
  ret <8 x i16> %r
}

@lebedev.ri Do you mean this pattern?

define <8 x i16> @src(i64 %a0) {
  %v = insertelement <2 x i64> poison, i64 %a0, i32 0
  %r = bitcast <2 x i64> %v to <8 x i16>
  ret <8 x i16> %r
}
define <8 x i16> @tgt(i64 %a0) {
  %v = bitcast i64 %a0 to <4 x i16>
  %r = shufflevector <4 x i16> %v, <4 x i16> poison, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef>
  ret <8 x i16> %r
}

I guess?

@lebedev.ri Do you mean this pattern?

define <8 x i16> @src(i64 %a0) {
  %v = insertelement <2 x i64> poison, i64 %a0, i32 0
  %r = bitcast <2 x i64> %v to <8 x i16>
  ret <8 x i16> %r
}
define <8 x i16> @tgt(i64 %a0) {
  %v = bitcast i64 %a0 to <4 x i16>
  %r = shufflevector <4 x i16> %v, <4 x i16> poison, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef>
  ret <8 x i16> %r
}

I guess?

That's another related potentially missing canonicalization, but the pattern doesn't end at the insertelement. Without the trunc, there's just a single insertelement inst, so we can't do anything without a trunc?

llvm/lib/Transforms/InstCombine/InstCombineVectorOps.cpp
1525

The one-use is there because the fold is minimally creating 2 instructions: bitcast + shuffle.
It gets more complicated when we match the shift in the next patch part.
In the final patch, we can reduce instructions in some cases, but others will gain an instruction. If we don't have the one-use limit, that could mean we'd end up +2 instructions. I don't want to risk a regression from that possibility.

spatel updated this revision to Diff 478962.Nov 30 2022, 8:05 AM

Patch updated:

  1. Added documentation comment for the general transform.
  2. Added TODO for one-use restriction.
lebedev.ri accepted this revision.Nov 30 2022, 8:57 AM

Patch updated:

  1. Added documentation comment for the general transform.
  2. Added TODO for one-use restriction.

Actually, i guess i was thinking about canonicalizing all insertelements @ constant idx into shuffles
(+the shuffle to convert the scalar into an appropriately-sized vector),
but i guess that is almost borderline bad?
So LG i suppose.

Patch updated:

  1. Added documentation comment for the general transform.
  2. Added TODO for one-use restriction.

Actually, i guess i was thinking about canonicalizing all insertelements @ constant idx into shuffles
(+the shuffle to convert the scalar into an appropriately-sized vector),
but i guess that is almost borderline bad?
So LG i suppose.

Ah, I see. Yes, we could turn all insert-to-constant-index into shuffles after bitcasting a scalar to a <1 x N> type. The insertelt is the direct representation though, so that's a tough sell. TBH even with a trunc, it's a borderline canonicalization; it could go either way. I think the backend is fine either way too, but I'll watch for fallout.

This revision was landed with ongoing or failed builds.Nov 30 2022, 10:23 AM
This revision was automatically updated to reflect the committed changes.

FYI, I tracked an SVE miscompile down to this change https://github.com/llvm/llvm-project/issues/59357. However, so far it looks like this change just made a pre-existing problem more obvious.

We'll follow up with you if that's not the case.