This is an archive of the discontinued LLVM Phabricator instance.

[InstCombine] canonicalize trunc + insert as bitcast + shuffle, part 3
AbandonedPublic

Authored by spatel on Nov 28 2022, 3:34 PM.

Details

Summary

This enhances the folds from part 1 and part 2 to allow insertion into an arbitrary vector. This means we form a select-shuffle (no cross-lane movement is allowed).

Example proofs with endian diffs:
https://alive2.llvm.org/ce/z/Mqfgt8

We can create a select-shuffle for all targets because targets are expected to be able to lower select-shuffles reasonably. This transform could be generalized further if it was implemented in a target-specific pass (with a cost/legality model).

The transform can result in more instructions than we started with (in the case where the vector size is longer/shorter than the scalar), but I think that's a reasonable trade-off to make the canonicalization more consistent.

This allows removing a pair of instructions from the motivating example from issue #17113, but it is still not the ideal IR/codegen.

Diff Detail

Event Timeline

spatel created this revision.Nov 28 2022, 3:34 PM
Herald added a project: Restricted Project. · View Herald TranscriptNov 28 2022, 3:34 PM
spatel requested review of this revision.Nov 28 2022, 3:34 PM
Herald added a project: Restricted Project. · View Herald TranscriptNov 28 2022, 3:34 PM
spatel updated this revision to Diff 479017.Nov 30 2022, 10:32 AM

Rebased on top of part 1 ( a4c466766db7 ).

We can create a select-shuffle for all targets because targets are expected to be able to lower select-shuffles reasonably.

A perhaps minor point, I don't (think) I have objections to the patch, but I always considered select-shuffles to be somewhat of an x86-ism. I believe there is a special set of instructions for handling them, where the mask is stored as part of the instruction. As far as I understand there usually isn't a truly generic way to lower them efficiently (I'd be interested if there was!), and at worst case needing to resort to either lane moves or a constant mask + and/or. If its only a single lane like all the tests then it would just be an extract+insert, which is simpler.

Generally I would consider shuffles to be complex operations that often have a fairly high cost. Insert and trunc and bitcast are all usually simple.

spatel added a comment.Dec 1 2022, 5:58 AM

We can create a select-shuffle for all targets because targets are expected to be able to lower select-shuffles reasonably.

A perhaps minor point, I don't (think) I have objections to the patch, but I always considered select-shuffles to be somewhat of an x86-ism. I believe there is a special set of instructions for handling them, where the mask is stored as part of the instruction. As far as I understand there usually isn't a truly generic way to lower them efficiently (I'd be interested if there was!), and at worst case needing to resort to either lane moves or a constant mask + and/or. If its only a single lane like all the tests then it would just be an extract+insert, which is simpler.

Generally I would consider shuffles to be complex operations that often have a fairly high cost. Insert and trunc and bitcast are all usually simple.

That's a good point. x86 does have limited specialized select-shuffles (blends in x86 lingo) depending on which level of SSE/AVX is implemented. Most other SIMD targets have a vector bitwise select (bsl on AArch64 IIRC).
But yes, in the cases here "select-shuffle" is actually an over-specification/misnomer because we're only inserting a single element (what started as the scalar value) into the base vector.

I tried pushing a couple of tests through AArch64 codegen, and see diffs like this:

lsr	x8, x0, #48
mov	v0.h[3], w8
->
fmov	d1, x0
mov	v0.h[3], v1.h[3]

Does that seem neutral? If not, we could try harder to fold back to an insertelt in codegen or convert to a target-dependent transform in VectorCombine instead of a generic fold here.

I tried pushing a couple of tests through AArch64 codegen, and see diffs like this:

lsr	x8, x0, #48
mov	v0.h[3], w8
->
fmov	d1, x0
mov	v0.h[3], v1.h[3]

Does that seem neutral? If not, we could try harder to fold back to an insertelt in codegen or convert to a target-dependent transform in VectorCombine instead of a generic fold here.

That would come down to the difference between shift (cheap) and lane mov (should be cheapish too). I don't think there's a lot in it.

https://godbolt.org/z/haP87afo9 has some other cases from the tests here. bitcast can be awkward if is secretly includes an extend, which is more difficult than it should be for MVE where most vectors are assumed to be 128bit. We've had problem in the past with instcombine transforming shuffles where it isn't helpful, and I think we still have some. Like I said I don't want to block anything, but this doesn't seem very general, and might be better in the backend or to be cost modelled. (I'm not sure we have sensible costs for bitcasts though. They don't often come up from the vectorizers).

So I think the question is - do we treat this as a canonicalization issue (in which case this code can remain in InstCombine) or do we make it a cost driven fold and move it to VectorCombine?

Also, which option will make it easier to address the remaining missing GVN handling?

spatel added a comment.Dec 7 2022, 9:01 AM

Also, which option will make it easier to address the remaining missing GVN handling?

Having it in InstCombine will definitely be easier than VectorCombine with respect to phase ordering/dependencies on other passes.
In the motivating example, we don't get the folding opportunity until late because it requires inlining to see the pattern. That means we wouldn't do this until near the very end of optimization (and so no subsequent GVN).

I wasn't aware of the SDAG shuffle problems that @dmgreen noted for Thumb/MVE, so I was looking at that a bit closer. Even without this patch, we've already uncovered some awful codegen with the earlier folds like:

define <8 x i16> @low_index_longer_length_poison_basevec_i64(i64 %x) {
  %t = trunc i64 %x to i16
  %r = insertelement <8 x i16> poison, i16 %t, i64 0
  ret <8 x i16> %r
}

$ llc -o - -mtriple=thumbv8.1-m.main -mattr=+mve.fp -float-abi=hard
	vmov.16	q0[0], r0

-->

define <8 x i16> @low_index_longer_length_poison_basevec_i64(i64 %x) {
  %vec.x = bitcast i64 %x to <4 x i16>
  %r = shufflevector <4 x i16> %vec.x, <4 x i16> poison, <8 x i32> <i32 0, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
  ret <8 x i16> %r
}

	sub	sp, #8
	strd	r0, r1, [sp]
	mov	r0, sp
	vldrh.u32	q0, [r0]
	vmov	r2, r3, d0
	vmov	r0, r1, d1
	vmov.16	q0[0], r2
	vmov.16	q0[1], r3
	vmov.16	q0[2], r0
	vmov.16	q0[3], r1
	add	sp, #8

The reason for that is what seems like a bug in SelectionDAGBuilder. It creates these nodes for the bitcast + shuffle sequence:

Creating new node: t5: i64 = build_pair t2, t4
Creating new node: t6: v4i16 = bitcast t5
Creating new node: t7: v4i16 = undef
Creating new node: t8: v8i16 = concat_vectors t6, undef:v4i16

But that's discarding information - the upper 48-bits of the build_pair are zapped to undef by the shuffle in IR, but that's gone with the translation to concat_vectors.
I'll try to fix that.

The good news is that potential regressions like above have been in main for almost a week now, and I haven't seen any bug reports/complaints yet.
So maybe this kind of IR pattern doesn't happen much in real code where it would be noticed.

The good news is that potential regressions like above have been in main for almost a week now, and I haven't seen any bug reports/complaints yet.
So maybe this kind of IR pattern doesn't happen much in real code where it would be noticed.

I had regressions reported from 2 places from those patches. I was looking into fixing those in the backend using a combine of splat(bitcast(buildvector or splat(bitcast(scalar_to_vector to splat (it was having trouble getting the buildvector legal types correct). Transforming trunc+insert to bitcast+shuffle feels like a bit of a strange canonicalization to me. We can probably fix it up in the backend (and the only regressions I've seen have both been unsimplified splats), but trunc+insert seems simpler.

The good news is that potential regressions like above have been in main for almost a week now, and I haven't seen any bug reports/complaints yet.
So maybe this kind of IR pattern doesn't happen much in real code where it would be noticed.

I had regressions reported from 2 places from those patches. I was looking into fixing those in the backend using a combine of splat(bitcast(buildvector or splat(bitcast(scalar_to_vector to splat (it was having trouble getting the buildvector legal types correct). Transforming trunc+insert to bitcast+shuffle feels like a bit of a strange canonicalization to me. We can probably fix it up in the backend (and the only regressions I've seen have both been unsimplified splats), but trunc+insert seems simpler.

Thanks for the update - so there has been some fallout.

I agree that trunc+insert is simpler in the basic case. The challenge is that we haven't found any other way to solve the motivating bug. Leaving it to the backend is too late, so we need to convert a chain of inserts into a shuffle in IR to get the optimal result.

This line of patches got us at least partially there. If we convert back to insert at SDAG builder/combine time, that seems like it could mitigate the problems. If that's not feasible, then I think we should revert the preceding patches in this set.

Thanks for the update - so there has been some fallout.

I agree that trunc+insert is simpler in the basic case. The challenge is that we haven't found any other way to solve the motivating bug. Leaving it to the backend is too late, so we need to convert a chain of inserts into a shuffle in IR to get the optimal result.

This line of patches got us at least partially there. If we convert back to insert at SDAG builder/combine time, that seems like it could mitigate the problems. If that's not feasible, then I think we should revert the preceding patches in this set.

Yeah I don't know of a better way to fix it, I'm afraid. There is a quick fix for the regressions I ran into in D139611.

spatel added a comment.Dec 8 2022, 4:59 AM

Thanks for the update - so there has been some fallout.

I agree that trunc+insert is simpler in the basic case. The challenge is that we haven't found any other way to solve the motivating bug. Leaving it to the backend is too late, so we need to convert a chain of inserts into a shuffle in IR to get the optimal result.

This line of patches got us at least partially there. If we convert back to insert at SDAG builder/combine time, that seems like it could mitigate the problems. If that's not feasible, then I think we should revert the preceding patches in this set.

Yeah I don't know of a better way to fix it, I'm afraid. There is a quick fix for the regressions I ran into in D139611.

After thinking this over again, we should be able to add a more specific peephole that finds the common source op by looking through shifts and casts:
https://alive2.llvm.org/ce/z/_4iTEu
It's bigger than the typical pattern match, but it's not that bad. It could start off very narrow and be generalized in a few steps.
That avoids creating a shuffle, so it sidesteps the backend problems noted here. The question of whether we should canonicalize in the opposite direction from this patch is still open.

After thinking this over again, we should be able to add a more specific peephole that finds the common source op by looking through shifts and casts:
https://alive2.llvm.org/ce/z/_4iTEu
It's bigger than the typical pattern match, but it's not that bad. It could start off very narrow and be generalized in a few steps.
That avoids creating a shuffle, so it sidesteps the backend problems noted here. The question of whether we should canonicalize in the opposite direction from this patch is still open.

I don't want to block any work, don't consider me having any real objection. Some of this feels a little X86-shaped for instcombine, but the motivating example in #17113 seems like it would apply to any architecture, and I haven't seen any cases that can't be fixed up in the backend.

spatel abandoned this revision.Dec 8 2022, 12:47 PM

After thinking this over again, we should be able to add a more specific peephole that finds the common source op by looking through shifts and casts:
https://alive2.llvm.org/ce/z/_4iTEu
It's bigger than the typical pattern match, but it's not that bad. It could start off very narrow and be generalized in a few steps.
That avoids creating a shuffle, so it sidesteps the backend problems noted here. The question of whether we should canonicalize in the opposite direction from this patch is still open.

I don't want to block any work, don't consider me having any real objection. Some of this feels a little X86-shaped for instcombine, but the motivating example in #17113 seems like it would apply to any architecture, and I haven't seen any cases that can't be fixed up in the backend.

I agree that this is on the borderline - we're not ready to canonicalize to shuffles more generally. I've reverted the earlier patches and put up an alternative:
D139668