Legalize t3: v2i16 = bitcast i32 with (v2i16 extract_subvector (v4i16 bitcast (v2i32 scalar_to_vector (i32 in))), 0)
Fix https://github.com/llvm/llvm-project/issues/61638
Differential D147678
[LegalizeTypes][AArch64] Use scalar_to_vector to eliminate bitcast Allen on Apr 5 2023, 8:32 PM. Authored by
Details Legalize t3: v2i16 = bitcast i32 with (v2i16 extract_subvector (v4i16 bitcast (v2i32 scalar_to_vector (i32 in))), 0) Fix https://github.com/llvm/llvm-project/issues/61638 NOTE: Don't touch getPreferredVectorAction like X86 as this will touch
too many test cases.
Diff Detail Event TimelineComment Actions My gut feeling is this isn't a valid transformation. Looking at the output from: define <2 x i16> @bitcast_from_int(i32 %word) { %ret = bitcast i32 %word to <2 x i16> ret <2 x i16> %ret } define i16 @extract_lo(<2 x i16> %vec) { %ret = extractelement <2 x i16> %vec, i64 0 ret i16 %ret } define i16 @extract_hi(<2 x i16> %vec) { %ret = extractelement <2 x i16> %vec, i64 1 ret i16 %ret } bitcast_from_int: // @bitcast_from_int sub sp, sp, #16 .cfi_def_cfa_offset 16 add x8, sp, #12 str w0, [sp, #12] ld1 { v0.h }[0], [x8] orr x8, x8, #0x2 ld1 { v0.h }[2], [x8] add sp, sp, #16 ret extract_lo: // @extract_lo fmov w0, s0 ret extract_hi: // @extract_hi mov w0, v0.s[1] ret Shows <2 x i16> is represented as an unpacked <2 x i32> vector with the valid bits being the bottom 16-bits within each 32bit element. This means the ISD::ANY_EXTEND within the DAG your matching is critical and this gets dropped when replacing with ISD::SCALAR_TO_VECTOR. Ultimately it looks like ISD::BITCAST for smaller than legal vectors types is not a nop unless the src/dst is a load/store in which case the extension can be removed. The poor code is the result of type legalisation so perhaps you'd be better off implementing custom legalisation for ISD::BITCAST involving 32-bit vector types. I recall we did something similar for SVE to handle its unpacked vector types (see ReplaceBITCASTResults). Comment Actions Thanks for your advise, I'll try that.
@@ -22680,6 +22680,11 @@ AArch64TargetLowering::getPreferredVectorAction(MVT VT) const { VT == MVT::v1f32) return TypeWidenVector; + // We prefer to widen v4i16 to v2i32 instead of to promote. + // TODO: Add more types. + if (VT == MVT::v2i16) + return TypeWidenVector; + return TargetLoweringBase::getPreferredVectorAction(VT); }
Comment Actions Thanks for doing this. That's great. The v2i8 version doesn't look too efficient with it moving back and forth, but they all look better than going via the stack. LGTM, thanks. Comment Actions I add a comment on https://github.com/llvm/llvm-project/issues/61638#issuecomment-1495316614** , these issue can produce efficient assemble (only a fmov s0, w0** as expected), but there is too many tests affected. Comment Actions I've an extra nit but otherwise looks good.
Comment Actions The <2 x i16> across a call boundary needs to be treated like a <2 x i32> (as that is a legal type). To do differently would be an ABI break, which would be a lot of work. There are certainly places where we could be better for small vector types, but I don't think a fmov s0, w0 would be valid on it's own. My comment was more about the bitcast i16 %word to <2 x i8> case which is moving back and forth between gpr and vectors more than it needs to. That can be improved in other patches though. By the way it might be worth making sure the tests cover BE too. I believe they are OK, but it would be a good idea to make sure it is tested. |
SCALAR_TO_VECTOR from i32 to v2i16 implicitly discards the high 16 bits of the input. (The other lane is UNDEF. UNDEF might happen to be what you want in simple cases, but it won't be in general.)
Something like (v2i16 extract_subvector (v4i16 bitcast (v2i32 scalar_to_vector (i32 in))), 0) should have the semantics you want, and lower to something reasonably efficient.