Download Raw Diff

Details

Reviewers

dmgreen
david-arm
sdesmalen
nikic
spatel
paulwalker-arm
efriedma

Commits

rG43b2df03e842: [LegalizeTypes][AArch64] Use scalar_to_vector to eliminate bitcast

Summary

Legalize t3: v2i16 = bitcast i32
with            (v2i16 extract_subvector (v4i16 bitcast (v2i32 scalar_to_vector (i32 in))), 0)

Fix https://github.com/llvm/llvm-project/issues/61638

NOTE: Don't touch getPreferredVectorAction like X86 as this will touch too many test cases.

Diff Detail

Event Timeline

Allen created this revision.Apr 5 2023, 8:32 PM

Herald added a project: Restricted Project. · View Herald TranscriptApr 5 2023, 8:32 PM

Herald added subscribers: StephenFan, pengfei, hiraditya, kristof.beyls. · View Herald Transcript

Allen requested review of this revision.Apr 5 2023, 8:32 PM

Herald added a project: Restricted Project. · View Herald TranscriptApr 5 2023, 8:32 PM

Herald added a subscriber: llvm-commits. · View Herald Transcript

Harbormaster completed remote builds in B223913: Diff 511261.Apr 5 2023, 8:40 PM

My gut feeling is this isn't a valid transformation. Looking at the output from:

define <2 x i16> @bitcast_from_int(i32 %word) {
  %ret = bitcast i32 %word to <2 x i16>
  ret <2 x i16> %ret
}

define i16 @extract_lo(<2 x i16> %vec) {
  %ret = extractelement <2 x i16> %vec, i64 0
  ret i16 %ret
}

define i16 @extract_hi(<2 x i16> %vec) {
  %ret = extractelement <2 x i16> %vec, i64 1
  ret i16 %ret
}

bitcast_from_int:                       // @bitcast_from_int
	sub	sp, sp, #16
	.cfi_def_cfa_offset 16
	add	x8, sp, #12
	str	w0, [sp, #12]
	ld1	{ v0.h }[0], [x8]
	orr	x8, x8, #0x2
	ld1	{ v0.h }[2], [x8]
	add	sp, sp, #16
	ret

extract_lo:                             // @extract_lo
	fmov	w0, s0
	ret

extract_hi:                             // @extract_hi
	mov	w0, v0.s[1]
	ret

Shows <2 x i16> is represented as an unpacked <2 x i32> vector with the valid bits being the bottom 16-bits within each 32bit element. This means the ISD::ANY_EXTEND within the DAG your matching is critical and this gets dropped when replacing with ISD::SCALAR_TO_VECTOR. Ultimately it looks like ISD::BITCAST for smaller than legal vectors types is not a nop unless the src/dst is a load/store in which case the extension can be removed. The poor code is the result of type legalisation so perhaps you'd be better off implementing custom legalisation for ISD::BITCAST involving 32-bit vector types. I recall we did something similar for SVE to handle its unpacked vector types (see ReplaceBITCASTResults).

Thanks for your advise, I'll try that.

In current case, I also check N->getValueType(0).getScalarType() == N00.getNode()->getValueType(0), so we only address the case A --> VecB --> VecA, as the scalar type of VecA is same to A, so we can ignore the difference between VecB and VecA on the laid out ?

In addition, do you mean we “ prefer to widen v4i16 to v2i32” similar to x86 as following ? (I test it , it works fine, but there is too many case need to update, so I want to make sure I have understood what you mean rightly for off implementing custom legalisation for ISD::BITCAST)

@@ -22680,6 +22680,11 @@ AArch64TargetLowering::getPreferredVectorAction(MVT VT) const {
       VT == MVT::v1f32)
     return TypeWidenVector;
 
+  // We prefer to widen v4i16 to v2i32 instead of to promote.
+  // TODO: Add more types.
+  if (VT == MVT::v2i16)
+    return TypeWidenVector;
+
   return TargetLoweringBase::getPreferredVectorAction(VT);
 }

Matt added a subscriber: Matt.Apr 11 2023, 11:45 AM

Define custom legalisation for ISD::BITCAST as comment

Allen updated this revision to Diff 530152.Jun 9 2023, 9:42 PM

Harbormaster completed remote builds in B237916: Diff 530152.Jun 9 2023, 10:29 PM

efriedma added a subscriber: efriedma.Jun 9 2023, 11:06 PM

efriedma added inline comments.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
22895	SCALAR_TO_VECTOR from i32 to v2i16 implicitly discards the high 16 bits of the input. (The other lane is UNDEF. UNDEF might happen to be what you want in simple cases, but it won't be in general.) Something like `(v2i16 extract_subvector (v4i16 bitcast (v2i32 scalar_to_vector (i32 in))), 0)` should have the semantics you want, and lower to something reasonably efficient.

(v2i16 extract_subvector (v4i16 bitcast (v2i32 scalar_to_vector (i32 in))), 0)

Harbormaster completed remote builds in B237927: Diff 530172.Jun 10 2023, 1:17 AM

Allen edited the summary of this revision. (Show Details)Jun 10 2023, 2:47 AM

Allen added a reviewer: efriedma.

Allen added inline comments.Jun 10 2023, 2:52 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
22895	Thanks, apply your comment. The extra instruction ushll v0.4s, v0.4h, #0 for (v4i32 any_extend v4i16) will clear the other lane. t2: i32,ch = CopyFromReg t0, Register:i32 %0 t8: v2i32 = scalar_to_vector t2 t9: v4i16 = bitcast t8 t20: v4i32 = any_extend t9 t21: v2i32 = extract_subvector t20, Constant:i64<0>

Is it worth adding other types like i16->v2i8 and i32->v4i8 too?

Add new biscast i16->v2i8 and i32->v4i8

Harbormaster completed remote builds in B238371: Diff 530746.Jun 12 2023, 8:38 PM

Thanks for doing this. That's great.

The v2i8 version doesn't look too efficient with it moving back and forth, but they all look better than going via the stack. LGTM, thanks.

This revision is now accepted and ready to land.Jun 13 2023, 3:07 AM

In D147678#4416884, @dmgreen wrote:

Thanks for doing this. That's great.

The v2i8 version doesn't look too efficient with it moving back and forth, but they all look better than going via the stack. LGTM, thanks.

I add a comment on https://github.com/llvm/llvm-project/issues/61638#issuecomment-1495316614** , these issue can produce efficient assemble (only a fmov s0, w0** as expected), but there is too many tests affected.
so I'm not sure if this is a direction for further optimization?

I've an extra nit but otherwise looks good.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
22894–22897	Is this necessary given it is a requirement of `ISD::BITCAST`? By which I mean you shouldn't need to manually verify the DAG is valid here? I'd expect `getNode()` to catch such errors.
22902	Please can you use `getVectorIdxConstant` here rather than hardwiring `MVT::i64`?

Fix comments
a) delete assert
b) use getVectorIdxConstant

Allen marked 3 inline comments as done.Jun 13 2023, 6:14 AM

Allen added inline comments.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
22894–22897	deleted assert, thanks
22902	Apply your comment, thanks

In D147678#4416899, @Allen wrote:

I add a comment on https://github.com/llvm/llvm-project/issues/61638#issuecomment-1495316614** , these issue can produce efficient assemble (only a fmov s0, w0** as expected), but there is too many tests affected.
so I'm not sure if this is a direction for further optimization?

The <2 x i16> across a call boundary needs to be treated like a <2 x i32> (as that is a legal type). To do differently would be an ABI break, which would be a lot of work. There are certainly places where we could be better for small vector types, but I don't think a fmov s0, w0 would be valid on it's own. My comment was more about the bitcast i16 %word to <2 x i8> case which is moving back and forth between gpr and vectors more than it needs to. That can be improved in other patches though.

By the way it might be worth making sure the tests cover BE too. I believe they are OK, but it would be a good idea to make sure it is tested.