This is an archive of the discontinued LLVM Phabricator instance.

[LegalizeTypes][AArch64] Use scalar_to_vector to eliminate bitcast
ClosedPublic

Authored by Allen on Apr 5 2023, 8:32 PM.

Details

Summary
Legalize t3: v2i16 = bitcast i32
with            (v2i16 extract_subvector (v4i16 bitcast (v2i32 scalar_to_vector (i32 in))), 0)

Fix https://github.com/llvm/llvm-project/issues/61638

NOTE: Don't touch getPreferredVectorAction like X86 as this will touch too many test cases.

Diff Detail

Event Timeline

Allen created this revision.Apr 5 2023, 8:32 PM
Allen requested review of this revision.Apr 5 2023, 8:32 PM
Herald added a project: Restricted Project. · View Herald TranscriptApr 5 2023, 8:32 PM

My gut feeling is this isn't a valid transformation. Looking at the output from:

define <2 x i16> @bitcast_from_int(i32 %word) {
  %ret = bitcast i32 %word to <2 x i16>
  ret <2 x i16> %ret
}

define i16 @extract_lo(<2 x i16> %vec) {
  %ret = extractelement <2 x i16> %vec, i64 0
  ret i16 %ret
}

define i16 @extract_hi(<2 x i16> %vec) {
  %ret = extractelement <2 x i16> %vec, i64 1
  ret i16 %ret
}
bitcast_from_int:                       // @bitcast_from_int
	sub	sp, sp, #16
	.cfi_def_cfa_offset 16
	add	x8, sp, #12
	str	w0, [sp, #12]
	ld1	{ v0.h }[0], [x8]
	orr	x8, x8, #0x2
	ld1	{ v0.h }[2], [x8]
	add	sp, sp, #16
	ret

extract_lo:                             // @extract_lo
	fmov	w0, s0
	ret

extract_hi:                             // @extract_hi
	mov	w0, v0.s[1]
	ret

Shows <2 x i16> is represented as an unpacked <2 x i32> vector with the valid bits being the bottom 16-bits within each 32bit element. This means the ISD::ANY_EXTEND within the DAG your matching is critical and this gets dropped when replacing with ISD::SCALAR_TO_VECTOR. Ultimately it looks like ISD::BITCAST for smaller than legal vectors types is not a nop unless the src/dst is a load/store in which case the extension can be removed. The poor code is the result of type legalisation so perhaps you'd be better off implementing custom legalisation for ISD::BITCAST involving 32-bit vector types. I recall we did something similar for SVE to handle its unpacked vector types (see ReplaceBITCASTResults).

Allen added a comment.EditedApr 6 2023, 8:42 PM

Thanks for your advise, I'll try that.

  • In current case, I also check N->getValueType(0).getScalarType() == N00.getNode()->getValueType(0), so we only address the case A --> VecB --> VecA, as the scalar type of VecA is same to A, so we can ignore the difference between VecB and VecA on the laid out ?
  • In addition, do you mean we “ prefer to widen v4i16 to v2i32” similar to x86 as following ? (I test it , it works fine, but there is too many case need to update, so I want to make sure I have understood what you mean rightly for off implementing custom legalisation for ISD::BITCAST)
@@ -22680,6 +22680,11 @@ AArch64TargetLowering::getPreferredVectorAction(MVT VT) const {
       VT == MVT::v1f32)
     return TypeWidenVector;
 
+  // We prefer to widen v4i16 to v2i32 instead of to promote.
+  // TODO: Add more types.
+  if (VT == MVT::v2i16)
+    return TypeWidenVector;
+
   return TargetLoweringBase::getPreferredVectorAction(VT);
 }
Matt added a subscriber: Matt.Apr 11 2023, 11:45 AM
Allen updated this revision to Diff 530145.Jun 9 2023, 9:11 PM
Allen retitled this revision from [DAGCombiner][AArch64] Use scalar_to_vector to eliminate bitcast to [LegalizeTypes][AArch64] Use scalar_to_vector to eliminate bitcast.
Allen edited the summary of this revision. (Show Details)

Define custom legalisation for ISD::BITCAST as comment

Allen updated this revision to Diff 530152.Jun 9 2023, 9:42 PM
efriedma added inline comments.
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
22895

SCALAR_TO_VECTOR from i32 to v2i16 implicitly discards the high 16 bits of the input. (The other lane is UNDEF. UNDEF might happen to be what you want in simple cases, but it won't be in general.)

Something like (v2i16 extract_subvector (v4i16 bitcast (v2i32 scalar_to_vector (i32 in))), 0) should have the semantics you want, and lower to something reasonably efficient.

Allen updated this revision to Diff 530172.Jun 10 2023, 12:13 AM

(v2i16 extract_subvector (v4i16 bitcast (v2i32 scalar_to_vector (i32 in))), 0)

Allen edited the summary of this revision. (Show Details)Jun 10 2023, 2:47 AM
Allen added a reviewer: efriedma.
Allen added inline comments.Jun 10 2023, 2:52 AM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
22895

Thanks, apply your comment.
The extra instruction ushll v0.4s, v0.4h, #0 for (v4i32 any_extend v4i16) will clear the other lane.

        t2: i32,ch = CopyFromReg t0, Register:i32 %0
      t8: v2i32 = scalar_to_vector t2
    t9: v4i16 = bitcast t8
  t20: v4i32 = any_extend t9
t21: v2i32 = extract_subvector t20, Constant:i64<0>

Is it worth adding other types like i16->v2i8 and i32->v4i8 too?

Allen updated this revision to Diff 530746.Jun 12 2023, 6:59 PM

Add new biscast i16->v2i8 and i32->v4i8

dmgreen accepted this revision.Jun 13 2023, 3:07 AM

Thanks for doing this. That's great.

The v2i8 version doesn't look too efficient with it moving back and forth, but they all look better than going via the stack. LGTM, thanks.

This revision is now accepted and ready to land.Jun 13 2023, 3:07 AM
Allen added a comment.EditedJun 13 2023, 3:16 AM

Thanks for doing this. That's great.

The v2i8 version doesn't look too efficient with it moving back and forth, but they all look better than going via the stack. LGTM, thanks.

I add a comment on https://github.com/llvm/llvm-project/issues/61638#issuecomment-1495316614** , these issue can produce efficient assemble (only a fmov s0, w0** as expected), but there is too many tests affected.
so I'm not sure if this is a direction for further optimization?

paulwalker-arm accepted this revision.Jun 13 2023, 5:24 AM

I've an extra nit but otherwise looks good.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
22894–22897

Is this necessary given it is a requirement of ISD::BITCAST? By which I mean you shouldn't need to manually verify the DAG is valid here? I'd expect getNode() to catch such errors.

22902

Please can you use getVectorIdxConstant here rather than hardwiring MVT::i64?

Allen updated this revision to Diff 530882.Jun 13 2023, 6:12 AM

Fix comments
a) delete assert
b) use getVectorIdxConstant

Allen marked 3 inline comments as done.Jun 13 2023, 6:14 AM
Allen added inline comments.
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
22894–22897

deleted assert, thanks

22902

Apply your comment, thanks

I add a comment on https://github.com/llvm/llvm-project/issues/61638#issuecomment-1495316614** , these issue can produce efficient assemble (only a fmov s0, w0** as expected), but there is too many tests affected.
so I'm not sure if this is a direction for further optimization?

The <2 x i16> across a call boundary needs to be treated like a <2 x i32> (as that is a legal type). To do differently would be an ABI break, which would be a lot of work. There are certainly places where we could be better for small vector types, but I don't think a fmov s0, w0 would be valid on it's own. My comment was more about the bitcast i16 %word to <2 x i8> case which is moving back and forth between gpr and vectors more than it needs to. That can be improved in other patches though.

By the way it might be worth making sure the tests cover BE too. I believe they are OK, but it would be a good idea to make sure it is tested.

Allen updated this revision to Diff 530896.Jun 13 2023, 7:03 AM
Allen marked 2 inline comments as done.

rebase to cover BE

Allen updated this revision to Diff 531307.Jun 14 2023, 6:41 AM

rebase as the changes of tests

This revision was landed with ongoing or failed builds.Jun 14 2023, 8:34 AM
This revision was automatically updated to reflect the committed changes.