This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] turn extended vecreduce bigger than v16i8 into udot/sdot
ClosedPublic

Authored by zjaffal on Jan 13 2023, 7:23 AM.

Details

Summary

We can do this by breaking vecreduce into v16i8 vectors generating udot/sdot and concatenating them.

Diff Detail

Event Timeline

zjaffal created this revision.Jan 13 2023, 7:23 AM
zjaffal requested review of this revision.Jan 13 2023, 7:23 AM
Herald added a project: Restricted Project. · View Herald TranscriptJan 13 2023, 7:23 AM

Sounds like a nice improvement

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
15206

I think this should be something like !(IsValidElementCount && IsValidSize).
It is worth adding a v4i8 test if one doesn't exist already:

define i32 @src(ptr %p, i32 %b) {
entry:
  %a64 = load <4 x i8>, ptr %p
  %a65 = sext <4 x i8> %a64 to <4 x i32>
  %a66 = mul nsw <4 x i32> %a65, %a65
  %a67 = tail call i32 @llvm.vector.reduce.add.v4i32(<4 x i32> %a66)
  %a = add i32 %a67, %b
  ret i32 %a
}
15248

DotOpcode can be moved out of the loop, and commoned with the version above. Zeroes can be moved up too.

zjaffal added inline comments.Jan 16 2023, 2:21 AM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
15206

I think this should be something like !(IsValidElementCount && IsValidSize).

but then we won't cover the case for v8i8 or shall we change is validElementCount to be
Op0VT.getVectorNumElements() % 16 == 0; || Op0VT.getVectorNumElements() % 8 == 0;

dmgreen added inline comments.Jan 16 2023, 3:03 AM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
15206

Sorry - I meant with the Op0VT != MVT::v8i8 too. The condition as written here will bail out if both !IsValidElementCount and !IsValidSize, but it seems like it should be bailing if one of them is false. So:

if (Op0VT != MVT::v8i8 && (!IsValidElementCount || !IsValidSize))

It could also do bool IsValidElementCount = Op0VT == MVT::v8i8 || Op0VT.getVectorNumElements() % 16 == 0; and then check that if (!IsValidElementCount || !IsValidSize), if you think that is cleaner.

zjaffal updated this revision to Diff 489500.Jan 16 2023, 4:15 AM
  1. Update IsValidElementCount to check for vectors as multiples of 8.
  2. Move Zero and DotOpcode outside the for loop

Do you have any tests for cases that are a multiple of 8 but not of 16, like <24 x ...? And can you make sure we have the load <4 x i8> test case?

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
15206

Using Op0VT.getVectorNumElements() % 16 == 0 || Op0VT.getVectorNumElements() == 8; would be simpler if you did not care about <24 x types (which might be better as a 16+8, not 3*8).

15237

Some of these types/constants might be incorrect for multiples of 8?

Do you have any tests for cases that are a multiple of 8 but not of 16, like <24 x ...? And can you make sure we have the load <4 x i8> test case?

I will add tests for <24 x ..> and <4 x i8>

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
15206

I will add the v4i8 test on a separate patch. We don't support it at the moment

15248

Shall we add test cases for vectors that are multiples of 8?
for example v24i8

dmgreen added inline comments.Jan 16 2023, 7:50 AM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
15206

We don't need to support it with this fold necessarily. It was just hitting issues with the original version of this patch, so would be good to make sure we have the test coverage for it.

15248

Yeah that sounds useful. Thanks.

zjaffal updated this revision to Diff 489778.Jan 17 2023, 5:16 AM

Add support for vectors larger than v16i8 that are a multiple of v8i8

I added additional tests in D141922

zjaffal marked 7 inline comments as done.Jan 17 2023, 5:18 AM
dmgreen added inline comments.Jan 18 2023, 3:33 AM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
15235

vectos -> vectors

15247

Should this be 4? Or 2

(If 4 is correct it can be pulled up out of the if. But I think 2 might be a better value. I'm not 100% sure what happens when the operands and the type of a concat don't match up).

15249

I += 1 instead of I += Offset?

zjaffal updated this revision to Diff 490147.Jan 18 2023, 7:17 AM
  1. Fix Offset for extract vector
  2. Fix spelling mistakes

Thanks - the patch looks pretty good to me now. The widths that are a multiple 8, but not of 16 (like 24 and 40), whilst they wont be super common, are not doing as well as they could be. Could they instead general v16 "chunks", until there is an v8 remainder? So split v40 into v16+v16+v8. It should hopefully reduce the amount of shuffling and extra instructions in the test_udot_v24i8_nomla like cases, as well as being less s/udot's in total.

Matt added a subscriber: Matt.Jan 23 2023, 12:57 PM
zjaffal updated this revision to Diff 492703.Jan 27 2023, 4:33 AM

Generate v16 checks and then seperate the reminder in v8 chucks this should improve the test case for v24i8

Thanks. The results are looking better now, if we can clean up the code a little then this looks good to me.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
15237

I don't think this needs floor

15254

This can only ever be 0 or 1, so probably doesn't need the loop. Hopefully this can simplify things a little, as we won't need to concat v8 vectors.

15277

They would need to be 0's I think. Would it be better and simpler to just return vecreduce.add(v16s) + vecreduce.add(v8)?

zjaffal updated this revision to Diff 492743.Jan 27 2023, 6:51 AM

Simplify the case where we generate v16 then a v8 vector

dmgreen added inline comments.Jan 29 2023, 2:22 AM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
15257

DotOpcode -> ISD::VECREDUCE_ADD I think?

15267

-> I * 16

zjaffal added inline comments.Jan 29 2023, 7:41 AM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
15257

yeah that is correct I will change it now

15267

this is for extracting the v8 chuck I think it should be 8 not 16

zjaffal updated this revision to Diff 493304.Jan 30 2023, 6:46 AM

Fix offset value and instruction opcode for VecReudceAdd16

dmgreen accepted this revision.Jan 31 2023, 12:09 AM

Thanks for the updates. This looks very general now. I you agree with the comment about the offset, then this LGTM with that change.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
15267

I was thinking this should be 16 because we have extracted I lots of v16 chunk above, and so should be going that far into the original vector. For example with 24x case should be getting this from offset 16 into the vector. (So in the tests it should be from an ldr d after the existing ldr q. They currently use an ext from the vector).

This revision is now accepted and ready to land.Jan 31 2023, 12:09 AM

Thanks for the updates. This looks very general now. I you agree with the comment about the offset, then this LGTM with that change.

I will change it before I land the patch then

This revision was landed with ongoing or failed builds.Jan 31 2023, 2:24 AM
This revision was automatically updated to reflect the committed changes.