This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Improve SAD pattern
ClosedPublic

Authored by jaykang10 on Jun 10 2021, 9:26 AM.

Details

Summary

Given a vecreduce_add node, detect the below pattern and convert it to the node sequence with UABDL, UADB and UADDLP.

i32 vecreduce_add(
 v16i32 abs(
   v16i32 sub(
     v16i32 zero_extend(v16i8 a), v16i32 zero_extend(v16i8 b))))
i32 vecreduce_add(
  v4i32 UADDLP(
    v8i16 add(
      v8i16 zext(
        v8i8 UABD low8:v16i8 a, low8:v16i8 b
      v8i16 zext(
        v8i8 UABD high8:v16i8 a, high8:v16i8 b

For example, the updated pattern improves the assembly output as below.
The source llvm IR

define i32 @test_sad_v16i8(i8* nocapture readonly %a, i8* nocapture readonly %b) {
entry:
  %0 = bitcast i8* %a to <16 x i8>*
  %1 = load <16 x i8>, <16 x i8>* %0
  %2 = zext <16 x i8> %1 to <16 x i32>
  %3 = bitcast i8* %b to <16 x i8>*
  %4 = load <16 x i8>, <16 x i8>* %3
  %5 = zext <16 x i8> %4 to <16 x i32>
  %6 = sub nsw <16 x i32> %5, %2
  %7 = call <16 x i32> @llvm.abs.v16i32(<16 x i32> %6, i1 true)
  %8 = call i32 @llvm.vector.reduce.add.v16i32(<16 x i32> %7) 
  ret i32 %8
}

The assembly output from original pattern.

	ldr	q0, [x0]
	ldr	q1, [x1]
	uabd	v0.16b, v1.16b, v0.16b
	ushll2	v1.8h, v0.16b, #0
	ushll	v0.8h, v0.8b, #0
	uaddl2	v2.4s, v0.8h, v1.8h
	uaddl	v0.4s, v0.4h, v1.4h
	add	v0.4s, v0.4s, v2.4s
	addv	s0, v0.4s
	fmov	w0, s0
	ret

The assembly output from updated pattern.

	ldr	q0, [x0]
	ldr	q1, [x1]
	uabdl	v2.8h, v1.8b, v0.8b
	uabal2	v2.8h, v1.16b, v0.16b
	uaddlp	v0.4s, v2.8h
	addv	s0, v0.4s
	fmov	w0, s0
	ret

Diff Detail

Event Timeline

jaykang10 created this revision.Jun 10 2021, 9:26 AM
jaykang10 requested review of this revision.Jun 10 2021, 9:26 AM
Herald added a project: Restricted Project. · View Herald TranscriptJun 10 2021, 9:26 AM

Nice patch. It's a big combine, but we certainly have bigger elsewhere. Sometimes it's possible to do things like this as smaller pieces, that add up to the same thing. Speaking of which, would it be possible to combine (in a separate patch, maybe as a tablegen pattern) addv (uaddlp x) -> uaddlv x ?

Would it make sense to add signed handling too? If I'm not mistaken, if it has sext's as opposed to zext's, it can turn into addv(uaddlp(sabd,sabd2). The sadb's produce an unsigned value from signed inputs, so the remaining extensions are zext's not sexts and it doesn't need a SADDLP node. That may not be correct though, so, um, make sure you check it.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
12004

This doesn't appear to use UADALP. Is that meant as a shorthand for UADA + UADDLP?

12011

This is mostly a style comment, so feel free to ignore if you like. They Opcode's are only used once in the checks, so could be done inline without the variable. Having the "ABSOp0" and "SUB" nodes could just be one variable called "SUB" too, as we are checking the opcode on them straight away.

llvm/test/CodeGen/AArch64/neon-sad.ll
7

It can be good to pre-commit the tests, to just show the differences in the review. It makes it easier to see what the patch does.

Nice patch. It's a big combine, but we certainly have bigger elsewhere. Sometimes it's possible to do things like this as smaller pieces, that add up to the same thing.

Thanks for your kind comments @dmgreen
I agree with you. I tried to do this combine with smaller pieces but the v16i32 type caused more nodes. Type legalizer split the v16i32 type operations into multiple v4i32 and v4i16 operations because the AArch64 target does not support the native register class for the v16i32 type. In order to avoid the type legalization, I combined vecreduce_add.

Speaking of which, would it be possible to combine (in a separate patch, maybe as a tablegen pattern) addv (uaddlp x) -> uaddlv x ?

It is good point! I have tried below pattern following your suggestion. It seems to work. If you are ok, let me add below pattern in this patch.

let AddedComplexity = 10 in { 
def : Pat<(i32 (extractelt
                 (v4i32 (AArch64uaddv (v4i32 (AArch64uaddlp (v8i16 V128:$op))))),
                 (i64 0))),
          (UADDLVv8i16v V128:$op)>;
}

Would it make sense to add signed handling too? If I'm not mistaken, if it has sext's as opposed to zext's, it can turn into addv(uaddlp(sabd,sabd2). The sadb's produce an unsigned value from signed inputs, so the remaining extensions are zext's not sexts and it doesn't need a SADDLP node. That may not be correct though, so, um, make sure you check it.

Yep, I agree with you. Let me update code to support signed one.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
12004

Sorry for typo. Let me update the name.

12011

Yep, let me update it.

llvm/test/CodeGen/AArch64/neon-sad.ll
7

Yep, let me pre-commit this test.

jaykang10 updated this revision to Diff 351416.Jun 11 2021, 5:32 AM

Following comments from @dmgreen, updated code.

jaykang10 updated this revision to Diff 351424.Jun 11 2021, 5:52 AM

Fixed typo

It is good point! I have tried below pattern following your suggestion. It seems to work. If you are ok, let me add below pattern in this patch.

let AddedComplexity = 10 in { 
def : Pat<(i32 (extractelt
                 (v4i32 (AArch64uaddv (v4i32 (AArch64uaddlp (v8i16 V128:$op))))),
                 (i64 0))),
          (UADDLVv8i16v V128:$op)>;
}

Do you mind doing this as a new patch? As it does feel logically separable. If we can test them, it would be good to add the various other sizes too. And, I'm not sure about this, but maybe it doesn't need to start from the extract, and can produce a INSERT_SUBREG like some of the other patterns do (like the ones from SIMDAcrossLanesIntrinsic). That might remove the need for the added complexity, and the INSERT_SUBREG / EXRACT_SUBREG should all get cleared up later in the pipeline.

Matt added a subscriber: Matt.Jun 11 2021, 10:23 AM

It is good point! I have tried below pattern following your suggestion. It seems to work. If you are ok, let me add below pattern in this patch.

let AddedComplexity = 10 in { 
def : Pat<(i32 (extractelt
                 (v4i32 (AArch64uaddv (v4i32 (AArch64uaddlp (v8i16 V128:$op))))),
                 (i64 0))),
          (UADDLVv8i16v V128:$op)>;
}

Do you mind doing this as a new patch? As it does feel logically separable. If we can test them, it would be good to add the various other sizes too. And, I'm not sure about this, but maybe it doesn't need to start from the extract, and can produce a INSERT_SUBREG like some of the other patterns do (like the ones from SIMDAcrossLanesIntrinsic). That might remove the need for the added complexity, and the INSERT_SUBREG / EXRACT_SUBREG should all get cleared up later in the pipeline.

Yep, let me create a new patch for it.

jaykang10 updated this revision to Diff 351792.Jun 14 2021, 1:11 AM

Following comments from @dmgreen, updated code.

@dmgreen If possible, can we push this patch please? After that, I will create a new patch for the TableGen pattern.

dmgreen accepted this revision.Jun 14 2021, 6:33 AM

Thanks. LGTM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
12073–12074

Just use UABAL directly? It would seem simpler that way. Same for the other SDValues in this function, which seem to be copied more than they need to be.

This revision is now accepted and ready to land.Jun 14 2021, 6:33 AM
jaykang10 updated this revision to Diff 351873.Jun 14 2021, 7:46 AM

Following comments from @dmgreen, updated code.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
12073–12074

Yep, let me update it.

This revision was landed with ongoing or failed builds.Jun 14 2021, 7:53 AM
This revision was automatically updated to reflect the committed changes.