This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Select lower sub,abs pattern to sabd on AArch64
ClosedPublic

Authored by karthikthecool on Dec 26 2014, 7:38 AM.

Download Raw Diff

Details

Reviewers

t.p.northover
jmolloy

Summary

Hi,
The below code -

int c[4],b[4],a[4];
void fn() {
  c[0] = abs(a[0]-b[0]);
  c[1] = abs(a[1]-b[1]);
  c[2] = abs(a[2]-b[2]);
  c[3] = abs(a[3]-b[3]);
}

is compiled into -

fn:                                     // @fn
// BB#0:

adrp x8, a
add x8, x8, :lo12:a
adrp x9, b
add x9, x9, :lo12:b
ldr q0, [x8]
ldr q1, [x9]
sub v0.4s, v0.4s, v1.4s
abs v0.4s, v0.4s
adrp x8, c
add x8, x8, :lo12:c
str q0, [x8]
ret
The sequence-

sub	v0.4s, v0.4s, v1.4s
abs	v0.4s, v0.4s

can further be lowered to a single instruction on AArch64-

sabd	v0.4s, v0.4s, v1.4s

This patch pattern matches the same in .td file to generate sabd instruction.
Please let me know if this is good to commit.

Thanks and Regards
Karthik Bhat

Diff Detail

Repository: rL LLVM

Event Timeline

karthikthecool updated this revision to Diff 17637.Dec 26 2014, 7:38 AM

karthikthecool retitled this revision from to [AArch64] Select lower sub,abs pattern to sabd on AArch64.

karthikthecool updated this object.

karthikthecool edited the test plan for this revision. (Show Details)

karthikthecool added reviewers: t.p.northover, jmolloy.

karthikthecool set the repository for this revision to rL LLVM.

karthikthecool added a subscriber: Unknown Object (MLST).

Herald added a subscriber: aemerson. · View Herald TranscriptDec 26 2014, 7:38 AM

Update patch as SABD subtracts the second vector operand with the first and places the absolute value of difference in destination.
The pattern in the patch was incorrectly subtracting the first vector opearand with the second.
Updated the patch to generate the correct instruction. i.e.

sub	v0.4s, v0.4s, v1.4s
abs	v0.4s, v0.4s

can further be lowered to a single instruction on AArch64-

sabd	v0.4s, v1.4s, v0.4s and *NOT* sabd	v0.4s, v0.4s, v1.4s

Please let me know if this is good to commit.
Thanks and Regards
Karthik Bhat

Hi Karthik,
Your original diff had it right the first time.

sub v0.4s, v0.4s, v1.4s => v0.4s - v1.4s and result of substract stored to v0.4s
abs v0.4s, v0.4s

Shouldn't it be SABDv4i32 V128:$Rn, V128:$Rm ?

Thanks Jyoti.. Yes you are right i got confused by the wording in the instruction manual. The 1st version of the patch was correct. Reverting back to the same.
Please let me know if you have any other comments.
Thanks and Regards
Karthik Bhat

A generic name for the testcase would be better.
Need to add checks for complete machine instructions in the test output rather than just pneumonic, since we have added different patterns to handle various types of data.
Could you please modify to test these?
While above changes are needed, i would prefer we get a go from either of reviewers before committing.

Hi Jyoti,
Updated the test cases as per reveiw comments to checke the exact instruction being generated.
Merged test case with D6791.
Please let me know if you have any other comments or if it is good to commit.
Thanks and Regards
Karthik Bhat

karthikthecool mentioned this in D6791: [AArch64] Select lower fsub,fabs pattern to fabd on AArch64.Jan 5 2015, 2:21 AM

Hi Karthik,

This also looks fine to me. As I mentioned on your previous Phab revision, I'd have written the testcase using function parameters rather than globals as it's a little easier to read. Perhaps that'd be worth thinking about for future revisions.

LGTM.

James

This revision is now accepted and ready to land.Jan 5 2015, 2:53 AM

Hi James,
Thanks for the input. It makes sense to have function arguments instead of global variable. Submitted as r225165 after modifying test case as per comments.
Thanks!

Revision Contents

Path

Size

lib/

Target/

AArch64/

AArch64InstrInfo.td

27 lines

test/

CodeGen/

AArch64/

arm64-neon-simd-vabs.ll

107 lines

Diff 17790

lib/Target/AArch64/AArch64InstrInfo.td

This file is larger than 256 KB, so syntax highlighting is disabled by default.

	Show First 20 Lines • Show All 2,727 Lines • ▼ Show 20 Lines
	defm BIT : SIMDLogicalThreeVectorTied<1, 0b10, "bit", AArch64bit>;			defm BIT : SIMDLogicalThreeVectorTied<1, 0b10, "bit", AArch64bit>;
	defm BSL : SIMDLogicalThreeVectorTied<1, 0b01, "bsl",			defm BSL : SIMDLogicalThreeVectorTied<1, 0b01, "bsl",
	TriOpFrag<(or (and node:$LHS, node:$MHS), (and (vnot node:$LHS), node:$RHS))>>;			TriOpFrag<(or (and node:$LHS, node:$MHS), (and (vnot node:$LHS), node:$RHS))>>;
	defm EOR : SIMDLogicalThreeVector<1, 0b00, "eor", xor>;			defm EOR : SIMDLogicalThreeVector<1, 0b00, "eor", xor>;
	defm ORN : SIMDLogicalThreeVector<0, 0b11, "orn",			defm ORN : SIMDLogicalThreeVector<0, 0b11, "orn",
	BinOpFrag<(or node:$LHS, (vnot node:$RHS))> >;			BinOpFrag<(or node:$LHS, (vnot node:$RHS))> >;
	defm ORR : SIMDLogicalThreeVector<0, 0b10, "orr", or>;			defm ORR : SIMDLogicalThreeVector<0, 0b10, "orr", or>;

				// SABD Vd.<T>, Vn.<T>, Vm.<T> Subtracts the elements of Vm from the corresponding
				// elements of Vn, and places the absolute values of the results in the elements of Vd.
				def : Pat<(xor (v8i8 (AArch64vashr (v8i8(sub V64:$Rn, V64:$Rm)), (i32 7))),
				(v8i8 (add (v8i8(sub V64:$Rn, V64:$Rm)),
				(AArch64vashr (v8i8(sub V64:$Rn, V64:$Rm)), (i32 7))))),
				(SABDv8i8 V64:$Rn, V64:$Rm)>;
				def : Pat<(xor (v4i16 (AArch64vashr (v4i16(sub V64:$Rn, V64:$Rm)), (i32 15))),
				(v4i16 (add (v4i16(sub V64:$Rn, V64:$Rm)),
				(AArch64vashr (v4i16(sub V64:$Rn, V64:$Rm)), (i32 15))))),
				(SABDv4i16 V64:$Rn, V64:$Rm)>;
				def : Pat<(xor (v2i32 (AArch64vashr (v2i32(sub V64:$Rn, V64:$Rm)), (i32 31))),
				(v2i32 (add (v2i32(sub V64:$Rn, V64:$Rm)),
				(AArch64vashr (v2i32(sub V64:$Rn, V64:$Rm)), (i32 31))))),
				(SABDv2i32 V64:$Rn, V64:$Rm)>;
				def : Pat<(xor (v16i8 (AArch64vashr (v16i8(sub V128:$Rn, V128:$Rm)), (i32 7))),
				(v16i8 (add (v16i8(sub V128:$Rn, V128:$Rm)),
				(AArch64vashr (v16i8(sub V128:$Rn, V128:$Rm)), (i32 7))))),
				(SABDv16i8 V128:$Rn, V128:$Rm)>;
				def : Pat<(xor (v8i16 (AArch64vashr (v8i16(sub V128:$Rn, V128:$Rm)), (i32 15))),
				(v8i16 (add (v8i16(sub V128:$Rn, V128:$Rm)),
				(AArch64vashr (v8i16(sub V128:$Rn, V128:$Rm)), (i32 15))))),
				(SABDv8i16 V128:$Rn, V128:$Rm)>;
				def : Pat<(xor (v4i32 (AArch64vashr (v4i32(sub V128:$Rn, V128:$Rm)), (i32 31))),
				(v4i32 (add (v4i32(sub V128:$Rn, V128:$Rm)),
				(AArch64vashr (v4i32(sub V128:$Rn, V128:$Rm)), (i32 31))))),
				(SABDv4i32 V128:$Rn, V128:$Rm)>;

	def : Pat<(AArch64bsl (v8i8 V64:$Rd), V64:$Rn, V64:$Rm),			def : Pat<(AArch64bsl (v8i8 V64:$Rd), V64:$Rn, V64:$Rm),
	(BSLv8i8 V64:$Rd, V64:$Rn, V64:$Rm)>;			(BSLv8i8 V64:$Rd, V64:$Rn, V64:$Rm)>;
	def : Pat<(AArch64bsl (v4i16 V64:$Rd), V64:$Rn, V64:$Rm),			def : Pat<(AArch64bsl (v4i16 V64:$Rd), V64:$Rn, V64:$Rm),
	(BSLv8i8 V64:$Rd, V64:$Rn, V64:$Rm)>;			(BSLv8i8 V64:$Rd, V64:$Rn, V64:$Rm)>;
	def : Pat<(AArch64bsl (v2i32 V64:$Rd), V64:$Rn, V64:$Rm),			def : Pat<(AArch64bsl (v2i32 V64:$Rd), V64:$Rn, V64:$Rm),
	(BSLv8i8 V64:$Rd, V64:$Rn, V64:$Rm)>;			(BSLv8i8 V64:$Rd, V64:$Rn, V64:$Rm)>;
	def : Pat<(AArch64bsl (v1i64 V64:$Rd), V64:$Rn, V64:$Rm),			def : Pat<(AArch64bsl (v1i64 V64:$Rd), V64:$Rn, V64:$Rm),
	(BSLv8i8 V64:$Rd, V64:$Rn, V64:$Rm)>;			(BSLv8i8 V64:$Rd, V64:$Rn, V64:$Rm)>;
	▲ Show 20 Lines • Show All 2,912 Lines • Show Last 20 Lines

test/CodeGen/AArch64/arm64-neon-simd-vabs.ll

				; RUN: llc -mtriple=aarch64-none-linux-gnu < %s \| FileCheck %s
				target datalayout = "e-m:e-i64:64-i128:128-n32:64-S128"
				target triple = "aarch64--linux-gnu"

				@a = common global [4 x i32] zeroinitializer
				@b = common global [4 x i32] zeroinitializer
				@c = common global [4 x i32] zeroinitializer

				; CHECK: testv4i32
				; CHECK: sabd v0.4s, v0.4s, v1.4s
				define void @testv4i32() {
				%1 = load <4 x i32>* bitcast ([4 x i32]* @a to <4 x i32>*)
				%2 = load <4 x i32>* bitcast ([4 x i32]* @b to <4 x i32>*)
				%3 = sub nsw <4 x i32> %1, %2
				%4 = icmp sgt <4 x i32> %3, <i32 -1, i32 -1, i32 -1, i32 -1>
				%5 = sub <4 x i32> zeroinitializer, %3
				%6 = select <4 x i1> %4, <4 x i32> %3, <4 x i32> %5
				store <4 x i32> %6, <4 x i32>* bitcast ([4 x i32]* @c to <4 x i32>*)
				ret void
				}

				@d = common global [2 x i32] zeroinitializer
				@e = common global [2 x i32] zeroinitializer
				@f = common global [2 x i32] zeroinitializer

				; CHECK: testv2i32
				; CHECK: sabd v0.2s, v0.2s, v1.2s
				define void @testv2i32() {
				%1 = load <2 x i32>* bitcast ([2 x i32]* @d to <2 x i32>*)
				%2 = load <2 x i32>* bitcast ([2 x i32]* @e to <2 x i32>*)
				%3 = sub nsw <2 x i32> %1, %2
				%4 = icmp sgt <2 x i32> %3, <i32 -1, i32 -1>
				%5 = sub <2 x i32> zeroinitializer, %3
				%6 = select <2 x i1> %4, <2 x i32> %3, <2 x i32> %5
				store <2 x i32> %6, <2 x i32>* bitcast ([2 x i32]* @f to <2 x i32>*)
				ret void
				}

				@g = common global [8 x i16] zeroinitializer
				@h = common global [8 x i16] zeroinitializer
				@i = common global [8 x i16] zeroinitializer

				; CHECK: testv8i16
				; CHECK: sabd v0.8h, v0.8h, v1.8h
				define void @testv8i16() {
				%1 = load <8 x i16>* bitcast ([8 x i16]* @g to <8 x i16>*)
				%2 = load <8 x i16>* bitcast ([8 x i16]* @h to <8 x i16>*)
				%3 = sub nsw <8 x i16> %1, %2
				%4 = icmp sgt <8 x i16> %3, <i16 -1, i16 -1,i16 -1, i16 -1,i16 -1, i16 -1,i16 -1, i16 -1>
				%5 = sub <8 x i16> zeroinitializer, %3
				%6 = select <8 x i1> %4, <8 x i16> %3, <8 x i16> %5
				store <8 x i16> %6, <8 x i16>* bitcast ([8 x i16]* @i to <8 x i16>*)
				ret void
				}

				@j = common global [4 x i16] zeroinitializer
				@k = common global [4 x i16] zeroinitializer
				@l = common global [4 x i16] zeroinitializer

				; CHECK: testv4i16
				; CHECK: sabd
				define void @testv4i16() {
				%1 = load <4 x i16>* bitcast ([4 x i16]* @j to <4 x i16>*)
				%2 = load <4 x i16>* bitcast ([4 x i16]* @k to <4 x i16>*)
				%3 = sub nsw <4 x i16> %1, %2
				%4 = icmp sgt <4 x i16> %3, <i16 -1, i16 -1,i16 -1, i16 -1>
				%5 = sub <4 x i16> zeroinitializer, %3
				%6 = select <4 x i1> %4, <4 x i16> %3, <4 x i16> %5
				store <4 x i16> %6, <4 x i16>* bitcast ([4 x i16]* @l to <4 x i16>*)
				ret void
				}

				@m = common global [16 x i8] zeroinitializer
				@n = common global [16 x i8] zeroinitializer
				@o = common global [16 x i8] zeroinitializer

				; CHECK: testv16i8
				; CHECK: sabd v0.16b, v0.16b, v1.16b
				define void @testv16i8() {
				%1 = load <16 x i8>* bitcast ([16 x i8]* @m to <16 x i8>*)
				%2 = load <16 x i8>* bitcast ([16 x i8]* @n to <16 x i8>*)
				%3 = sub nsw <16 x i8> %1, %2
				%4 = icmp sgt <16 x i8> %3, <i8 -1, i8 -1,i8 -1, i8 -1,i8 -1, i8 -1,i8 -1, i8 -1,i8 -1, i8 -1,i8 -1, i8 -1,i8 -1, i8 -1,i8 -1, i8 -1>
				%5 = sub <16 x i8> zeroinitializer, %3
				%6 = select <16 x i1> %4, <16 x i8> %3, <16 x i8> %5
				store <16 x i8> %6, <16 x i8>* bitcast ([16 x i8]* @o to <16 x i8>*)
				ret void
				}

				@p = common global [8 x i8] zeroinitializer
				@q = common global [8 x i8] zeroinitializer
				@r = common global [8 x i8] zeroinitializer

				; CHECK: testv8i8
				; CHECK: sabd v0.8b, v0.8b, v1.8b
				define void @testv8i8() {
				%1 = load <8 x i8>* bitcast ([8 x i8]* @p to <8 x i8>*)
				%2 = load <8 x i8>* bitcast ([8 x i8]* @q to <8 x i8>*)
				%3 = sub nsw <8 x i8> %1, %2
				%4 = icmp sgt <8 x i8> %3, <i8 -1, i8 -1,i8 -1, i8 -1,i8 -1, i8 -1,i8 -1, i8 -1>
				%5 = sub <8 x i8> zeroinitializer, %3
				%6 = select <8 x i1> %4, <8 x i8> %3, <8 x i8> %5
				store <8 x i8> %6, <8 x i8>* bitcast ([8 x i8]* @r to <8 x i8>*)
				ret void
				}