This is an archive of the discontinued LLVM Phabricator instance.

Add 64 bit pattern matching for PSADBW
ClosedPublic

Authored by oren_ben_simhon on Apr 2 2017, 5:33 AM.

Details

Summary

PSADBW pattern currently supports the 32 bit IR pattern and only GLT (greather than) comparison.
The patch extends the pattern to catch also 64 bit IR pattern and includes all other comparison types (not only GLT).

Diff Detail

Repository
rL LLVM

Event Timeline

oren_ben_simhon created this revision.Apr 2 2017, 5:33 AM
mkuper added inline comments.Apr 2 2017, 10:43 AM
lib/Target/X86/X86ISelLowering.cpp
29045

instrcutions -> instructions

29226

This looks a bit suspicious. I suggest you split the patch in two - one to extend to arbitrary integer types, the other for this.
The larger integer types part is basically good to go, but I'd like to understand how the sign extend case happens better.

test/CodeGen/X86/sad_variations.ll
7

Any reason the test cases have to be so large? Can you match a single psadbw instead of several?

oren_ben_simhon marked 2 inline comments as done.Apr 3 2017, 12:11 AM
oren_ben_simhon added inline comments.
lib/Target/X86/X86ISelLowering.cpp
29226

Unfortunately, the larger integer types change will have no effect if i don't skip the sign extend instruction, so i prefer to keep it together.

I attached a ppt

that explains the mathematical prove behind removing skipping the sign extend.

Basically the logic i try to explain is:

  1. The only difference between the 64 bit IR version and 32 bit IR version is that the 64 bit IR version has sign extend.
  2. I explain that the sign extend is actually a zero extend.
  3. I explain that there is no mathematical difference between 64/32 bit results.
  4. Since 32 bit version of psadbw pattern is already proved to be mathematical correct, I conclude that the pattern with sign extend matches psadbw instruction.

The last function in the *.ll file demonstrates the pattern.
Let me know if it makes sense to you.

Implemented comments posted until 03/04 (Thank You Michael)

mkuper added inline comments.Apr 3 2017, 12:22 AM
lib/Target/X86/X86ISelLowering.cpp
29226

I haven't really looking at the slides yet, but regardless - I don't understand why you say the change will have no effect.
It'll have an effect if the final extension is a zext, not a sext. I don't see any reason why you'd have, specifically, sexts for 64-bit. I mean, your lit tests happen to have a sext, but you can just as well replace it with a zext.
Can you explain why the two changes (32 -> 64 and zext -> sext) are not orthogonal?

mkuper edited edge metadata.Apr 3 2017, 12:25 AM

**haven't really looked.

Ohhh, I think I got it now. Sorry for the noise, let me think about it. :-)

lib/Target/X86/X86ISelLowering.cpp
29226

You are correct. I can just as well replace it with a zext (I will update the code).
The zext/sext only happens when moving to 64 bit. That is why i consider it holistic solution and not orthogonal.
If you still feel like it is necessary to break the solution into two step review, let me know and I will create the two reviews.

mkuper added inline comments.Apr 3 2017, 12:39 AM
lib/Target/X86/X86ISelLowering.cpp
29226

I think you'll need an extra check for the zext as well. In fact, you probably just want to match ZEXT/SEXT/ANYEXT here (assuming it's correct).
IIUC, the case that's really equivalent to the i32 one is something that extends to i64 directly. without any extension after the select - something like:

%0 = bitcast i8* %cur to <8 x i8>*
%1 = load <8 x i8>, <8 x i8>* %0, align 1
%2 = zext <8 x i8> %1 to <8 x i64>
%3 = bitcast i8* %ref to <8 x i8>*
%4 = load <8 x i8>, <8 x i8>* %3, align 1
%5 = zext <8 x i8> %4 to <8 x i64>
%6 = sub nsw <8 x i64> %2, %5
%7 = icmp slt <8 x i64> %6, zeroinitializer
%8 = sub nsw <8 x i64> zeroinitializer, %6
%9 = select <8 x i1> %7, <8 x i64> %8, <8 x i64> %6
%rdx.shuf = shufflevector <8 x i64> %9, <8 x i64> undef, <8 x i32> <i32 4, i32 5, i32 6, i32 7, i32 undef, i32 undef, i32 undef, i32 undef>

Right?

Added support for zext (instead of just sext) and added a test case (Thank you Michael)

oren_ben_simhon added inline comments.Apr 3 2017, 4:13 AM
lib/Target/X86/X86ISelLowering.cpp
29226

I will add any_extend as you suggested.

I have to agree with you that the pattern that you specified is the more expected pattern.
However for some reason, the vectorizer chooses to do a second zero extend to i64 after the icmp.

Anyway, the pattern you suggested will be matched and I added the test case to the test file.

This is the example of the C code which is the base to all my lit tests:

#define ABS_GT(X)    (((X)>0)?(X):-(X))
unsigned int sad8_c(const unsigned char * const cur,
	   const unsigned char * const ref,
	   const unsigned int stride) {
	unsigned int sad = 0;
	unsigned int j;
	unsigned char const *ptr_cur = cur;
	unsigned char const *ptr_ref = ref;
	for (j = 0; j < 8; j++) {
		sad += ABS_GT(ptr_cur[0] - ptr_ref[0]);
		sad += ABS_GT(ptr_cur[1] - ptr_ref[1]);
		sad += ABS_GT(ptr_cur[2] - ptr_ref[2]);
		sad += ABS_GT(ptr_cur[3] - ptr_ref[3]);
		sad += ABS_GT(ptr_cur[4] - ptr_ref[4]);
		sad += ABS_GT(ptr_cur[5] - ptr_ref[5]);
		sad += ABS_GT(ptr_cur[6] - ptr_ref[6]);
		sad += ABS_GT(ptr_cur[7] - ptr_ref[7]);
		ptr_cur += stride;
		ptr_ref += stride;
	}
	return sad;
}

Added new function to test file and added ANY_EXTEND option as well (Thanks again Michael)

oren_ben_simhon marked an inline comment as done.Apr 3 2017, 5:46 AM
mkuper accepted this revision.Apr 3 2017, 10:23 AM

LGTM

This revision is now accepted and ready to land.Apr 3 2017, 10:23 AM
This revision was automatically updated to reflect the committed changes.