This is an archive of the discontinued LLVM Phabricator instance.

[ARM] Do not use vtrn for vectorshuffle if the order is reversed
ClosedPublic

Authored by jketema on Aug 25 2015, 2:26 PM.

Download Raw Diff

Details

Reviewers

rengolin
ab
aemerson
jmolloy

Summary

The tests in isVTRNMask and isVTRN_v_undef_Mask should also check that the elements of the upper and lower half of the vectorshuffle occur in the correct order when both halves are used. Without this test the code assumes that it is correct to use vector transpose (vtrn) for the masks <1, 1, 0, 0> and <1, 3, 0, 2>, among others, but the transpose actually incorrectly generates shuffles for <0, 0, 1, 1> and <0, 2, 1, 3> in this case.

Diff Detail

Event Timeline

jketema updated this revision to Diff 33120.Aug 25 2015, 2:26 PM

jketema retitled this revision from to [ARM] Do not use vtrn for vectorshuffle if the order is reversed.

jketema updated this object.

jketema added reviewers: rengolin, aemerson.

jketema added a subscriber: llvm-commits.

Herald added subscribers: rengolin, aemerson. · View Herald TranscriptAug 25 2015, 2:26 PM

Bump

LGTM with the additional suggested testcase (if it does make sense), thanks!

I'm surprised we didn't see this earlier, I guess it was exposed by r240118?

lib/Target/ARM/ARMISelLowering.cpp
5056–5059	Looking at this again, I realize: isn't this a tad too conservative? What happens when you have: <-1, 4, 2, 6, 1, 5, 3, 7> More importantly: could this break, say with: <-1, 5, 3, 7, 1, 5, 3, 7> which isn't a vtrn, but looks like it will match? After this patch the '>=' should catch this, I think. If so, could you add a testcase?

This revision is now accepted and ready to land.Sep 1 2015, 6:33 PM

jketema added inline comments.Sep 2 2015, 3:30 AM

lib/Target/ARM/ARMISelLowering.cpp
5056–5059	The first one still generates a vtrn as expected, using both the upper and lower part. The second one also generates a vtrn but uses the upper part twice and not the lower part. Curiously, there don't seem to be any test cases for this behavior.

I've added two test cases based on the suggestion by Ahmed.

jketema added inline comments.Sep 2 2015, 4:20 AM

lib/Target/ARM/ARMISelLowering.cpp
5056–5059	I've added two test cases. Do these look ok?

I think the new tests generated vtrn because v8i32 had to be split. After that, there are no double-width shuffles anymore, so we'll just match (and later CSE) two identical shuffles. Can you try with v8i16 = v4i16,v4i16?

Also, from what I found, the <-1, 4, ...> also isn't tested, so it'd be great if you added it too.
I expected it to fail, because M[i] == 0 is false (when i == 0), so we will try and fail to match the upper result (WhichResult == 1) in the lower lanes (i == 0).

The investigation is much appreciated, thanks again!

Is this ok now with the added tests?

No: the added tests that generate <8 x i32> aren't testing the patched code: <8 x i32> isn't legal on ARM, so the tests' shufflevectors will be split into 2 separate <4 x i32> shufflevectors. Please test using (for instance) <4 x i16> and <8 x i16> instead of <4 x i32> and <8 x i32>.

This revision now requires changes to proceed.Sep 3 2015, 3:48 PM

Update tests.

Hi Ahmed,

I've updated the tests. Thanks for pointing out the issue with the type sizes, I had not realized this. Your comments also made me realise that there's an easier way to test for the incorrect vectors, as you see.

I agree that the case where only the upper or lower half is used, is indeed a bit conservative. Replacing

WhichResult = M[i] == 0 ? 0 : 1;

for (int k = i; k < i + NumElts; ++k) {
  if (M[k] >= 0) {
    WhichResult = (unsigned) M[k] % 2;
    break;
  }
}

could probably solve this, and would accept <-1, 4, 2, 6, 1, 5, 3, 7>. What do you think?

Bump

Nice! We still can't catch <-1, 4, 2, 6> though, right? If so, a FIXME would be enough.

LGTM, thanks!

test/CodeGen/ARM/vtrn.ll
379	No need to check BB#0? (Same below)

This revision is now accepted and ready to land.Sep 8 2015, 5:42 PM

Hi Ahmed,

I've added a FIXME comment, as <-1, 4, 2, 6> is indeed still rejected.

The check for @ BB#0 seems necessary, otherwise the tests fail by matching vtrn in the .type line. If there's a better way to not match on this line, please let me know.

If this is all ok now, please commit on my behalf.

Committed on Jeroen's behalf in rL257254.

This revision is now accepted and ready to land.Sep 10 2015, 1:44 AM

Thanks for committing James.

Revision Contents

Path

Size

lib/

Target/

ARM/

ARMISelLowering.cpp

14 lines

test/

CodeGen/

ARM/

vtrn.ll

28 lines

vuzp.ll

12 lines

vzip.ll

10 lines

Diff 33999

lib/Target/ARM/ARMISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 5,044 Lines • ▼ Show 20 Lines	static bool isVTRNMask(ArrayRef<int> M, EVT VT, unsigned &WhichResult) {
unsigned EltSz = VT.getVectorElementType().getSizeInBits();		unsigned EltSz = VT.getVectorElementType().getSizeInBits();
if (EltSz == 64)		if (EltSz == 64)
return false;		return false;

unsigned NumElts = VT.getVectorNumElements();		unsigned NumElts = VT.getVectorNumElements();
if (M.size() != NumElts && M.size() != NumElts*2)		if (M.size() != NumElts && M.size() != NumElts*2)
return false;		return false;

// If the mask is twice as long as the result then we need to check the upper		// If the mask is twice as long as the input vector then we need to check the
// and lower parts of the mask		// upper and lower parts of the mask with a matching value for WhichResult
for (unsigned i = 0; i < M.size(); i += NumElts) {		for (unsigned i = 0; i < M.size(); i += NumElts) {
		if (M.size() == NumElts * 2)
		WhichResult = i / NumElts;
		else
WhichResult = M[i] == 0 ? 0 : 1;		WhichResult = M[i] == 0 ? 0 : 1;
		abUnsubmitted Not Done Reply Inline Actions Looking at this again, I realize: isn't this a tad too conservative? What happens when you have: <-1, 4, 2, 6, 1, 5, 3, 7> More importantly: could this break, say with: <-1, 5, 3, 7, 1, 5, 3, 7> which isn't a vtrn, but looks like it will match? After this patch the '>=' should catch this, I think. If so, could you add a testcase? ab: Looking at this again, I realize: isn't this a tad too conservative? What happens when you have…
		jketemaAuthorUnsubmitted Not Done Reply Inline Actions The first one still generates a vtrn as expected, using both the upper and lower part. The second one also generates a vtrn but uses the upper part twice and not the lower part. Curiously, there don't seem to be any test cases for this behavior. jketema: The first one still generates a vtrn as expected, using both the upper and lower part. The…
		jketemaAuthorUnsubmitted Not Done Reply Inline Actions I've added two test cases. Do these look ok? jketema: I've added two test cases. Do these look ok?
for (unsigned j = 0; j < NumElts; j += 2) {		for (unsigned j = 0; j < NumElts; j += 2) {
if ((M[i+j] >= 0 && (unsigned) M[i+j] != j + WhichResult) \|\|		if ((M[i+j] >= 0 && (unsigned) M[i+j] != j + WhichResult) \|\|
(M[i+j+1] >= 0 && (unsigned) M[i+j+1] != j + NumElts + WhichResult))		(M[i+j+1] >= 0 && (unsigned) M[i+j+1] != j + NumElts + WhichResult))
return false;		return false;
}		}
}		}

if (M.size() == NumElts*2)		if (M.size() == NumElts*2)
Show All 10 Lines	static bool isVTRN_v_undef_Mask(ArrayRef<int> M, EVT VT, unsigned &WhichResult){
if (EltSz == 64)		if (EltSz == 64)
return false;		return false;

unsigned NumElts = VT.getVectorNumElements();		unsigned NumElts = VT.getVectorNumElements();
if (M.size() != NumElts && M.size() != NumElts*2)		if (M.size() != NumElts && M.size() != NumElts*2)
return false;		return false;

for (unsigned i = 0; i < M.size(); i += NumElts) {		for (unsigned i = 0; i < M.size(); i += NumElts) {
		if (M.size() == NumElts * 2)
		WhichResult = i / NumElts;
		else
WhichResult = M[i] == 0 ? 0 : 1;		WhichResult = M[i] == 0 ? 0 : 1;
for (unsigned j = 0; j < NumElts; j += 2) {		for (unsigned j = 0; j < NumElts; j += 2) {
if ((M[i+j] >= 0 && (unsigned) M[i+j] != j + WhichResult) \|\|		if ((M[i+j] >= 0 && (unsigned) M[i+j] != j + WhichResult) \|\|
(M[i+j+1] >= 0 && (unsigned) M[i+j+1] != j + WhichResult))		(M[i+j+1] >= 0 && (unsigned) M[i+j+1] != j + WhichResult))
return false;		return false;
}		}
}		}

if (M.size() == NumElts*2)		if (M.size() == NumElts*2)
▲ Show 20 Lines • Show All 6,784 Lines • Show Last 20 Lines

test/CodeGen/ARM/vtrn.ll

Show First 20 Lines • Show All 365 Lines • ▼ Show 20 Lines	define <8 x i8> @vtrn_mismatched_builvector1(<8 x i8> %tr0, <8 x i8> %tr1,
; CHECK: vbsl		; CHECK: vbsl
%cmp2_load = load <4 x i8>, <4 x i8> * %cmp2_ptr, align 4		%cmp2_load = load <4 x i8>, <4 x i8> * %cmp2_ptr, align 4
%cmp2 = trunc <4 x i8> %cmp2_load to <4 x i1>		%cmp2 = trunc <4 x i8> %cmp2_load to <4 x i1>
%c0 = icmp ult <4 x i32> %cmp0, %cmp1		%c0 = icmp ult <4 x i32> %cmp0, %cmp1
%c = shufflevector <4 x i1> %c0, <4 x i1> %cmp2, <8 x i32> <i32 0, i32 4, i32 1, i32 5, i32 2, i32 6, i32 3, i32 7>		%c = shufflevector <4 x i1> %c0, <4 x i1> %cmp2, <8 x i32> <i32 0, i32 4, i32 1, i32 5, i32 2, i32 6, i32 3, i32 7>
%rv = select <8 x i1> %c, <8 x i8> %tr0, <8 x i8> %tr1		%rv = select <8 x i1> %c, <8 x i8> %tr0, <8 x i8> %tr1
ret <8 x i8> %rv		ret <8 x i8> %rv
}		}

		; Negative test that should not generate a vtrn
		define void @lower_twice_no_vtrn(<4 x i16>* %A, <4 x i16>* %B, <8 x i16>* %C) {
		entry:
		; CHECK-LABEL: lower_twice_no_vtrn
		; CHECK: @ BB#0:
		abUnsubmitted Not Done Reply Inline Actions No need to check BB#0? (Same below) ab: No need to check BB#0? (Same below)
		; CHECK-NOT: vtrn
		; CHECK: mov pc, lr
		%tmp1 = load <4 x i16>, <4 x i16>* %A
		%tmp2 = load <4 x i16>, <4 x i16>* %B
		%0 = shufflevector <4 x i16> %tmp1, <4 x i16> %tmp2, <8 x i32> <i32 undef, i32 5, i32 3, i32 7, i32 1, i32 5, i32 3, i32 7>
		store <8 x i16> %0, <8 x i16>* %C
		ret void
		}

		; Negative test that should not generate a vtrn
		define void @upper_twice_no_vtrn(<4 x i16>* %A, <4 x i16>* %B, <8 x i16>* %C) {
		entry:
		; CHECK-LABEL: upper_twice_no_vtrn
		; CHECK: @ BB#0:
		; CHECK-NOT: vtrn
		; CHECK: mov pc, lr
		%tmp1 = load <4 x i16>, <4 x i16>* %A
		%tmp2 = load <4 x i16>, <4 x i16>* %B
		%0 = shufflevector <4 x i16> %tmp1, <4 x i16> %tmp2, <8 x i32> <i32 0, i32 undef, i32 2, i32 6, i32 0, i32 4, i32 2, i32 6>
		store <8 x i16> %0, <8 x i16>* %C
		ret void
		}

test/CodeGen/ARM/vuzp.ll

Show First 20 Lines • Show All 280 Lines • ▼ Show 20 Lines	entry:
; CHECK-NOT: vtrn		; CHECK-NOT: vtrn
; CHECK: vuzp		; CHECK: vuzp
%tmp1 = load <2 x i32>, <2 x i32>* %A		%tmp1 = load <2 x i32>, <2 x i32>* %A
%tmp2 = load <2 x i32>, <2 x i32>* %B		%tmp2 = load <2 x i32>, <2 x i32>* %B
%0 = shufflevector <2 x i32> %tmp1, <2 x i32> %tmp2, <4 x i32> <i32 0, i32 0, i32 1, i32 3>		%0 = shufflevector <2 x i32> %tmp1, <2 x i32> %tmp2, <4 x i32> <i32 0, i32 0, i32 1, i32 3>
ret <4 x i32> %0		ret <4 x i32> %0
}		}

		define void @vuzp_rev_shufflemask_vtrn(<2 x i32>* %A, <2 x i32>* %B, <4 x i32>* %C) {
		entry:
		; CHECK-LABEL: vuzp_rev_shufflemask_vtrn
		; CHECK-NOT: vtrn
		; CHECK: vuzp
		%tmp1 = load <2 x i32>, <2 x i32>* %A
		%tmp2 = load <2 x i32>, <2 x i32>* %B
		%0 = shufflevector <2 x i32> %tmp1, <2 x i32> %tmp2, <4 x i32> <i32 1, i32 3, i32 0, i32 2>
		store <4 x i32> %0, <4 x i32>* %C
		ret void
		}

define <8 x i8> @vuzp_trunc(<8 x i8> %in0, <8 x i8> %in1, <8 x i32> %cmp0, <8 x i32> %cmp1) {		define <8 x i8> @vuzp_trunc(<8 x i8> %in0, <8 x i8> %in1, <8 x i32> %cmp0, <8 x i32> %cmp1) {
; In order to create the select we need to truncate the vcgt result from a vector of i32 to a vector of i8.		; In order to create the select we need to truncate the vcgt result from a vector of i32 to a vector of i8.
; This results in a build_vector with mismatched types. We will generate two vmovn.i32 instructions to		; This results in a build_vector with mismatched types. We will generate two vmovn.i32 instructions to
; truncate from i32 to i16 and one vuzp to perform the final truncation for i8.		; truncate from i32 to i16 and one vuzp to perform the final truncation for i8.
; CHECK-LABEL: vuzp_trunc		; CHECK-LABEL: vuzp_trunc
; CHECK: vmovn.i32		; CHECK: vmovn.i32
; CHECK: vmovn.i32		; CHECK: vmovn.i32
; CHECK: vuzp		; CHECK: vuzp
▲ Show 20 Lines • Show All 64 Lines • Show Last 20 Lines

test/CodeGen/ARM/vzip.ll

Show First 20 Lines • Show All 289 Lines • ▼ Show 20 Lines	entry:
; CHECK-LABEL: vzip_lower_shufflemask_vuzp		; CHECK-LABEL: vzip_lower_shufflemask_vuzp
; CHECK-NOT: vuzp		; CHECK-NOT: vuzp
; CHECK: vzip		; CHECK: vzip
%tmp1 = load <2 x i32>, <2 x i32>* %A		%tmp1 = load <2 x i32>, <2 x i32>* %A
%0 = shufflevector <2 x i32> %tmp1, <2 x i32> %tmp1, <4 x i32> <i32 0, i32 2, i32 1, i32 0>		%0 = shufflevector <2 x i32> %tmp1, <2 x i32> %tmp1, <4 x i32> <i32 0, i32 2, i32 1, i32 0>
ret <4 x i32> %0		ret <4 x i32> %0
}		}

		define void @vzip_undef_rev_shufflemask_vtrn(<2 x i32>* %A, <4 x i32>* %B) {
		entry:
		; CHECK-LABEL: vzip_undef_rev_shufflemask_vtrn
		; CHECK-NOT: vtrn
		; CHECK: vzip
		%tmp1 = load <2 x i32>, <2 x i32>* %A
		%0 = shufflevector <2 x i32> %tmp1, <2 x i32> undef, <4 x i32> <i32 1, i32 1, i32 0, i32 0>
		store <4 x i32> %0, <4 x i32>* %B
		ret void
		}