This is an archive of the discontinued LLVM Phabricator instance.

fix for PR20354 - Miscompile of fabs due to vectorization
AbandonedPublic

Authored by spatel on Jul 22 2014, 3:07 PM.

Download Raw Diff

Details

Reviewers

rengolin
chandlerc
nadav

Summary

In PR20354 ( http://llvm.org/bugs/show_bug.cgi?id=20354 ), we're miscompiling a vector fabs operation.

This patch corrects that case and allows optimization of vector fabs ops via sign bit twiddling rather than using FP instructions. It also changes the logic in visitFNEG to allow vector fneg ops to be optimized in a similar way.

The fabs and fneg cases are similar enough that we should probably refactor the code to reduce duplication, but I don't want to add that complication to this patch.

This patch breaks an existing ARM testcase in test/CodeGen/ARM/2009-10-21-InvalidFNeg.ll. That test was expecting use of VFPU/NEON, but now we don't even need to touch the FPU. I think that's universally better for any ARM target?

I've added testcases to the existing X86 tests for vec_fabs and vec_neg. Also added a FIXME to the existing tests in vec_fneg.ll because they don't have any checks.

Diff Detail

Event Timeline

spatel updated this revision to Diff 11787.Jul 22 2014, 3:07 PM

spatel retitled this revision from to fix for PR20354 - Miscompile of fabs due to vectorization.

spatel updated this object.

spatel edited the test plan for this revision. (Show Details)

spatel added reviewers: rengolin, chandlerc, nadav.

spatel added a subscriber: Unknown Object (MLST).

Herald added a subscriber: aemerson. · View Herald TranscriptJul 22 2014, 3:07 PM

rengolin added inline comments.Jul 28 2014, 9:28 AM

test/CodeGen/ARM/2009-10-21-InvalidFNeg.ll
14	I'm not sure what was the purpose of this test, but you can't just assume that it's ok. Can you share the resulting code? Currently, it produces this: add r1, sp, #36 add r0, r0, #48 vld1.32 {d16[0]}, [r1:32] add r1, r1, #4 vld1.32 {d16[1]}, [r1:32] add r1, sp, #44 vld1.32 {d17[0]}, [r1:32] add r1, r1, #4 vld1.32 {d17[1]}, [r1:32] vneg.f32 q8, q8 vst1.64 {d16, d17}, [r0:128] bx lr Which I agree, doesn't look like a clear winner, but it might be a side effect of the original intention...

Pinging this vigorously as it is a miscompile fix that has gone without
review for a week... Adding more folks.

This optimization itself looks useful, and the code is good, but surely this is hiding a bug? If so, the optimization should be committed later, once we know the real cause of the bug.

You're causing more code to hit the DAG combine here, i.e., the vector case will now do so. But to have had a miscompile the code must have been missing this combine, and hitting something later which is going wrong.

Is there another DAG combine, or even part of legalization which is wrong? I'd expect it to be common code if both ARM and X86 exhibit the same bug.

Hi Pete and Renato -

Thank you very much for the feedback. I won't be back at an LLVM dev machine until next week, so I can't generate the new code for the ARM test at the moment.

The cause of the bug for fabs in PR20354 is that we are incorrectly guarding against vector operands, so we produce a bit mask where only the *first* high bit of a vector was masked off: 0x7fffffffffffffff. We need to mask off each vector element's high bit: 0x7fffffff7fffffff. I'll try to make this clearer by adding some comments.

I think we can just fix the bug in FABS by changing the 'if' check in visitFABS to match what already exists in visitFNEG. Ie, this part of the check:
!N0.getOperand(0).getValueType().isVector()

would become:
!VT.isVector()

Of course, this will not generate the optimal code for vectors, but it should avoid the miscompilation.

Hi Sanjay

Oh yeah, I see the fix you've made for the mask, and i totally agree that fix is correct, and required.

My point is that because of !VT.isVector(), we shouldn't even be hitting the code inside that branch right now. So if you're getting miscompilations, its due to bad code elsewhere.

Or is the point that you're only getting miscompilations *after* you remove !VT.isVector(), and thats what you're fixing?

Thanks.

Hi Pete -
We're getting a miscompilation from the existing / unchanged code for visitFABS. I've tried to combine a bug fix and two optimizations into this single patch, and I think that was a mistake because that's causing confusion.

Let me abandon this patch, submit a new patch that *only* fixes the bug for FABS, follow that up with a patch that optimizes the vector case for FABS, and finally submit a final patch that optimizes the vector case for FNEG.

Thanks for the clarification and for spinning off the other work in to its own bug.

The remainder in this bug LGTM given the fix in the other patch.

Thanks,
Pete

I'm still worried about the ARM changes. Not that it looks wrong, just that the test doesn't tell me the whole picture. Can you share the resulting asm, please?

--renato

Hi Renato -

Certainly, I will put the ARM code change into the forthcoming patch for FNEG and cc you. Ok to do that in the new patch rather than putting anything more in this abandoned patch?

I have the FABS optimization patch proposal ready right now, so I will add you as a reviewer on that too. I hope the extra comments that I've included there will make it more obvious how the code is expected to change.

Sure, I only realised it was abandoned after I commented... ;)

--renato

Revision Contents

Path

Size

lib/

CodeGen/

SelectionDAG/

DAGCombiner.cpp

40 lines

test/

CodeGen/

ARM/

2009-10-21-InvalidFNeg.ll

10 lines

X86/

vec_fabs.ll

26 lines

vec_fneg.ll

28 lines

Diff 11787

lib/CodeGen/SelectionDAG/DAGCombiner.cpp

Context not available.
	&DAG.getTarget().Options))	&DAG.getTarget().Options))
	return GetNegatedExpression(N0, DAG, LegalOperations);	return GetNegatedExpression(N0, DAG, LegalOperations);

	// Transform fneg(bitconvert(x)) -> bitconvert(x^sign) to avoid loading	// Transform fneg(bitconvert(x)) -> bitconvert(x ^ sign) to avoid loading
	// constant pool values.	// constant pool values.
	if (!TLI.isFNegFree(VT) && N0.getOpcode() == ISD::BITCAST &&	if (!TLI.isFNegFree(VT) &&
	!VT.isVector() &&	N0.getOpcode() == ISD::BITCAST &&
	N0.getNode()->hasOneUse() &&	N0.getNode()->hasOneUse()) {
	N0.getOperand(0).getValueType().isInteger()) {
	SDValue Int = N0.getOperand(0);	SDValue Int = N0.getOperand(0);
	EVT IntVT = Int.getValueType();	EVT IntVT = Int.getValueType();
	if (IntVT.isInteger() && !IntVT.isVector()) {	if (IntVT.isInteger() && !IntVT.isVector()) {
		APInt SignMask;
		if (N0.getValueType().isVector()) {
		SignMask = APInt::getSignBit(N0.getValueType().getScalarSizeInBits());
		SignMask = APInt::getSplat(IntVT.getSizeInBits(), SignMask);
		} else {
		SignMask = APInt::getSignBit(IntVT.getSizeInBits());
		}
	Int = DAG.getNode(ISD::XOR, SDLoc(N0), IntVT, Int,	Int = DAG.getNode(ISD::XOR, SDLoc(N0), IntVT, Int,
	DAG.getConstant(APInt::getSignBit(IntVT.getSizeInBits()), IntVT));	DAG.getConstant(SignMask, IntVT));
	AddToWorklist(Int.getNode());	AddToWorklist(Int.getNode());
	return DAG.getNode(ISD::BITCAST, SDLoc(N),	return DAG.getNode(ISD::BITCAST, SDLoc(N), VT, Int);
	VT, Int);
	}	}
	}	}

Context not available.
	if (N0.getOpcode() == ISD::FNEG \|\| N0.getOpcode() == ISD::FCOPYSIGN)	if (N0.getOpcode() == ISD::FNEG \|\| N0.getOpcode() == ISD::FCOPYSIGN)
	return DAG.getNode(ISD::FABS, SDLoc(N), VT, N0.getOperand(0));	return DAG.getNode(ISD::FABS, SDLoc(N), VT, N0.getOperand(0));

	// Transform fabs(bitconvert(x)) -> bitconvert(x&~sign) to avoid loading	// Transform fabs(bitconvert(x)) -> bitconvert(x & ~sign) to avoid loading
	// constant pool values.	// constant pool values.
	if (!TLI.isFAbsFree(VT) &&	if (!TLI.isFAbsFree(VT) &&
	N0.getOpcode() == ISD::BITCAST && N0.getNode()->hasOneUse() &&	N0.getOpcode() == ISD::BITCAST &&
	N0.getOperand(0).getValueType().isInteger() &&	N0.getNode()->hasOneUse()) {
	!N0.getOperand(0).getValueType().isVector()) {
	SDValue Int = N0.getOperand(0);	SDValue Int = N0.getOperand(0);
	EVT IntVT = Int.getValueType();	EVT IntVT = Int.getValueType();
	if (IntVT.isInteger() && !IntVT.isVector()) {	if (IntVT.isInteger() && !IntVT.isVector()) {
		APInt SignMask;
		if (N0.getValueType().isVector()) {
		SignMask = ~APInt::getSignBit(N0.getValueType().getScalarSizeInBits());
		SignMask = APInt::getSplat(IntVT.getSizeInBits(), SignMask);
		} else {
		SignMask = ~APInt::getSignBit(IntVT.getSizeInBits());
		}
	Int = DAG.getNode(ISD::AND, SDLoc(N0), IntVT, Int,	Int = DAG.getNode(ISD::AND, SDLoc(N0), IntVT, Int,
	DAG.getConstant(~APInt::getSignBit(IntVT.getSizeInBits()), IntVT));	DAG.getConstant(SignMask, IntVT));
	AddToWorklist(Int.getNode());	AddToWorklist(Int.getNode());
	return DAG.getNode(ISD::BITCAST, SDLoc(N),	return DAG.getNode(ISD::BITCAST, SDLoc(N), N->getValueType(0), Int);
	N->getValueType(0), Int);
	}	}
	}	}

Context not available.

test/CodeGen/ARM/2009-10-21-InvalidFNeg.ll

	; RUN: llc -mcpu=cortex-a8 -mattr=+neon < %s \| grep vneg	; RUN: llc %s -mcpu=cortex-a8 -mattr=+neon -o - \| FileCheck %s

	target datalayout = "e-p:32:32:32-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64"	target datalayout = "e-p:32:32:32-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64"
	target triple = "armv7-eabi"	target triple = "armv7-eabi"

Context not available.
	%fff = type { %struct.vec_float4 }	%fff = type { %struct.vec_float4 }
	%struct.vec_float4 = type { <4 x float> }	%struct.vec_float4 = type { <4 x float> }

		; As part of the optimization from PR20354, this test should no longer use any VFP/NEON.
		rengolinUnsubmitted Not Done Reply Inline Actions I'm not sure what was the purpose of this test, but you can't just assume that it's ok. Can you share the resulting code? Currently, it produces this: add r1, sp, #36 add r0, r0, #48 vld1.32 {d16[0]}, [r1:32] add r1, r1, #4 vld1.32 {d16[1]}, [r1:32] add r1, sp, #44 vld1.32 {d17[0]}, [r1:32] add r1, r1, #4 vld1.32 {d17[1]}, [r1:32] vneg.f32 q8, q8 vst1.64 {d16, d17}, [r0:128] bx lr Which I agree, doesn't look like a clear winner, but it might be a side effect of the original intention... rengolin: I'm not sure what was the purpose of this test, but you can't just assume that it's ok. Can you…
		; CHECK-LABEL: foo
		; CHECK: eor {{.}}, {{.}}, #-2147483648
		; CHECK: eor {{.}}, {{.}}, #-2147483648
		; CHECK: eor {{.}}, {{.}}, #-2147483648
		; CHECK: eor {{.}}, {{.}}, #-2147483648

	define linkonce_odr arm_aapcs_vfpcc void @foo(%eee* noalias sret %agg.result, i64 %tfrm.0.0, i64 %tfrm.0.1, i64 %tfrm.0.2, i64 %tfrm.0.3, i64 %tfrm.0.4, i64 %tfrm.0.5, i64 %tfrm.0.6, i64 %tfrm.0.7) nounwind noinline {	define linkonce_odr arm_aapcs_vfpcc void @foo(%eee* noalias sret %agg.result, i64 %tfrm.0.0, i64 %tfrm.0.1, i64 %tfrm.0.2, i64 %tfrm.0.3, i64 %tfrm.0.4, i64 %tfrm.0.5, i64 %tfrm.0.6, i64 %tfrm.0.7) nounwind noinline {
	entry:	entry:
	%tmp104 = zext i64 %tfrm.0.2 to i512 ; <i512> [#uses=1]	%tmp104 = zext i64 %tfrm.0.2 to i512 ; <i512> [#uses=1]
Context not available.

test/CodeGen/X86/vec_fabs.ll

Context not available.
	ret <8 x float> %t	ret <8 x float> %t
	}	}
	declare <8 x float> @llvm.fabs.v8f32(<8 x float> %p)	declare <8 x float> @llvm.fabs.v8f32(<8 x float> %p)

		; The following 2 tests are related to PR20354 ( http://llvm.org/bugs/show_bug.cgi?id=20354 ).
		; If we're bitcasting to and from FP values, we can avoid the FPU entirely.

		; Make sure that we're only turning off the sign bit of each float value.

		; CHECK-LABEL: fabs_v2f32_1
		; CHECK: xorl %eax, %eax
		; CHECK: movl $2147483647, %edx
		define i64 @fabs_v2f32_1() {
		%ones = bitcast i64 18446744069414584320 to <2 x float> ; 0xffffffff00000000
		%fabs = call <2 x float> @llvm.fabs.v2f32(<2 x float> %ones)
		%ret = bitcast <2 x float> %fabs to i64
		ret i64 %ret
		}

		; CHECK-LABEL: fabs_v2f32_2
		; CHECK: movl $2147483647, %eax
		; CHECK: xorl %edx, %edx
		define i64 @fabs_v2f32_2() {
		%ones = bitcast i64 4294967295 to <2 x float> ; 0x00000000ffffffff
		%fabs = call <2 x float> @llvm.fabs.v2f32(<2 x float> %ones)
		%ret = bitcast <2 x float> %fabs to i64
		ret i64 %ret
		}
		declare <2 x float> @llvm.fabs.v2f32(<2 x float> %p)
Context not available.

test/CodeGen/X86/vec_fneg.ll

	; RUN: llc < %s -march=x86 -mattr=+sse2	; RUN: llc < %s -march=x86 \| FileCheck %s

		; FIXME: The following 2 tests don't have any checks!
	define <4 x float> @t1(<4 x float> %Q) {	define <4 x float> @t1(<4 x float> %Q) {
	%tmp15 = fsub <4 x float> < float -0.000000e+00, float -0.000000e+00, float -0.000000e+00, float -0.000000e+00 >, %Q	%tmp15 = fsub <4 x float> < float -0.000000e+00, float -0.000000e+00, float -0.000000e+00, float -0.000000e+00 >, %Q
	ret <4 x float> %tmp15	ret <4 x float> %tmp15
Context not available.
	%tmp15 = fsub <4 x float> zeroinitializer, %Q	%tmp15 = fsub <4 x float> zeroinitializer, %Q
	ret <4 x float> %tmp15	ret <4 x float> %tmp15
	}	}

		; The following 2 tests are related to PR20354 ( http://llvm.org/bugs/show_bug.cgi?id=20354 ).
		; If we're bitcasting to and from FP values, we can avoid the FPU entirely.

		; Make sure that we're only flipping the sign bit of each float.

		; CHECK-LABEL: fneg_v2f32_1
		; CHECK: movl $-2147483648, %eax
		; CHECK: movl $2147483647, %edx
		define i64 @fneg_v2f32_1() {
		%ones = bitcast i64 18446744069414584320 to <2 x float> ; 0xffffffff00000000
		%fneg = fsub <2 x float> <float -0.0, float -0.0>, %ones
		%ret = bitcast <2 x float> %fneg to i64
		ret i64 %ret
		}

		; CHECK-LABEL: fneg_v2f32_2
		; CHECK: movl $2147483647, %eax
		; CHECK: movl $-2147483648, %edx
		define i64 @fneg_v2f32_2() {
		%ones = bitcast i64 4294967295 to <2 x float> ; 0x00000000ffffffff
		%fneg = fsub <2 x float> <float -0.0, float -0.0>, %ones
		%ret = bitcast <2 x float> %fneg to i64
		ret i64 %ret
		}
Context not available.