This is an archive of the discontinued LLVM Phabricator instance.

don't transform splats of vector FP (PR20358)
AbandonedPublic

Authored by spatel on Jul 18 2014, 11:20 AM.

Download Raw Diff

Details

Reviewers

RKSimon
aschwaighofer
scanon
hfinkel
dexonsmith

Summary

This patch addresses the performance and security problem shown here:
http://llvm.org/bugs/show_bug.cgi?id=20358

Because of unintended operations on denorms, it may be a severe performance loss to transform shuffle ops on FP vectors. Don't do that unless we're working with unsafe algebra (ie, -ffast-math) where it's likely that denorms are being ignored or flushed.

Diff Detail

Event Timeline

spatel updated this revision to Diff 11652.Jul 18 2014, 11:20 AM

spatel retitled this revision from to don't transform splats of vector FP (PR20358).

spatel updated this object.

spatel edited the test plan for this revision. (Show Details)

spatel added reviewers: aschwaighofer, hfinkel, dexonsmith.

spatel added a subscriber: Unknown Object (MLST).

Patch updated:
It's likely that denorms are not a concern in a -ffast-math setting, so limit this patch to only fire (ie, bail out of the transform) on FP ops that are not 'fast'.

As noted in the bug report, there are now known security exploits related to denorm execution (!), so there's another reason for this patch besides the potentially giant performance loss:
http://cseweb.ucsd.edu/~dkohlbre/papers/subnormal.pdf

Also - and I may be going for a world-record here - it's been ~490 days, so...

Ping. :)

This might penalize programs that don't traffic in denormals (most programs as I understand it). Did you measure the performance impact?

Did the bug report originate from a real program where this is an issue?

Though, I agree that putting a denormal on the execution path where it was not before seems like bad form.

In D4583#294622, @aschwaighofer wrote:

This might penalize programs that don't traffic in denormals (most programs as I understand it). Did you measure the performance impact?

I agree with the rarity of denormals assessment. Although I would alter it slightly to "most programs don't think they traffic in denormals." Every once in a while, you run into a performance or profile mystery that seems inexplicable until you discover that some op has dipped into denorm territory. It's hard enough to spot that when the program actually causes it. If the compiler is introducing the denormal hit, that's a debugging problem I don't think anyone ever wants to see.

So yes, there is a potential perf loss: we're not eliminating one splat instruction with this patch. I view this as a small price to pay for not introducing performance unpredictability into a program. That said, I don't see any perf differences using test-suite on x86-64 with this change.

Did the bug report originate from a real program where this is an issue?

No. But I hope I never see that real program. :)

Also, I don't know what denorm penalties look like outside of x86 and some old PPC chips. If there are targets where denorm ops are free/cheap, that would be a good argument to make this target-specific.

I just connected the dots between this case and other recent speculative execution bugs/patches:
https://llvm.org/bugs/show_bug.cgi?id=24343
https://llvm.org/bugs/show_bug.cgi?id=24818
https://llvm.org/bugs/show_bug.cgi?id=25572
http://reviews.llvm.org/D12882
http://reviews.llvm.org/D13297
http://reviews.llvm.org/D14630

This splat is just a special (and probably extremely rare) case of the general problem:

#define DENORM ((float)(1.0e-39))
float foo(float x, float y) {
   if (x > DENORM) return x * y;
   else return y;
}

$ ./clang -O2 denorm.c -S -o - -emit-llvm

define float @foo(float %x, float %y) #0 {
  %cmp = fcmp ogt float %x, 0x37D5C73000000000
  %mul = fmul float %x, %y    <--- crazy expensive op that we wanted to avoid just got executed
  %retval.0 = select i1 %cmp, float %mul, float %y
  ret float %retval.0
}

Unless there is objection, I'll abandon this patch. If we really want to solve this, we'd have to add some global flag (-fno-speculative-fp-math?) or instruction-level annotation to include FP ops under !isSafeToSpeculativelyExecute().

In D4583#294643, @spatel wrote:
I just connected the dots between this case and other recent speculative execution bugs/patches:
https://llvm.org/bugs/show_bug.cgi?id=24343
https://llvm.org/bugs/show_bug.cgi?id=24818
https://llvm.org/bugs/show_bug.cgi?id=25572
http://reviews.llvm.org/D12882
http://reviews.llvm.org/D13297
http://reviews.llvm.org/D14630

This splat is just a special (and probably extremely rare) case of the general problem:
#define DENORM ((float)(1.0e-39))
float foo(float x, float y) {
   if (x > DENORM) return x * y;
   else return y;
}
$ ./clang -O2 denorm.c -S -o - -emit-llvm
define float @foo(float %x, float %y) #0 {
  %cmp = fcmp ogt float %x, 0x37D5C73000000000
  %mul = fmul float %x, %y    <--- crazy expensive op that we wanted to avoid just got executed
  %retval.0 = select i1 %cmp, float %mul, float %y
  ret float %retval.0
}
Unless there is objection, I'll abandon this patch. If we really want to solve this, we'd have to add some global flag (-fno-speculative-fp-math?) or instruction-level annotation to include FP ops under !isSafeToSpeculativelyExecute().

I agree. Also, this seems related to Sergey Dmitrouk's work on enabling FP env access, etc.

In D4583#295524, @hfinkel wrote:

I agree. Also, this seems related to Sergey Dmitrouk's work on enabling FP env access, etc.

Thanks, Hal. I wasn't aware of that work.
cc'ing Sergey on this thread for one more thing to consider in the design.

spatel abandoned this revision.Nov 24 2015, 8:02 AM

spatel mentioned this in rL253997: [InstCombine] fix propagation of fast-math-flags.Nov 24 2015, 9:54 AM

Revision Contents

Path

Size

lib/

Transforms/

InstCombine/

InstCombineAddSub.cpp

6 lines

InstCombineMulDivRem.cpp

9 lines

InstructionCombining.cpp

5 lines

test/

Transforms/

InstCombine/

pr20059.ll

16 lines

vec_shuffle.ll

43 lines

Diff 11652

lib/Transforms/InstCombine/InstCombineAddSub.cpp

Context not available.

	Instruction *InstCombiner::visitFAdd(BinaryOperator &I) {	Instruction *InstCombiner::visitFAdd(BinaryOperator &I) {
	bool Changed = SimplifyAssociativeOrCommutative(I);	bool Changed = SimplifyAssociativeOrCommutative(I);
	Value LHS = I.getOperand(0), RHS = I.getOperand(1);	Value LHS = I.getOperand(0), RHS = I.getOperand(1);

	if (Value *V = SimplifyVectorOp(I))
	return ReplaceInstUsesWith(I, V);

	if (Value *V = SimplifyFAddInst(LHS, RHS, I.getFastMathFlags(), DL))	if (Value *V = SimplifyFAddInst(LHS, RHS, I.getFastMathFlags(), DL))
	return ReplaceInstUsesWith(I, V);	return ReplaceInstUsesWith(I, V);

	if (isa<Constant>(RHS)) {	if (isa<Constant>(RHS)) {
	if (isa<PHINode>(LHS))	if (isa<PHINode>(LHS))
Context not available.
	}	}

	Instruction *InstCombiner::visitFSub(BinaryOperator &I) {	Instruction *InstCombiner::visitFSub(BinaryOperator &I) {
	Value Op0 = I.getOperand(0), Op1 = I.getOperand(1);	Value Op0 = I.getOperand(0), Op1 = I.getOperand(1);

	if (Value *V = SimplifyVectorOp(I))
	return ReplaceInstUsesWith(I, V);

	if (Value *V = SimplifyFSubInst(Op0, Op1, I.getFastMathFlags(), DL))	if (Value *V = SimplifyFSubInst(Op0, Op1, I.getFastMathFlags(), DL))
	return ReplaceInstUsesWith(I, V);	return ReplaceInstUsesWith(I, V);

	if (isa<Constant>(Op0))	if (isa<Constant>(Op0))
	if (SelectInst *SI = dyn_cast<SelectInst>(Op1))	if (SelectInst *SI = dyn_cast<SelectInst>(Op1))
Context not available.

lib/Transforms/InstCombine/InstCombineMulDivRem.cpp

Context not available.

	Instruction *InstCombiner::visitFMul(BinaryOperator &I) {	Instruction *InstCombiner::visitFMul(BinaryOperator &I) {
	bool Changed = SimplifyAssociativeOrCommutative(I);	bool Changed = SimplifyAssociativeOrCommutative(I);
	Value Op0 = I.getOperand(0), Op1 = I.getOperand(1);	Value Op0 = I.getOperand(0), Op1 = I.getOperand(1);

	if (Value *V = SimplifyVectorOp(I))
	return ReplaceInstUsesWith(I, V);

	if (isa<Constant>(Op0))	if (isa<Constant>(Op0))
	std::swap(Op0, Op1);	std::swap(Op0, Op1);

	if (Value *V = SimplifyFMulInst(Op0, Op1, I.getFastMathFlags(), DL))	if (Value *V = SimplifyFMulInst(Op0, Op1, I.getFastMathFlags(), DL))
	return ReplaceInstUsesWith(I, V);	return ReplaceInstUsesWith(I, V);
Context not available.
	}	}

	Instruction *InstCombiner::visitFDiv(BinaryOperator &I) {	Instruction *InstCombiner::visitFDiv(BinaryOperator &I) {
	Value Op0 = I.getOperand(0), Op1 = I.getOperand(1);	Value Op0 = I.getOperand(0), Op1 = I.getOperand(1);

	if (Value *V = SimplifyVectorOp(I))
	return ReplaceInstUsesWith(I, V);

	if (Value *V = SimplifyFDivInst(Op0, Op1, DL))	if (Value *V = SimplifyFDivInst(Op0, Op1, DL))
	return ReplaceInstUsesWith(I, V);	return ReplaceInstUsesWith(I, V);

	if (isa<Constant>(Op0))	if (isa<Constant>(Op0))
	if (SelectInst *SI = dyn_cast<SelectInst>(Op1))	if (SelectInst *SI = dyn_cast<SelectInst>(Op1))
Context not available.
	}	}

	Instruction *InstCombiner::visitFRem(BinaryOperator &I) {	Instruction *InstCombiner::visitFRem(BinaryOperator &I) {
	Value Op0 = I.getOperand(0), Op1 = I.getOperand(1);	Value Op0 = I.getOperand(0), Op1 = I.getOperand(1);

	if (Value *V = SimplifyVectorOp(I))
	return ReplaceInstUsesWith(I, V);

	if (Value *V = SimplifyFRemInst(Op0, Op1, DL))	if (Value *V = SimplifyFRemInst(Op0, Op1, DL))
	return ReplaceInstUsesWith(I, V);	return ReplaceInstUsesWith(I, V);

	// Handle cases involving: rem X, (select Cond, Y, Z)	// Handle cases involving: rem X, (select Cond, Y, Z)
	if (isa<SelectInst>(Op1) && SimplifyDivRemOfSelect(I))	if (isa<SelectInst>(Op1) && SimplifyDivRemOfSelect(I))
Context not available.

lib/Transforms/InstCombine/InstructionCombining.cpp

Context not available.
	/// \return Pointer to node that must replace the original binary operator, or	/// \return Pointer to node that must replace the original binary operator, or
	/// null pointer if no transformation was made.	/// null pointer if no transformation was made.
	Value *InstCombiner::SimplifyVectorOp(BinaryOperator &Inst) {	Value *InstCombiner::SimplifyVectorOp(BinaryOperator &Inst) {
	if (!Inst.getType()->isVectorTy()) return nullptr;	if (!Inst.getType()->isVectorTy()) return nullptr;

		// It is potentially harmful to operate on unknown FP vector elements.
		// Eg, FP ops using denormals can take over 10x longer than normals.
		assert(!Inst.getType()->isFPOrFPVectorTy() &&
		"Attempting to transform Vector FP operation to use unknown elements");

	// It may not be safe to reorder shuffles and things like div, urem, etc.	// It may not be safe to reorder shuffles and things like div, urem, etc.
	// because we may trap when executing those ops on unknown vector elements.	// because we may trap when executing those ops on unknown vector elements.
	// See PR20059.	// See PR20059.
	if (!isSafeToSpeculativelyExecute(&Inst, DL)) return nullptr;	if (!isSafeToSpeculativelyExecute(&Inst, DL)) return nullptr;

Context not available.

test/Transforms/InstCombine/pr20059.ll

	; RUN: opt -S -instcombine < %s \| FileCheck %s

	; In PR20059 ( http://llvm.org/pr20059 ), shufflevector operations are reordered/removed
	; for an srem operation. This is not a valid optimization because it may cause a trap
	; on div-by-zero.

	; CHECK-LABEL: @do_not_reorder
	; CHECK: %splat1 = shufflevector <4 x i32> %p1, <4 x i32> undef, <4 x i32> zeroinitializer
	; CHECK-NEXT: %splat2 = shufflevector <4 x i32> %p2, <4 x i32> undef, <4 x i32> zeroinitializer
	; CHECK-NEXT: %retval = srem <4 x i32> %splat1, %splat2
	define <4 x i32> @do_not_reorder(<4 x i32> %p1, <4 x i32> %p2) {
	%splat1 = shufflevector <4 x i32> %p1, <4 x i32> undef, <4 x i32> zeroinitializer
	%splat2 = shufflevector <4 x i32> %p2, <4 x i32> undef, <4 x i32> zeroinitializer
	%retval = srem <4 x i32> %splat1, %splat2
	ret <4 x i32> %retval
	}

test/Transforms/InstCombine/vec_shuffle.ll

Context not available.
	<4 x i32> <i32 1, i32 2, i32 3, i32 0>	<4 x i32> <i32 1, i32 2, i32 3, i32 0>
	%r = add nuw <4 x i32> %t1, %t2	%r = add nuw <4 x i32> %t1, %t2
	ret <4 x i32> %r	ret <4 x i32> %r
	}	}

	define <4 x float> @shuffle_17fsub(<4 x float> %v1, <4 x float> %v2) nounwind uwtable {
	; CHECK-LABEL: @shuffle_17fsub(
	; CHECK-NOT: shufflevector
	; CHECK: fsub <4 x float> %v1, %v2
	; CHECK: shufflevector
	%t1 = shufflevector <4 x float> %v1, <4 x float> zeroinitializer,
	<4 x i32> <i32 1, i32 2, i32 3, i32 0>
	%t2 = shufflevector <4 x float> %v2, <4 x float> zeroinitializer,
	<4 x i32> <i32 1, i32 2, i32 3, i32 0>
	%r = fsub <4 x float> %t1, %t2
	ret <4 x float> %r
	}

	define <4 x i32> @shuffle_17addconst(<4 x i32> %v1, <4 x i32> %v2) {	define <4 x i32> @shuffle_17addconst(<4 x i32> %v1, <4 x i32> %v2) {
	; CHECK-LABEL: @shuffle_17addconst(	; CHECK-LABEL: @shuffle_17addconst(
	; CHECK-NOT: shufflevector	; CHECK-NOT: shufflevector
	; CHECK: [[VAR1:%[a-zA-Z0-9.]+]] = add <4 x i32> %v1, <i32 4, i32 1, i32 2, i32 3>	; CHECK: [[VAR1:%[a-zA-Z0-9.]+]] = add <4 x i32> %v1, <i32 4, i32 1, i32 2, i32 3>
	; CHECK: [[VAR2:%[a-zA-Z0-9.]+]] = shufflevector <4 x i32> [[VAR1]], <4 x i32> undef, <4 x i32> <i32 1, i32 2, i32 3, i32 0>	; CHECK: [[VAR2:%[a-zA-Z0-9.]+]] = shufflevector <4 x i32> [[VAR1]], <4 x i32> undef, <4 x i32> <i32 1, i32 2, i32 3, i32 0>
Context not available.
	; CHECK: and	; CHECK: and
	%mask01.i = shufflevector <4 x i32> %__mask, <4 x i32> undef, <4 x i32> <i32 0, i32 0, i32 1, i32 1>	%mask01.i = shufflevector <4 x i32> %__mask, <4 x i32> undef, <4 x i32> <i32 0, i32 0, i32 1, i32 1>
	%masked_new.i.i.i = and <4 x i32> bitcast (<2 x i64> <i64 ptrtoint (<4 x i32> (<4 x i32>)* @pr20114 to i64), i64 ptrtoint (<4 x i32> (<4 x i32>)* @pr20114 to i64)> to <4 x i32>), %mask01.i	%masked_new.i.i.i = and <4 x i32> bitcast (<2 x i64> <i64 ptrtoint (<4 x i32> (<4 x i32>)* @pr20114 to i64), i64 ptrtoint (<4 x i32> (<4 x i32>)* @pr20114 to i64)> to <4 x i32>), %mask01.i
	ret <4 x i32> %masked_new.i.i.i	ret <4 x i32> %masked_new.i.i.i
	}	}

		; In PR20059 ( http://llvm.org/pr20059 ), shufflevector operations are reordered/removed
		; for an srem operation. This is not a valid optimization because it may cause a trap
		; on div-by-zero.

		define <4 x i32> @pr20059(<4 x i32> %p1, <4 x i32> %p2) {
		; CHECK-LABEL: @pr20059
		; CHECK: %splat1 = shufflevector <4 x i32> %p1, <4 x i32> undef, <4 x i32> zeroinitializer
		; CHECK-NEXT: %splat2 = shufflevector <4 x i32> %p2, <4 x i32> undef, <4 x i32> zeroinitializer
		; CHECK-NEXT: %retval = srem <4 x i32> %splat1, %splat2
		%splat1 = shufflevector <4 x i32> %p1, <4 x i32> undef, <4 x i32> zeroinitializer
		%splat2 = shufflevector <4 x i32> %p2, <4 x i32> undef, <4 x i32> zeroinitializer
		%retval = srem <4 x i32> %splat1, %splat2
		ret <4 x i32> %retval
		}

		; In PR20358 ( http://llvm.org/pr20358 ), shufflevector operations are reordered/removed
		; for an FP mul operation. This may not be a profitable optimization because it may
		; cause operations on denormals.

		define <4 x float> @pr20358(<4 x float> %p1, <4 x float> %p2) {
		; CHECK-LABEL: @pr20358
		; CHECK: %splat1 = shufflevector <4 x float> %p1, <4 x float> undef, <4 x i32> zeroinitializer
		; CHECK-NEXT: %splat2 = shufflevector <4 x float> %p2, <4 x float> undef, <4 x i32> zeroinitializer
		; CHECK-NEXT: %retval = fmul <4 x float> %splat1, %splat2
		%splat1 = shufflevector <4 x float> %p1, <4 x float> undef, <4 x i32> zeroinitializer
		%splat2 = shufflevector <4 x float> %p2, <4 x float> undef, <4 x i32> zeroinitializer
		%retval = fmul <4 x float> %splat1, %splat2
		ret <4 x float> %retval
		}
Context not available.