This is an archive of the discontinued LLVM Phabricator instance.

don't transform splats of vector FP (PR20358)
AbandonedPublic

Authored by spatel on Jul 18 2014, 11:20 AM.

Download Raw Diff

Details

Reviewers

RKSimon
aschwaighofer
scanon
hfinkel
dexonsmith

Summary

This patch addresses the performance and security problem shown here:
http://llvm.org/bugs/show_bug.cgi?id=20358

Because of unintended operations on denorms, it may be a severe performance loss to transform shuffle ops on FP vectors. Don't do that unless we're working with unsafe algebra (ie, -ffast-math) where it's likely that denorms are being ignored or flushed.

Diff Detail

Event Timeline

spatel updated this revision to Diff 11652.Jul 18 2014, 11:20 AM

spatel retitled this revision from to don't transform splats of vector FP (PR20358).

spatel updated this object.

spatel edited the test plan for this revision. (Show Details)

spatel added reviewers: aschwaighofer, hfinkel, dexonsmith.

spatel added a subscriber: Unknown Object (MLST).

Patch updated:
It's likely that denorms are not a concern in a -ffast-math setting, so limit this patch to only fire (ie, bail out of the transform) on FP ops that are not 'fast'.

As noted in the bug report, there are now known security exploits related to denorm execution (!), so there's another reason for this patch besides the potentially giant performance loss:
http://cseweb.ucsd.edu/~dkohlbre/papers/subnormal.pdf

Also - and I may be going for a world-record here - it's been ~490 days, so...

Ping. :)

This might penalize programs that don't traffic in denormals (most programs as I understand it). Did you measure the performance impact?

Did the bug report originate from a real program where this is an issue?

Though, I agree that putting a denormal on the execution path where it was not before seems like bad form.

In D4583#294622, @aschwaighofer wrote:

This might penalize programs that don't traffic in denormals (most programs as I understand it). Did you measure the performance impact?

I agree with the rarity of denormals assessment. Although I would alter it slightly to "most programs don't think they traffic in denormals." Every once in a while, you run into a performance or profile mystery that seems inexplicable until you discover that some op has dipped into denorm territory. It's hard enough to spot that when the program actually causes it. If the compiler is introducing the denormal hit, that's a debugging problem I don't think anyone ever wants to see.

So yes, there is a potential perf loss: we're not eliminating one splat instruction with this patch. I view this as a small price to pay for not introducing performance unpredictability into a program. That said, I don't see any perf differences using test-suite on x86-64 with this change.

Did the bug report originate from a real program where this is an issue?

No. But I hope I never see that real program. :)

Also, I don't know what denorm penalties look like outside of x86 and some old PPC chips. If there are targets where denorm ops are free/cheap, that would be a good argument to make this target-specific.

I just connected the dots between this case and other recent speculative execution bugs/patches:
https://llvm.org/bugs/show_bug.cgi?id=24343
https://llvm.org/bugs/show_bug.cgi?id=24818
https://llvm.org/bugs/show_bug.cgi?id=25572
http://reviews.llvm.org/D12882
http://reviews.llvm.org/D13297
http://reviews.llvm.org/D14630

This splat is just a special (and probably extremely rare) case of the general problem:

#define DENORM ((float)(1.0e-39))
float foo(float x, float y) {
   if (x > DENORM) return x * y;
   else return y;
}

$ ./clang -O2 denorm.c -S -o - -emit-llvm

define float @foo(float %x, float %y) #0 {
  %cmp = fcmp ogt float %x, 0x37D5C73000000000
  %mul = fmul float %x, %y    <--- crazy expensive op that we wanted to avoid just got executed
  %retval.0 = select i1 %cmp, float %mul, float %y
  ret float %retval.0
}

Unless there is objection, I'll abandon this patch. If we really want to solve this, we'd have to add some global flag (-fno-speculative-fp-math?) or instruction-level annotation to include FP ops under !isSafeToSpeculativelyExecute().

In D4583#294643, @spatel wrote:
I just connected the dots between this case and other recent speculative execution bugs/patches:
https://llvm.org/bugs/show_bug.cgi?id=24343
https://llvm.org/bugs/show_bug.cgi?id=24818
https://llvm.org/bugs/show_bug.cgi?id=25572
http://reviews.llvm.org/D12882
http://reviews.llvm.org/D13297
http://reviews.llvm.org/D14630

This splat is just a special (and probably extremely rare) case of the general problem:
#define DENORM ((float)(1.0e-39))
float foo(float x, float y) {
   if (x > DENORM) return x * y;
   else return y;
}
$ ./clang -O2 denorm.c -S -o - -emit-llvm
define float @foo(float %x, float %y) #0 {
  %cmp = fcmp ogt float %x, 0x37D5C73000000000
  %mul = fmul float %x, %y    <--- crazy expensive op that we wanted to avoid just got executed
  %retval.0 = select i1 %cmp, float %mul, float %y
  ret float %retval.0
}
Unless there is objection, I'll abandon this patch. If we really want to solve this, we'd have to add some global flag (-fno-speculative-fp-math?) or instruction-level annotation to include FP ops under !isSafeToSpeculativelyExecute().

I agree. Also, this seems related to Sergey Dmitrouk's work on enabling FP env access, etc.

In D4583#295524, @hfinkel wrote:

I agree. Also, this seems related to Sergey Dmitrouk's work on enabling FP env access, etc.

Thanks, Hal. I wasn't aware of that work.
cc'ing Sergey on this thread for one more thing to consider in the design.

spatel abandoned this revision.Nov 24 2015, 8:02 AM

spatel mentioned this in rL253997: [InstCombine] fix propagation of fast-math-flags.Nov 24 2015, 9:54 AM

Revision Contents

Path

Size

lib/

Transforms/

InstCombine/

InstructionCombining.cpp

15 lines

test/

Transforms/

InstCombine/

vec_shuffle.ll

25 lines

Diff 40871

lib/Transforms/InstCombine/InstructionCombining.cpp

Show First 20 Lines • Show All 1,239 Lines • ▼ Show 20 Lines	do {
Ancestor = Ancestor->user_back();		Ancestor = Ancestor->user_back();
} while (1);		} while (1);
}		}

/// \brief Creates node of binary operation with the same attributes as the		/// \brief Creates node of binary operation with the same attributes as the
/// specified one but with other operands.		/// specified one but with other operands.
static Value CreateBinOpAsGiven(BinaryOperator &Inst, Value LHS, Value *RHS,		static Value CreateBinOpAsGiven(BinaryOperator &Inst, Value LHS, Value *RHS,
InstCombiner::BuilderTy *B) {		InstCombiner::BuilderTy *B) {
		// FIXME: Propagate fast-math-flags.
Value *BORes = B->CreateBinOp(Inst.getOpcode(), LHS, RHS);		Value *BORes = B->CreateBinOp(Inst.getOpcode(), LHS, RHS);
if (BinaryOperator *NewBO = dyn_cast<BinaryOperator>(BORes)) {		if (BinaryOperator *NewBO = dyn_cast<BinaryOperator>(BORes)) {
if (isa<OverflowingBinaryOperator>(NewBO)) {		if (isa<OverflowingBinaryOperator>(NewBO)) {
NewBO->setHasNoSignedWrap(Inst.hasNoSignedWrap());		NewBO->setHasNoSignedWrap(Inst.hasNoSignedWrap());
NewBO->setHasNoUnsignedWrap(Inst.hasNoUnsignedWrap());		NewBO->setHasNoUnsignedWrap(Inst.hasNoUnsignedWrap());
}		}
if (isa<PossiblyExactOperator>(NewBO))		if (isa<PossiblyExactOperator>(NewBO))
NewBO->setIsExact(Inst.isExact());		NewBO->setIsExact(Inst.isExact());
Show All 9 Lines	Value *InstCombiner::SimplifyVectorOp(BinaryOperator &Inst) {
if (!Inst.getType()->isVectorTy()) return nullptr;		if (!Inst.getType()->isVectorTy()) return nullptr;

// It may not be safe to reorder shuffles and things like div, urem, etc.		// It may not be safe to reorder shuffles and things like div, urem, etc.
// because we may trap when executing those ops on unknown vector elements.		// because we may trap when executing those ops on unknown vector elements.
// See PR20059.		// See PR20059.
if (!isSafeToSpeculativelyExecute(&Inst))		if (!isSafeToSpeculativelyExecute(&Inst))
return nullptr;		return nullptr;

		// The shuffle transformations below may create floating-point math operations
		// on values not specified by the program. Those values may include denormals.
		// Operating on denormals may be extremely expensive or dangerous (PR20358).
		// If unsafe algebra is not allowed, bail out to avoid that possibility.

		// TODO: There is no direct connection between unsafe algebra and denormal
		// handling, but it is likely that an unsafe FP environment is also an
		// environment where denormals are automatically flushed to zero. If support
		// for detecting/changing the FP environment is added, this check should be
		// improved to directly query the settings for denormals.

		if (isa<FPMathOperator>(Inst) && !Inst.hasUnsafeAlgebra())
		return nullptr;

unsigned VWidth = cast<VectorType>(Inst.getType())->getNumElements();		unsigned VWidth = cast<VectorType>(Inst.getType())->getNumElements();
Value LHS = Inst.getOperand(0), RHS = Inst.getOperand(1);		Value LHS = Inst.getOperand(0), RHS = Inst.getOperand(1);
assert(cast<VectorType>(LHS->getType())->getNumElements() == VWidth);		assert(cast<VectorType>(LHS->getType())->getNumElements() == VWidth);
assert(cast<VectorType>(RHS->getType())->getNumElements() == VWidth);		assert(cast<VectorType>(RHS->getType())->getNumElements() == VWidth);

// If both arguments of binary operation are shuffles, which use the same		// If both arguments of binary operation are shuffles, which use the same
// mask and shuffle within a single vector, it is worthwhile to move the		// mask and shuffle within a single vector, it is worthwhile to move the
// shuffle after binary operation:		// shuffle after binary operation:
▲ Show 20 Lines • Show All 1,892 Lines • Show Last 20 Lines

test/Transforms/InstCombine/vec_shuffle.ll

Show First 20 Lines • Show All 304 Lines • ▼ Show 20 Lines	; CHECK: shufflevector
%t1 = shufflevector <4 x i32> %v1, <4 x i32> zeroinitializer,		%t1 = shufflevector <4 x i32> %v1, <4 x i32> zeroinitializer,
<4 x i32> <i32 1, i32 2, i32 3, i32 0>		<4 x i32> <i32 1, i32 2, i32 3, i32 0>
%t2 = shufflevector <4 x i32> %v2, <4 x i32> zeroinitializer,		%t2 = shufflevector <4 x i32> %v2, <4 x i32> zeroinitializer,
<4 x i32> <i32 1, i32 2, i32 3, i32 0>		<4 x i32> <i32 1, i32 2, i32 3, i32 0>
%r = add nuw <4 x i32> %t1, %t2		%r = add nuw <4 x i32> %t1, %t2
ret <4 x i32> %r		ret <4 x i32> %r
}		}

		; If the FP operation is 'fast', hoist it to eliminate a shuffle.

define <4 x float> @shuffle_17fsub(<4 x float> %v1, <4 x float> %v2) nounwind uwtable {		define <4 x float> @shuffle_17fsub(<4 x float> %v1, <4 x float> %v2) nounwind uwtable {
; CHECK-LABEL: @shuffle_17fsub(		; CHECK-LABEL: @shuffle_17fsub(
; CHECK-NOT: shufflevector		; CHECK-NEXT: fsub <4 x float> %v1, %v2
; CHECK: fsub <4 x float> %v1, %v2		; CHECK-NEXT: shufflevector
; CHECK: shufflevector		; CHECK-NEXT: ret <4 x float>
		%t1 = shufflevector <4 x float> %v1, <4 x float> zeroinitializer,
		<4 x i32> <i32 1, i32 2, i32 3, i32 0>
		%t2 = shufflevector <4 x float> %v2, <4 x float> zeroinitializer,
		<4 x i32> <i32 1, i32 2, i32 3, i32 0>
		%r = fsub fast <4 x float> %t1, %t2
		ret <4 x float> %r
		}

		; If the FP operation is not 'fast', do not risk operating on denormals:
		; https://llvm.org/bugs/show_bug.cgi?id=20358

		define <4 x float> @pr20358(<4 x float> %v1, <4 x float> %v2) nounwind uwtable {
		; CHECK-LABEL: @pr20358(
		; CHECK-NEXT: %t1 = shufflevector <4 x float> %v1, <4 x float> undef, <4 x i32> <i32 1, i32 2, i32 3, i32 0>
		; CHECK-NEXT: %t2 = shufflevector <4 x float> %v2, <4 x float> undef, <4 x i32> <i32 1, i32 2, i32 3, i32 0>
		; CHECK-NEXT: %r = fsub <4 x float> %t1, %t2
		; CHECK-NEXT: ret <4 x float> %r
%t1 = shufflevector <4 x float> %v1, <4 x float> zeroinitializer,		%t1 = shufflevector <4 x float> %v1, <4 x float> zeroinitializer,
<4 x i32> <i32 1, i32 2, i32 3, i32 0>		<4 x i32> <i32 1, i32 2, i32 3, i32 0>
%t2 = shufflevector <4 x float> %v2, <4 x float> zeroinitializer,		%t2 = shufflevector <4 x float> %v2, <4 x float> zeroinitializer,
<4 x i32> <i32 1, i32 2, i32 3, i32 0>		<4 x i32> <i32 1, i32 2, i32 3, i32 0>
%r = fsub <4 x float> %t1, %t2		%r = fsub <4 x float> %t1, %t2
ret <4 x float> %r		ret <4 x float> %r
}		}

▲ Show 20 Lines • Show All 114 Lines • Show Last 20 Lines