Download Raw Diff

Details

Reviewers

spatel
RKSimon
zvi
craig.topper
igorb
hfinkel

Commits

rG16b20d2fc5df: [X86][X86 intrinsics]Folding cmp(sub(a,b),0) into cmp(a,b) optimization
rL300422: [X86][X86 intrinsics]Folding cmp(sub(a,b),0) into cmp(a,b) optimization

Summary

This patch is a part of two reviews and base on the https://reviews.llvm.org/D31396.

This patch adds new optimization (Folding cmp(sub(a,b),0) into cmp(a,b))
to instCombineCall pass and was written specific for X86 CMP intrinsics.

Diff Detail

Event Timeline

m_zuckerman created this revision.Mar 27 2017, 8:42 AM

m_zuckerman added a parent revision: D31396: [X86][LLVM][Canonical Compare Intrinsics] Creating a canonical representation for X86 CMP intrinsics.

Please make sure you always include llvm-commits as a subscriber in future patches.

lib/Transforms/InstCombine/InstCombineCalls.cpp
2401	Pull out these repeated calls to getFastMathFlags: FastMathFlags FMFs = I->getFastMathFlags();
test/Transforms/InstCombine/X86FsubCmpCombine.ll
2	Regenerate with utils/update_test_checks.py
6	maybe split these into one test per intrinsic? also do tests for safe algebra (or maybe one of the other fmfs each per test).
13	clean this up if possible

m_zuckerman updated this revision to Diff 93545.Mar 30 2017, 2:41 PM

m_zuckerman marked 4 inline comments as done.

If this transform is valid (cc'ing @scanon), then should we also do this for general IR?

define i1 @fcmpsub(double %x, double %y) {
  %sub = fsub nnan ninf nsz double %x, %y
  %cmp = fcmp nnan ninf nsz ugt double %sub, 0.0
  ret i1 %cmp
}

define i1 @fcmpsub(double %x, double %y) {
  %cmp = fcmp nnan ninf nsz ugt double %x, %y
  ret i1 %cmp
}

lib/Transforms/InstCombine/InstCombineCalls.cpp
2401	Don't we need to check the FMF of the intrinsic too?
2402–2403	Currently, unsafeAlgebra implies all of the other FMF bits, so checking that bit is redundant here. If we change the definition of unsafeAlgebra in the future (there was a proposal for this recently), then this check will be wrong. Either way, remove the unsafeAlgebra predicate (unless I'm misunderstanding the constraints of this transform).

(x-y) == 0 --> x == y does not require nsz (zeros of any sign compare equal), nor does it require nnan (if x or y is NaN, both comparisons are false). It *does* require ninf (because inf-inf = NaN), and it also requires that subnormals are not flushed by the CPU. There's been some churn around flushing recently, Hal may have thoughts (+Hal).

In D31398#715296, @spatel wrote:
If this transform is valid (cc'ing @scanon), then should we also do this for general IR?
define i1 @fcmpsub(double %x, double %y) {
  %sub = fsub nnan ninf nsz double %x, %y
  %cmp = fcmp nnan ninf nsz ugt double %sub, 0.0
  ret i1 %cmp
}

define i1 @fcmpsub(double %x, double %y) {
  %cmp = fcmp nnan ninf nsz ugt double %x, %y
  ret i1 %cmp
}

You are absolutely right, your transform is valid and we will do it after this patch.
Since the intrinsics are lowered with generic IR, mine patch is still valid and we will need them both for a complete solution.

lib/Transforms/InstCombine/InstCombineCalls.cpp
2401	No, we don't need to check, this is implied from the flags of the sub. According to http://llvm.org/docs/LangRef.html#fast-math-flags optimizations can assume that the arguments and the result behave as expected from them. Since the compare uses the result and splits them to two arguments (the same arguments as in the sub) we are still working with the early assumption. We can continue with the assumptions as long we will work with the same arguments or the same result.

m_zuckerman updated this revision to Diff 93747.Apr 1 2017, 7:07 AM

In D31398#716002, @m_zuckerman wrote:

You are absolutely right, your transform is valid and we will do it after this patch.
Since the intrinsics are lowered with generic IR, mine patch is still valid and we will need them both for a complete solution.

We need to be conservative for the general case...speaking from experience :). As @scanon mentioned, we need some way to tell whether denorms are flushed to zero or not. I think this patch is safe currently because we assume the default FP ENV, and on x86 that would not have DAZ/FTZ.
We don't need nsz or nnan for this fold (see @scanon comment).
I'd prefer to put all of the tests in one file since they are just variants of the same fold.

In D31398#716031, @spatel wrote:

In D31398#716002, @m_zuckerman wrote:

You are absolutely right, your transform is valid and we will do it after this patch.
Since the intrinsics are lowered with generic IR, mine patch is still valid and we will need them both for a complete solution.

We need to be conservative for the general case...speaking from experience :). As @scanon mentioned, we need some way to tell whether denorms are flushed to zero or not. I think this patch is safe currently because we assume the default FP ENV, and on x86 that would not have DAZ/FTZ.

We don't need nsz or nnan for this fold (see @scanon comment).

I'd prefer to put all of the tests in one file since they are just variants of the same fold.

1.I agree with you, in the general case we need to be caution, but this transformation is feasible.
Regard the X86 this transformation is safe (As you wrote) from all perspectives. We know that status of the flags and we know the target.

3.I am with you on that.

m_zuckerman added a reviewer: llvm-commits.Apr 2 2017, 4:45 AM

m_zuckerman removed a reviewer: llvm-commits.

m_zuckerman updated this revision to Diff 93795.Apr 2 2017, 9:15 AM

spatel added inline comments.Apr 3 2017, 7:39 AM

lib/Transforms/InstCombine/InstCombineCalls.cpp
2396	You can use 'auto *' with dyn_cast because the type is obvious.
2398–2400	This comment should specify the non-obvious constraints that we've discussed here: // This fold requires NINF because inf minus inf is nan. // NSZ is not needed because zeros of any sign are equal for both compares. // NNAN is not needed because nans compare the same for both compares. // FMF are not needed on the compare intrinsic because...
test/Transforms/InstCombine/X86FsubCmpCombine.ll
7	Misspelling in function name. Also, as suggested earlier, I'd really prefer to have one test per intrinsic rather than everything in one function. It makes it a lot easier to see the simple pattern that's getting folded.

m_zuckerman updated this revision to Diff 94199.Apr 5 2017, 4:38 AM

Ping

LGTM - see inline comments for a couple of cleanups.

lib/Transforms/InstCombine/InstCombineCalls.cpp
2396	You missed this nit.
2407–2410	Giving local names to the operands doesn't add value here IMO, but if you want to do that, I prefer to use "A" and "B" to match the formula in the comment.

This revision is now accepted and ready to land.Apr 10 2017, 4:36 PM

In D31398#715307, @scanon wrote:

(x-y) == 0 --> x == y does not require nsz (zeros of any sign compare equal), nor does it require nnan (if x or y is NaN, both comparisons are false). It *does* require ninf (because inf-inf = NaN), and it also requires that subnormals are not flushed by the CPU. There's been some churn around flushing recently, Hal may have thoughts (+Hal).

We still assume standard denormal handling. There's ongoing work to add intrinsics for operations that are sensitive to this behavior (and rounding modes, etc.), but that shouldn't affect this.

Hi,
I did a small modification on the code since I am going to retire from the canonical compare representation review.
I added to the code the ability to fold cmp(0, fsub(a,b)) additnal to the reviewrd cmp(fsub(a,b),0)

Thanks,
Michael Zuckerman

spatel added inline comments.Apr 14 2017, 7:26 AM

Transforms/InstCombine/InstCombineCalls.cpp

2348–2366 ↗

(On Diff #95159)

There are many ways to deal with commuted patterns, and a 2-loop is my least favorite. Would you consider using std::swap and matchers instead? Something like:

Value *Arg0 = II->getArgOperand(0);
Value *Arg1 = II->getArgOperand(1);
bool Arg0IsZero = match(Arg0, m_Zero());
if (Arg0IsZero)
  std::swap(Arg0, Arg1);
Value *A, *B;
if ((match(Arg0, m_OneUse(m_FSub(m_Value(A), m_Value(B)))) &&
     match(Arg1, m_Zero()) &&
     cast<Instruction>(Arg0)->getFastMathFlags().noInfs())) {
  if (Arg0IsZero)
    std::swap(A, B);
  II->setArgOperand(0, A);
  II->setArgOperand(1, B);
  return II;
}

I agree with you that your code is more elegant.

LGTM.

Thanks

Closed by commit rL300422: [X86][X86 intrinsics]Folding cmp(sub(a,b),0) into cmp(a,b) optimization (authored by mzuckerm). · Explain WhyApr 16 2017, 6:39 AM

This revision was automatically updated to reflect the committed changes.

Diff 94199

lib/Transforms/InstCombine/InstCombineCalls.cpp

	Show First 20 Lines • Show All 992 Lines • ▼ Show 20 Lines
	return II;			return II;
	break;			break;
	}			}
	case Intrinsic::x86_avx512_mask_cmp_pd_128:			case Intrinsic::x86_avx512_mask_cmp_pd_128:
	case Intrinsic::x86_avx512_mask_cmp_pd_256:			case Intrinsic::x86_avx512_mask_cmp_pd_256:
	case Intrinsic::x86_avx512_mask_cmp_pd_512:			case Intrinsic::x86_avx512_mask_cmp_pd_512:
	case Intrinsic::x86_avx512_mask_cmp_ps_128:			case Intrinsic::x86_avx512_mask_cmp_ps_128:
	case Intrinsic::x86_avx512_mask_cmp_ps_256:			case Intrinsic::x86_avx512_mask_cmp_ps_256:
	case Intrinsic::x86_avx512_mask_cmp_ps_512:			case Intrinsic::x86_avx512_mask_cmp_ps_512: {
	if(X86CreateCanonicalCMP(II))			if(X86CreateCanonicalCMP(II))
	return II;			return II;
				// Folding cmp(sub(a,b),0) into cmp(a,b)
				if (Instruction *I = dyn_cast<Instruction>(II->getArgOperand(0))) {
				spatelUnsubmitted Not Done Reply Inline Actions You can use 'auto ' with dyn_cast because the type is obvious. spatel:* You can use 'auto *' with dyn_cast because the type is obvious.
				spatelUnsubmitted Not Done Reply Inline Actions You missed this nit. spatel: You missed this nit.
				if (I->getOpcode() == Instruction::FSub && I->hasOneUse()) {
				// This fold requires only the NINF(not +/- inf) since inf minus
				// inf is nan.
				// NSZ(No Signed Zeros) is not needed because zeros of any sign are
				spatelUnsubmitted Not Done Reply Inline Actions This comment should specify the non-obvious constraints that we've discussed here: // This fold requires NINF because inf minus inf is nan. // NSZ is not needed because zeros of any sign are equal for both compares. // NNAN is not needed because nans compare the same for both compares. // FMF are not needed on the compare intrinsic because... spatel: This comment should specify the non-obvious constraints that we've discussed here: // This…
				// equal for both compares.
				RKSimonUnsubmitted Done Reply Inline Actions Pull out these repeated calls to getFastMathFlags: FastMathFlags FMFs = I->getFastMathFlags(); RKSimon: Pull out these repeated calls to getFastMathFlags: ``` FastMathFlags FMFs = I->getFastMathFlags…
				spatelUnsubmitted Not Done Reply Inline Actions Don't we need to check the FMF of the intrinsic too? spatel: Don't we need to check the FMF of the intrinsic too?
				m_zuckermanAuthorUnsubmitted Not Done Reply Inline Actions No, we don't need to check, this is implied from the flags of the sub. According to http://llvm.org/docs/LangRef.html#fast-math-flags optimizations can assume that the arguments and the result behave as expected from them. Since the compare uses the result and splits them to two arguments (the same arguments as in the sub) we are still working with the early assumption. We can continue with the assumptions as long we will work with the same arguments or the same result. m_zuckerman: No, we don't need to check, this is implied from the flags of the sub. According to http…
				// NNAN is not needed because nans compare the same for both compares.
				// The compare intrinsic uses the above assumptions and therefore
				spatelUnsubmitted Done Reply Inline Actions Currently, unsafeAlgebra implies all of the other FMF bits, so checking that bit is redundant here. If we change the definition of unsafeAlgebra in the future (there was a proposal for this recently), then this check will be wrong. Either way, remove the unsafeAlgebra predicate (unless I'm misunderstanding the constraints of this transform). spatel: Currently, unsafeAlgebra implies all of the other FMF bits, so checking that bit is redundant…
				// doesn't require additional flags.
				FastMathFlags FMFs = I->getFastMathFlags();
				if (FMFs.noInfs() && isa<ConstantAggregateZero>((II->getArgOperand(1)))) {
				Value *LHS = I->getOperand(0);
				Value *RHS = I->getOperand(1);
				II->setArgOperand(0, LHS);
				II->setArgOperand(1, RHS);
				spatelUnsubmitted Not Done Reply Inline Actions Giving local names to the operands doesn't add value here IMO, but if you want to do that, I prefer to use "A" and "B" to match the formula in the comment. spatel: Giving local names to the operands doesn't add value here IMO, but if you want to do that, I…
				return II;
				}
				}
				}
	break;			break;
				}

	case Intrinsic::x86_avx512_mask_add_ps_512:			case Intrinsic::x86_avx512_mask_add_ps_512:
	case Intrinsic::x86_avx512_mask_div_ps_512:			case Intrinsic::x86_avx512_mask_div_ps_512:
	case Intrinsic::x86_avx512_mask_mul_ps_512:			case Intrinsic::x86_avx512_mask_mul_ps_512:
	case Intrinsic::x86_avx512_mask_sub_ps_512:			case Intrinsic::x86_avx512_mask_sub_ps_512:
	case Intrinsic::x86_avx512_mask_add_pd_512:			case Intrinsic::x86_avx512_mask_add_pd_512:
	case Intrinsic::x86_avx512_mask_div_pd_512:			case Intrinsic::x86_avx512_mask_div_pd_512:
	case Intrinsic::x86_avx512_mask_mul_pd_512:			case Intrinsic::x86_avx512_mask_mul_pd_512:
	▲ Show 20 Lines • Show All 992 Lines • Show Last 20 Lines

test/Transforms/InstCombine/X86FsubCmpCombine.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
				; RUN: opt < %s -instcombine -S \| FileCheck %s
				RKSimonUnsubmitted Done Reply Inline Actions Regenerate with utils/update_test_checks.py RKSimon: Regenerate with utils/update_test_checks.py

				; The test checks the folding of cmp(sub(a,b),0) into cmp(a,b).

				define i8 @sub_compare_foldingPD128_safe(<2 x double> %a, <2 x double> %aa){
				RKSimonUnsubmitted Done Reply Inline Actions maybe split these into one test per intrinsic? also do tests for safe algebra (or maybe one of the other fmfs each per test). RKSimon: maybe split these into one test per intrinsic? also do tests for safe algebra (or maybe one of…
				; CHECK-LABEL: @sub_compare_foldingPD128_safe(
				spatelUnsubmitted Not Done Reply Inline Actions Misspelling in function name. Also, as suggested earlier, I'd really prefer to have one test per intrinsic rather than everything in one function. It makes it a lot easier to see the simple pattern that's getting folded. spatel: Misspelling in function name. Also, as suggested earlier, I'd really prefer to have one test…
				; CHECK-NEXT: entry:
				; CHECK-NEXT: [[SUB_SAFE:%.]] = fsub <2 x double> [[A:%.]], [[AA:%.*]]
				; CHECK-NEXT: [[TMP0:%.*]] = tail call i8 @llvm.x86.avx512.mask.cmp.pd.128(<2 x double> [[SUB_SAFE]], <2 x double> zeroinitializer, i32 5, i8 -1)
				; CHECK-NEXT: ret i8 [[TMP0]]
				;
				entry:
				RKSimonUnsubmitted Done Reply Inline Actions clean this up if possible RKSimon: clean this up if possible
				%sub.safe = fsub <2 x double> %a, %aa
				%0 = tail call i8 @llvm.x86.avx512.mask.cmp.pd.128(<2 x double> %sub.safe , <2 x double> zeroinitializer, i32 5, i8 -1)
				ret i8 %0
				}


				define i8 @sub_compare_foldingPD128(<2 x double> %a, <2 x double> %aa){
				; CHECK-LABEL: @sub_compare_foldingPD128(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: [[TMP0:%.]] = tail call i8 @llvm.x86.avx512.mask.cmp.pd.128(<2 x double> [[A:%.]], <2 x double> [[AA:%.*]], i32 5, i8 -1)
				; CHECK-NEXT: ret i8 [[TMP0]]
				;
				entry:
				%sub.i = fsub ninf <2 x double> %a, %aa
				%0 = tail call i8 @llvm.x86.avx512.mask.cmp.pd.128(<2 x double> %sub.i , <2 x double> zeroinitializer, i32 5, i8 -1)
				ret i8 %0
				}


				define i8 @sub_compare_foldingPD256(<4 x double> %a, <4 x double> %aa){
				; CHECK-LABEL: @sub_compare_foldingPD256(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: [[TMP0:%.]] = tail call i8 @llvm.x86.avx512.mask.cmp.pd.256(<4 x double> [[A:%.]], <4 x double> [[AA:%.*]], i32 5, i8 -1)
				; CHECK-NEXT: ret i8 [[TMP0]]
				;
				entry:
				%sub.i1 = fsub ninf <4 x double> %a, %aa
				%0 = tail call i8 @llvm.x86.avx512.mask.cmp.pd.256(<4 x double> %sub.i1, <4 x double> zeroinitializer, i32 5, i8 -1)
				ret i8 %0
				}


				define i8 @sub_compare_foldingPD512(<8 x double> %a, <8 x double> %aa){
				; CHECK-LABEL: @sub_compare_foldingPD512(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: [[TMP0:%.]] = tail call i8 @llvm.x86.avx512.mask.cmp.pd.512(<8 x double> [[A:%.]], <8 x double> [[AA:%.*]], i32 11, i8 -1, i32 4)
				; CHECK-NEXT: ret i8 [[TMP0]]
				;
				entry:
				%sub.i2 = fsub ninf <8 x double> %a, %aa
				%0 = tail call i8 @llvm.x86.avx512.mask.cmp.pd.512(<8 x double> %sub.i2, <8 x double> zeroinitializer, i32 11, i8 -1, i32 4)
				ret i8 %0
				}


				define i8 @sub_compare_foldingPS128(<4 x float> %a, <4 x float> %aa){
				; CHECK-LABEL: @sub_compare_foldingPS128(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: [[TMP0:%.]] = tail call i8 @llvm.x86.avx512.mask.cmp.ps.128(<4 x float> [[A:%.]], <4 x float> [[AA:%.*]], i32 12, i8 -1)
				; CHECK-NEXT: ret i8 [[TMP0]]
				;
				entry:
				%sub.i3 = fsub ninf <4 x float> %a, %aa
				%0 = tail call i8 @llvm.x86.avx512.mask.cmp.ps.128(<4 x float> %sub.i3, <4 x float> zeroinitializer, i32 12, i8 -1)
				ret i8 %0
				}


				define i8 @sub_compare_foldingPS256(<8 x float> %a, <8 x float> %aa){
				; CHECK-LABEL: @sub_compare_foldingPS256(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: [[TMP0:%.]] = tail call i8 @llvm.x86.avx512.mask.cmp.ps.256(<8 x float> [[A:%.]], <8 x float> [[AA:%.*]], i32 5, i8 -1)
				; CHECK-NEXT: ret i8 [[TMP0]]
				;
				entry:
				%sub.i4 = fsub ninf <8 x float> %a, %aa
				%0 = tail call i8 @llvm.x86.avx512.mask.cmp.ps.256(<8 x float> %sub.i4, <8 x float> zeroinitializer, i32 5, i8 -1)
				ret i8 %0
				}


				define i16 @sub_compare_foldingPS512(<16 x float> %a, <16 x float> %aa){
				; CHECK-LABEL: @sub_compare_foldingPS512(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: [[TMP0:%.]] = tail call i16 @llvm.x86.avx512.mask.cmp.ps.512(<16 x float> [[A:%.]], <16 x float> [[AA:%.*]], i32 11, i16 -1, i32 4)
				; CHECK-NEXT: ret i16 [[TMP0]]
				;
				entry:
				%sub.i5 = fsub ninf <16 x float> %a, %aa
				%0 = tail call i16 @llvm.x86.avx512.mask.cmp.ps.512(<16 x float> %sub.i5, <16 x float> zeroinitializer, i32 11, i16 -1, i32 4)
				ret i16 %0
				}

				declare i8 @llvm.x86.avx512.mask.cmp.pd.128(<2 x double>, <2 x double>, i32, i8)
				declare i8 @llvm.x86.avx512.mask.cmp.pd.256(<4 x double>, <4 x double>, i32, i8)
				declare i8 @llvm.x86.avx512.mask.cmp.pd.512(<8 x double>, <8 x double>, i32, i8, i32)
				declare i8 @llvm.x86.avx512.mask.cmp.ps.128(<4 x float>, <4 x float>, i32, i8)
				declare i8 @llvm.x86.avx512.mask.cmp.ps.256(<8 x float>, <8 x float>, i32, i8)
				declare i16 @llvm.x86.avx512.mask.cmp.ps.512(<16 x float>, <16 x float>, i32, i16, i32)

This is an archive of the discontinued LLVM Phabricator instance.

[X86][X86 intrinsics]Folding cmp(sub(a,b),0) into cmp(a,b) optimization
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 94199

lib/Transforms/InstCombine/InstCombineCalls.cpp

test/Transforms/InstCombine/X86FsubCmpCombine.ll

This is an archive of the discontinued LLVM Phabricator instance.

[X86][X86 intrinsics]Folding cmp(sub(a,b),0) into cmp(a,b) optimizationClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 94199

lib/Transforms/InstCombine/InstCombineCalls.cpp

test/Transforms/InstCombine/X86FsubCmpCombine.ll

[X86][X86 intrinsics]Folding cmp(sub(a,b),0) into cmp(a,b) optimization
ClosedPublic