Download Raw Diff

Details

Reviewers

majnemer
efriedma
gberry
RKSimon
craig.topper
arsenm

Commits

rG84a238dd6204: [DAGCombine] Transform (fadd A, (fmul B, -2.0)) -> (fsub A, (fadd B, B)).
rL302153: [DAGCombine] Transform (fadd A, (fmul B, -2.0)) -> (fsub A, (fadd B, B)).

Summary

PTAL

Chad

Diff Detail

Event Timeline

mcrosier created this revision.Apr 27 2017, 9:19 AM

I would guess we should prefer fmul as the canonical form, for the same reason we prefer "shl %x, 1" over "add %x, %x": we optimize instructions where hasOneUse() is true more aggressively. I guess it doesn't make a big difference either way, though.

Either way, please add a testcase for "fadd %x, %x", to confirm that we canonicalize both fadd and fmul to the same form.

-Add test case to check fadd X, X is canonical form, per Eli's request.

FWIW, I could also fix this in the DAG combiner. The particular case I care about looks like 'a * b - 2.0 * c', but the expression is transformed to 'a * c + -2.0 * c' before we hit the DAG Combine of interest (which only expects a +2.0).

Another data-point: reassociate currently prefers to canonicalize "fadd %x, %x", to "fmul %x, 2". We don't want to fight back and forth in IR.

In D32596#740344, @efriedma wrote:

Another data-point: reassociate currently prefers to canonicalize "fadd %x, %x", to "fmul %x, 2". We don't want to fight back and forth in IR.

Thanks, Eli. I'll work on extending the DAG combine in that case. Does that sound reasonable?

-Rewrite as a DAG combine.

Herald added a subscriber: nhaehnle. · View Herald TranscriptApr 28 2017, 11:32 AM

Not sure if we need a target hook for this...? Replacing one instruction with two might not always be a good idea.

Otherwise looks fine.

This is worse for AMDGPU because it requires a larger instruction encoding for f16/f32. For f64 this is better

In D32596#743761, @arsenm wrote:

This is worse for AMDGPU because it requires a larger instruction encoding for f16/f32. For f64 this is better

Specifically if the user has a neg source modifier. Otherwise it is always worse

spatel added a subscriber: spatel.May 2 2017, 3:12 PM

RKSimon added inline comments.May 2 2017, 3:20 PM

lib/CodeGen/SelectionDAG/DAGCombiner.cpp
9583	Only do this if isFNegFree()?

spatel added inline comments.May 2 2017, 3:21 PM

test/CodeGen/X86/fmul-combines.ll
21–22	What we don't see in this check, but you probably know or can infer: x86 doesn't have an 'fneg' op for SSE/AVX (they ran out of transistors?). So we load the 128-bit sign-bit mask from memory: xorps LCPI1_0(%rip), %xmm0 It's also true that the mul version would load a '2.0', but this adds an extra op, and I don't think that's good for any x86 target. There's one other reason this may not be good: there are actually CPUs (hello, Jaguar!) that have faster FP multiplies than FP adds.

Address reviewers feedback by narrowing transform so that the negation is "free".

@efriedma: This transform now replaces 2 instruction for 2 instruction in the worst case. For AArch64 it's actually replaces 3 for 2 because we now avoid materializing the -2.0. Probably true for other targets.
@arsenm: I don't believe this new version causes the instruction encoding to change in size, but please correct me if I'm wrong.
@spatel: This should addresses your comment w.r.t. X86. If you wish, I can add a target-hook to predicate this transform if the target supports fast multiplication. Please let me know if that's still a concern.

Thank you everyone for your feedback. Very much appreciated.

Herald added a subscriber: wdng. · View Herald TranscriptMay 3 2017, 9:39 AM

In D32596#744882, @mcrosier wrote:

@arsenm: I don't believe this new version causes the instruction encoding to change in size, but please correct me if I'm wrong.

Yes, this is always better if it's really an fsub since the user then doesn't matter

This revision is now accepted and ready to land.May 3 2017, 10:14 AM

In D32596#744944, @arsenm wrote:

In D32596#744882, @mcrosier wrote:

@arsenm: I don't believe this new version causes the instruction encoding to change in size, but please correct me if I'm wrong.

Yes, this is always better if it's really an fsub since the user then doesn't matter

Great. I'll wait for other's feedback before committing.

efriedma added inline comments.May 3 2017, 12:20 PM

lib/CodeGen/SelectionDAG/DAGCombiner.cpp
9468	hasOneUse()?

mcrosier added inline comments.May 3 2017, 12:48 PM

lib/CodeGen/SelectionDAG/DAGCombiner.cpp
9468	Very good point! One second.

-Check for a single use, so that we know the fmul will be folded away.

LGTM.

My only comment is that you may be missing additional cases where the fneg could be folded away, but perhaps those can be fixed in a follow up change.

In D32596#745247, @gberry wrote:

My only comment is that you may be missing additional cases where the fneg could be folded away, but perhaps those can be fixed in a follow up change.

Yes, I'll investigate once this patch lands.

In D32596#745262, @mcrosier wrote:

In D32596#745247, @gberry wrote:

My only comment is that you may be missing additional cases where the fneg could be folded away, but perhaps those can be fixed in a follow up change.

Yes, I'll investigate once this patch lands.

Seems like isNegatibleForFree() would be the place to recognize patterns like this, and then we'd have a corresponding special case for -2.0 in GetNegatedExpression().

As written, this transform should be good for all x86 because it removes a constant load, so no objections, but...
I'm confused about our handling of FP folds. We're saying that this is a universally good (all targets and no relaxed FP needed) codegen fold, but we don't want it in IR/InstCombine because we prefer constants there.

Some tests to think about below. We'll fold the first 3 in DAGCombiner after this patch (universally afaict), but InstCombine does nothing with those. Should InstCombine fold fnegs into constants and fsub -> fadd?

The last case is transformed partially in InstCombine (div -> mul), but DAGCombiner does nothing with that. It's ok to not have a DAG fold for that because nothing this late is producing an fdiv?

define float @add_mul_neg2(float %a, float %b) {
  %mul = fmul float %b, -2.0
  %add = fadd float %a, %mul
  ret float %add
}

define float @sub_mul_neg2(float %a, float %b) {
  %mul = fmul float %b, -2.0
  %sub = fsub float %a, %mul
  ret float %sub
}

define float @mul_mul_neg2(float %a, float %b) {
  %mul = fmul float %b, -2.0
  %neg = fsub float -0.0, %a
  %mul2 = fmul float %neg, %mul
  ret float %mul2
}

define float @sub_div_neghalf(float %a, float %b) {
  %div = fdiv float %b, -0.5
  %sub = fsub float %a, %div
  ret float %sub
}

Closed by commit rL302153: [DAGCombine] Transform (fadd A, (fmul B, -2.0)) -> (fsub A, (fadd B, B)). (authored by mcrosier). · Explain WhyMay 4 2017, 7:28 AM

This revision was automatically updated to reflect the committed changes.

In D32596#745330, @spatel wrote:

Seems like isNegatibleForFree() would be the place to recognize patterns like this, and then we'd have a corresponding special case for -2.0 in GetNegatedExpression().

I'll investigate this suggestion. Thanks.

As written, this transform should be good for all x86 because it removes a constant load, so no objections, but...

Okay, good.

I'm confused about our handling of FP folds. We're saying that this is a universally good (all targets and no relaxed FP needed) codegen fold, but we don't want it in IR/InstCombine because we prefer constants there.

As Eli pointed out, the reassociation pass prefers the fmul with constant (to expose additional factoring opportunities, IIRC). He also pointed out that we should prefer fmul X, 2.0 as the canonical form, for the same reason we prefer "shl %x, 1" over "add %x, %x": we optimize instructions where hasOneUse() is true more aggressively. Given these two implications I went ahead and implemented this as a DAG combine. However, I think it might make sense to have InstCombine canonicalize to the mul with constant form as well.

Some tests to think about below. We'll fold the first 3 in DAGCombiner after this patch (universally afaict), but InstCombine does nothing with those. Should InstCombine fold fnegs into constants and fsub -> fadd?

The last case is transformed partially in InstCombine (div -> mul), but DAGCombiner does nothing with that. It's ok to not have a DAG fold for that because nothing this late is producing an fdiv?

While InstCombine does nothing, the reassociation pass will canonicalize to

(fsub A, (fmul/fdiv B, -C)) -> (fadd A, (fmul/fdiv B, C))
(fadd A, (fmul/fdiv B, -C)) -> (fsub A, (fmul/fdiv B, C))

where C is a constant and

(fsub A, (fdiv b, -0.5)) -> (fadd A, (fmul b, 2.0))

for the last test.

We could make InstCombine prefer these forms as well and that might make a difference, but it might not. My guess is it probably doesn't matter, but I'll experiment.

In D32596#746288, @mcrosier wrote:
While InstCombine does nothing, the reassociation pass will canonicalize to
(fsub A, (fmul/fdiv B, -C)) -> (fadd A, (fmul/fdiv B, C))
(fadd A, (fmul/fdiv B, -C)) -> (fsub A, (fmul/fdiv B, C))

Thanks for checking those out. I haven't looked at the reassociation pass very much. I'm surprised to see it flip a constant's sign and create an fsub rather than fadd. Any ideas why that is a good thing to do?

In D32596#746314, @spatel wrote:
In D32596#746288, @mcrosier wrote:
While InstCombine does nothing, the reassociation pass will canonicalize to
(fsub A, (fmul/fdiv B, -C)) -> (fadd A, (fmul/fdiv B, C))
(fadd A, (fmul/fdiv B, -C)) -> (fsub A, (fmul/fdiv B, C))
Thanks for checking those out. I haven't looked at the reassociation pass very much. I'm surprised to see it flip a constant's sign and create an fsub rather than fadd. Any ideas why that is a good thing to do?

AFAICT reassociation is trying to force all constants to be positive so it can increase the opportunities for factorization. This should also allows CSE and GVN to eliminate more duplicate expressions (per D4904 and D5363).

In D32596#745247, @gberry wrote:

My only comment is that you may be missing additional cases where the fneg could be folded away, but perhaps those can be fixed in a follow up change.

Here's at least one case we're missing: https://bugs.llvm.org/show_bug.cgi?id=32939

mcrosier mentioned this in D39830: [DAGCombine] Transform (A + -2.0*B*C) -> (A - (B+B)*C).Nov 10 2017, 9:21 AM

Diff 97131

lib/CodeGen/SelectionDAG/DAGCombiner.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 9,459 Lines • ▼ Show 20 Lines	SDValue DAGCombiner::visitFSUB(SDNode *N) {
if (Options.NoSignedZerosFPMath \|\| N->getFlags()->hasNoSignedZeros()) {		if (Options.NoSignedZerosFPMath \|\| N->getFlags()->hasNoSignedZeros()) {
// (fsub 0, B) -> -B		// (fsub 0, B) -> -B
if (N0CFP && N0CFP->isZero()) {		if (N0CFP && N0CFP->isZero()) {
if (isNegatibleForFree(N1, LegalOperations, TLI, &Options))		if (isNegatibleForFree(N1, LegalOperations, TLI, &Options))
return GetNegatedExpression(N1, DAG, LegalOperations);		return GetNegatedExpression(N1, DAG, LegalOperations);
if (!LegalOperations \|\| TLI.isOperationLegal(ISD::FNEG, VT))		if (!LegalOperations \|\| TLI.isOperationLegal(ISD::FNEG, VT))
return DAG.getNode(ISD::FNEG, DL, VT, N1, Flags);		return DAG.getNode(ISD::FNEG, DL, VT, N1, Flags);
}		}
}		}
		efriedmaUnsubmitted Not Done Reply Inline Actions hasOneUse()? efriedma: hasOneUse()?
		mcrosierAuthorUnsubmitted Not Done Reply Inline Actions Very good point! One second. mcrosier: Very good point! One second.

// If 'unsafe math' is enabled, fold lots of things.		// If 'unsafe math' is enabled, fold lots of things.
if (Options.UnsafeFPMath) {		if (Options.UnsafeFPMath) {
// (fsub A, 0) -> A		// (fsub A, 0) -> A
if (N1CFP && N1CFP->isZero())		if (N1CFP && N1CFP->isZero())
return N0;		return N0;

// (fsub x, x) -> 0.0		// (fsub x, x) -> 0.0
▲ Show 20 Lines • Show All 97 Lines • ▼ Show 20 Lines	if (N0.getOpcode() == ISD::FADD &&
return DAG.getNode(ISD::FMUL, DL, VT, N0.getOperand(0), MulConsts, Flags);		return DAG.getNode(ISD::FMUL, DL, VT, N0.getOperand(0), MulConsts, Flags);
}		}
}		}

// fold (fmul X, 2.0) -> (fadd X, X)		// fold (fmul X, 2.0) -> (fadd X, X)
if (N1CFP && N1CFP->isExactlyValue(+2.0))		if (N1CFP && N1CFP->isExactlyValue(+2.0))
return DAG.getNode(ISD::FADD, DL, VT, N0, N0, Flags);		return DAG.getNode(ISD::FADD, DL, VT, N0, N0, Flags);

		// fold (fmul X, -2.0) -> (fneg (fadd X, X))
		if (N1CFP && N1CFP->isExactlyValue(-2.0))
		RKSimonUnsubmitted Not Done Reply Inline Actions Only do this if isFNegFree()? RKSimon: Only do this if isFNegFree()?
		if (!LegalOperations \|\| TLI.isOperationLegal(ISD::FNEG, VT)) {
		SDValue Add = DAG.getNode(ISD::FADD, DL, VT, N0, N0, Flags);
		return DAG.getNode(ISD::FNEG, DL, VT, Add);
		}

// fold (fmul X, -1.0) -> (fneg X)		// fold (fmul X, -1.0) -> (fneg X)
if (N1CFP && N1CFP->isExactlyValue(-1.0))		if (N1CFP && N1CFP->isExactlyValue(-1.0))
if (!LegalOperations \|\| TLI.isOperationLegal(ISD::FNEG, VT))		if (!LegalOperations \|\| TLI.isOperationLegal(ISD::FNEG, VT))
return DAG.getNode(ISD::FNEG, DL, VT, N0);		return DAG.getNode(ISD::FNEG, DL, VT, N0);

// fold (fmul (fneg X), (fneg Y)) -> (fmul X, Y)		// fold (fmul (fneg X), (fneg Y)) -> (fmul X, Y)
if (char LHSNeg = isNegatibleForFree(N0, LegalOperations, TLI, &Options)) {		if (char LHSNeg = isNegatibleForFree(N0, LegalOperations, TLI, &Options)) {
if (char RHSNeg = isNegatibleForFree(N1, LegalOperations, TLI, &Options)) {		if (char RHSNeg = isNegatibleForFree(N1, LegalOperations, TLI, &Options)) {
▲ Show 20 Lines • Show All 6,793 Lines • Show Last 20 Lines

test/CodeGen/AArch64/fmul-combines.ll

This file was added.

				; RUN: llc < %s -mtriple=aarch64-none-linux-gnu -verify-machineinstrs \| FileCheck %s

				; CHECK-LABEL: test1:
				; CHECK: fadd s0, s0, s0
				; CHECK: fneg s0, s0
				define float @test1(float %x) {
				%y = fmul float %x, -2.0
				ret float %y
				}

				; CHECK-LABEL: test2:
				; CHECK: fadd d0, d0, d0
				; CHECK: fneg d0, d0
				define double @test2(double %x) {
				%y = fmul double %x, -2.0
				ret double %y
				}

				; a * b - 2.0 * c
				; CHECK-LABEL: test3:
				; CHECK: fmul d0, d0, d1
				; CHECK: fadd d1, d2, d2
				; CHECK: fsub d0, d0, d1
				define double @test3(double %a, double %b, double %d) {
				entry:
				%mul = fmul double %a, %b
				%mul1 = fmul double %d, 2.000000e+00
				%sub = fsub double %mul, %mul1
				ret double %sub
				}

test/CodeGen/AMDGPU/fmul-2-combine-multi-use.ll

Show First 20 Lines • Show All 78 Lines • ▼ Show 20 Lines	define amdgpu_kernel void @multiple_use_fadd_multi_fmad_f32(float addrspace(1)* %out, float %x, float %y, float %z) #0 {
%mad0 = fadd fast float %mul2, %y		%mad0 = fadd fast float %mul2, %y
%mad1 = fadd fast float %mul2, %z		%mad1 = fadd fast float %mul2, %z
store volatile float %mad0, float addrspace(1)* %out		store volatile float %mad0, float addrspace(1)* %out
store volatile float %mad1, float addrspace(1)* %out.gep.1		store volatile float %mad1, float addrspace(1)* %out.gep.1
ret void		ret void
}		}

; GCN-LABEL: {{^}}fmul_x2_xn2_f32:		; GCN-LABEL: {{^}}fmul_x2_xn2_f32:
; GCN: v_mul_f32_e64 [[TMP0:v[0-9]+]], [[X:s[0-9]+]], -4.0		; GCN: v_add_f32_e64 [[TMP0:v[0-9]+]], [[X:s[0-9]+]], [[X]]
; GCN: v_mul_f32_e32 [[RESULT:v[0-9]+]], [[X]], [[TMP0]]		; GCN: v_mul_f32_e64 [[RESULT:v[0-9]+]], -[[TMP0]], [[TMP0]]
; GCN: buffer_store_dword [[RESULT]]		; GCN: buffer_store_dword [[RESULT]]
define amdgpu_kernel void @fmul_x2_xn2_f32(float addrspace(1)* %out, float %x, float %y) #0 {		define amdgpu_kernel void @fmul_x2_xn2_f32(float addrspace(1)* %out, float %x, float %y) #0 {
%out.gep.1 = getelementptr float, float addrspace(1)* %out, i32 1		%out.gep.1 = getelementptr float, float addrspace(1)* %out, i32 1
%mul2 = fmul fast float %x, 2.0		%mul2 = fmul fast float %x, 2.0
%muln2 = fmul fast float %x, -2.0		%muln2 = fmul fast float %x, -2.0
%mul = fmul fast float %mul2, %muln2		%mul = fmul fast float %mul2, %muln2
store volatile float %mul, float addrspace(1)* %out		store volatile float %mul, float addrspace(1)* %out
ret void		ret void
▲ Show 20 Lines • Show All 98 Lines • ▼ Show 20 Lines	define amdgpu_kernel void @multiple_use_fadd_multi_fmad_f16(half addrspace(1)* %out, i16 zeroext %x.arg, i16 zeroext %y.arg, i16 zeroext %z.arg) #0 {
%mad0 = fadd fast half %mul2, %y		%mad0 = fadd fast half %mul2, %y
%mad1 = fadd fast half %mul2, %z		%mad1 = fadd fast half %mul2, %z
store volatile half %mad0, half addrspace(1)* %out		store volatile half %mad0, half addrspace(1)* %out
store volatile half %mad1, half addrspace(1)* %out.gep.1		store volatile half %mad1, half addrspace(1)* %out.gep.1
ret void		ret void
}		}

; GCN-LABEL: {{^}}fmul_x2_xn2_f16:		; GCN-LABEL: {{^}}fmul_x2_xn2_f16:
; GCN: v_mul_f16_e64 [[TMP0:v[0-9]+]], [[X:s[0-9]+]], -4.0		; GCN: v_add_f16_e64 [[TMP0:v[0-9]+]], [[X:s[0-9]+]], [[X]]
; GCN: v_mul_f16_e32 [[RESULT:v[0-9]+]], [[X]], [[TMP0]]		; GCN: v_mul_f16_e64 [[RESULT:v[0-9]+]], -[[TMP0]], [[TMP0]]
; GCN: buffer_store_short [[RESULT]]		; GCN: buffer_store_short [[RESULT]]
define amdgpu_kernel void @fmul_x2_xn2_f16(half addrspace(1)* %out, i16 zeroext %x.arg, i16 zeroext %y.arg) #0 {		define amdgpu_kernel void @fmul_x2_xn2_f16(half addrspace(1)* %out, i16 zeroext %x.arg, i16 zeroext %y.arg) #0 {
%x = bitcast i16 %x.arg to half		%x = bitcast i16 %x.arg to half
%y = bitcast i16 %y.arg to half		%y = bitcast i16 %y.arg to half
%out.gep.1 = getelementptr half, half addrspace(1)* %out, i32 1		%out.gep.1 = getelementptr half, half addrspace(1)* %out, i32 1
%mul2 = fmul fast half %x, 2.0		%mul2 = fmul fast half %x, 2.0
%muln2 = fmul fast half %x, -2.0		%muln2 = fmul fast half %x, -2.0
%mul = fmul fast half %mul2, %muln2		%mul = fmul fast half %mul2, %muln2
Show All 22 Lines

test/CodeGen/AMDGPU/fmuladd.f32.ll

	Show First 20 Lines • Show All 185 Lines • ▼ Show 20 Lines

	; GCN-LABEL: {{^}}fmuladd_neg_2.0_a_b_f32			; GCN-LABEL: {{^}}fmuladd_neg_2.0_a_b_f32
	; GCN: {{buffer\|flat}}_load_dword [[R1:v[0-9]+]],			; GCN: {{buffer\|flat}}_load_dword [[R1:v[0-9]+]],
	; GCN: {{buffer\|flat}}_load_dword [[R2:v[0-9]+]],			; GCN: {{buffer\|flat}}_load_dword [[R2:v[0-9]+]],
	; GCN-FLUSH: v_mac_f32_e32 [[R2]], -2.0, [[R1]]			; GCN-FLUSH: v_mac_f32_e32 [[R2]], -2.0, [[R1]]

	; GCN-DENORM-FASTFMA: v_fma_f32 [[RESULT:v[0-9]+]], [[R1]], -2.0, [[R2]]			; GCN-DENORM-FASTFMA: v_fma_f32 [[RESULT:v[0-9]+]], [[R1]], -2.0, [[R2]]

	; GCN-DENORM-SLOWFMA: v_mul_f32_e32 [[TMP:v[0-9]+]], -2.0, [[R1]]			; GCN-DENORM-SLOWFMA: v_add_f32_e32 [[TMP:v[0-9]+]], [[R1]], [[R1]]
	; GCN-DENORM-SLOWFMA: v_add_f32_e32 [[RESULT:v[0-9]+]], [[R2]], [[TMP]]			; GCN-DENORM-SLOWFMA: v_subrev_f32_e32 [[RESULT:v[0-9]+]], [[TMP]], [[R2]]

	; SI-DENORM: buffer_store_dword [[RESULT]]			; SI-DENORM: buffer_store_dword [[RESULT]]
	; VI-DENORM: flat_store_dword v{{\[[0-9]+:[0-9]+\]}}, [[RESULT]]			; VI-DENORM: flat_store_dword v{{\[[0-9]+:[0-9]+\]}}, [[RESULT]]
	define amdgpu_kernel void @fmuladd_neg_2.0_a_b_f32(float addrspace(1)* %out, float addrspace(1)* %in) #0 {			define amdgpu_kernel void @fmuladd_neg_2.0_a_b_f32(float addrspace(1)* %out, float addrspace(1)* %in) #0 {
	%tid = call i32 @llvm.amdgcn.workitem.id.x()			%tid = call i32 @llvm.amdgcn.workitem.id.x()
	%gep.0 = getelementptr float, float addrspace(1)* %out, i32 %tid			%gep.0 = getelementptr float, float addrspace(1)* %out, i32 %tid
	%gep.1 = getelementptr float, float addrspace(1)* %gep.0, i32 1			%gep.1 = getelementptr float, float addrspace(1)* %gep.0, i32 1
	%gep.out = getelementptr float, float addrspace(1)* %out, i32 %tid			%gep.out = getelementptr float, float addrspace(1)* %out, i32 %tid
	▲ Show 20 Lines • Show All 42 Lines • ▼ Show 20 Lines
	; GCN: {{buffer\|flat}}_load_dword [[R2:v[0-9]+]],			; GCN: {{buffer\|flat}}_load_dword [[R2:v[0-9]+]],

	; GCN-FLUSH: v_mac_f32_e32 [[R2]], -2.0, [[R1]]			; GCN-FLUSH: v_mac_f32_e32 [[R2]], -2.0, [[R1]]
	; SI-FLUSH: buffer_store_dword [[R2]]			; SI-FLUSH: buffer_store_dword [[R2]]
	; VI-FLUSH: flat_store_dword v{{\[[0-9]+:[0-9]+\]}}, [[R2]]			; VI-FLUSH: flat_store_dword v{{\[[0-9]+:[0-9]+\]}}, [[R2]]

	; GCN-DENORM-FASTFMA: v_fma_f32 [[RESULT:v[0-9]+]], -[[R1]], 2.0, [[R2]]			; GCN-DENORM-FASTFMA: v_fma_f32 [[RESULT:v[0-9]+]], -[[R1]], 2.0, [[R2]]

	; GCN-DENORM-SLOWFMA: v_mul_f32_e32 [[TMP:v[0-9]+]], -2.0, [[R1]]			; GCN-DENORM-SLOWFMA: v_add_f32_e32 [[TMP:v[0-9]+]], [[R1]], [[R1]]
	; GCN-DENORM-SLOWFMA: v_add_f32_e32 [[RESULT:v[0-9]+]], [[TMP]], [[R2]]			; GCN-DENORM-SLOWFMA: v_subrev_f32_e32 [[RESULT:v[0-9]+]], [[TMP]], [[R2]]

	; SI-DENORM: buffer_store_dword [[RESULT]]			; SI-DENORM: buffer_store_dword [[RESULT]]
	; VI-DENORM: flat_store_dword v{{\[[0-9]+:[0-9]+\]}}, [[RESULT]]			; VI-DENORM: flat_store_dword v{{\[[0-9]+:[0-9]+\]}}, [[RESULT]]
	define amdgpu_kernel void @fmuladd_2.0_neg_a_b_f32(float addrspace(1)* %out, float addrspace(1)* %in) #0 {			define amdgpu_kernel void @fmuladd_2.0_neg_a_b_f32(float addrspace(1)* %out, float addrspace(1)* %in) #0 {
	%tid = call i32 @llvm.amdgcn.workitem.id.x()			%tid = call i32 @llvm.amdgcn.workitem.id.x()
	%gep.0 = getelementptr float, float addrspace(1)* %out, i32 %tid			%gep.0 = getelementptr float, float addrspace(1)* %out, i32 %tid
	%gep.1 = getelementptr float, float addrspace(1)* %gep.0, i32 1			%gep.1 = getelementptr float, float addrspace(1)* %gep.0, i32 1
	%gep.out = getelementptr float, float addrspace(1)* %out, i32 %tid			%gep.out = getelementptr float, float addrspace(1)* %out, i32 %tid
	▲ Show 20 Lines • Show All 320 Lines • Show Last 20 Lines

test/CodeGen/X86/fmul-combines.ll

	Show All 11 Lines
	; CHECK-LABEL: fmul2_v4f32:			; CHECK-LABEL: fmul2_v4f32:
	; CHECK: addps %xmm0, %xmm0			; CHECK: addps %xmm0, %xmm0
	; CHECK-NEXT: retq			; CHECK-NEXT: retq
	define <4 x float> @fmul2_v4f32(<4 x float> %x) {			define <4 x float> @fmul2_v4f32(<4 x float> %x) {
	%y = fmul <4 x float> %x, <float 2.0, float 2.0, float 2.0, float 2.0>			%y = fmul <4 x float> %x, <float 2.0, float 2.0, float 2.0, float 2.0>
	ret <4 x float> %y			ret <4 x float> %y
	}			}

				; CHECK-LABEL: fmulneg2_v4f32:
				; CHECK: addps %xmm0, %xmm0
				; CHECK: xorps
				spatelUnsubmitted Not Done Reply Inline Actions What we don't see in this check, but you probably know or can infer: x86 doesn't have an 'fneg' op for SSE/AVX (they ran out of transistors?). So we load the 128-bit sign-bit mask from memory: xorps LCPI1_0(%rip), %xmm0 It's also true that the mul version would load a '2.0', but this adds an extra op, and I don't think that's good for any x86 target. There's one other reason this may not be good: there are actually CPUs (hello, Jaguar!) that have faster FP multiplies than FP adds. spatel: What we don't see in this check, but you probably know or can infer: x86 doesn't have an 'fneg'…
				; CHECK-NEXT: retq
				define <4 x float> @fmulneg2_v4f32(<4 x float> %x) {
				%y = fmul <4 x float> %x, <float -2.0, float -2.0, float -2.0, float -2.0>
				ret <4 x float> %y
				}

	; CHECK-LABEL: constant_fold_fmul_v4f32:			; CHECK-LABEL: constant_fold_fmul_v4f32:
	; CHECK: movaps			; CHECK: movaps
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	define <4 x float> @constant_fold_fmul_v4f32(<4 x float> %x) {			define <4 x float> @constant_fold_fmul_v4f32(<4 x float> %x) {
	%y = fmul <4 x float> <float 4.0, float 4.0, float 4.0, float 4.0>, <float 2.0, float 2.0, float 2.0, float 2.0>			%y = fmul <4 x float> <float 4.0, float 4.0, float 4.0, float 4.0>, <float 2.0, float 2.0, float 2.0, float 2.0>
	ret <4 x float> %y			ret <4 x float> %y
	}			}

	▲ Show 20 Lines • Show All 152 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[DAGCombine] Transform (fadd A, (fmul B, -2.0)) -> (fsub A, (fadd B, B)).
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 97131

lib/CodeGen/SelectionDAG/DAGCombiner.cpp

test/CodeGen/AArch64/fmul-combines.ll

test/CodeGen/AMDGPU/fmul-2-combine-multi-use.ll

test/CodeGen/AMDGPU/fmuladd.f32.ll

test/CodeGen/X86/fmul-combines.ll

This is an archive of the discontinued LLVM Phabricator instance.

[DAGCombine] Transform (fadd A, (fmul B, -2.0)) -> (fsub A, (fadd B, B)).ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 97131

lib/CodeGen/SelectionDAG/DAGCombiner.cpp

test/CodeGen/AArch64/fmul-combines.ll

test/CodeGen/AMDGPU/fmul-2-combine-multi-use.ll

test/CodeGen/AMDGPU/fmuladd.f32.ll

test/CodeGen/X86/fmul-combines.ll

[DAGCombine] Transform (fadd A, (fmul B, -2.0)) -> (fsub A, (fadd B, B)).
ClosedPublic