Page MenuHomePhabricator

[DAGCombine] Transform (fadd A, (fmul B, -2.0)) -> (fsub A, (fadd B, B)).
ClosedPublic

Authored by mcrosier on Apr 27 2017, 9:19 AM.

Diff Detail

Repository
rL LLVM

Event Timeline

mcrosier created this revision.Apr 27 2017, 9:19 AM
efriedma edited edge metadata.Apr 27 2017, 11:34 AM

I would guess we should prefer fmul as the canonical form, for the same reason we prefer "shl %x, 1" over "add %x, %x": we optimize instructions where hasOneUse() is true more aggressively. I guess it doesn't make a big difference either way, though.

Either way, please add a testcase for "fadd %x, %x", to confirm that we canonicalize both fadd and fmul to the same form.

mcrosier updated this revision to Diff 96982.Apr 27 2017, 12:32 PM

-Add test case to check fadd X, X is canonical form, per Eli's request.

mcrosier added a comment.EditedApr 27 2017, 12:49 PM

FWIW, I could also fix this in the DAG combiner. The particular case I care about looks like 'a * b - 2.0 * c', but the expression is transformed to 'a * c + -2.0 * c' before we hit the DAG Combine of interest (which only expects a +2.0).

Another data-point: reassociate currently prefers to canonicalize "fadd %x, %x", to "fmul %x, 2". We don't want to fight back and forth in IR.

Another data-point: reassociate currently prefers to canonicalize "fadd %x, %x", to "fmul %x, 2". We don't want to fight back and forth in IR.

Thanks, Eli. I'll work on extending the DAG combine in that case. Does that sound reasonable?

mcrosier updated this revision to Diff 97131.Apr 28 2017, 11:32 AM
mcrosier retitled this revision from [InstCombine] Transform fmul X, +2.0 --> fadd X, X. to [DAGCombine] Transform (fmul X, -2.0) --> (fneg (fadd X, X))..
mcrosier edited the summary of this revision. (Show Details)

-Rewrite as a DAG combine.

Not sure if we need a target hook for this...? Replacing one instruction with two might not always be a good idea.

Otherwise looks fine.

arsenm edited edge metadata.May 2 2017, 12:59 PM

This is worse for AMDGPU because it requires a larger instruction encoding for f16/f32. For f64 this is better

arsenm added a comment.May 2 2017, 1:05 PM

This is worse for AMDGPU because it requires a larger instruction encoding for f16/f32. For f64 this is better

Specifically if the user has a neg source modifier. Otherwise it is always worse

spatel added a subscriber: spatel.May 2 2017, 3:12 PM
RKSimon added inline comments.May 2 2017, 3:20 PM
lib/CodeGen/SelectionDAG/DAGCombiner.cpp
9583 ↗(On Diff #97131)

Only do this if isFNegFree()?

spatel added inline comments.May 2 2017, 3:21 PM
test/CodeGen/X86/fmul-combines.ll
21–22 ↗(On Diff #97131)

What we don't see in this check, but you probably know or can infer: x86 doesn't have an 'fneg' op for SSE/AVX (they ran out of transistors?).

So we load the 128-bit sign-bit mask from memory:

xorps	LCPI1_0(%rip), %xmm0

It's also true that the mul version would load a '2.0', but this adds an extra op, and I don't think that's good for any x86 target.

There's one other reason this may not be good: there are actually CPUs (hello, Jaguar!) that have faster FP multiplies than FP adds.

mcrosier updated this revision to Diff 97667.EditedMay 3 2017, 9:39 AM
mcrosier retitled this revision from [DAGCombine] Transform (fmul X, -2.0) --> (fneg (fadd X, X)). to [DAGCombine] Transform (fadd A, (fmul B, -2.0)) -> (fsub A, (fadd B, B))..

Address reviewers feedback by narrowing transform so that the negation is "free".

@efriedma: This transform now replaces 2 instruction for 2 instruction in the worst case. For AArch64 it's actually replaces 3 for 2 because we now avoid materializing the -2.0. Probably true for other targets.
@arsenm: I don't believe this new version causes the instruction encoding to change in size, but please correct me if I'm wrong.
@spatel: This should addresses your comment w.r.t. X86. If you wish, I can add a target-hook to predicate this transform if the target supports fast multiplication. Please let me know if that's still a concern.

Thank you everyone for your feedback. Very much appreciated.

arsenm accepted this revision.May 3 2017, 10:14 AM

@arsenm: I don't believe this new version causes the instruction encoding to change in size, but please correct me if I'm wrong.

Yes, this is always better if it's really an fsub since the user then doesn't matter

This revision is now accepted and ready to land.May 3 2017, 10:14 AM

@arsenm: I don't believe this new version causes the instruction encoding to change in size, but please correct me if I'm wrong.

Yes, this is always better if it's really an fsub since the user then doesn't matter

Great. I'll wait for other's feedback before committing.

efriedma added inline comments.May 3 2017, 12:20 PM
lib/CodeGen/SelectionDAG/DAGCombiner.cpp
9468 ↗(On Diff #97667)

hasOneUse()?

mcrosier added inline comments.May 3 2017, 12:48 PM
lib/CodeGen/SelectionDAG/DAGCombiner.cpp
9468 ↗(On Diff #97667)

Very good point! One second.

mcrosier updated this revision to Diff 97713.May 3 2017, 12:50 PM

-Check for a single use, so that we know the fmul will be folded away.

efriedma accepted this revision.May 3 2017, 1:12 PM

LGTM.

gberry edited edge metadata.May 3 2017, 1:15 PM

My only comment is that you may be missing additional cases where the fneg could be folded away, but perhaps those can be fixed in a follow up change.

My only comment is that you may be missing additional cases where the fneg could be folded away, but perhaps those can be fixed in a follow up change.

Yes, I'll investigate once this patch lands.

spatel added a comment.May 3 2017, 2:29 PM

My only comment is that you may be missing additional cases where the fneg could be folded away, but perhaps those can be fixed in a follow up change.

Yes, I'll investigate once this patch lands.

Seems like isNegatibleForFree() would be the place to recognize patterns like this, and then we'd have a corresponding special case for -2.0 in GetNegatedExpression().

As written, this transform should be good for all x86 because it removes a constant load, so no objections, but...
I'm confused about our handling of FP folds. We're saying that this is a universally good (all targets and no relaxed FP needed) codegen fold, but we don't want it in IR/InstCombine because we prefer constants there.

Some tests to think about below. We'll fold the first 3 in DAGCombiner after this patch (universally afaict), but InstCombine does nothing with those. Should InstCombine fold fnegs into constants and fsub -> fadd?

The last case is transformed partially in InstCombine (div -> mul), but DAGCombiner does nothing with that. It's ok to not have a DAG fold for that because nothing this late is producing an fdiv?

define float @add_mul_neg2(float %a, float %b) {
  %mul = fmul float %b, -2.0
  %add = fadd float %a, %mul
  ret float %add
}

define float @sub_mul_neg2(float %a, float %b) {
  %mul = fmul float %b, -2.0
  %sub = fsub float %a, %mul
  ret float %sub
}

define float @mul_mul_neg2(float %a, float %b) {
  %mul = fmul float %b, -2.0
  %neg = fsub float -0.0, %a
  %mul2 = fmul float %neg, %mul
  ret float %mul2
}

define float @sub_div_neghalf(float %a, float %b) {
  %div = fdiv float %b, -0.5
  %sub = fsub float %a, %div
  ret float %sub
}
This revision was automatically updated to reflect the committed changes.

Seems like isNegatibleForFree() would be the place to recognize patterns like this, and then we'd have a corresponding special case for -2.0 in GetNegatedExpression().

I'll investigate this suggestion. Thanks.

As written, this transform should be good for all x86 because it removes a constant load, so no objections, but...

Okay, good.

I'm confused about our handling of FP folds. We're saying that this is a universally good (all targets and no relaxed FP needed) codegen fold, but we don't want it in IR/InstCombine because we prefer constants there.

As Eli pointed out, the reassociation pass prefers the fmul with constant (to expose additional factoring opportunities, IIRC). He also pointed out that we should prefer fmul X, 2.0 as the canonical form, for the same reason we prefer "shl %x, 1" over "add %x, %x": we optimize instructions where hasOneUse() is true more aggressively. Given these two implications I went ahead and implemented this as a DAG combine. However, I think it might make sense to have InstCombine canonicalize to the mul with constant form as well.

Some tests to think about below. We'll fold the first 3 in DAGCombiner after this patch (universally afaict), but InstCombine does nothing with those. Should InstCombine fold fnegs into constants and fsub -> fadd?

The last case is transformed partially in InstCombine (div -> mul), but DAGCombiner does nothing with that. It's ok to not have a DAG fold for that because nothing this late is producing an fdiv?

While InstCombine does nothing, the reassociation pass will canonicalize to

(fsub A, (fmul/fdiv B, -C)) -> (fadd A, (fmul/fdiv B, C))
(fadd A, (fmul/fdiv B, -C)) -> (fsub A, (fmul/fdiv B, C))

where C is a constant and

(fsub A, (fdiv b, -0.5)) -> (fadd A, (fmul b, 2.0))

for the last test.

We could make InstCombine prefer these forms as well and that might make a difference, but it might not. My guess is it probably doesn't matter, but I'll experiment.

While InstCombine does nothing, the reassociation pass will canonicalize to

(fsub A, (fmul/fdiv B, -C)) -> (fadd A, (fmul/fdiv B, C))
(fadd A, (fmul/fdiv B, -C)) -> (fsub A, (fmul/fdiv B, C))

Thanks for checking those out. I haven't looked at the reassociation pass very much. I'm surprised to see it flip a constant's sign and create an fsub rather than fadd. Any ideas why that is a good thing to do?

While InstCombine does nothing, the reassociation pass will canonicalize to

(fsub A, (fmul/fdiv B, -C)) -> (fadd A, (fmul/fdiv B, C))
(fadd A, (fmul/fdiv B, -C)) -> (fsub A, (fmul/fdiv B, C))

Thanks for checking those out. I haven't looked at the reassociation pass very much. I'm surprised to see it flip a constant's sign and create an fsub rather than fadd. Any ideas why that is a good thing to do?

AFAICT reassociation is trying to force all constants to be positive so it can increase the opportunities for factorization. This should also allows CSE and GVN to eliminate more duplicate expressions (per D4904 and D5363).

My only comment is that you may be missing additional cases where the fneg could be folded away, but perhaps those can be fixed in a follow up change.

Here's at least one case we're missing: https://bugs.llvm.org/show_bug.cgi?id=32939