This is an archive of the discontinued LLVM Phabricator instance.

propagate IR-level fast-math-flags to DAG nodes
ClosedPublic

Authored by spatel on Apr 8 2015, 10:50 AM.

Details

Summary

This patch is an attempt to introduce the minimum plumbing necessary to use IR-level fast-math-flags (FMF) in the backend along with one usage and test case (reciprocal estimate).

History:

  1. IR-level FMF was added around 2012: http://lists.cs.uiuc.edu/pipermail/llvmdev/2012-October/054999.html
  2. Integer IR optimization flags (nsw, nuw, exact) were extended to the backend in June 2014: http://reviews.llvm.org/rL210467

Motivation:

  1. We have customers who would like to see clang have more flexible behavior with respect to FP reciprocal approximations. This means that -freciprocal-math is honored as a separate optimization setting from -ffast-math (see existing gcc behavior for a starting point).

This is broken in the:
a. front-end: -freciprocal-math flag is ignored
b. the IR optimizer: UnsafeAlgebra implies AllowReciprocal
c. the backend: IR FMF flags are dropped entirely

  1. Longer-term and more fundamentally: one of the goals for IR-level FMF was to allow mixing of strict and relaxed FP math code. This is impossible without backend support. The problem is exacerbated by LTO.

My initial draft of this patch left the nsw/nuw/exact changes mostly as-is, but it quickly became clear that maintaining the APIs for those and adding FMF on the side made everything worse. So I grouped all of the optimization flags into one class and went from there. The flags and node classes are copied directly from the IR FMF class and FPMathOperator. We should obviously unify the backend with the IR optimizer on those flag bits, but I think that can be a follow-on patch? The goal of this initial patch is to not break any existing functionality or change the IR code while adding optional processing of FMF to the backend.

I'm promising to continue work on this to convert much more of the existing backend code to use FMF wherever possible.

Diff Detail

Event Timeline

spatel updated this revision to Diff 23427.Apr 8 2015, 10:50 AM
spatel retitled this revision from to propagate IR-level fast-math-flags to DAG nodes.
spatel updated this object.
spatel edited the test plan for this revision. (Show Details)
spatel added reviewers: hfinkel, andreadb, rengolin.
spatel added a subscriber: Unknown Object (MLST).
alexr added a subscriber: alexr.Apr 8 2015, 11:18 AM
andreadb edited edge metadata.Apr 9 2015, 4:52 AM

Hi Sanjay,

I only had a couple of minor comments/questions. In general I think this is a nice initial patch.
If I remember correctly, the initial support for nsw/nuw/exact flags in SelectionDAG was added by Marcello Maggioni. In case, I suggest you to add him in cc to the code review (I don't know if he is still working on this..).

include/llvm/CodeGen/SelectionDAG.h
1230–1231

My understanding is that flags can only be present on binary operations. This is also the reason why originally nsw/nuw/exact were only added to BinarySDNode.
Is there a reason why BinarySDNode should extend from SDNode? At the moment, 'Ops' is always expected to have two SDValues.

include/llvm/CodeGen/SelectionDAGNodes.h
971–977

This is ok. However, what about having something like this?

setNoUnsignedWrap() {
  Flags |= NoUnsignedWrap;
}

clearNoUnsignedWrap() {
  Flags ^= NoUnsignedWrap;
}

Also it is a shame that most of this code is repeated in 'class SDNodeWithFlags'. I wonder if there is a better design that allows to delegate the 'bit manipulation part' as much as possible to 'SDNodeFlags'...

Adding Marcello to reviewers.

spatel added inline comments.Apr 9 2015, 7:55 AM
include/llvm/CodeGen/SelectionDAG.h
1230–1231

It is correct that we only have flags on binary ops today, but I was considering the case of adding FMF to intrinsics or libcalls ( https://llvm.org/bugs/show_bug.cgi?id=21290 ). I was also thinking about the FMA case and other nodes that don't exist in IR. I think we'll need to propagate the flags to more than binops eventually, so...

I agree that this patch is hacky in that it only updates the binop API, but it was the minimal change.

spatel added inline comments.Apr 9 2015, 8:04 AM
include/llvm/CodeGen/SelectionDAGNodes.h
971–977

I agree completely, but I didn't see an immediate solution, so I just went with the lazy approach: copy and paste!

I'm open to any suggestions for improvement on the interface, and eventually (soon) I would unify the IR and SDNode versions of the flags, so we get the benefits in both places.

spatel added a comment.Apr 9 2015, 8:22 AM

re: the -freciprocal-math flag

"see existing gcc behavior for a starting point"

I should've taken my own advice more seriously. :)

It turns out that -freciprocal-math is only used for a reassociation optimization on divisions. Specifically, this one:
https://llvm.org/bugs/show_bug.cgi?id=16218

With gcc, all of the machine-level hackery that we've put into generating hardware estimate instructions is controlled by a flag that doesn't even exist in clang yet: -mrecip
https://llvm.org/bugs/show_bug.cgi?id=20912

So the end result is that this patch's recip change is wrong. Update to at least that part of the patch coming soon.

I'm assuming that we should not diverge from gcc's user-visible behavior without good reason, so we'll still make -freciprocal-math cause 'arcp' generation in IR ( http://reviews.llvm.org/rL234493 ), but 'arcp' has nothing to do with HW estimate codegen. For that, we need -mrecip with its multitude of options:
https://gcc.gnu.org/onlinedocs/gcc-4.9.2/gcc/i386-and-x86-64-Options.html#index-mrecip_003dopt-1627

spatel added a comment.Apr 9 2015, 9:57 AM

Delayed by another case of 'how does anything work...ever?':
https://llvm.org/bugs/show_bug.cgi?id=23172

spatel updated this revision to Diff 23503.Apr 9 2015, 10:38 AM
spatel edited edge metadata.

Patch updated:

  1. Corrected DAGCombiner 'arcp' optimization to only trigger for the reassociation division case, not the estimate cases.
  2. Updated PowerPC test case that uses that optimization (x86 still doesn't have it); worked around PR23172 by putting the 'arcp' versions of the tests before the unsafe-fp-math versions.

Hello, I'm the "Marcello Maggioni" of the original backend flags by the way :P

I'm very interested in what is going on here and I think is about time we have "Node-level" Fast Math flags in the backend! This is a very nice start!

lib/CodeGen/SelectionDAG/SelectionDAG.cpp
3614–3615

Does it even make sense to have BinarySDNode around at all considering the main function generating these now generates standard SDNodes?

spatel added inline comments.Apr 10 2015, 1:58 PM
lib/CodeGen/SelectionDAG/SelectionDAG.cpp
3614–3615

Hi Marcello -

Thanks for looking at this patch.

I got rid of "BinaryWithFlagsSDNode", but "BinarySDNode" had some uses elsewhere along with "UnarySDNode" and "TernarySDNode".

The comment for BinarySDNode says:

/// This class is used for two-operand SDNodes.  This is solely
/// to allow co-allocation of node operands with the node itself.

So it's just there for malloc efficiency? If everybody agrees that it has no value, then I can delete it, but I think that should be a follow-on patch.

Over all as far as I'm concerned the approach here is pretty sound in my opinion and is definitely an improvement over everything we have at the moment.

lib/CodeGen/SelectionDAG/SelectionDAG.cpp
3614–3615

Yeah, sure, I was just bringing up the the question, thats all :)

One high level comment and a few inline nits:

Can you split out the flags part of the patch first? I'm very certain that's not objectionable and will make the actual content of your patch much smaller.

Rest inline :)

Thanks!

-eric

lib/CodeGen/SelectionDAG/DAGCombiner.cpp
7822

Nit: No braces around single statement ifs.

lib/CodeGen/SelectionDAG/SelectionDAG.cpp
406

Early exit?

416

Early exit?

spatel updated this revision to Diff 23890.Apr 16 2015, 6:23 PM

Let me ask a dumb question since I don't know any better. :)

Is the size of an SDNode so critical that we can't do the proper software engineering thing: add the flags as a member rather than riding on the bonus bits available in SubclassData? If we do that, all of the distasteful get/set and bit shifting replication disappears.

If we make a bag o' bits for the flags as in this updated patch, they'd just be single byte. As it stands, I'm showing that the sizeof(SDNode) and sizeof(SDNodeWithFlags) are both 80 bytes (building on MacOSX 10.10). Free bits!

That said, there's something very wrong with this patch now...I'm seeing *intermittent* failures on seemingly unrelated codegen regression tests. Can anyone spot the bug?

hfinkel edited edge metadata.Apr 16 2015, 7:49 PM

Let me ask a dumb question since I don't know any better. :)

Is the size of an SDNode so critical that we can't do the proper software engineering thing: add the flags as a member rather than riding on the bonus bits available in SubclassData? If we do that, all of the distasteful get/set and bit shifting replication disappears.

If we make a bag o' bits for the flags as in this updated patch, they'd just be single byte. As it stands, I'm showing that the sizeof(SDNode) and sizeof(SDNodeWithFlags) are both 80 bytes (building on MacOSX 10.10). Free bits!

The size is important, but if you don't increase the size on common architectures, then you should be good to go.

That said, there's something very wrong with this patch now...I'm seeing *intermittent* failures on seemingly unrelated codegen regression tests. Can anyone spot the bug?

Have you tried running under valgrind and/or asan/msan? Does this happen in a debug build, or just an optimized build?

Have you tried running under valgrind and/or asan/msan? Does this happen in a debug build, or just an optimized build?

Not yet; that's the next step. I wanted to see if there were structural objections to this patch before chasing down the errors.

I'm seeing the failures with both debug and optimized builds. The tests that fail regularly, but not always are:

LLVM :: CodeGen/AArch64/aarch64-address-type-promotion.ll
LLVM :: CodeGen/AArch64/arm64-vector-ldst.ll
LLVM :: CodeGen/X86/cmovcmov.ll
ab added a subscriber: ab.Apr 17 2015, 6:42 PM

Have you tried running under valgrind and/or asan/msan? Does this happen in a debug build, or just an optimized build?

Not yet; that's the next step. I wanted to see if there were structural objections to this patch before chasing down the errors.

Valgrind noted that the flags were being read uninitialized. I thought between bitfields, bools and shifts, I must've stepped into a pile of UB, but the actual bug was when creating the node. I was checking:

if (Flags && mayHaveOptimizationFlags(Opcode))

We can't use the nullptr Flags to decide the type of node to allocate; if it may have opt flags, then it must be an SDNodeWithFlags. We just have to fill in default flags in that case.

I had no luck with sanitizers (not enabled with the OS X system clang? And can't build LLVM trunk with that turned on?). But it pushed me to build on Linux x86-64 where I confirmed that the sizeof(SDNode) == sizeof(SDNodeWithFlags) == 80. So again, it looks like we can use a proper software class for the flags without any size overhead.

spatel updated this revision to Diff 24178.Apr 21 2015, 3:18 PM
spatel edited edge metadata.

Patch updated:

  1. Fixed bug when allocating SDNodesWithFlags.
  2. Fixed inline problems noted by Eric Christopher.

I didn't split off the flags class from the rest of the patch as suggested by Eric on the previous review because the class is now just a trivial bit container. But if that's still the preferred way to go, please let me know.

Hi Sanjay,

It's still my preference if possible because it'll take the API nfc change out of the mix on the patch. Is it possible for you to do that?

Thanks!

-eric

It's still my preference if possible because it'll take the API nfc change out of the mix on the patch. Is it possible for you to do that?

Hi Eric -

Sure - it's a little more work, but it's the right thing to do. I think we should actually break this into 3 steps:

  1. Introduce the new Flags struct and have the existing BinaryWithFlagsSDNode use them only for nsw/nuw/exact (NFC).
  2. Convert BinaryWithFlagsSDNode to the broader SDNodeWithFlags class and extend the potential users to the FP nodes (NFC).
  3. Update DAGCombiner to use the 'arcp' flag and add the new test cases.

First patch coming up...

Absolutely agreed. Thanks!

-eric

spatel updated this revision to Diff 24564.Apr 28 2015, 10:39 AM

Patch updated:

  1. We hacked off the head of this patch into D9325 ( http://reviews.llvm.org/rL235997 ) - move the flags into their own space.
  2. I've also removed the tail end of this patch; that's just the part that recognizes 'arcp' and optimizes based on it in DAGCombiner.

So this patch is now NFC-intended, but there are 2 structural changes:

  1. The main diff is that we're preparing to expand the reach of the optimization flags to affect more than just binary SDNodes. I think this is already necessary for intrinsics in IR ( PR21290 - https://llvm.org/bugs/show_bug.cgi?id=21290 ). And now that we have the flags in the backend, we can propagate them to non-binop nodes that don't even exist in IR such as FMA, FNEG, etc.
  2. The other change is that we're actually copying the FP fast-math-flags from the IR instructions to SDNodes. They're just not being used for anything yet...that's the next patch.
echristo accepted this revision.May 5 2015, 11:06 AM
echristo added a reviewer: echristo.

LGTM.

Thanks for splitting this out! :)

-eric

This revision is now accepted and ready to land.May 5 2015, 11:06 AM
This revision was automatically updated to reflect the committed changes.
spatel added a comment.May 5 2015, 2:46 PM

Thanks, Eric! Checked in at r236546.

For the record, r236546 was reverted by:
http://reviews.llvm.org/rL236600

The bug is independent of FMF or the other flags. It can be reproduced by allocating a plain SDNode for any instruction with 2 operands rather than a BinarySDNode. The bug manifests (visible under valgrind) after recycling the operand memory and/or morphing a node.

I didn't take any perf measurements, but I assume that we want to continue using the specialized operand classes (UnarySDNode, BinarySDNode, TernarySDNode) for their allocation optimizations. So I simplified this patch to keep the existing BinaryWithFlagsSDNode class that derives from BinarySDNode:
http://reviews.llvm.org/rL237046