This is an archive of the discontinued LLVM Phabricator instance.

Outliner: Add MIR-level outlining pass
ClosedPublic

Authored by paquette on Nov 18 2016, 3:01 PM.

Details

Summary

This is an updated patch for the outliner described in the RFC at: http://lists.llvm.org/pipermail/llvm-dev/2016-August/104170.html

The outliner is a code-size reduction pass which works by finding repeated sequences of instructions in a program, and replacing them with calls to functions. This would be especially useful to people working in low-memory environments, where sacrificing performance for space is acceptable.

This would add a interprocedural outliner directly before printing assembly. For reference on how this would work, this patch also includes X86 target hooks and an X86 test.

The outliner is run like so:

clang -mno-red-zone -mllvm -enable-machine-outliner file.c

I would love for people to test it out and tell me about how well it works for them, and maybe even play around with the provided target hooks. Tell me what you think!

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
paquette updated this revision to Diff 80470.Dec 6 2016, 1:21 PM

Thanks for the feedback on the previous patches! I've updated the outliner significantly from the previous version. In this version there is...

  • No suffix tree in ADT: The suffix tree is now a part of MachineOutliner.h
  • No TerminatedString: Everything is done just using std::vectors now with a couple helper methods.
  • Slightly different outlining technique: We don't have to keep track of offsets anymore by sorting candidates in descending order rather than ascending order.
  • Better documentation, comments, etc: I fixed up the Doxygen stuff, went through, and wrote detailed explanations of each function. I also added some FIXMEs and TODOs to help guide further development on the outliner.

Tell me what you think!

It might help to mark all comments that were addressed so far as "Done" in phabricator.

paquette updated this revision to Diff 80512.Dec 6 2016, 4:59 PM
paquette retitled this revision from Outliner: Add MIR-level outlining pass (resubmission) to Outliner: Add MIR-level outlining pass.
paquette marked 52 inline comments as done.

I realized I missed a bunch of stuff on MachineOutliner.cpp/MachineOutliner.h, so I went ahead and picked through the code and fixed a good chunk of it.

Highlights:

  • No more MachineOutliner.h!
  • Less gross X86 TargetInstrInfo!
  • Range-based for loops!
  • Proper pass initialization!

I'd also like to ask about the best way to handle custom epilogue/prologue insertion. I left in the three separate virtual functions in TargetInstrInfo.h. Does anyone think it'd be better to just smack them all together into one function which handles function call insertion? I feel like that'd be the best way to do it, but I left it as is for now.

Thanks!
Jessica

It would be awesome to have an option that serializes the Collection for use by external tools like Souper:
//std::vector<std::vector<unsigned> *> Collection;

Perhaps replace SuffixTree with SuffixContainer<SuffixTree> and SuffixContainter<SuffixArray>? Suffix arrays are more performant and are trivial to copy. Suffix trees are nice to walk.

Also, think about Vitanyai's approximation of Kolmogorov complexity using gzip, https://arxiv.org/pdf/cs/0111054.pdf

Dirty hack, but you could use the nesting of instructions that gzip comes up with.

MatzeB added a comment.Dec 8 2016, 4:06 PM

Hi Jessica,

this is great progress from the last patch! I think we need one more roundtrip until we can commit the code and keep improving it inside the repository (as this is an independent pass it shouldn't affect the rest of the compiler).

For the whole Outliner.cpp file: Internal types and functions should be marked as such. In this case it's probably best to open an anonymous namespace (except for the externally visible functions createOutlinerPass(), and InitializerMachineOutliner())

include/llvm/CodeGen/Passes.h
406

Indent.

include/llvm/Target/TargetInstrInfo.h
1563–1609

You should move this block a bit upwards (We usually try to keep private fields at the beginning or end of a class, this would move them in the middle).

1605

As this is a codegen API passing in a MachineFunction& parameter is more natural. (Implementations can always use MF.getFunction() to get back to the llvm::Function)

lib/CodeGen/MachineOutliner.cpp
2

MachineOutliner.cpp

109

std::pair<String *, size_t>?

179–187

Looks like this could be a constructor.

180

This looks like it could just be a constructor on the SuffixTreeNode struct.

190

Should this be called deleteSuffixTreeSubTree or similar as it deletes more than 1 node?

206–207

this has no effect

392

can be size_t size() const

398

\param

417–418

No need for to cast EndIdx.

469

use a reference instead of a pointer?

477

Could do an early exit:

if (MaxHeight == 0)
  return nullptr;
552

the braces around the cases are unnecessary

583

This can use an early exit so you do not need to indent everything inside the if.

596

This seems to be an impossible case as you already tested for N != nullptr.

618–622

This could be

for (SuffixTreeNode *T = N->Link; T && T != Root; T = T->Link)
  T->Valid = false;

(It is usually better to go with a for loop instead of while() { ... increment; } on principle, because you do not have to remember to duplicate the increment code when you want to use continue somewhere inside the loop)

631–632

Maybe do not initialize these to give a hint to the reader that they will be set by findString() anyway (If findString() doesn't always set them, then you should make it).

635

Could be an early exit.

648

Use const StringCollection &Strings

667

The deleteSuffixTreeNode() implementation already protects against nullptrs.

714

the llvm:: prefix should not be necessary (same for OccBB)

742

You probably only need to put INITIALIZE_PASS and createOutlinerPass() in the llvm namespace.

773

Maybe use "MIR Function Outlining" to be in sync with INITIALIZE_PASS. Most llvm pass 'names' read more like short descriptions.

800–802

Could use references instead of pointers.

844–846

This makes no sense.

854

Maybe use "machine-outliner" to be in sync with DEBUG_TYPE.

869

Maybe add assert(CurrLegalInstrMapping < CurrIllegalInstrMapping && "Overflow"); The same at the place where you increment CurrLegalInstrMapping.

876–878

The find(), if (I != end()) ... else insert() sequence walks over the datastructure in the find() and again in the insert() step.

Instead you can simply always insert() and check the return value to see whether an element was actually inserted or an existing one reused:

auto I = map.insert(make_pair(MI, CurrLegalInstrMapping));
// Newly inserted?
if (I.second)
  CurrLegalInstrMapping++;
unsigned MINumber = I.first;
896

Use a reference.

991

I don't think the value names need to be kept around.

1020

Can be begin() instead of instr_begin().

1025

Should probably document this with something like
`// The cloned memory operands reference the old function. Drop them.`

1026

can be MBB->begin()

1048–1051

Could use a range based for:

for (OutlinedFunction &OF : FunctionList)
  OF.MF = createOutlinedFunction(M, OF);
1109–1113

Constructing names should not be necessary here. You should be able to add a GlobalValue MachineOperand instead of a Symbol one when you construct the call instruction.

1148–1150

You can move those out of the loop.

1159

There seems to be an unnecessary duplication of the string.

silvas added a subscriber: silvas.Dec 8 2016, 10:41 PM

Some stylistic comments.

Also, the amount of testing seems small compared to the amount of code being added. Can you beef it up? Especially testing for the handling of "unsafe" cases would be good.

lib/CodeGen/MachineOutliner.cpp
17

A link to your devmtg talk might be good (together with some explanation about how it relates to what is implemented here).

55

This seems to have some nontrivial invariants, so you may want something a bit stronger than a typedef. Maybe a lightweight struct around this would be better? You already have a couple helpers that seems like they could quite naturally be methods of such a struct.

I guess the question is: does it ever make sense for a random piece of code that is using this typedef to actually operate on it using the std::vector API? Or would such code inherently risk violating some sort of invariant? If the latter, a lightweight struct encapsulating the underlying vector is likely to be a good choice.

61

This doesn't really make much sense to me. Can you explain this data structure a bit better?

74

This could use a better name.

125

Is there a good online resource you can link to about Ukkonen's algorithm? If this implementation is patterned off of a particular resource that might be good to link to if possible.

127

Small coding standard nit: use static/anonymous namespace on all the helper stuff here and elsewhere: http://llvm.org/docs/CodingStandards.html#anonymous-namespaces

142

I assume "start" was meant here. But "star index" (as in Kleene) sounds vaguely plausible in the context of a string algorithm so this typo is worth fixing.

153

Do you really need a std::map? http://llvm.org/docs/ProgrammersManual.html#map-like-containers-std-map-densemap-etc

Also, if the number of outgoing edges is small, a small-type container is probably better here.

158

As per the comment, maybe call this "is in tree" or "already pruned" or something like that?

164

Can you be a bit more specific about where this shortcut link comes in in Ukkonen's algorithm?

173

Both Start and End here are described as "index" but one is a pointer. That's worth explaining in the comment.

Also, SuffixIndex is also an "index" and gets an "Index" suffix to its variable name. Maybe these should be StartIndex and EndIndex?
Also, other places in this patch use Idx for variable names to mean "index". Maybe these should be StartIdx etc.?

307

Move the comment inside the braces so that this can use the more common } else { style, here and elsewhere.

338

There's quite a few naked new/delete in this code. Can you encapsulate the ownership better? If not, can you centralize the new/delete in helpers and add some comments about lifetime/ownership?

771

Can you hold this by value?

1176

Can you improve the ownership here? E.g. use a std::unique_ptr to manage the lifetime?

lib/CodeGen/TargetPassConfig.cpp
95

What is the official name? "machine outliner" or "MIR outliner". Please be consistent here and elsewhere.

paquette updated this revision to Diff 83675.Jan 9 2017, 12:57 PM
paquette marked 55 inline comments as done.

Hi everyone,

I think I addressed most if not all of the comments from the previous patch; thanks for all of the help with reviewing this! I'm quite happy with how this has been coming along in progressing from intern project to real code.

Major changes:

  • No use of raw new/delete in the outliner
  • Tried to decouple the "program strings" from normal strings-- that seemed to be somewhat confusing
  • New test which also checks if the outliner obeys basic block boundaries
  • Lots of type changes; lots of things that didn't have to be pointers are no longer pointers
  • More use of appropriate LLVM data structures for memory management, maps, etc.

Tell me what you think!

Great work!

This round of review feedback is mostly about comment accuracy and some implementation improvements like centralizing/encapsulating the numbering state machine, handling the EndIdx for leaf nodes in a simpler way, and avoiding copying too many vectors around. Also various style/readability nits.

The testing still seems light considering how much functionality this adds. Can we use MIR to test this? Ideally we'd have pretty good coverage of X86InstrInfo::isLegalToOutline
One way to think about this is: suppose we run this on a bunch of programs in the wild and find some bugs in the suffix tree construction algorithm or isLegalToOutline. How are we going to write tests for those fixes?
Since you've done such a nice job keeping SuffixTree separate (with the simple interface based on unsigned) it might even be possible to write a C++ unittest for it. That requires pulling it up into a header though and other headaches/boilerplate that might not be worth it right now.

lib/CodeGen/MachineOutliner.cpp
71

My reading of this comment interprets this as saying that this class e.g. holds a DenseMap of MI's to integers itself, but that is not the case. Can you make this more precise?

The term "mapping" is somewhat confusing since in common usage a "mapping" usually denotes a container. But throughout this patch the term "mapping" is used to refer to a "string". I think I get what sense it is meant in (e.g. "this is the mapping of this MBB through the our value numbering map"). I can't think of any really good names except for something horribly verbose like "StringOfInstructionNumbers" or something like that. Anyway, you probably want to beef up this comment to describe the sense in which "mapping" is used here and throughout this patch.

After staring at the patch for a while the term "mapping" has grown on me, so it's not a big deal.

73

Rather than a generic "this is used for compatibility with the suffix tree", can you be more precise. Something like "Our suffix tree implementation operates on this class" or something? After changing the constructor argument of SuffixTree to const std::vector<std::vector<unsigned>> & the only use of this class is as an internal data structure of the SuffixTree, which is an even more precise statement.

77

This description in terms of "hash" vs "unique" doesn't seem accurate. Once you pull out a class to encapsulate the numbering you can just mention that here. As far as users of this class are concerned the numbers are just unique symbols; I don't think you need to go into too much detail besides linking to the place where we assign the numbering.

100

Generally the term "program" is reserved for talking about the final linked executable, but this code (except during LTO) generally operates on a single Module/TU. Can you give this a better name? Maybe just BlockMappings or something like that.

144

Why is this comment talking about "2D mapping"? There is just a single index.

203

This comment is pure gold. Nice.

242

Comment on the + 1.

265

Isn't it a more like that the leaves "represent" suffixes rather than "contain" suffixes?

267

I don't think this is correct. It isn't the leaves representing suffixes per se that facilitates finding repeated substrings, but rather the fact that the internal nodes represent repeated substrings (shared prefixes of the suffixes). You may want to beef up this paragraph on suffix trees a bit to describe the basic invariants a bit more (leaves represent suffixes, internal nodes represent shared prefixes of the suffixes).

270

Same comment as on ProgramMapping. The integers themselves aren't hashes (otherwise this code would have to do something special for collisions).

280

I'm not sure what you are trying to convey with this paragraph. Maybe you can just mention that the implementation maintains parent links (maybe also describe what they are used for). I don't see the big picture for talking about cycles or explicit digraphs here.

289

The RAII behavior of BumpPtrAllocator is pretty well-known, so you don't need to explicitly mention it (that's the beauty of RAII; it just cleans up for you!). This comment can probably be reduced to "Allocator that owns all nodes in the tree".

293

This arrangement of adding an extra layer of indirection for the special "EndIdx" handling for leaf nodes is interesting but after staring at the code for a while it seems like it obscures things (or maybe I'm missing something).

It seems that the only thing that needs this extra layer of indirection is during construction in the test if (StrIdx > *(CurrSuffixTreeNode->EndIdx)) { which could be replaced by something like if (StrIdx > (CurrSuffixTreeNode->EndIdx == -1 : LeafEndIdx : CurrSuffixTreeNode->EndIdx) { or similar. To do that, SuffixTreeNode::EndIdx would just be an integer held by value, with -1 being a sentinel indicating that this is a leaf that needs the special handling in that one test.

295

what does the EndIdx even mean for an internal node, since they can be shared? Maybe explain that in the comment of the EndIdx member of SuffixTreeNode?

303

You can use the same BumpPtrAllocator for both of these if you want. It's kind of nice to have a place for these comments though so no biggie.

348

Can Parent just be a constructor argument?

367

This assert is inside an if(ChildPair.second != nullptr) so probably doesn't buy you much.

548

Nit: move the comment inside so you can use the coding-standard compliant } else if (...) {

596

Nit: Putting this CurrSuffixTreeNode->Children[QueryString[CurrIdx]] expression (used 3 times) in a variable will make things a bit shorter and also give you an opportunity to give that value a name. E.g. maybe if (Child && Child->IsInTree) {?

605

Nit: move this comment inside the else so that you have } else {.

614

Comment on the - 1 part of this. It's setting off my off-by-one error spidey sense.

674

You're copying quite a few std::vector's here. Can they be ArrayRef's?

726

This iterates over Strings.MBBMappings. In what sense does it treat Strings as "flat" if it looks at individual substrings?

Also, I find it a bit weird that we take a ProgramMapping as input to this constructor, but then all these append calls seem to build up a different ProgramMapping. Do the two ProgramMapping's end up being equivalent? It would be nice to see a bit more explanation about this. At least for the purposes of this constructor, maybe a const std::vector<std::vector<unsigned>> & is the natural interface because it doesn't use any of the fancy methods on ProgramMapping.

752

It seems a bit weird to me that this class is caring explicitly about the distinction between the "flat" and non-"flat" senses of ProgramMapping. I thought that ProgramMapping was just supposed to encapsulate a 2D ragged array and make it look flat, but here it seems that the external code still cares about the distinction between 2D-ness and flat-ness. Can you make it a bit clearer in the code and comments what ProgramMapping is supposed to represent and what its interactions with the rest of the code are?

798

I don't see much mention of the function Id's in ProgramMapping. Looking at the code, it seems like it should be unsigned and roughly represents the call instruction that jumps to the outlined function.

827

There are a couple members here related to the legal-instruction/illegal-instruction/function numbering that could stand to be pulled out into an isolated class (which can then be held by value in the pass) separate from the pass boilerplate. Such a class will also be a good place to authoritatively document the numbering scheme and encapsulate it.

944

Small readability nit: use std::tie(I, WasInserted) = ... or something like that to make this a bit clearer (this map interface returning the pair is always confusing without that).

1155

It isn't really the "function's id" but rather the id for a call instruction that jumps to it, right?

1156

Nit: remove commented out code.

silvas added inline comments.Jan 10 2017, 4:34 AM
lib/CodeGen/MachineOutliner.cpp
96

Can you add a high-level explanation of why we have a 2D vector in the first place. I.e. why do we need to "pretend" instead of just materializing the flattened vector?

One thing that may help make this clearer is encapsulating the mutation of MBBMappings a bit better. Right now there seem to be some un-encapsulated mutations, so just looking at the class it's not clear what the elementary operations on it are.

123

It shouldn't be too hard to bring this down to log(N) if necessary, but it surprises me that O(N) is fine when N = number of MBB's in the module (for reference, a typical FullLTO for a codebase ~ the size of clang has 10's of thousands of functions, and probably an order of magnitude more MBB's). Can you add a comment explaining that it takes linear time and why that's fine (or not fine, but fine for now)?

aprantl added inline comments.Jan 12 2017, 8:53 AM
lib/CodeGen/MachineOutliner.cpp
200

Did you get a chance to look into this?

I did a quick run on SPEC CPU 2006 with FullLTO and it seems like I ran into 3 different assertion failures across various programs: https://reviews.llvm.org/P7954
There seem to be 3 different assertions getting hit.

Here are some basic bugpoint-reduced test cases. Repro with llc -enable-machine-outliner -O3 foo.ll (though I expect these will be sensitive to minor codegen changes; I assume you have access to SPEC so you can reduce them again if needed)

https://reviews.llvm.org/P7957 :

Assertion `!NodePtr->isKnownSentinel()' failed.

https://reviews.llvm.org/P7958 :

Assertion `(isImpReg || Op.isRegMask() || MCID->isVariadic() || OpNo < MCID->getNumOperands() || isMetaDataOp) && "Trying to add an operand to a machine instr that is already done!"' failed.

https://reviews.llvm.org/P7960 :

Assertion `Occurrences.size() > 0 && "Longest repeated substring has no occurrences."' failed.

Also, here is a case that takes a really long time to compile (it does eventually finish) stuck in MachineOutliner::buildCandidateList :
https://reviews.llvm.org/P7959
Bugpoint found this one from the "Occurrences.size() > 0" assertion failure case.

For now, it's probably best to focus on the code review comments. Once the code is in good shape and committed, these sorts of bugs can be incrementally hammered out (tracked in bugzilla, fixed with clear individual patches (with test cases :) )).

include/llvm/CodeGen/Passes.h
407

Nit: comment and function aren't aligned.

include/llvm/Target/TargetInstrInfo.h
1564

Nit: inconsistent indent.

lib/CodeGen/MachineOutliner.cpp
49

Nit: this won't work on case-sensitive file systems.

In case you're wondering, that didn't take that long to reduce to that point. Just have to learn the tools and workflow. We have amazing tools for doing test case reduction and an easily understandable test-suite build system! (mad props to those who made bugpoint; the folks that made test-suite awesome; and LLD's -save-temps; and a bit of -globalopt and -metarenamer to clean things up)

  • I hear that you are planning to get rid of the ProgramMapping construct which would be my last real stopper.
  • There is still a lot of nitpicky stuff around.
  • This time I checked the suffix tree construction algorithm which looks good.
  • For future patches: Having a DenseMap<unsigned, SuffixTreeNode *> in every node is potentially wasteful. Intuitively I would expect the majority of nodes to only have a small number of entries, so it may be good to measure and explore alternative representations like a dynamically adapting datastructure (i.e. switching from linear list to map depending on number of children).
include/llvm/CodeGen/Passes.h
406–407

This should probably be createMachineOutlinerPass() to be consistent.

include/llvm/Target/TargetInstrInfo.h
1564

indentation.

You should also consider to move these functions towards the other functions so the "private:" part can stay at the end of the class definition.

1576

Indentation

1605

As this is a codegen API I would rather pass a MachineFunction&

lib/CodeGen/MachineOutliner.cpp
41

Move this below the #includes so you do not accidentally affect the headers.

64–65

Maybe drop the Stat suffix, esp. in a statistic dump that looks superfluous.

173

Suggestion (possibly for later patches):
As far as I see it a node is either a leaf or an inner node and never changes it nature. You could make this and the constraints on the End and Children members a bit more obvious when representing this in a type hierarchy (and safe a bit of memory):

struct SuffixTreeNode {
  bool IsLeaf; ...
};
struct SuffixTreeLeafNode : public SuffixTreeNode {
  size_t *EndIdx;
  size_t SuffixIdx;
};
struct SuffixTreeInternalNode : SuffixTreeNode {
  Map<SuffixTreeNode*> Children;
};
238

can be const.

239–244
  • Maybe handle the special case early:
if (StartIdx == EmptyIdx)
  return 0;

return *EndIdx - StartIdx + 1;
  • I assume this is not supposed to be called with *EndIdx == EmptyIdx? Add an assert()?
249–251

Interesting to see this packaged up in an own struct instead of just putting the members directly into the SuffixTree class. But doesn't hurt either I guess.

303

It looks like you can use a single allocator for nodes and EndIndexes.

346

Maybe add

else
  assert(EndIdx == EmptyIdx);

to make sure callers know what they are doing. An alternative would be to provide different functions for inserting leafs or inner nodes.

364

You can save an indentation level here with

if (ChildPair.second == nullptr)
  continue;
388

As you only have 1 "out" parameter you could simply return the new value instead. At the call side I find SuffixesToAdd = extend(x, y, SuffixesToAdd); easier to understand when you do not have to wonder whether a parameters is an "out" parameter.

The NeedsLink parameter is nullptr for all callers?

391–392

General note on comments: I would expect comments at this indentation level to talk about properties/situations at that level and not just inside the if. This gets clearer if you formulate the comment as a question, or an if-then statement (or move the comment into the if block):

// Look at the last character if the current mapping is 0.
if (Active.Len == 0)
  Active.Idx = EndIdx;

// Current mapping is 0?
if (Active.Len == 0) {
  // Look at the last added character.
  Active.Idx = EndIdx;
}
398

LastChar can be moved further down as it's not used by some paths through the function.

473

Add a comment that you check whether Active.Node is the root
or alternatively add a SuffixTreeNode::isRoot() function.

484–485

Indentation

491

This should rather be ArrayRef<unsigned>.

513

Add assert(SuffixesToAdd == 0);?

518

Maybe setSuffixIndices(..., /*LabelHeight =*/0) instead of the local variable so readers do not keep wondering whether it is an "out" parameter.

688

move this into the loop.

804–808

C++ does the right thing for Name(Name) etc. so you can drop the _ suffixes from the parameter names. Similar with some other constructors.

810

You should extend the anonymous namespace to include the MachineOutliner class. The only things that needs to be visible to the outside are initializeMachineOutlinerPass() (=the stuff coming out of INITIALIZE_PASS) and createOutlinerPass().

833

Because this relies on implementation details of DenseMapInfo, better play it safe with static_assert so the compilation fails if someone decides to change the values:

static_assert(DenseMapInfo<unsigned>::getEmptyKey() == (unsigned)-1);
static_assert(DenseMapInfo<unsigned>::getTombstoneKey() == (unsigned)-2);

that way things also explain themselves and you can get away with a shorter comment.

The module pass instance can in theory be reused for multiple programs. So the state here needs to be initialized and cleared in runOnModule().

835

It's the next number to be assigned, isn't it? Same with CurrIllegalInstrMapping.

922

Indentation, MBB can be const

953–954

The overflow could (in theory) be triggered by a user and not just by compiler bugs. So could use report_fatal_error() so it stays around in release builds:

if (CurrLegalInstrMapping < CurIllegalInstrMapping)
  report_fatal_error("Instruction mapping overflow!");
955

Use DenseMapInfo<unsigned>::get{Empty|Tombstone}Key() instead of hardcoding the values.

1009

Could use emplace_back(OccBB, StartIdxInBB, ...)

1018

Could use emplace_back().

1051

I think getOrInsertFunction() copies the name and does not take ownership of the passed string so this is an unnecessary copy and a memory leak.

1067–1069

Use references for variables that cannot be nullptr.

1104–1105

I think CandidateList and FunctionList can be const or better ArrayRef.

1214

Move to assignment.

1215–1216

As you have some state such as CurrentFunctionID, CurrLegalMapping in the class anyway, maybe the two vectors and the Worklist can move there as well so you do not need to pass them around? Just need to clear() them at the end of the function then.

And a few more for the X86 part.

lib/Target/X86/X86InstrInfo.cpp
10403–10405

Heh, in theory every single x86 instruction modifies RIP. But I assume we don't model it like that in LLVM. In any way restricting this to reads(RIP) should be enough.

10415–10420

Are those tests necessary given that you already throw out operations with FrameIndex operands?

10422

You could use MachineInstr::isPosition() instead of checking for isLabel() and isCFIInstruction()

10425

Better use const MachineOperand &MOP to avoid some copying.

lib/Target/X86/X86InstrInfo.h
634

This linebreak seems unnecessary.

pmatos added a subscriber: pmatos.Jan 30 2017, 4:28 AM
aprantl added inline comments.Feb 21 2017, 3:34 PM
lib/CodeGen/MachineOutliner.cpp
200

Ping :-)

Note that it is really important to skip over any DBG_VALUE intrinsics while deciding whether to outline a sequence of instructions. Otherwise compiling with -g will produce different code than without, which we generally consider to be serious bug in the compiler.

paquette updated this revision to Diff 89535.Feb 23 2017, 10:40 AM
paquette marked 71 inline comments as done.

Alright, it's been a while, but here's the next version of the outlining patch! As always, thanks to everyone taking the time to read through all this code. This version of the outliner is quite different from the previous one, since I've improved on it a lot since the last patch.

Major changes

  • More LLVM-ey and most comments addressed!
  • X86 target won't outline debug instructions anymore. There are other things to think about wrt debug info, which I'm currently working on.
  • More tests: Right now, MIR tests with the outliner make LLC unhappy, so I wrote a couple IR tests which should be easy enough to transition to MIR.
  • No more ProgramMapping: instead there's another vector which keeps track of the positions of each instruction in the program. It uses iterators because the delete function on MachineBasicBlocks takes a start and end iterator. (If that makes anyone uncomfortable, I can change it to pointers.)
  • Improved suffix tree pruning: the previous version was too aggressive and threw out too many candidates. The new version uses a vector of leaves.
  • No more unnecessary functions: the previous version had a FIXME stating that sometimes the outliner could create unnecessary functions when all of the candidates for a function were removed. Now overlap pruning happens directly before outlining, so functions are created as they're needed.
  • Outlined functions are now link once ODR: This allows the linker to dedupe outlined functions without LTO.
  • Mapper class: This class performs the instruction-integer mappings and is passed around the outliner
  • General suffix tree queries: Different targets will have different benefit functions, and even different types of functions to outline. For example, after this, I have a version of the outliner which supports tail-calling outlined functions. In the interest of keeping target-specific stuff out of the SuffixTree, the target now defines a benefit function, which is then maximized by the DFS query for repeated substrings. I think this will allow for more fine-grained outlining on various targets.

Tell me what you think!

aprantl added inline comments.Feb 23 2017, 11:13 AM
lib/Target/X86/X86InstrInfo.cpp
10412

This is not the right way to do this. We need to skip over DBG_VALUE instruction as if they didn't exist. Otherwise the presence of DBG_VALUEs in the instruction stream will have an effect on the outlining decision, which means that compiling with -g will generate different code than without.

Please also be sure to include a testcase that exercises this.

MatzeB accepted this revision.Feb 23 2017, 4:18 PM

Thanks for pushing this, this is coming along nicely!

At this point I don't see any correctness or compiletime problems (with the comments below addressed).
Let's commit this and keep improving it in-tree. This should also make reviewing easier as we can have smaller follow-up patches.

lib/CodeGen/MachineOutliner.cpp
782–783

An instance of this only describes a single outlined function.

800

doxygen ///

844

I don't think you need to initialize this, it gets overwritten anyway in the next line.

1225–1241

Hmm...

  • This hash doesn't seem collision free. Someone having two files with the same name (maybe in two different projects that he links together later) may happen.
  • Of course a collision shouldn't hurt as the linker will compare the contents anyway, but why even bother with a hash then?
  • I think the linker will only try to merge functions with the same name but the function name(-hash) is currently based on the name not the contents of the function so I would expect this to be not helpful in most cases.

Maybe stay with the previous internal linking and try the LinkOnce tricks in a follow-up commit (where it is based on the contents).

lib/Target/X86/X86InstrInfo.cpp
10424–10425

This is surprising, checking the MCInstrDesc should not be necessary.
This is most probably a bug somewhere else in codegen, so there is nothing we can do here.

However I'd be good if you could find the time later to create a reproducer and file a PR about it, reading and writing registers without having operands for it looks like a bug waiting to happen elsewhere.

test/CodeGen/X86/machine-outliner-basic.ll
1

Better use -mtriple=x86_64-- instead of -march so we also force the operating system etc.

24

You probably want to force this test to use the same outlined function everywhere. FileCheck allows assigning names and checking for repeated patterns:

; CHECK: callq [[_OUTLINED_FUNCTION[0-9]+_0:OUTLINEFUNC]]
...
; CHECK: callq [[OUTLINEFUNC]]
...
; CHECK-LABEL: [[OUTLINEFUNC]]:
test/CodeGen/X86/machine-outliner-bb-boundaries.ll
1

You can probably merge the tests together into a single file as they are all about the same pass and use the same llc flags.

22

I'd remove those standard dumping comments. If you actually care about the label give it a real name, if not the comment shouldn't be necessary either.

This revision is now accepted and ready to land.Feb 23 2017, 4:18 PM
silvas added inline comments.Feb 23 2017, 6:45 PM
lib/CodeGen/MachineOutliner.cpp
1225–1241

Of course a collision shouldn't hurt as the linker will compare the contents anyway, but why even bother with a hash then?

No. linkonce_odr requires that if the name matches then the contents are interchangeable, since one gets selected arbitrarily. So for correctness the hash must be collision-free.

(see also the discussion in D29512 which also involves finding a stable "name" for the TU)

Also, I don't see the point of doing this. The linker's content-based deduplication ("ICF") should handle this case without caring about the name. If you want to use the linker's comdat/linkonce (i.e. name-based) deduplication then you can just use the function's contents as the name (mangling away NUL bytes), or a *strong* hash (collisions are a correctness problem).

Presumably, if users are using this pass, then they care about code size and so they are likely to have ICF enabled already. So I don't see the point of doing this linkage trick.

test/CodeGen/X86/machine-outliner-interprocedural.ll
8

The leading underscore here is darwin-specific. Add an explicit triple to avoid this (otherwise non-Darwin bots will break).

FYI this doesn't build on Linux with gcc

[94/1782] Building CXX object lib/CodeGen/CMakeFiles/LLVMCodeGen.dir/MachineOutliner.cpp.o
FAILED: lib/CodeGen/CMakeFiles/LLVMCodeGen.dir/MachineOutliner.cpp.o 
/usr/bin/c++   -DGTEST_HAS_RTTI=0 -DLLVM_BUILD_GLOBAL_ISEL -D_GNU_SOURCE -D__STDC_CONSTANT_MACROS -D__STDC_FORMAT_MACROS -D__STDC_LIMIT_MACROS -Ilib/CodeGen -I/usr/local/google/home/silvasean/pg/llvm/llvm/lib/CodeGen -Iinclude -I/usr/local/google/home/silvasean/pg/llvm/llvm/include -fPIC -fvisibility-inlines-hidden -Wall -W -Wno-unused-parameter -Wwrite-strings -Wcast-qual -Wno-missing-field-initializers -pedantic -Wno-long-long -Wno-maybe-uninitialized -Wdelete-non-virtual-dtor -Wno-comment -std=c++11 -ffunction-sections -fdata-sections -O2 -g -DNDEBUG    -fno-exceptions -fno-rtti -MD -MT lib/CodeGen/CMakeFiles/LLVMCodeGen.dir/MachineOutliner.cpp.o -MF lib/CodeGen/CMakeFiles/LLVMCodeGen.dir/MachineOutliner.cpp.o.d -o lib/CodeGen/CMakeFiles/LLVMCodeGen.dir/MachineOutliner.cpp.o -c /usr/local/google/home/silvasean/pg/llvm/llvm/lib/CodeGen/MachineOutliner.cpp
/usr/local/google/home/silvasean/pg/llvm/llvm/lib/CodeGen/MachineOutliner.cpp:171:18: error: enclosing class of constexpr non-static member function ‘bool {anonymous}::SuffixTreeNode::isLeaf() const’ is not a literal type
   constexpr bool isLeaf() const { return SuffixIdx != EmptyIdx; }
                  ^
/usr/local/google/home/silvasean/pg/llvm/llvm/lib/CodeGen/MachineOutliner.cpp:91:8: note: ‘{anonymous}::SuffixTreeNode’ is not literal because:
 struct SuffixTreeNode {
        ^
/usr/local/google/home/silvasean/pg/llvm/llvm/lib/CodeGen/MachineOutliner.cpp:91:8: note:   ‘{anonymous}::SuffixTreeNode’ has a non-trivial destructor
/usr/local/google/home/silvasean/pg/llvm/llvm/lib/CodeGen/MachineOutliner.cpp:174:18: error: enclosing class of constexpr non-static member function ‘bool {anonymous}::SuffixTreeNode::isRoot() const’ is not a literal type
   constexpr bool isRoot() const { return StartIdx == EmptyIdx; }
                  ^
[143/1782] Building CXX object lib/CodeGen/CMakeFiles/LLVMCodeGen.dir/RegAllocGreedy.cpp.o

The compiler is gcc 4.8.4

I just tested on SPECCPU2006 (FullLTO) and no assertion failures!

However, 403.gcc and 483.xalancbmk (at least) seem to have a huge compile time slowdown (superlinear behavior?). Some rough numbers comparing LLC runtime:
403.gcc 11s -> 66s
483.xalancbmk 16s -> 144s
(so about 5-10x slowdown of LLC due to the suffix tree)

Most of the time seems to be spent inside buildCandidateList. Sampling a couple stacks it seems like it is stuck in findBest, usually just 1 or 2 stack frames in findBest and so at least the problem isn't that it is recursing too deeply.
I added some printfs to print out the depth and vertex degree of each node in the suffix tree for 483.xalancbmk and I got this: https://reviews.llvm.org/F3114496


So it makes sense that typically one would be only 1 or 2 stack frames deep.

Modulo the pruning that is going on, we seem to do O(N) work in bestRepeatedSubstring once per outlining candidate. Is the pruning effective enough that the sum of all calls to bestRepeatedSubstring doesn't grow out of control? My suspicion is that it isn't, and I think a contrived case like AAABBBCCCDDD... (Assume "A" represents constant-size string large enough to be profitable to outline) will trigger O(N^2) behavior in the number of instructions in the module.
Is it possible to do algorithmically better? (exploiting suffix tree invariants maybe?)

Also, it looks like this pass actually increases (1-5%) text size on all of the SPEC binaries except for 401.bzip2: https://reviews.llvm.org/F3114409

(I double and triple checked and I don't have it switched around; the raw data (doublechecked the labels are right) is: https://reviews.llvm.org/P7971)

Can you please find out why this isn't helping (and in fact is hurting)? Are better heuristics needed? At the very least, the cost function seems like it needs to be amended to take into account the true overheads.

In particular, it seems that the cost function does not take into account that the outlined functions will have some minimum alignment applied to them (or can you mark them as not requiring this alignment? still, it would end up depending on linker placement (alignment of adjacent sections) and such as to how much padding actually is inserted).
On 483.xalancbmk, the suffix tree based outliner find 2311 functoins to outline, and almost all of them are 2 instructions, which is typically less than 16 bytes, which is the minimum alignment that will be imposed (just from looking at the output binary).
A naive approach which just looks for identical runs of outlinable instructions (ignoring substrings) outlines 2391 functions (slightly more). The total benefit is somewhat greater for the suffix tree though at 29379 vs 27994 for the naive approach.
This appears to be due to the outliner finding many more length-2 sequences to outline: https://reviews.llvm.org/F3114690

Overall, it seems like the vast majority of the benefit on 483.xalancbmk is due to extremely short instruction sequences. But if we are going to avoid very short instruction sequences because they actually aren't profitable, then most of the outlinable instructions disappear on this test case (and at a glance, the other SPEC benchmarks are pretty similar). I'd also like to note that this testing is with FullLTO, so it is a best-case scenario for the outliner (whole program visibility to the suffix tree). What kinds of programs does this outliner perform well on?

For reference, here are all the outlined functions from 483.xalancbmk: https://reviews.llvm.org/F3114805

One interesting thing is that they are almost all short sequences of mov instructions. Staring at the code that calls them, it's clear why this is: almost all of the outlined functions in 483.xalancbmk are in sequences like this:

...
  362a1e:`      e8 dd 20 0e 00       `  callq  444b00 <OUTLINED_FUNCTION2637142655534006531_61>
  362a23:`      e8 4c e8 09 00       `  callq  401274 <_ZN11xercesc_2_512XMLBufferMgr13releaseBufferERNS_9XMLBufferE>
...

(FWIW, I tried and IPRA doesn't actually decrease text size much on SPEC with FullLTO)

I.e. what has been outlined is function setup overhead. There are also quite a few outlined functions right before jumps, which are factoring out code sequences like this:

00000000004a2eb0 <OUTLINED_FUNCTION2637142655534006531_2458>:
  4a2eb0:`      48 8b 41 18          `  mov    0x18(%rcx),%rax
  4a2eb4:`      48 85 c0             `  test   %rax,%rax
  4a2eb7:`      c3                   `  retq␣␣␣
  4a2eb8:`      0f 1f 84 00 00 00 00 `  nopl   0x0(%rax,%rax,1)
  4a2ebf:`      00␣
lib/CodeGen/MachineOutliner.cpp
1031

If I understand what this is doing correctly, it can be easily made less than O(N^2) by sorting ascending by Start and descending by End (SROA does something similar to do efficient overlap calculations).

lib/Target/X86/X86InstrInfo.cpp
10387

This name does not follow the coding standard. Should be getOutliningBenefit or something

10400

isFunctionSafeToOutlineFrom

Also, this pass will almost surely introduce timing side-channel attacks into cryptography code (code that would otherwise by "constant time" and needs to be for security).

I'm not sure how heavily we care about this security aspect as a community, but I'm a slightly wary of having this on by default at any optimization level due to this issue. E.g. a size-constrained program for a secure processing element on a phone recompiles with this option and it silently breaks the security of the entire device. Hopefully the folks programming the secure element have some sort of testing to avoid this or at least have all critical primitives written in asm (or done by a hardware peripheral).

I can't think of any other optimizations we have that would move a program away from being "constant time"; is there any precedent?

I'm not sure how heavily we care about this security aspect as a community, but I'm a slightly wary of having this on by default at any optimization level due to this issue.

Not that it is on by default right now. Just a concern to keep in mind down the road.

Also, this pass will almost surely introduce timing side-channel attacks into cryptography code (code that would otherwise by "constant time" and needs to be for security).

I'm not sure how heavily we care about this security aspect as a community, but I'm a slightly wary of having this on by default at any optimization level due to this issue. E.g. a size-constrained program for a secure processing element on a phone recompiles with this option and it silently breaks the security of the entire device. Hopefully the folks programming the secure element have some sort of testing to avoid this or at least have all critical primitives written in asm (or done by a hardware peripheral).

I can't think of any other optimizations we have that would move a program away from being "constant time"; is there any precedent?

?!? This should be true for most compiler transformations. I don't know how these problems are handled in practice but I doubt they enable compiler optimizations. I don't see why we should start this discussion with this particular review.

?!? This should be true for most compiler transformations. I don't know how these problems are handled in practice but I doubt they enable compiler optimizations. I don't see why we should start this discussion with this particular review.

I agree that we don't want to discuss it in this review (that's why I said "down the road"), but most compiler transformations I can think of remove indirection or otherwise simplify things towards the set of "constant time" instructions (such as elementary reg-reg adds and such). This pass introduces call instructions into arbitrary code (and calls on x86 architecturally write to memory and are subject to branch prediction, etc.). I agree, let's not have this discussion here though.

btw, down the road you may want to have this pass really know in detail the encoded length of each instruction on x86. There are quite a few *single instructions* that would be beneficial from a code size perspective to outline (if the outlined function is set to have alignment of 1). A quick analysis of an LLD binary (which contains all of LLVM linked in for LTO) shows there is over 5% code size savings just from outlining single instructions (since many x86 instructions encode to be larger than a CALL instruction which is 5 bytes). About half of the benefit (so about 2-3% of the total on this test case) comes from instructions that reference the stack via %rsp (mostly zeroing out stack slots), which could still be outlined if the offset was rewritten.

paquette marked 15 inline comments as done.Feb 27 2017, 1:37 PM

... 403.gcc and 483.xalancbmk (at least) seem to have a huge compile time slowdown (superlinear behavior?). Some rough numbers comparing LLC runtime:
403.gcc 11s -> 66s
483.xalancbmk 16s -> 144s
(so about 5-10x slowdown of LLC due to the suffix tree)

Most of the time seems to be spent inside buildCandidateList. Sampling a couple stacks it seems like it is stuck in findBest...
...
Modulo the pruning that is going on, we seem to do O(N) work in bestRepeatedSubstring once per outlining candidate. Is the pruning effective enough that the sum of all calls to bestRepeatedSubstring doesn't grow out of control? My suspicion is that it isn't, and I think a contrived case like AAABBBCCCDDD... (Assume "A" represents constant-size string large enough to be profitable to outline) will trigger O(N^2) behavior in the number of instructions in the module.
Is it possible to do algorithmically better? (exploiting suffix tree invariants maybe?)

Yeah that'd be a nasty case, and it's worth looking into, for sure.

Some quick ideas off the top of my head:

  • Pre-prune nodes which can never lead to outlining candidates while setting suffix indices. This would still be vulnerable to programs that look like AABBCC, but may improve query time on average.
  • Keep track of a collection of prospective "Outlining points". During the first traversal, if we find anything beneficial remember where it was. On the next traversal, if we have a next best point, start at that point instead of the root.
  • Keep track of every beneficial substring, during one O(n) traversal, and prune overlaps choosing the most beneficial ones greedily.

... it seems that the cost function does not take into account that the outlined functions will have some minimum alignment applied to them (or can you mark them as not requiring this alignment? still, it would end up depending on linker placement (alignment of adjacent sections) and such as to how much padding actually is inserted).

I'll have to look into that and see what happens.

On 483.xalancbmk, the suffix tree based outliner find 2311 functoins to outline, and almost all of them are 2 instructions, which is typically less than 16 bytes, which is the minimum alignment that will be imposed (just from looking at the output binary).
...
Overall, it seems like the vast majority of the benefit on 483.xalancbmk is due to extremely short instruction sequences. But if we are going to avoid very short instruction sequences because they actually aren't profitable, then most of the outlinable instructions disappear on this test case (and at a glance, the other SPEC benchmarks are pretty similar). I'd also like to note that this testing is with FullLTO, so it is a best-case scenario for the outliner (whole program visibility to the suffix tree).

Did you modify the benefit function to verify that removing length-2 instruction sequences actually removes most candidates? We could have found, say BC as the most beneficial, which would prune out all instances of ABC. There could very well be repeated instances of ABC that would be beneficial to outline. It might be possible to impose a minimum length restriction on x86 without losing all of the candidates.

What kinds of programs does this outliner perform well on?

In the test suite, the x86 outliner tended to do well on programs with heavy macro usage or automatically-generated code.

As you found, x86 is a particularly hostile environment for this sort of pass. :) It was just used for a proof of concept and for ease of testing. Most work for this pass should be done for other targets, like, say ARM64.

btw, down the road you may want to have this pass really know in detail the encoded length of each instruction on x86. There are quite a few *single instructions* that would be beneficial from a code size perspective to outline (if the outlined function is set to have alignment of 1). A quick analysis of an LLD binary (which contains all of LLVM linked in for LTO) shows there is over 5% code size savings just from outlining single instructions (since many x86 instructions encode to be larger than a CALL instruction which is 5 bytes). About half of the benefit (so about 2-3% of the total on this test case) comes from instructions that reference the stack via %rsp (mostly zeroing out stack slots), which could still be outlined if the offset was rewritten.

I would really love to do this, but I'm not sure if it's possible in LLVM at the moment. If it is, then I'll gladly add it in since I think it's probably one of the main reasons that x86 tests can get larger rather than smaller. The only thing that's (architecturally) tricky is that the target would need to know about the instruction-integer mappings. This could be done by moving the InstructionMapper over to the target, but I'm not sure if that's the best approach. If it's okay to do that, I doubt it'd be too difficult.

paquette updated this revision to Diff 89928.Feb 27 2017, 1:55 PM

Okay, here's the next revision, everyone!

Changes

  • Debug info is now skipped over as if it doesn't exist
  • isLegalToOutline->getOutliningType, which returns Legal, Illegal, or Invisible. Invisible is used for instructions which should be ignored.
  • Combined outliner tests
  • Added test with debug info
  • Outlined functions are private again without the wonky "hash"
  • Style conformance changes, etc.

My LGTM still stands. Should I commit on your behalf or do you already have access?

aprantl added inline comments.Feb 27 2017, 2:02 PM
lib/Target/X86/X86InstrInfo.cpp
10439

Thanks!

test/CodeGen/X86/machine-outliner-debuginfo.ll
43 ↗(On Diff #89928)

There should also be a negative check to ensure no DBG_VALUE is in the outlined function and that that no debug locations are attached to the outlined function.

My LGTM still stands. Should I commit on your behalf or do you already have access?

I don't have commit access yet, so go ahead.

paquette updated this revision to Diff 89937.Feb 27 2017, 2:17 PM

Changes

  • Added negative test for debug info in machine-outliner-debuginfo.ll. The check allows for is_stmt debug stuff because that seems to be out of my control.
paquette updated this revision to Diff 89950.Feb 27 2017, 3:48 PM

Changes

  • Realized that the last debug test wasn't sufficient, so I fixed it up. It now handles debug values as well.
paquette updated this revision to Diff 89951.Feb 27 2017, 4:15 PM

Third time is the charm. Edits to the tests. The debug test now *truly* makes sure that debug values don't impact outlining. Also removed some cruft from the tests.

This revision was automatically updated to reflect the committed changes.

I went ahead and committed the current state as we believe all immediately actionable things are addressed. And I'd really like us to do further work on it upstream so we don't get a 2000 lines review for every little change.

We do appreciate all the discussion here and hope we can continue on llvm-dev and the upcoming patches.

Nice to see this finally land!

FWIW, I did talk to a security professional inside google (the type of person for which common advice "don't write your own crypto" doesn't apply) and they said that they weren't particularly worried about the transformation done by this pass. Phew!