Page MenuHomePhabricator

Outliner: Add MIR-level outlining pass
ClosedPublic

Authored by paquette on Nov 18 2016, 3:01 PM.

Details

Summary

This is an updated patch for the outliner described in the RFC at: http://lists.llvm.org/pipermail/llvm-dev/2016-August/104170.html

The outliner is a code-size reduction pass which works by finding repeated sequences of instructions in a program, and replacing them with calls to functions. This would be especially useful to people working in low-memory environments, where sacrificing performance for space is acceptable.

This would add a interprocedural outliner directly before printing assembly. For reference on how this would work, this patch also includes X86 target hooks and an X86 test.

The outliner is run like so:

clang -mno-red-zone -mllvm -enable-machine-outliner file.c

I would love for people to test it out and tell me about how well it works for them, and maybe even play around with the provided target hooks. Tell me what you think!

Diff Detail

Repository
rL LLVM

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
silvas added inline comments.Jan 10 2017, 4:34 AM
lib/CodeGen/MachineOutliner.cpp
95 ↗(On Diff #83675)

Can you add a high-level explanation of why we have a 2D vector in the first place. I.e. why do we need to "pretend" instead of just materializing the flattened vector?

One thing that may help make this clearer is encapsulating the mutation of MBBMappings a bit better. Right now there seem to be some un-encapsulated mutations, so just looking at the class it's not clear what the elementary operations on it are.

122 ↗(On Diff #83675)

It shouldn't be too hard to bring this down to log(N) if necessary, but it surprises me that O(N) is fine when N = number of MBB's in the module (for reference, a typical FullLTO for a codebase ~ the size of clang has 10's of thousands of functions, and probably an order of magnitude more MBB's). Can you add a comment explaining that it takes linear time and why that's fine (or not fine, but fine for now)?

aprantl added inline comments.Jan 12 2017, 8:53 AM
lib/CodeGen/MachineOutliner.cpp
199 ↗(On Diff #78585)

Did you get a chance to look into this?

I did a quick run on SPEC CPU 2006 with FullLTO and it seems like I ran into 3 different assertion failures across various programs: https://reviews.llvm.org/P7954
There seem to be 3 different assertions getting hit.

Here are some basic bugpoint-reduced test cases. Repro with llc -enable-machine-outliner -O3 foo.ll (though I expect these will be sensitive to minor codegen changes; I assume you have access to SPEC so you can reduce them again if needed)

https://reviews.llvm.org/P7957 :

Assertion `!NodePtr->isKnownSentinel()' failed.

https://reviews.llvm.org/P7958 :

Assertion `(isImpReg || Op.isRegMask() || MCID->isVariadic() || OpNo < MCID->getNumOperands() || isMetaDataOp) && "Trying to add an operand to a machine instr that is already done!"' failed.

https://reviews.llvm.org/P7960 :

Assertion `Occurrences.size() > 0 && "Longest repeated substring has no occurrences."' failed.

Also, here is a case that takes a really long time to compile (it does eventually finish) stuck in MachineOutliner::buildCandidateList :
https://reviews.llvm.org/P7959
Bugpoint found this one from the "Occurrences.size() > 0" assertion failure case.

For now, it's probably best to focus on the code review comments. Once the code is in good shape and committed, these sorts of bugs can be incrementally hammered out (tracked in bugzilla, fixed with clear individual patches (with test cases :) )).

include/llvm/CodeGen/Passes.h
402 ↗(On Diff #83675)

Nit: comment and function aren't aligned.

include/llvm/Target/TargetInstrInfo.h
1518 ↗(On Diff #83675)

Nit: inconsistent indent.

lib/CodeGen/MachineOutliner.cpp
48 ↗(On Diff #83675)

Nit: this won't work on case-sensitive file systems.

In case you're wondering, that didn't take that long to reduce to that point. Just have to learn the tools and workflow. We have amazing tools for doing test case reduction and an easily understandable test-suite build system! (mad props to those who made bugpoint; the folks that made test-suite awesome; and LLD's -save-temps; and a bit of -globalopt and -metarenamer to clean things up)

  • I hear that you are planning to get rid of the ProgramMapping construct which would be my last real stopper.
  • There is still a lot of nitpicky stuff around.
  • This time I checked the suffix tree construction algorithm which looks good.
  • For future patches: Having a DenseMap<unsigned, SuffixTreeNode *> in every node is potentially wasteful. Intuitively I would expect the majority of nodes to only have a small number of entries, so it may be good to measure and explore alternative representations like a dynamically adapting datastructure (i.e. switching from linear list to map depending on number of children).
include/llvm/CodeGen/Passes.h
401–402 ↗(On Diff #83675)

This should probably be createMachineOutlinerPass() to be consistent.

include/llvm/Target/TargetInstrInfo.h
1518 ↗(On Diff #83675)

indentation.

You should also consider to move these functions towards the other functions so the "private:" part can stay at the end of the class definition.

1530 ↗(On Diff #83675)

Indentation

1559 ↗(On Diff #83675)

As this is a codegen API I would rather pass a MachineFunction&

lib/CodeGen/MachineOutliner.cpp
40 ↗(On Diff #83675)

Move this below the #includes so you do not accidentally affect the headers.

63–64 ↗(On Diff #83675)

Maybe drop the Stat suffix, esp. in a statistic dump that looks superfluous.

172 ↗(On Diff #83675)

Suggestion (possibly for later patches):
As far as I see it a node is either a leaf or an inner node and never changes it nature. You could make this and the constraints on the End and Children members a bit more obvious when representing this in a type hierarchy (and safe a bit of memory):

struct SuffixTreeNode {
  bool IsLeaf; ...
};
struct SuffixTreeLeafNode : public SuffixTreeNode {
  size_t *EndIdx;
  size_t SuffixIdx;
};
struct SuffixTreeInternalNode : SuffixTreeNode {
  Map<SuffixTreeNode*> Children;
};
237 ↗(On Diff #83675)

can be const.

238–243 ↗(On Diff #83675)
  • Maybe handle the special case early:
if (StartIdx == EmptyIdx)
  return 0;

return *EndIdx - StartIdx + 1;
  • I assume this is not supposed to be called with *EndIdx == EmptyIdx? Add an assert()?
248–250 ↗(On Diff #83675)

Interesting to see this packaged up in an own struct instead of just putting the members directly into the SuffixTree class. But doesn't hurt either I guess.

302 ↗(On Diff #83675)

It looks like you can use a single allocator for nodes and EndIndexes.

345 ↗(On Diff #83675)

Maybe add

else
  assert(EndIdx == EmptyIdx);

to make sure callers know what they are doing. An alternative would be to provide different functions for inserting leafs or inner nodes.

363 ↗(On Diff #83675)

You can save an indentation level here with

if (ChildPair.second == nullptr)
  continue;
387 ↗(On Diff #83675)

As you only have 1 "out" parameter you could simply return the new value instead. At the call side I find SuffixesToAdd = extend(x, y, SuffixesToAdd); easier to understand when you do not have to wonder whether a parameters is an "out" parameter.

The NeedsLink parameter is nullptr for all callers?

390–391 ↗(On Diff #83675)

General note on comments: I would expect comments at this indentation level to talk about properties/situations at that level and not just inside the if. This gets clearer if you formulate the comment as a question, or an if-then statement (or move the comment into the if block):

// Look at the last character if the current mapping is 0.
if (Active.Len == 0)
  Active.Idx = EndIdx;

// Current mapping is 0?
if (Active.Len == 0) {
  // Look at the last added character.
  Active.Idx = EndIdx;
}
397 ↗(On Diff #83675)

LastChar can be moved further down as it's not used by some paths through the function.

472 ↗(On Diff #83675)

Add a comment that you check whether Active.Node is the root
or alternatively add a SuffixTreeNode::isRoot() function.

483–484 ↗(On Diff #83675)

Indentation

490 ↗(On Diff #83675)

This should rather be ArrayRef<unsigned>.

512 ↗(On Diff #83675)

Add assert(SuffixesToAdd == 0);?

517 ↗(On Diff #83675)

Maybe setSuffixIndices(..., /*LabelHeight =*/0) instead of the local variable so readers do not keep wondering whether it is an "out" parameter.

687 ↗(On Diff #83675)

move this into the loop.

803–807 ↗(On Diff #83675)

C++ does the right thing for Name(Name) etc. so you can drop the _ suffixes from the parameter names. Similar with some other constructors.

809 ↗(On Diff #83675)

You should extend the anonymous namespace to include the MachineOutliner class. The only things that needs to be visible to the outside are initializeMachineOutlinerPass() (=the stuff coming out of INITIALIZE_PASS) and createOutlinerPass().

832 ↗(On Diff #83675)

Because this relies on implementation details of DenseMapInfo, better play it safe with static_assert so the compilation fails if someone decides to change the values:

static_assert(DenseMapInfo<unsigned>::getEmptyKey() == (unsigned)-1);
static_assert(DenseMapInfo<unsigned>::getTombstoneKey() == (unsigned)-2);

that way things also explain themselves and you can get away with a shorter comment.

The module pass instance can in theory be reused for multiple programs. So the state here needs to be initialized and cleared in runOnModule().

834 ↗(On Diff #83675)

It's the next number to be assigned, isn't it? Same with CurrIllegalInstrMapping.

921 ↗(On Diff #83675)

Indentation, MBB can be const

952–953 ↗(On Diff #83675)

The overflow could (in theory) be triggered by a user and not just by compiler bugs. So could use report_fatal_error() so it stays around in release builds:

if (CurrLegalInstrMapping < CurIllegalInstrMapping)
  report_fatal_error("Instruction mapping overflow!");
954 ↗(On Diff #83675)

Use DenseMapInfo<unsigned>::get{Empty|Tombstone}Key() instead of hardcoding the values.

1008 ↗(On Diff #83675)

Could use emplace_back(OccBB, StartIdxInBB, ...)

1017 ↗(On Diff #83675)

Could use emplace_back().

1050 ↗(On Diff #83675)

I think getOrInsertFunction() copies the name and does not take ownership of the passed string so this is an unnecessary copy and a memory leak.

1066–1068 ↗(On Diff #83675)

Use references for variables that cannot be nullptr.

1103–1104 ↗(On Diff #83675)

I think CandidateList and FunctionList can be const or better ArrayRef.

1213 ↗(On Diff #83675)

Move to assignment.

1214–1215 ↗(On Diff #83675)

As you have some state such as CurrentFunctionID, CurrLegalMapping in the class anyway, maybe the two vectors and the Worklist can move there as well so you do not need to pass them around? Just need to clear() them at the end of the function then.

And a few more for the X86 part.

lib/Target/X86/X86InstrInfo.cpp
9764–9766 ↗(On Diff #83675)

Heh, in theory every single x86 instruction modifies RIP. But I assume we don't model it like that in LLVM. In any way restricting this to reads(RIP) should be enough.

9776–9781 ↗(On Diff #83675)

Are those tests necessary given that you already throw out operations with FrameIndex operands?

9783 ↗(On Diff #83675)

You could use MachineInstr::isPosition() instead of checking for isLabel() and isCFIInstruction()

9786 ↗(On Diff #83675)

Better use const MachineOperand &MOP to avoid some copying.

lib/Target/X86/X86InstrInfo.h
617 ↗(On Diff #83675)

This linebreak seems unnecessary.

pmatos added a subscriber: pmatos.Jan 30 2017, 4:28 AM
aprantl added inline comments.Feb 21 2017, 3:34 PM
lib/CodeGen/MachineOutliner.cpp
199 ↗(On Diff #78585)

Ping :-)

Note that it is really important to skip over any DBG_VALUE intrinsics while deciding whether to outline a sequence of instructions. Otherwise compiling with -g will produce different code than without, which we generally consider to be serious bug in the compiler.

paquette updated this revision to Diff 89535.Feb 23 2017, 10:40 AM
paquette marked 71 inline comments as done.

Alright, it's been a while, but here's the next version of the outlining patch! As always, thanks to everyone taking the time to read through all this code. This version of the outliner is quite different from the previous one, since I've improved on it a lot since the last patch.

Major changes

  • More LLVM-ey and most comments addressed!
  • X86 target won't outline debug instructions anymore. There are other things to think about wrt debug info, which I'm currently working on.
  • More tests: Right now, MIR tests with the outliner make LLC unhappy, so I wrote a couple IR tests which should be easy enough to transition to MIR.
  • No more ProgramMapping: instead there's another vector which keeps track of the positions of each instruction in the program. It uses iterators because the delete function on MachineBasicBlocks takes a start and end iterator. (If that makes anyone uncomfortable, I can change it to pointers.)
  • Improved suffix tree pruning: the previous version was too aggressive and threw out too many candidates. The new version uses a vector of leaves.
  • No more unnecessary functions: the previous version had a FIXME stating that sometimes the outliner could create unnecessary functions when all of the candidates for a function were removed. Now overlap pruning happens directly before outlining, so functions are created as they're needed.
  • Outlined functions are now link once ODR: This allows the linker to dedupe outlined functions without LTO.
  • Mapper class: This class performs the instruction-integer mappings and is passed around the outliner
  • General suffix tree queries: Different targets will have different benefit functions, and even different types of functions to outline. For example, after this, I have a version of the outliner which supports tail-calling outlined functions. In the interest of keeping target-specific stuff out of the SuffixTree, the target now defines a benefit function, which is then maximized by the DFS query for repeated substrings. I think this will allow for more fine-grained outlining on various targets.

Tell me what you think!

aprantl added inline comments.Feb 23 2017, 11:13 AM
lib/Target/X86/X86InstrInfo.cpp
10412 ↗(On Diff #89535)

This is not the right way to do this. We need to skip over DBG_VALUE instruction as if they didn't exist. Otherwise the presence of DBG_VALUEs in the instruction stream will have an effect on the outlining decision, which means that compiling with -g will generate different code than without.

Please also be sure to include a testcase that exercises this.

MatzeB accepted this revision.Feb 23 2017, 4:18 PM

Thanks for pushing this, this is coming along nicely!

At this point I don't see any correctness or compiletime problems (with the comments below addressed).
Let's commit this and keep improving it in-tree. This should also make reviewing easier as we can have smaller follow-up patches.

lib/CodeGen/MachineOutliner.cpp
782–783 ↗(On Diff #89535)

An instance of this only describes a single outlined function.

800 ↗(On Diff #89535)

doxygen ///

844 ↗(On Diff #89535)

I don't think you need to initialize this, it gets overwritten anyway in the next line.

1225–1241 ↗(On Diff #89535)

Hmm...

  • This hash doesn't seem collision free. Someone having two files with the same name (maybe in two different projects that he links together later) may happen.
  • Of course a collision shouldn't hurt as the linker will compare the contents anyway, but why even bother with a hash then?
  • I think the linker will only try to merge functions with the same name but the function name(-hash) is currently based on the name not the contents of the function so I would expect this to be not helpful in most cases.

Maybe stay with the previous internal linking and try the LinkOnce tricks in a follow-up commit (where it is based on the contents).

lib/Target/X86/X86InstrInfo.cpp
10424–10425 ↗(On Diff #89535)

This is surprising, checking the MCInstrDesc should not be necessary.
This is most probably a bug somewhere else in codegen, so there is nothing we can do here.

However I'd be good if you could find the time later to create a reproducer and file a PR about it, reading and writing registers without having operands for it looks like a bug waiting to happen elsewhere.

test/CodeGen/X86/machine-outliner-basic.ll
1 ↗(On Diff #89535)

Better use -mtriple=x86_64-- instead of -march so we also force the operating system etc.

24 ↗(On Diff #89535)

You probably want to force this test to use the same outlined function everywhere. FileCheck allows assigning names and checking for repeated patterns:

; CHECK: callq [[_OUTLINED_FUNCTION[0-9]+_0:OUTLINEFUNC]]
...
; CHECK: callq [[OUTLINEFUNC]]
...
; CHECK-LABEL: [[OUTLINEFUNC]]:
test/CodeGen/X86/machine-outliner-bb-boundaries.ll
1 ↗(On Diff #89535)

You can probably merge the tests together into a single file as they are all about the same pass and use the same llc flags.

22 ↗(On Diff #89535)

I'd remove those standard dumping comments. If you actually care about the label give it a real name, if not the comment shouldn't be necessary either.

This revision is now accepted and ready to land.Feb 23 2017, 4:18 PM
silvas added inline comments.Feb 23 2017, 6:45 PM
lib/CodeGen/MachineOutliner.cpp
1225–1241 ↗(On Diff #89535)

Of course a collision shouldn't hurt as the linker will compare the contents anyway, but why even bother with a hash then?

No. linkonce_odr requires that if the name matches then the contents are interchangeable, since one gets selected arbitrarily. So for correctness the hash must be collision-free.

(see also the discussion in D29512 which also involves finding a stable "name" for the TU)

Also, I don't see the point of doing this. The linker's content-based deduplication ("ICF") should handle this case without caring about the name. If you want to use the linker's comdat/linkonce (i.e. name-based) deduplication then you can just use the function's contents as the name (mangling away NUL bytes), or a *strong* hash (collisions are a correctness problem).

Presumably, if users are using this pass, then they care about code size and so they are likely to have ICF enabled already. So I don't see the point of doing this linkage trick.

test/CodeGen/X86/machine-outliner-interprocedural.ll
8 ↗(On Diff #89535)

The leading underscore here is darwin-specific. Add an explicit triple to avoid this (otherwise non-Darwin bots will break).

FYI this doesn't build on Linux with gcc

[94/1782] Building CXX object lib/CodeGen/CMakeFiles/LLVMCodeGen.dir/MachineOutliner.cpp.o
FAILED: lib/CodeGen/CMakeFiles/LLVMCodeGen.dir/MachineOutliner.cpp.o 
/usr/bin/c++   -DGTEST_HAS_RTTI=0 -DLLVM_BUILD_GLOBAL_ISEL -D_GNU_SOURCE -D__STDC_CONSTANT_MACROS -D__STDC_FORMAT_MACROS -D__STDC_LIMIT_MACROS -Ilib/CodeGen -I/usr/local/google/home/silvasean/pg/llvm/llvm/lib/CodeGen -Iinclude -I/usr/local/google/home/silvasean/pg/llvm/llvm/include -fPIC -fvisibility-inlines-hidden -Wall -W -Wno-unused-parameter -Wwrite-strings -Wcast-qual -Wno-missing-field-initializers -pedantic -Wno-long-long -Wno-maybe-uninitialized -Wdelete-non-virtual-dtor -Wno-comment -std=c++11 -ffunction-sections -fdata-sections -O2 -g -DNDEBUG    -fno-exceptions -fno-rtti -MD -MT lib/CodeGen/CMakeFiles/LLVMCodeGen.dir/MachineOutliner.cpp.o -MF lib/CodeGen/CMakeFiles/LLVMCodeGen.dir/MachineOutliner.cpp.o.d -o lib/CodeGen/CMakeFiles/LLVMCodeGen.dir/MachineOutliner.cpp.o -c /usr/local/google/home/silvasean/pg/llvm/llvm/lib/CodeGen/MachineOutliner.cpp
/usr/local/google/home/silvasean/pg/llvm/llvm/lib/CodeGen/MachineOutliner.cpp:171:18: error: enclosing class of constexpr non-static member function ‘bool {anonymous}::SuffixTreeNode::isLeaf() const’ is not a literal type
   constexpr bool isLeaf() const { return SuffixIdx != EmptyIdx; }
                  ^
/usr/local/google/home/silvasean/pg/llvm/llvm/lib/CodeGen/MachineOutliner.cpp:91:8: note: ‘{anonymous}::SuffixTreeNode’ is not literal because:
 struct SuffixTreeNode {
        ^
/usr/local/google/home/silvasean/pg/llvm/llvm/lib/CodeGen/MachineOutliner.cpp:91:8: note:   ‘{anonymous}::SuffixTreeNode’ has a non-trivial destructor
/usr/local/google/home/silvasean/pg/llvm/llvm/lib/CodeGen/MachineOutliner.cpp:174:18: error: enclosing class of constexpr non-static member function ‘bool {anonymous}::SuffixTreeNode::isRoot() const’ is not a literal type
   constexpr bool isRoot() const { return StartIdx == EmptyIdx; }
                  ^
[143/1782] Building CXX object lib/CodeGen/CMakeFiles/LLVMCodeGen.dir/RegAllocGreedy.cpp.o

The compiler is gcc 4.8.4

I just tested on SPECCPU2006 (FullLTO) and no assertion failures!

However, 403.gcc and 483.xalancbmk (at least) seem to have a huge compile time slowdown (superlinear behavior?). Some rough numbers comparing LLC runtime:
403.gcc 11s -> 66s
483.xalancbmk 16s -> 144s
(so about 5-10x slowdown of LLC due to the suffix tree)

Most of the time seems to be spent inside buildCandidateList. Sampling a couple stacks it seems like it is stuck in findBest, usually just 1 or 2 stack frames in findBest and so at least the problem isn't that it is recursing too deeply.
I added some printfs to print out the depth and vertex degree of each node in the suffix tree for 483.xalancbmk and I got this: https://reviews.llvm.org/F3114496


So it makes sense that typically one would be only 1 or 2 stack frames deep.

Modulo the pruning that is going on, we seem to do O(N) work in bestRepeatedSubstring once per outlining candidate. Is the pruning effective enough that the sum of all calls to bestRepeatedSubstring doesn't grow out of control? My suspicion is that it isn't, and I think a contrived case like AAABBBCCCDDD... (Assume "A" represents constant-size string large enough to be profitable to outline) will trigger O(N^2) behavior in the number of instructions in the module.
Is it possible to do algorithmically better? (exploiting suffix tree invariants maybe?)

Also, it looks like this pass actually increases (1-5%) text size on all of the SPEC binaries except for 401.bzip2: https://reviews.llvm.org/F3114409

(I double and triple checked and I don't have it switched around; the raw data (doublechecked the labels are right) is: https://reviews.llvm.org/P7971)

Can you please find out why this isn't helping (and in fact is hurting)? Are better heuristics needed? At the very least, the cost function seems like it needs to be amended to take into account the true overheads.

In particular, it seems that the cost function does not take into account that the outlined functions will have some minimum alignment applied to them (or can you mark them as not requiring this alignment? still, it would end up depending on linker placement (alignment of adjacent sections) and such as to how much padding actually is inserted).
On 483.xalancbmk, the suffix tree based outliner find 2311 functoins to outline, and almost all of them are 2 instructions, which is typically less than 16 bytes, which is the minimum alignment that will be imposed (just from looking at the output binary).
A naive approach which just looks for identical runs of outlinable instructions (ignoring substrings) outlines 2391 functions (slightly more). The total benefit is somewhat greater for the suffix tree though at 29379 vs 27994 for the naive approach.
This appears to be due to the outliner finding many more length-2 sequences to outline: https://reviews.llvm.org/F3114690

Overall, it seems like the vast majority of the benefit on 483.xalancbmk is due to extremely short instruction sequences. But if we are going to avoid very short instruction sequences because they actually aren't profitable, then most of the outlinable instructions disappear on this test case (and at a glance, the other SPEC benchmarks are pretty similar). I'd also like to note that this testing is with FullLTO, so it is a best-case scenario for the outliner (whole program visibility to the suffix tree). What kinds of programs does this outliner perform well on?

For reference, here are all the outlined functions from 483.xalancbmk: https://reviews.llvm.org/F3114805

One interesting thing is that they are almost all short sequences of mov instructions. Staring at the code that calls them, it's clear why this is: almost all of the outlined functions in 483.xalancbmk are in sequences like this:

...
  362a1e:`      e8 dd 20 0e 00       `  callq  444b00 <OUTLINED_FUNCTION2637142655534006531_61>
  362a23:`      e8 4c e8 09 00       `  callq  401274 <_ZN11xercesc_2_512XMLBufferMgr13releaseBufferERNS_9XMLBufferE>
...

(FWIW, I tried and IPRA doesn't actually decrease text size much on SPEC with FullLTO)

I.e. what has been outlined is function setup overhead. There are also quite a few outlined functions right before jumps, which are factoring out code sequences like this:

00000000004a2eb0 <OUTLINED_FUNCTION2637142655534006531_2458>:
  4a2eb0:`      48 8b 41 18          `  mov    0x18(%rcx),%rax
  4a2eb4:`      48 85 c0             `  test   %rax,%rax
  4a2eb7:`      c3                   `  retq␣␣␣
  4a2eb8:`      0f 1f 84 00 00 00 00 `  nopl   0x0(%rax,%rax,1)
  4a2ebf:`      00␣
lib/CodeGen/MachineOutliner.cpp
1031 ↗(On Diff #89535)

If I understand what this is doing correctly, it can be easily made less than O(N^2) by sorting ascending by Start and descending by End (SROA does something similar to do efficient overlap calculations).

lib/Target/X86/X86InstrInfo.cpp
10387 ↗(On Diff #89535)

This name does not follow the coding standard. Should be getOutliningBenefit or something

10400 ↗(On Diff #89535)

isFunctionSafeToOutlineFrom

Also, this pass will almost surely introduce timing side-channel attacks into cryptography code (code that would otherwise by "constant time" and needs to be for security).

I'm not sure how heavily we care about this security aspect as a community, but I'm a slightly wary of having this on by default at any optimization level due to this issue. E.g. a size-constrained program for a secure processing element on a phone recompiles with this option and it silently breaks the security of the entire device. Hopefully the folks programming the secure element have some sort of testing to avoid this or at least have all critical primitives written in asm (or done by a hardware peripheral).

I can't think of any other optimizations we have that would move a program away from being "constant time"; is there any precedent?

I'm not sure how heavily we care about this security aspect as a community, but I'm a slightly wary of having this on by default at any optimization level due to this issue.

Not that it is on by default right now. Just a concern to keep in mind down the road.

Also, this pass will almost surely introduce timing side-channel attacks into cryptography code (code that would otherwise by "constant time" and needs to be for security).

I'm not sure how heavily we care about this security aspect as a community, but I'm a slightly wary of having this on by default at any optimization level due to this issue. E.g. a size-constrained program for a secure processing element on a phone recompiles with this option and it silently breaks the security of the entire device. Hopefully the folks programming the secure element have some sort of testing to avoid this or at least have all critical primitives written in asm (or done by a hardware peripheral).

I can't think of any other optimizations we have that would move a program away from being "constant time"; is there any precedent?

?!? This should be true for most compiler transformations. I don't know how these problems are handled in practice but I doubt they enable compiler optimizations. I don't see why we should start this discussion with this particular review.

?!? This should be true for most compiler transformations. I don't know how these problems are handled in practice but I doubt they enable compiler optimizations. I don't see why we should start this discussion with this particular review.

I agree that we don't want to discuss it in this review (that's why I said "down the road"), but most compiler transformations I can think of remove indirection or otherwise simplify things towards the set of "constant time" instructions (such as elementary reg-reg adds and such). This pass introduces call instructions into arbitrary code (and calls on x86 architecturally write to memory and are subject to branch prediction, etc.). I agree, let's not have this discussion here though.

btw, down the road you may want to have this pass really know in detail the encoded length of each instruction on x86. There are quite a few *single instructions* that would be beneficial from a code size perspective to outline (if the outlined function is set to have alignment of 1). A quick analysis of an LLD binary (which contains all of LLVM linked in for LTO) shows there is over 5% code size savings just from outlining single instructions (since many x86 instructions encode to be larger than a CALL instruction which is 5 bytes). About half of the benefit (so about 2-3% of the total on this test case) comes from instructions that reference the stack via %rsp (mostly zeroing out stack slots), which could still be outlined if the offset was rewritten.

paquette marked 15 inline comments as done.Feb 27 2017, 1:37 PM

... 403.gcc and 483.xalancbmk (at least) seem to have a huge compile time slowdown (superlinear behavior?). Some rough numbers comparing LLC runtime:
403.gcc 11s -> 66s
483.xalancbmk 16s -> 144s
(so about 5-10x slowdown of LLC due to the suffix tree)

Most of the time seems to be spent inside buildCandidateList. Sampling a couple stacks it seems like it is stuck in findBest...
...
Modulo the pruning that is going on, we seem to do O(N) work in bestRepeatedSubstring once per outlining candidate. Is the pruning effective enough that the sum of all calls to bestRepeatedSubstring doesn't grow out of control? My suspicion is that it isn't, and I think a contrived case like AAABBBCCCDDD... (Assume "A" represents constant-size string large enough to be profitable to outline) will trigger O(N^2) behavior in the number of instructions in the module.
Is it possible to do algorithmically better? (exploiting suffix tree invariants maybe?)

Yeah that'd be a nasty case, and it's worth looking into, for sure.

Some quick ideas off the top of my head:

  • Pre-prune nodes which can never lead to outlining candidates while setting suffix indices. This would still be vulnerable to programs that look like AABBCC, but may improve query time on average.
  • Keep track of a collection of prospective "Outlining points". During the first traversal, if we find anything beneficial remember where it was. On the next traversal, if we have a next best point, start at that point instead of the root.
  • Keep track of every beneficial substring, during one O(n) traversal, and prune overlaps choosing the most beneficial ones greedily.

... it seems that the cost function does not take into account that the outlined functions will have some minimum alignment applied to them (or can you mark them as not requiring this alignment? still, it would end up depending on linker placement (alignment of adjacent sections) and such as to how much padding actually is inserted).

I'll have to look into that and see what happens.

On 483.xalancbmk, the suffix tree based outliner find 2311 functoins to outline, and almost all of them are 2 instructions, which is typically less than 16 bytes, which is the minimum alignment that will be imposed (just from looking at the output binary).
...
Overall, it seems like the vast majority of the benefit on 483.xalancbmk is due to extremely short instruction sequences. But if we are going to avoid very short instruction sequences because they actually aren't profitable, then most of the outlinable instructions disappear on this test case (and at a glance, the other SPEC benchmarks are pretty similar). I'd also like to note that this testing is with FullLTO, so it is a best-case scenario for the outliner (whole program visibility to the suffix tree).

Did you modify the benefit function to verify that removing length-2 instruction sequences actually removes most candidates? We could have found, say BC as the most beneficial, which would prune out all instances of ABC. There could very well be repeated instances of ABC that would be beneficial to outline. It might be possible to impose a minimum length restriction on x86 without losing all of the candidates.

What kinds of programs does this outliner perform well on?

In the test suite, the x86 outliner tended to do well on programs with heavy macro usage or automatically-generated code.

As you found, x86 is a particularly hostile environment for this sort of pass. :) It was just used for a proof of concept and for ease of testing. Most work for this pass should be done for other targets, like, say ARM64.

btw, down the road you may want to have this pass really know in detail the encoded length of each instruction on x86. There are quite a few *single instructions* that would be beneficial from a code size perspective to outline (if the outlined function is set to have alignment of 1). A quick analysis of an LLD binary (which contains all of LLVM linked in for LTO) shows there is over 5% code size savings just from outlining single instructions (since many x86 instructions encode to be larger than a CALL instruction which is 5 bytes). About half of the benefit (so about 2-3% of the total on this test case) comes from instructions that reference the stack via %rsp (mostly zeroing out stack slots), which could still be outlined if the offset was rewritten.

I would really love to do this, but I'm not sure if it's possible in LLVM at the moment. If it is, then I'll gladly add it in since I think it's probably one of the main reasons that x86 tests can get larger rather than smaller. The only thing that's (architecturally) tricky is that the target would need to know about the instruction-integer mappings. This could be done by moving the InstructionMapper over to the target, but I'm not sure if that's the best approach. If it's okay to do that, I doubt it'd be too difficult.

paquette updated this revision to Diff 89928.Feb 27 2017, 1:55 PM

Okay, here's the next revision, everyone!

Changes

  • Debug info is now skipped over as if it doesn't exist
  • isLegalToOutline->getOutliningType, which returns Legal, Illegal, or Invisible. Invisible is used for instructions which should be ignored.
  • Combined outliner tests
  • Added test with debug info
  • Outlined functions are private again without the wonky "hash"
  • Style conformance changes, etc.

My LGTM still stands. Should I commit on your behalf or do you already have access?

aprantl added inline comments.Feb 27 2017, 2:02 PM
lib/Target/X86/X86InstrInfo.cpp
10439 ↗(On Diff #89928)

Thanks!

test/CodeGen/X86/machine-outliner-debuginfo.ll
43 ↗(On Diff #89928)

There should also be a negative check to ensure no DBG_VALUE is in the outlined function and that that no debug locations are attached to the outlined function.

My LGTM still stands. Should I commit on your behalf or do you already have access?

I don't have commit access yet, so go ahead.

paquette updated this revision to Diff 89937.Feb 27 2017, 2:17 PM

Changes

  • Added negative test for debug info in machine-outliner-debuginfo.ll. The check allows for is_stmt debug stuff because that seems to be out of my control.
paquette updated this revision to Diff 89950.Feb 27 2017, 3:48 PM

Changes

  • Realized that the last debug test wasn't sufficient, so I fixed it up. It now handles debug values as well.
paquette updated this revision to Diff 89951.Feb 27 2017, 4:15 PM

Third time is the charm. Edits to the tests. The debug test now *truly* makes sure that debug values don't impact outlining. Also removed some cruft from the tests.

This revision was automatically updated to reflect the committed changes.

I went ahead and committed the current state as we believe all immediately actionable things are addressed. And I'd really like us to do further work on it upstream so we don't get a 2000 lines review for every little change.

We do appreciate all the discussion here and hope we can continue on llvm-dev and the upcoming patches.

Nice to see this finally land!

FWIW, I did talk to a security professional inside google (the type of person for which common advice "don't write your own crypto" doesn't apply) and they said that they weren't particularly worried about the transformation done by this pass. Phew!