This is an archive of the discontinued LLVM Phabricator instance.

TransformUtils: Introduce module splitter.
ClosedPublic

Authored by pcc on Aug 18 2015, 7:57 PM.

Details

Summary

The module splitter splits a module into linkable partitions. It will
be used to implement parallel LTO code generation.

This initial version of the splitter does not attempt to deal with the
somewhat subtle symbol visibility issues around module splitting. These
will be dealt with in a future change.

Diff Detail

Event Timeline

pcc updated this revision to Diff 32493.Aug 18 2015, 7:57 PM
pcc retitled this revision from to TransformUtils: Introduce module splitter..
pcc updated this object.
pcc added reviewers: mehdi_amini, chandlerc.
pcc added a subscriber: llvm-commits.
mehdi_amini edited edge metadata.Aug 18 2015, 10:52 PM

Hi Peter,

I wonder if we can't make the split really more efficient, especially memory wise. Ideally we should seek to:

  • Limit looping over the module content
  • Move content instead of duplicating it.

This is especially important since LTO deals with very large module.
So reusing the cloneModule() implementation is clever and nicely done, but on the long term I'm not sure it is the way to go. I can imagine many speed/memory improvements that could be done by splitting a Module in N partitions in one pass.
So I wonder: if we want to really productize this, I'm not sure that the current code sharing/intrusion in cloneModule() is the right way to go. What do you think?

If you still want to move forward with this patch as is for now, you may find some thoughts about the patch inline.

include/llvm/Transforms/Utils/SplitModule.h
32

What about adding a warning/todo/fixme about "some subtle symbol visibility issues" that have to be fixed?
(I notices you mention it'll be fixed later)

34

A function that "Split a Module in N partition" is very valuable. I'm not necessarily convinced that the interface doesn't intermix two different things:

  • Splitting the module
  • What to do with the partitions (including issuing threads, etc.)

What about providing the API that only cares about the actual split here?
(I understand that it does not provide some pipeline where the first thread starts before the other partitions are created, but still)

std::vector<std::unique_ptr<Module>> SplitModule(std::unique_ptr<Module> M, unsigned N);

std::vector<std::unique_ptr<Module>> SplitModule(std::unique_ptr<Module> M, unsigned N, std::function<unsigned(GlobalValue &GV)> PartitionSelection);

Where the callback allows the caller to control the partition logic.
The first one could call the second one with your current modulo-hash partition method as a callback.

Also if the Module is left in an undetermined state, the caller won't be able to do anything with it but destroying it, so we might better take ownership of it anyway. It might be possible to re-use it for partition 0 and save one clone?

lib/Transforms/Utils/CloneModule.cpp
41

I would found the callback call sites more readable with a name like "ShouldCloneDefinition", because "CloneDefinition" makes me think it will actually Clone something.
(just an opinion, feel free to do whatever seems best for you)

94

Function attributes are not required for correctness? I'm unsure about that.

lib/Transforms/Utils/SplitModule.cpp
37

Minor comment: the file is lib/Transforms/Utils/SplitModule.cpp but the name includes "lto"
Some context is leaking ;-)

tools/llvm-split/llvm-split.cpp
51

Not sure why the lambda needs to capture by copy?

pcc updated this revision to Diff 32723.Aug 20 2015, 12:14 PM
pcc edited edge metadata.
  • Address review comments
pcc added a comment.Aug 20 2015, 12:14 PM
In D12132#227433, @joker.eph wrote:

Hi Peter,

I wonder if we can't make the split really more efficient, especially memory wise. Ideally we should seek to:

  • Limit looping over the module content
  • Move content instead of duplicating it.

This is especially important since LTO deals with very large module.
So reusing the cloneModule() implementation is clever and nicely done, but on the long term I'm not sure it is the way to go. I can imagine many speed/memory improvements that could be done by splitting a Module in N partitions in one pass.

I agree with all of this, and I'd be interested in exploring any ideas around making this more efficient, but at the very least I think it would be useful to have this in the tree as a benchmark of comparison against more efficient approaches. I suppose the question is what the best way of structuring this code would be given that we intend to replace parts of it with a better design later.

So I wonder: if we want to really productize this, I'm not sure that the current code sharing/intrusion in cloneModule() is the right way to go. What do you think?

Whichever component we replace cloneModule() with will basically need to solve the same problem of providing some way to partition the module. It should be simple to revert the change to cloneModule() later once we have something more reasonable; it should mostly be a matter of constant folding the calls to ShouldCloneDefinition.

include/llvm/Transforms/Utils/SplitModule.h
32

Done

34

I found that pipelining can give a significant performance boost -- on my machine it can take around 40 seconds to create each partition for Chromium -- so I'd like to find a way to keep it if possible. I came up with a new design for the API that avoids this function having to deal with threads, which I've implemented in the new patch.

Regarding the partition selection callback, I would suggest that we hold back on that until there is a use case for it.

Also if the Module is left in an undetermined state, the caller won't be able to do anything with it but destroying it, so we might better take ownership of it anyway.

Done, but D12205 will be needed later for the code generator.

It might be possible to re-use it for partition 0 and save one clone?

Yes, good point, that should be possible. I've added a FIXME.

lib/Transforms/Utils/CloneModule.cpp
41

Done

94

Yes, hence "generally". Note that this is for external references only. Thread-local variables are the only case that springs to mind where attributes matter. But it turns out that aliases of thread-local variables are broken anyway:

$ cat tl1.ll
@x = thread_local global i32 0
@y = alias i32* @x

define void @f() {
  store i32 42, i32* @x
  store i32 42, i32* @y
  ret void
}
$ ~/src/llvm-build-rel/bin/llc -o - tl1.ll -mtriple=x86_64-pc-win32
	.text
	.def	 f;
	.scl	2;
	.type	32;
	.endef
	.globl	f
	.align	16, 0x90
f:                                      # @f
.Ltmp0:
.seh_proc f
# BB#0:
.Ltmp1:
	.seh_endprologue
	movl	_tls_index(%rip), %eax
	movq	%gs:88, %rcx
	movq	(%rcx,%rax,8), %rax
	movl	$42, x@SECREL32(%rax)
	movl	$42, y(%rip)
	retq
.Ltmp2:
	.seh_endproc

	.section	.tls$,"dw"
	.globl	x                       # @x
	.align	4
x:
	.long	0                       # 0x0


	.globl	y
y = x
$ ~/src/llvm-build-rel/bin/llc -o - tl1.ll -mtriple=x86_64-unknown-linux
	.text
	.file	"tl1.ll"
	.globl	f
	.align	16, 0x90
	.type	f,@function
f:                                      # @f
	.cfi_startproc
# BB#0:
	movl	$42, %fs:x@TPOFF
	movl	$42, y(%rip)
	retq
.Lfunc_end0:
	.size	f, .Lfunc_end0-f
	.cfi_endproc

	.type	x,@object               # @x
	.section	.tbss,"awT",@nobits
	.globl	x
	.align	4
x:
	.long	0                       # 0x0
	.size	x, 4


	.globl	y
y = x
	.section	".note.GNU-stack","",@progbits
$ ~/src/llvm-build-rel/bin/llc -o - tl1.ll -mtriple=x86_64-apple-darwin
	.section	__TEXT,__text,regular,pure_instructions
	.globl	_f
	.align	4, 0x90
_f:                                     ## @f
	.cfi_startproc
## BB#0:
	pushq	%rax
Ltmp0:
	.cfi_def_cfa_offset 16
	movq	_x@TLVP(%rip), %rdi
	callq	*(%rdi)
	movl	$42, (%rax)
	movl	$42, _y(%rip)
	popq	%rax
	retq
	.cfi_endproc

.tbss _x$tlv$init, 4, 2                 ## @x

	.section	__DATA,__thread_vars,thread_local_variables
	.globl	_x
_x:
	.quad	__tlv_bootstrap
	.quad	0
	.quad	_x$tlv$init


	.globl	_y
_y = _x
.subsections_via_symbols
lib/Transforms/Utils/SplitModule.cpp
37

Fixed

tools/llvm-split/llvm-split.cpp
51

This protected against the possibility of std::string (OutName) not being thread safe, but with the API change this is no longer needed.

pcc added inline comments.Aug 20 2015, 12:22 PM
lib/Transforms/Utils/CloneModule.cpp
94

I spoke too soon. Looks like aliases can be thread-local, and their thread-localness is a property of the alias rather than being inherited from the aliasee. That seems really bizarre. But in any case we are copying the thread-local property from the alias, so we are fine as far as thread-localness goes.

pcc updated this revision to Diff 32734.Aug 20 2015, 1:34 PM
  • Remove unneeded dependency, rename CloneDefinition -> ShouldCloneDefinition, add makefile
pcc updated this revision to Diff 32741.Aug 20 2015, 2:00 PM
  • Update test
mehdi_amini accepted this revision.Aug 20 2015, 4:24 PM
mehdi_amini edited edge metadata.

LGTM!

tools/llvm-split/llvm-split.cpp
51

Why is this needed at all? Can't the lambda reference OutputFilename directly?

This revision is now accepted and ready to land.Aug 20 2015, 4:24 PM
pcc added inline comments.Aug 20 2015, 7:49 PM
tools/llvm-split/llvm-split.cpp
51

Just a leftover from the old code; removed.

This revision was automatically updated to reflect the committed changes.