This is intended to provide a parallel (threaded) ThinLTO scheme
for linker plugin use through the libLTO C API.
This is intended to provide a parallel (threaded) ThinLTO scheme
Thanks Teresa for the review. See answers inlined.
No guarantee provided by the API :)
PruningInterval controls the interval between two check on the cache: i.e. if you set 2h the plugin will not bother checking for cache invalidation for the next two hours.
It is not very helpful, but matches LTOCodeGenerator indeed.
Aren't they almost immediately materialized anyway? I thought that laziness would have a cost for no real benefit maybe?
SourceFileName can fit! Good point.
I remove the internalization phase from this patch for now. This will change a lot with the graph in the summary.
Good point. Dunno.
This is copy/pasted from LTOCodeGenerator::applyRestriction. I read it as "don't need to do anything thing you won't internalize any more something already private"
OK, I'll keep this for a future version of the patch, I removed the internalize for now.
You need to block "reader" of OptimizedBuffer in case there is already a writer populating it.
No preserved or cross referenced symbols means that you will internalize *everything* and then global DCE will remove *everything*. This is of little interest. So I just considered that empty() means the linker didn't provide any information at all. It helps implementing testing with llvm-lto as well.
The buffers identifier are supplied by the linker and can be anything:
extern void thinlto_codegen_add_module(thinlto_code_gen_t cg, const char *identifier, const char *data, int length);
Here there is no good reason I think. I think that StringMap will allocate the entry with the key, so you want a value that is quite small (for rehashing / growing the map).
Good point, will update to try using a set of flags as close as possible the LTOCodeGenerator
The reason llvm-lto does not use run() is to test steps in isolation.
Is the max cache size set via the space that was available at the start of the compilation, or is the threshold updated so that if something else comes along and eats up some disk space the allowable max cache size is adjusted downward?
It just wasn't clear how they interact, although your explanation was what I guessed. Maybe for the thinlto_codegen_set_cache_entry_expiration interface add that the module may be removed earlier due to the pruning interval.
Also for future consider caching value unless it is expected to be called only once per module.
The when lazy metadata linking is enabled, metadata parsing should be postponed until we actually go to do the import and linkInModule during the bulk importing. Since the source module is destroyed at the end of linkInModule, this should reduce the max amount of live metadata to one module at a time, rather than all modules we are importing from. I'm surprised it didn't reduce the max memory usage in a debug build.
If we ever go back to post-pass metadata linking in the function importer (i.e. if we decide to do iterative importing rather than a single bulk import from each module), this will become more critical. However, if we move to summary-only importing decisions it will obviate the need to do this.
Until we go to summary only import decisions, I would think it should reduce the max memory usage as per my first paragraph. Did you see a cost going with lazy metadata loading?
Is this a TODO? If so, please add TODO comment.
Add doxygyen comment describing ModuleMap.
Ok. When you put this support back in your description above ("don't need to do anything thing you won't internalize any more something already private") is better than the comment about restriction.
Ok, but I think you still want the linkonce/weak linkage changes? When you put this back this is more reason to split the internalization and linkonce/weak linkage changes into two routines and only invoke the latter if the sets are empty.
What is the difference between this ModuleBuffer and the one loaded out of the ModuleMap in the below call to loadModuleFromBuffer? If they are the same, can loadModuleFromBuffer be called a single time and the resulting module optionally saved after?
Unfortunately this means that ThinLTOCodeGenerator::run() is currently untested. Consider adding a mode to llvm-lto that does all thinlto steps (thin-action=all?) to test all the steps via this interface.
Actual option below is "functionindex". But per the ref graph patch, this is broader than a function index, so I am looking at changing references to function index to something else anyway. So I would suggest changing the actual option below to "thinlto-index".
s/mentionned/mentioned/ (here and below for import()).
Adding all the modules isn't needed for promotion. Should be able to remove this loop.
There is a lot of code duplication between these various functions, consider refactoring possibly in a follow-on patch.
Why is the import() step a superset of promote and import, whereas the other steps (e.g. optimize()) only doing one thing?
What if there is more than one occurrence?
Mmmm, this is yet to be implemented.
Now I'm unsure it was clear enough, because you wrote " the module may be removed earlier due to the pruning interval". A cache entry *can't* be remove earlier. The pruning interval means that will only *check* the entries.
(I expect this to come back with a very different summary-based implementation)
Good point, this is legacy from when there was all the internalization stage, will cleanup!
It isn't needed... for now ;)
Same reason as above: when I wrote it against the pure summary-based importing, it was needed/helpful. I'll remove for now.
Any suggestion on how to test that? This is basically like testing opt -O3?
I'll add a check and error.
I don't really understand what you mean by this, particularly the last sentence about using extra space during the link without any limit. Isn't this set/used during the link (which is running the ThinLTO steps in process)?
Ah, so a clarifying comment would help then since I misunderstood. It sounds like the modules will be checked for expiration (and pruned from the cache) at the pruning interval. So with the default pruning interval of -1, the expiration is unused? I took these to be separate, complimentary, mechanisms for keeping the cache size down. Another option beyond just documenting the interactions well is to combine these parameters into a single interface that takes both the pruning interval and the expiration.
Not sure, maybe just invoke this stage in your test after the importing action to make sure it succeeds (without necessarily checking for any specific optimization)?
Sorry for the late entry... Some of my questions could have been already answered.
Does this handles commons properly?
Can we theoretically have a mixture of opt levels? -O2/-Os? Should I be able to respect per library opt level?
I do not know if it is relevant at this point, but for what it worth - my IR might have metadata that changes opt/codegen.
Default to /tmp?
Yes! ...and default should probably be O2...
So I assume I should be able to control all of those...
Sorry, I miss something - why is this unconditional?
In general - don't you want to verify modules as they progress through the stages? I do it in regular LTO and it did help on more than one occasion :)
+1 on Teresa's point about llvm-lto.
Thanks for all the comments!
(Please see inline for the discussion)
Can you be more specific? I'm not sure what is specific with common on this aspect?
Ideally I think we'd want this parameter to be recorded in the bitcode itself, what do you think?
Here it is about the *codegen* opt level though, it won't impact the optimizer.
slarin: how does it play with cross-module importing? If a function is defined in a module compiled with O3 but imported in a module compiled with O2?
The way the client is enabling dumping temporaries is by providing a path.
Yes, but we don't have an interface for these as of today. A serialization of the PMB options in the bitcode would help.
As mentioned above, this is conditional in the saveTempBitcode function itself, which starts with
if (SaveTempsDir.empty()) return;
We'll always verify the module once at the beginning of the optimizer pipeline, but I guess we could do it more frequently in assert builds.
This is done in optimized()
(see line 143 in ThinLTOCodeGenerator.cpp PMB.VerifyInput = true;)
I probably not fully understand how this list should work... but this is not the proper place to figure it out - please ignore this comment for now.
...yes. My target has codegen properties that are exposed to a user, which might produce drastically different results if not set properly, but once again, ideally it should come from bitcode.
If I know settings for each module I can handle these situations in platform specific way. Besides rough optimization levels I have different addressing modes used in different modules, and I might chose not to mix certain features at all... At this point my main concern seems to be revolving around general LTO issues (like mixing different optimization scopes into one) and might not be "thin"-lto specific. We had to jump through hoops for regular LTO, and I see very similar set of issues being designed in here as well...
Thanks for the great review. Hopefully I didn't forget anything with this update.
Tried to make the doc very explicit, let me know what you think.
It is supposed to be called once indeed. I think we can update the implementation in the future if needed.
So just tested:
With lazy loading of metadata: getLazyBitcodeModule takes 237ms and materializeMetadata takes 74ms (total 314ms)
So no perf diff.
I probably don't see any diff on the memory because most metadata will leak to the context and being free'd with the Module. So there is no real impact on the peak memory (just delaying a little bit).
Done, it was valuable :)
Thanks for adding the description! I do think that the formula for the new cache size is not right, see suggested fix below.
A couple misc comment typos, but LGTM once these the above is addressed.
"left over half the available space"? (i.e. add "half") Either that or "left over the free space" (since AvailableSpace = FreeSpace + CacheSize)?
This doesn't seem right. I think P should be divided by 100 and not multiplied by it, since it is a percentage. And I think the percentage needs to be multiplied by something, not divided by AvailableSpace?
Should this be:
since it is described as the percentage of available space used for the cache.
For this and the above suggestion, any change needs to be replicated below in ThinLTOCodeGenerator.h.
move 'an' to next line.
Ok, thanks for checking!