This is an archive of the discontinued LLVM Phabricator instance.

[COFF] Add and use a zero-copy tokenizer for .drectve
ClosedPublic

Authored by rnk on May 1 2020, 1:40 PM.

Details

Summary

This generalizes the main Windows command line tokenizer to be able to
produce StringRef substrings as well as freshly copied C strings. The
implementation is still shared with the normal tokenizer, which is
important, because we have unit tests for that.

.drective sections can be very long. They can potentially list up to
every symbol in the object file by name. It is worth avoiding these
string copies.

This saves a lot of memory when linking chrome.dll with PGO
instrumentation:

BEFORE      AFTER      % IMP

peak memory: 6657.76MB 4983.54MB -25%
real: 4m30.875s 2m26.250s -46%

The time improvement may not be real, my machine was noisy while running
this, but that the peak memory usage improvement should be real.

This change may also help apps that heavily use dllexport annotations,
because those also use linker directives in object files. Apps that do
not use many directives are unlikely to be affected.

Diff Detail

Event Timeline

rnk created this revision.May 1 2020, 1:40 PM
Herald added a project: Restricted Project. · View Herald TranscriptMay 1 2020, 1:40 PM
Herald added a subscriber: hiraditya. · View Herald Transcript
thakis accepted this revision.May 1 2020, 4:18 PM

Nice!

I don't believe the time improvement unless more data is collected, but this is still a very nice win :)

How involved would it be to move all callers to this new style api (posix too, I suppose)?

llvm/lib/Support/CommandLine.cpp
973

nit: Maybe add a comment along the lines of "// The token required unquoting, so it must always be save, regardless of AlwaysCopy".

This revision is now accepted and ready to land.May 1 2020, 4:18 PM
rnk marked an inline comment as done.May 1 2020, 7:18 PM

How involved would it be to move all callers to this new style api (posix too, I suppose)?

Very involved. I tried it. :)

The tokenizing API is mostly used for response file expansion, which is done before flag parsing. Both LLVM command line parsing libraries expect to see C strings, because they are set up to work on a raw argv list. I started StringRef-izing them, but it never ends.

The other thing is that, if we did zero-copy response file expansion, we would need to extend the lifetime of the response file MemoryBuffers. This is kind of cool in that we save one copy, but normally we read the response file and then throw away the memory, so there is no total reduction memory usage from updating the API.

Lastly, there is some inherent risk to passing around non-null-terminated StringRefs. Someone might call .data() on it to get a C string, but if it comes from a response file, it won't be one.

llvm/lib/Support/CommandLine.cpp
973

Will do.

This revision was automatically updated to reflect the committed changes.