Page MenuHomePhabricator

[profile] Add a mode to continuously sync counter updates to a file
ClosedPublic

Authored by vsk on Oct 2 2019, 11:39 AM.

Details

Summary

Add support for continuously syncing profile counter updates to a file.

The motivation for this is that programs do not always exit cleanly. On
iOS, for example, programs are usually killed via a signal from the OS.
Running atexit() handlers after catching a signal is unreliable, so some
method for progressively writing out profile data is necessary.

The approach taken here is to mmap() the __llvm_prf_cnts section onto
a raw profile. To do this, the linker must page-align the counter and
data sections, and the runtime must ensure that counters are mapped to a
page-aligned offset within a raw profile.

Continuous mode is (for the moment) incompatible with the online merging
mode. This limitation is lifted in https://reviews.llvm.org/D69586.

Continuous mode is also (for the moment) incompatible with value
profiling, as I'm not sure whether there is interest in this and the
implementation may be tricky.

As I have not been able to test extensively on non-Darwin platforms,
only Darwin support is included for the moment. However, continuous mode
may "just work" without modification on Linux and some UNIX-likes. AIUI
the default value for the GNU linker's --section-alignment flag is set
to the page size on many systems. This appears to be true for LLD as
well, as its no_nmagic option is on by default. Continuous mode will
not "just work" on Fuchsia or Windows, as it's not possible to mmap() a
section on these platforms. There is a proposal to add a layer of
indirection to the profile instrumentation to support these platforms.

rdar://54210980

Diff Detail

Event Timeline

vsk created this revision.Oct 2 2019, 11:39 AM
Herald added a project: Restricted Project. · View Herald TranscriptOct 2 2019, 11:39 AM
Herald added a subscriber: hiraditya. · View Herald Transcript

Not sure it belongs in this CL specifically, but would it be possible to later add a mode that defers the initialization of the mmap'd file until specifically requested by the process?

compiler-rt/lib/profile/InstrProfiling.h
179

I'm afraid I don't follow, could you explain in more detail why changing the filename is a problem?

vsk added a comment.Oct 2 2019, 12:51 PM

Not sure it belongs in this CL specifically, but would it be possible to later add a mode that defers the initialization of the mmap'd file until specifically requested by the process?

This sounds tricky to me, as the initialization requires some work to be done in each copy of the profile runtime linked into the process.

Could you clarify whether you're referring to processes which use the profiling runtime's static initializer, or not? My understanding is that for processes which do use the runtime's static initializer, it's always desirable to set up the mmap'd file in that initializer. If that's not true, I'd appreciate a counterexample. For processes that don't use the runtime's static initializer, the situation (at Apple, at least) is typically that it's a kernel extension or bare-metal piece of firmware, and the continuous mode isn't really applicable anyway.

compiler-rt/lib/profile/InstrProfiling.h
179

The problem is that there are N copies of __llvm_profile_set_filename in a process, one for each instrumented image (1 executable + (N-1) DSOs). When one copy of __llvm_profile_set_filename is called, it can change the filename visible to all images because of weak symbol coalescing. However, in continuous mode, that is not sufficient. What's needed is to remap the __llvm_prf_cnts -> FILE mappings in each image. And a call to __llvm_profile_set_filename cannot achieve that. At least, not unless each image publishes its section mappings -- imho that level of complexity would be excessive.

In D68351#1691915, @vsk wrote:

Not sure it belongs in this CL specifically, but would it be possible to later add a mode that defers the initialization of the mmap'd file until specifically requested by the process?

This sounds tricky to me, as the initialization requires some work to be done in each copy of the profile runtime linked into the process.

Could you clarify whether you're referring to processes which use the profiling runtime's static initializer, or not? My understanding is that for processes which do use the runtime's static initializer, it's always desirable to set up the mmap'd file in that initializer. If that's not true, I'd appreciate a counterexample. For processes that don't use the runtime's static initializer, the situation (at Apple, at least) is typically that it's a kernel extension or bare-metal piece of firmware, and the continuous mode isn't really applicable anyway.

We do use the static initializer, but in our use case the process doesn't necessarily know where to write coverage data until some other application-specific code has run. Ideally, I would like to be able to have the instrumented process receive a shared memory buffer from another process, do continuous profile sync into that memory region, and not care whether it is file-backed or anonymous mapping. So it would be great if application code could give the runtime some sort of handle to that object [1] and tell the runtime to use that space for profiling.

[1] This could be just a pointer to a buffer that the application has already mmap'd, or it could be a file descriptor that the runtime is responsible for calling mmap on. Not sure which one would make more sense.

vsk added a comment.Oct 2 2019, 1:29 PM
In D68351#1691915, @vsk wrote:

Not sure it belongs in this CL specifically, but would it be possible to later add a mode that defers the initialization of the mmap'd file until specifically requested by the process?

This sounds tricky to me, as the initialization requires some work to be done in each copy of the profile runtime linked into the process.

Could you clarify whether you're referring to processes which use the profiling runtime's static initializer, or not? My understanding is that for processes which do use the runtime's static initializer, it's always desirable to set up the mmap'd file in that initializer. If that's not true, I'd appreciate a counterexample. For processes that don't use the runtime's static initializer, the situation (at Apple, at least) is typically that it's a kernel extension or bare-metal piece of firmware, and the continuous mode isn't really applicable anyway.

We do use the static initializer, but in our use case the process doesn't necessarily know where to write coverage data until some other application-specific code has run.

I see, thanks for explaining.

Ideally, I would like to be able to have the instrumented process receive a shared memory buffer from another process, do continuous profile sync into that memory region, and not care whether it is file-backed or anonymous mapping. So it would be great if application code could give the runtime some sort of handle to that object [1] and tell the runtime to use that space for profiling.

I see. Is the lack of support for value profiling a concern, or would this just be for code coverage?

Regardless, I think it's possible to achieve this. If __llvm_profile_set_file_object is called in every copy of the profile runtime in a process, only minor changes to this patch would be required [1]. Judging by how setProfileFile is implemented, I suspect this is the case. If not, a much more invasive change would be needed.

[1] You could do this by recognizing a "%dc" filename pattern for "delayed continuous sync" mode (or some equivalent). The static initializer would detect this and defer the mmap() initialization. Later, when __llvm_profile_set_file_object is called, continuous mode can be initialized.

[1] This could be just a pointer to a buffer that the application has already mmap'd, or it could be a file descriptor that the runtime is responsible for calling mmap on. Not sure which one would make more sense.

I'm not sure it's possible to mmap() the contents of a section using MAP_FIXED over another non file-backed VM region. You may need to experiment with this.

sajjadm added a comment.EditedOct 2 2019, 1:47 PM
In D68351#1691969, @vsk wrote:
In D68351#1691915, @vsk wrote:

Not sure it belongs in this CL specifically, but would it be possible to later add a mode that defers the initialization of the mmap'd file until specifically requested by the process?

This sounds tricky to me, as the initialization requires some work to be done in each copy of the profile runtime linked into the process.

Could you clarify whether you're referring to processes which use the profiling runtime's static initializer, or not? My understanding is that for processes which do use the runtime's static initializer, it's always desirable to set up the mmap'd file in that initializer. If that's not true, I'd appreciate a counterexample. For processes that don't use the runtime's static initializer, the situation (at Apple, at least) is typically that it's a kernel extension or bare-metal piece of firmware, and the continuous mode isn't really applicable anyway.

We do use the static initializer, but in our use case the process doesn't necessarily know where to write coverage data until some other application-specific code has run.

I see, thanks for explaining.

Ideally, I would like to be able to have the instrumented process receive a shared memory buffer from another process, do continuous profile sync into that memory region, and not care whether it is file-backed or anonymous mapping. So it would be great if application code could give the runtime some sort of handle to that object [1] and tell the runtime to use that space for profiling.

I see. Is the lack of support for value profiling a concern, or would this just be for code coverage?

This would just be for code coverage.

Regardless, I think it's possible to achieve this. If __llvm_profile_set_file_object is called in every copy of the profile runtime in a process, only minor changes to this patch would be required [1]. Judging by how setProfileFile is implemented, I suspect this is the case. If not, a much more invasive change would be needed.

edit: deleted question that was answered in an inline comment

[1] You could do this by recognizing a "%dc" filename pattern for "delayed continuous sync" mode (or some equivalent). The static initializer would detect this and defer the mmap() initialization. Later, when __llvm_profile_set_file_object is called, continuous mode can be initialized.

[1] This could be just a pointer to a buffer that the application has already mmap'd, or it could be a file descriptor that the runtime is responsible for calling mmap on. Not sure which one would make more sense.

I'm not sure it's possible to mmap() the contents of a section using MAP_FIXED over another non file-backed VM region. You may need to experiment with this.

Good point, I'll look into it.

it seems like this approach can be extended down the road without much effort, so that's great. Thanks for humoring my questions. :)

vsk added a comment.Oct 2 2019, 1:52 PM
In D68351#1691969, @vsk wrote:
In D68351#1691915, @vsk wrote:

Not sure it belongs in this CL specifically, but would it be possible to later add a mode that defers the initialization of the mmap'd file until specifically requested by the process?

This sounds tricky to me, as the initialization requires some work to be done in each copy of the profile runtime linked into the process.

Could you clarify whether you're referring to processes which use the profiling runtime's static initializer, or not? My understanding is that for processes which do use the runtime's static initializer, it's always desirable to set up the mmap'd file in that initializer. If that's not true, I'd appreciate a counterexample. For processes that don't use the runtime's static initializer, the situation (at Apple, at least) is typically that it's a kernel extension or bare-metal piece of firmware, and the continuous mode isn't really applicable anyway.

We do use the static initializer, but in our use case the process doesn't necessarily know where to write coverage data until some other application-specific code has run.

I see, thanks for explaining.

Ideally, I would like to be able to have the instrumented process receive a shared memory buffer from another process, do continuous profile sync into that memory region, and not care whether it is file-backed or anonymous mapping. So it would be great if application code could give the runtime some sort of handle to that object [1] and tell the runtime to use that space for profiling.

I see. Is the lack of support for value profiling a concern, or would this just be for code coverage?

This would just be for code coverage.

Regardless, I think it's possible to achieve this. If __llvm_profile_set_file_object is called in every copy of the profile runtime in a process, only minor changes to this patch would be required [1]. Judging by how setProfileFile is implemented, I suspect this is the case. If not, a much more invasive change would be needed.

How does one get multiple copies of the runtime in the same process? Does that happen when you load a shared object file that was built with instrumentation?

Yes, exactly.

I think in our use case all our instrumented code is statically linked together, so that might not be a problem for me.

Oh, that makes sense then. There would just need to be a comment near __llvm_profile_set_file_object to explain that it only works on a per-image basis (i.e. that clients with instrumented DSOs should exercise caution with it).

[1] You could do this by recognizing a "%dc" filename pattern for "delayed continuous sync" mode (or some equivalent). The static initializer would detect this and defer the mmap() initialization. Later, when __llvm_profile_set_file_object is called, continuous mode can be initialized.

[1] This could be just a pointer to a buffer that the application has already mmap'd, or it could be a file descriptor that the runtime is responsible for calling mmap on. Not sure which one would make more sense.

I'm not sure it's possible to mmap() the contents of a section using MAP_FIXED over another non file-backed VM region. You may need to experiment with this.

Good point, I'll look into it.

it seems like this approach can be extended down the road without much effort, so that's great. Thanks for humoring my questions. :)

I think so, no worries!

ejvaughan added inline comments.Oct 2 2019, 4:40 PM
compiler-rt/lib/profile/InstrProfilingFile.c
406

Does FilenameBuf need to be free'd, or is it stack allocated?

vsk marked an inline comment as done.Oct 2 2019, 4:55 PM
vsk added inline comments.
compiler-rt/lib/profile/InstrProfilingFile.c
406

It's backed by a call to __builtin_alloca, which should just be a simple stack allocation.

Dor1s added a comment.Oct 2 2019, 10:26 PM

@vsk Vedant, sorry I'm a bit swamped right now and may not be able to review this promptly. Please let me know If my feedback is important here, I'll try to make up some time in that case. Sorry!

vsk added a comment.Oct 3 2019, 10:45 AM

@vsk Vedant, sorry I'm a bit swamped right now and may not be able to review this promptly. Please let me know If my feedback is important here, I'll try to make up some time in that case. Sorry!

No problem, I understand. Any feedback is appreciated.

+petr who has similar needs for Fuchia platform

phosek added a comment.Oct 3 2019, 2:49 PM

+petr who has similar needs for Fuchia platform

Thank you for including me, we have exactly the same use case in Fuchsia and this is the solution I've initially considered and been experimenting with locally based on the discussion with @davidxl. However, after internal discussion we've decided against this solution for two reasons:

  1. This requires the ability to mmap the output file over the (portion of) binary which is something we don't allow in Fuchsia for security reasons; once all modules have been mapped in by the dynamic linker, we don't allow any further changes to their mapping and using this solution would require special dynamic linker which is possible but highly undesirable.
  2. It introduces bloat to both the binary and the output file. It also complicates the runtime implementation due to alignment and padding requirements.

The alternative solution we've came up and that I've now been implementing is to change the instrumentation to allow extra level of indirection. Concretely, the instrumentation conceptually does the following: c = __profc_foo[idx]; c++; __profc_foo[idx] = c. We'd like to change this c = *(&__profc[idx] + *bias); c++; *(&__profc[idx] + *bias) = c where bias is a global variable set by the runtime to be the offset between the __llvm_prf_cnts section and the corresponding location in the file that's mmapped in. Initially, that offset would be 0 which would result in exactly the same behavior as today, but the runtime can mmap the output file into address space and then change the offset to make the counters be continuously updated.

The advantage of this solution is that there are no changes needed to the profile format. It also doesn't require mmapping the output file over the binary, the output file can be mmapped anywhere in the address space. The disadvantage is extra overhead since instrumentation is going to be slightly more complicated, although we don't have any numbers yet to quantify how much slower it's going to be. The implementation should be fairly minimal and my tentative plan was to gate it on compiler switch, so it wouldn't affect existing in any way (modulo introducing one new variable in the runtime to hold the bias). I'm hoping to have the prototype ready and uploaded for review within the next few days. What's your opinion on this idea? Would this be something that you'd be interested in as an alternative approach?

vsk added a comment.Oct 3 2019, 4:16 PM

+petr who has similar needs for Fuchia platform

Thank you for including me, we have exactly the same use case in Fuchsia and this is the solution I've initially considered and been experimenting with locally based on the discussion with @davidxl. However, after internal discussion we've decided against this solution for two reasons:

  1. This requires the ability to mmap the output file over the (portion of) binary which is something we don't allow in Fuchsia for security reasons; once all modules have been mapped in by the dynamic linker, we don't allow any further changes to their mapping and using this solution would require special dynamic linker which is possible but highly undesirable.
  2. It introduces bloat to both the binary and the output file. It also complicates the runtime implementation due to alignment and padding requirements.

The alternative solution we've came up and that I've now been implementing is to change the instrumentation to allow extra level of indirection. Concretely, the instrumentation conceptually does the following: c = __profc_foo[idx]; c++; __profc_foo[idx] = c. We'd like to change this c = *(&__profc[idx] + *bias); c++; *(&__profc[idx] + *bias) = c where bias is a global variable set by the runtime to be the offset between the __llvm_prf_cnts section and the corresponding location in the file that's mmapped in. Initially, that offset would be 0 which would result in exactly the same behavior as today, but the runtime can mmap the output file into address space and then change the offset to make the counters be continuously updated.

The advantage of this solution is that there are no changes needed to the profile format. It also doesn't require mmapping the output file over the binary, the output file can be mmapped anywhere in the address space. The disadvantage is extra overhead since instrumentation is going to be slightly more complicated, although we don't have any numbers yet to quantify how much slower it's going to be. The implementation should be fairly minimal and my tentative plan was to gate it on compiler switch, so it wouldn't affect existing in any way (modulo introducing one new variable in the runtime to hold the bias). I'm hoping to have the prototype ready and uploaded for review within the next few days. What's your opinion on this idea? Would this be something that you'd be interested in as an alternative approach?

@phosek thank you for sharing this alternative. I hope you don't mind my calling this the bias method, after the term in the proposed instrumentation.

The TLDR is that I don't think the bias method is a good fit for Linux/Darwin. Imho, this method won't reduce the complexity of continuous mode, and its instrumentation overhead is likely to outweigh the savings from reduced section alignment. Let me try and justify these claims :).

First and most critically, note that (on Linux & Darwin, at least), mmap() requires that the file offset argument be page-aligned. So, I don't believe there's a way to avoid changing the raw profile format, or to avoid the concomitant complexity necessary to calculate padding bytes in the runtime. I don't see how the bias method solves this problem: this seems to be a fundamental limitation of mmap().

Second, note that the size overhead of counter increment instrumentation dwarfs the alignment overhead for __llvm_prf_{cnts,data} for all but the smallest of programs. Assuming 16K pages, in the worst case, the section alignment costs 32K per image. Assuming an increment can be encoded in 7 bytes as incq l___profc_main(%rip), this section alignment cost is equivalent to ~4700 increments. For comparison, this is roughly the number of counter increments in FileCheck.cpp.o. If the size overhead of counter instrumentation were to double, as I believe it would with the bias method, it would rapidly erase the memory savings from eliminating the section alignment requirement.

Third, I'm not sure I understand the security constraints in Fuchsia that make it undesirable to map the contents of a section onto a file. Is the concern that an attacker process may update the file, thereby changing the in-memory contents of the section? I'll note that on Darwin, at least, the kernel does not permit this kind of sharing. If an attacker process modifies a file on disk that has the in-memory contents of a section MAP_SHARED over it, the in-memory contents of the mapped section are not changed (I verified this by making a small modification to darwin-proof-of-concept.c). If there is some security aspect to the problem I'm missing here, could you please elaborate on it? I'm hoping that the bias method will not be necessary to support continuous mode on Fuchsia, but it seems this depends on the answer to the security question. Either way, istm that the high level approach taken in this patch as-written is a better fit for Linux/Darwin, and moreover probably provides support that can be used to implement the bias method.

I appreciate your feedback and think we can definitely work together to build some kind of common solution!

vsk updated this revision to Diff 225959.Oct 21 2019, 2:42 PM

Updates:

  • Rebase.
  • I've added a caveat about the need for a different kind of instrumentation to support continuous mode on Fuchsia/Windows. Based on Petr's experiment, it looks like the current instrumentation is a better fit for Darwin.
  • I've contacted the Darwin Runtime team about the possibility of supporting continuous mode + on-line merging mode, using a shared memory object that's mmap'd to disk (mmap() alone appears to be insufficient). If this is feasible, I would prefer to revisit it as a follow-up.

PTAL, thanks.

vsk edited the summary of this revision. (Show Details)Oct 21 2019, 3:51 PM
vsk edited the summary of this revision. (Show Details)Oct 29 2019, 3:10 PM
vsk added a subscriber: rnk.
vsk added a comment.Oct 29 2019, 3:15 PM

David was correct (& the Darwin team confirmed this) -- there is no reason for continuous sync mode to be incompatible with online profile merging. That limitation is lifted in D69586.

davidxl added inline comments.Oct 29 2019, 4:06 PM
clang/lib/Driver/ToolChains/Darwin.cpp
1119

The host tool may not know the page size of the target. Also why making the alignment 16K instead of 4K?

1155

The cost of alignment is always paid even when %c is not specified?

compiler-rt/lib/profile/InstrProfilingFile.c
104

emit warning messages?

compiler-rt/test/profile/ContinuousSyncMode/darwin-proof-of-concept.c
108

What does this loop verify?

vsk updated this revision to Diff 226991.Oct 29 2019, 4:36 PM
vsk marked 3 inline comments as done.
  • Add a warning in __llvm_profile_set_file_object, some clarifying comments and rebase.
clang/lib/Driver/ToolChains/Darwin.cpp
1119

The maximum supported page size for iPhone 6+ is 16K, and 4K for Macs. I thought that the simplicity of using a common constant outweighed the memory savings from reduced alignment, but am happy to make this tighter if you'd prefer.

1155

Yes, this is done because %c is expected to become the default in Xcode, and to remove the need for recompilation when switching %c on/off.

compiler-rt/test/profile/ContinuousSyncMode/darwin-proof-of-concept.c
108

I've added some clarifying comments. This is to verify that the data written after the counters is equal to the "data[]" array, i.e. {1, 2, 3}.

davidxl accepted this revision.Oct 31 2019, 2:03 PM

lgtm

This revision is now accepted and ready to land.Oct 31 2019, 2:03 PM
This revision was automatically updated to reflect the committed changes.
Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptOct 31 2019, 4:17 PM
Herald added subscribers: Restricted Project, cfe-commits. · View Herald Transcript
dmajor added a subscriber: dmajor.Nov 19 2019, 12:24 PM
dmajor added inline comments.
compiler-rt/lib/profile/InstrProfilingFile.c
377

@vsk Doesn't this unconditionally break profiling on Fuschia/Windows? Should there at least be a check for __llvm_profile_is_continuous_mode_enabled() before erroring out? I'm currently testing my project's Windows build with trunk, and it's failing here even though we didn't enable continuous mode.

vsk marked an inline comment as done.Nov 19 2019, 12:32 PM
vsk added inline comments.
compiler-rt/lib/profile/InstrProfilingFile.c
377

Yes, sorry about the breakage, this should be addressed by 1aacf58819a27f428a46ce839a0ee797af03c1fd.