Page MenuHomePhabricator

[PGO] Remove data races on Data->Values field

Authored by davidxl on Dec 4 2015, 8:00 PM.



The Values field of the per-function data structure is overloaded for two different purposes. At runtime time, the field points to the in-memory value profile data, but when the data is serialized to disk, its value will be changed to point to internals of value profile data's serialization buffer (continuous) -- which are later relocated during profile reading.

Current implementation simply overwrites the pointer during profile dumping -- this won't work well with multi-threaded program -- after Values field is modified, runtime code may either crash or corrupt the data. Zerolized Value field may also be re-initialized with addresses which will be garbage in reading time. Introducing synchronization mechanism will greatly affect runtime performance.

The solution is simple -- instead of overwriting the "Data" data structure, the per-func data array will be copied to a small buffer and written out after the Values are fixed up in the buffer. The Data's Values field will be kept intact. There is no need to clear the memory either. In some use cases, the user code will just clear the counters and all the data allocated for VP can be reused.

Diff Detail

Event Timeline

davidxl updated this revision to Diff 41977.Dec 4 2015, 8:00 PM
davidxl retitled this revision from to [PGO] Remove data races on Data->Values field.
davidxl updated this object.
davidxl added a reviewer: betulb.
davidxl added a subscriber: llvm-commits.
davidxl updated this revision to Diff 42457.Dec 10 2015, 12:39 PM

Patch rebase.

betulb added inline comments.Dec 10 2015, 2:25 PM

storek -> store ?


I do not think we really do need a DataValuePtrs array. I like the direction taken by the removal of the update of the Values field of the ProfileData struct, but I do not think this implementation makes the overall need for synchronization any less due to the global declaration of the DataValuePtr's array. This also resurfaces the comments made for a previous implementation i.e. "depending on the implicit ordering of the ProfileData nodes in memory for reading of ValueData nodes". Value profiling data is readable w/o this indirection. I do not think there is any need for this array.

DataValuePtrs won't introduce data races as it is only accessed by the dumping thread.

Other than that, I admit I don't like introducing this overhead either. I will prepare a different patch for this.

davidxl updated this revision to Diff 42541.Dec 11 2015, 10:32 AM

[Update the patch according to Betul's feedback)

By removing the dependency on ->Values field from the reader, the runtime patch now becomes quite clean. Lots of unnecessary things are removed:

  1. The value profile data no longer needs to be put into an contiguous array before writing (as there will no relocation needed at read time)
  2. This removes the need to compute the total size ahead of time.
  3. It eliminates the problem of possible dropping of hot value sites due to updates from other threads
  4. It eliminates the need to reserve extra bytes for buffer, thus the need to set reserved space at runtime
  5. It greatly simplifies dumping -- there will be no need for two pass data collection -- one for total size, one for actual data collect.

The only major interface change is the data gathering interface -- it will now gather an array of pointers to ValueProfData objects instead of a pointer to the contiguous array.

Any more comments? since this is a bug fix with good cleanups, I'd like to get this in sooner.

betulb added inline comments.Dec 17 2015, 3:40 PM

As an overall comment, I'm not too fond of using macros this heavily. Specific to this line, DEF_VALUE_RECORD is used only once and there is a macro definition for its contents. It's not necessary. Secondly, R is passed to the macro w/o being declared in this context. I'd rather have preferred that R is declared first and then passed to the macro (if macro usage is ever a requisite).


I think there is an overhead of making the syscall to write as many times as there are profile data variables. I think it was one of the reasons why the other profiling constants/variables were merged into a single global variable and into linker sections to reduce the many iterations needed to write the data.

davidxl added inline comments.Dec 17 2015, 3:46 PM

yes, after the cleanup, this one can be removed -- as it is not reused anywhere else.


We can do the write in batches of course. I will do some measurement to see if that is needed.

betulb added inline comments.Dec 17 2015, 4:35 PM

Is there a purpose for not directly using free, assigning it to a pointer and calling it through the function pointer?


Is there any need to return S in this implementation? We may also remove the need to pass the ValueDataSize around.


Instead of ValueDataSize, why not check for if ValueDataBegin is null?

davidxl added inline comments.Dec 17 2015, 4:45 PM

For the record, this macro was also introduced to 'mimic' C++ where variable declaration and construction are done together.


good point. It was needed to indicate if there is vp data to be written -- but the rewrite makes it not necessary. Will remove.


yes. Will change.

Regarding performance, fwrite actually buffers the data so the number of system calls to write is actually much fewer than the number of Data entry - the overall time is dominated by IO, not system calls.

Here is the design of the stress testing:

  1. total number of value data entries to be written out : 3 million
  2. total size of the value data 1.6G -- this is way larger than an average program can produce -- for instance clang's profile data raw size is about 100M

Test machine is a sandybridge machine.

a) write out VP data one by one without batching

  • total number of calls to fwrite: 3M,
  • total number of calls to write: ~390K.
  • real time: ~12s; sys time: ~2.5s

b) write out VP data in batches -- batch size is 1024 (i.e, copy 1024 VP data into a buffer and write out)

  • total number of calls to fwrite: 3K
  • total number of calls to write: ~6K (yes, it is more than calls to fwrite -- large a very large write can be split into smaller chunks).
  • real time: ~12s, sys time: ~2s

The savings from reduced number of sys calls is not much.

In another experiment, /dev/null is used as the output to remove IO.

a) for non batch case

  • real time: ~1.6s (average of 10 runs)
  • sys time: ~0.85s (average of 10 runs)

b) with batch:

  • real time: ~1.7s (average 10 runs)
  • sys time: ~0.73s (average of 10 runs).

Note that b has addition cost of memory copy

Based on the above data, we can probably go with non-batch for simplicity for now. The batch write can be easily added later.

davidxl updated this revision to Diff 43205.Dec 17 2015, 6:41 PM

Updated the patch according to Betul's feedback.

(VP gather interface is cleaned up -- the size is still needed by the header.)

davidxl updated this revision to Diff 43477.Dec 22 2015, 1:38 PM

Reimplement VP data writing using the Buffer Writer API. With the new change, the VP writing buffer size also becomes configurable with an environment variable.

davidxl updated this revision to Diff 43478.Dec 22 2015, 1:42 PM

Remove a redundant check.

betulb edited edge metadata.Dec 23 2015, 5:28 PM

LGTM overall.


nullptr check for ValueDataSize.


1 -> sizeof(uint8_t) .

sizeof(char), sizeof(uint8_t) and imm constant value 1 are all interspersed throughout the code. sizeof(uint8_t) should replace all necessary locations to be consistent throughout the code. Similarly both char and uint8_t as types are used interchangeably throughout the code. Using one of the two consistently is preferable.

davidxl updated this revision to Diff 43712.Dec 28 2015, 11:12 PM
davidxl edited edge metadata.

Address review comments from Betul

This revision was automatically updated to reflect the committed changes.