Distributed-memory systems, from small clusters to supercomputers, use MPI (http://www.mpi-forum.org/) to manage inter-process communication. On large systems, a single job might consist of millions of simultaneously-running processes. Even though millions of processes are rare, tens of thousands are now common. Profiling such applications using our current infrastructure is difficult because it requires generating, and then merging, many thousands of files. On the distributed file systems often used on these machines, creating so many files often leads to performance problems (in addition to being difficult to manage). This problem can be eliminated by teaching compiler-rt how to aggregate the profiling counters so that only one process writes out the combined profiling data for all processes. This patch adds this functionality whenever compiler-rt is compiled with an available mpi.h header.
Any well-formed MPI program calls MPI_Finalize to shut down its communication resources, making them unavailable by the time atexit handlers are called. By design, all MPI functions used by MPI applications are provided as weak symbols. The strong symbols are provided using PMPI_* names. Thus, MPI_Finalize, by default, is a weak function calling PMPI_Finalize. This allows tools to easily hook the MPI functions used by applications while easily retaining the ability to access the underlying implementation. This is utilized within the new file InstrProfilingReduce.c, where each processor’s counters are aggregated onto just one processor before terminating the parallel environment.
The order in which libraries are included on Clang’s command line was leveraged to ensure compatibility across all applications and computing environments. Programs that include MPI functions will be linked with the actual MPI library, since compiler-rt libraries are linked last. However, ones that do not will simply link using the stub function implementations in InstrProfilingStub.c.
To accomplish the goal of limiting the file-writing to one process (for applications that properly call MPI_Finalize), we needed to change the way that the truncation of existing files works. Currently, we use fopen to force the truncation of any pre-existing profiling output at application startup, and then later open the file to actually write the collected profiling data. When using MPI, we only want the process identified as MPI rank 0 to create any profiling outputs. This cannot be determined at startup, and so we need to delay the initial creation of the profiling outputs until after MPI_Finalize is called. To do this, we change the current behavior (at least on POSIX systems) so that existing profiling outputs are truncated only if they already exist (i.e. the file is opened without O_CREAT). If an output does not exist, we don't create an empty file, but rather, create the file later when it is opened for writing. This seems like a generally-beneficial change for all systems.
Lastly, the testing file instrprof-reduce.c was created to test that only processes assigned to rank 0 produce a profiling file when MPI_Finalize is called. Tests were also done to ensure that profiling counts increased proportionally with the number of processes involved in a program.
Make the name conform to the 'name space' convention:
--> __llvm_profile_write_data