Before this change, XRay would conservatively patch sections of the code
one sled at a time. Upon testing/profiling, this turns out to take an
inordinate amount of time and cycles. For an instrumented clang binary,
the cycles spent both in the patching/unpatching routine constituted 4%
of the cycles -- this didn't count the time spent in the kernel while
performing the mprotect calls in quick succession.
With this change, we're coalescing the number of calls to mprotect from
being linear to the number of instrumentation points, to now being a
lower constant when patching all the sleds through __xray_patch() or
__xray_unpatch(). In the case of calling __xray_patch_function() or
__xray_unpatch_function() we're now doing an mprotect call once for
all the sleds for that function (reduction of at least 2x calls to
mprotect).