The initial motivation is providing fast, inline paths for memset and
memcpy with a dynamic size when that size happens to be small. Because
LLVM is *very* good at forming memset and memcpy out of raw loops and
many other constructs, it is especially important that these remain fast
even when used in circumstances where the library function call overhead
is unacceptably large.
The first attempt at addressing this was D35750, but that proved to only
exacerbate the issue rather than fixing it.
It turns out, at least for x86, we can emit a very minimal loop behind
a dynamic test on the size and dramatically improve the performance of
sizes that happen to be small.
To make all of this work *well* requires a lot of careful logic:
- We need to analyze and discover scaling of the size fed to memset and memcpy.
- We can't widen past the alignment.
- We need to emit any loop with *exactly* the right IR to get efficient lowering from the backend.
- It needs to run quite late to not be perturbed by other passes that try to "optimize" the loop.
- We need to avoid this in optsize and minsize functions.
- We need to generate checks for zero-length operations before the loop. This ends up being an even faster path.
- But we need to not generate *redundant* checks which means adding a mini predicate analysis just to find existing zero checks. It turns out these are incredibly common because so many of these routines are created out of loops which we have already extracted just such a predicate from.
There is still more we should do here such as:
- Don't emit these for cold libcalls.
- Use value profile data (if available) to bias at least the branch weights and potentially the actual sizes.
However, for at least a few benchmarks here that end up hitting this very hard,
I'm seeing between 20% and 50% improvements already. Naturally, I'll be
gathering more data both on performance impact and code size impact, but
I wanted to go ahead and get this out for review.
nit: I find OpByteSize not intuitive. Perhaps DataByteWidth?