This is an archive of the discontinued LLVM Phabricator instance.

[x86, CGP] increase memcmp() expansion up to 4 load pairs
ClosedPublic

Authored by RKSimon on Jul 6 2017, 9:50 AM.

Details

Summary

It should be a win to avoid going out to the system lib for all small memcmp() calls using scalar ops. For x86 32-bit, this means most everything up to 16 bytes. For 64-bit, that doubles because we can do 8-byte loads.

Notes:

  1. I don't have a strong opinion about the -Os behavior. Is 4 loads too much code for that case? It's effectively a question of how much do we trust the system implementation. Linux and macOS (and Windows I assume, but did not test) have optimized memcmp() code for x86, so it's probably not bad either way? PPC is using 8/4 for defaults on these. We do not expand at all for -Oz.
  1. There are still potential improvements to make for the CGP expansion IR and/or lowering such as avoiding select-of-constants (D34904) and not doing zexts to the max load type before doing a compare.
  1. We have special-case SSE/AVX codegen for (memcmp(x, y, 16/32) == 0) that will no longer be produced after this patch. I've shown the experimental justification for that change in PR33329:

https://bugs.llvm.org/show_bug.cgi?id=33329#c12
TLDR: While the vector code is a likely winner, we can't guarantee that it's a winner in all cases on all CPUs, so I'm willing to sacrifice it for the greater good of expanding all small memcmp(). If we want to resurrect that codegen, it can be done by adjusting the CGP params or poking a hole to let those fall-through the CGP expansion.

  1. I added the cmov attribute to the 32-bit codegen test because it removes some noise for that file. I think the intent for the SSE vs no-SSE runs is to show the potential difference for the 16 and 32 byte cases rather than the lack of cmov (which has been available for all CPUs since SSE1, so that's why it shows up automatically with -mattr=sse2).

Diff Detail

Repository
rL LLVM

Event Timeline

spatel created this revision.Jul 6 2017, 9:50 AM
RKSimon commandeered this revision.Jul 18 2017, 5:35 AM
RKSimon edited reviewers, added: spatel; removed: RKSimon.
RKSimon edited edge metadata.

Commandeering as @spatel is on holiday.

Does anyone have any reservations about accepting this? We're keen to get this in for the imminent release branch if at all possible.

courbet accepted this revision.Jul 18 2017, 6:25 AM

Commandeering as @spatel is on holiday.

Does anyone have any reservations about accepting this ?

As mentioned before I'm excited by that change and I'm convinced it's the right approach. I'm not sure about all the implications though so It'd be great if other reviewers could take a look too.

This revision is now accepted and ready to land.Jul 18 2017, 6:25 AM
joerg added a subscriber: joerg.Jul 18 2017, 6:37 AM

i386: code requires three push instructions + call + potential stack cleanup.
x86_64: code requires three register loads + call

memcmp expansion requires threeish instructions per unit of expansion? I'd say one unit of expansion for minsize, two for size and four for normal optimization sounds good correct? That includes a small bonus for less clobbering for size optimization. Maybe that's too pessimistic though.

i386: code requires three push instructions + call + potential stack cleanup.
x86_64: code requires three register loads + call

memcmp expansion requires threeish instructions per unit of expansion? I'd say one unit of expansion for minsize, two for size and four for normal optimization sounds good correct? That includes a small bonus for less clobbering for size optimization. Maybe that's too pessimistic though.

Agreed, I'll limit optsize to two load-compare instead of four; minsize is hard coded to use memcmp, which is beyond the scope of this patch.

This revision was automatically updated to reflect the committed changes.