Page MenuHomePhabricator

[x86, MemCmpExpansion] allow 2 pairs of loads per block (PR33325)

Authored by spatel on Jan 3 2018, 1:31 PM.



This is the last step needed to fix PR33325:

We're trading branch and compares for loads and logic ops. This makes the code smaller and hopefully faster in most cases.

The 24-byte case shows an interesting construct: we load the trailing scalar elements into vector registers and generate the same pcmpeq+movmsk code that we expected for a pair of full vector elements (see the 32- and 64-byte tests)

Diff Detail


Event Timeline

spatel created this revision.Jan 3 2018, 1:31 PM

Great! One minor comment.

3 ↗(On Diff #128553)

We're losing coverage for the case where we're testing for equality with only one load per block. What about adding a line with:
-memcmp-num-loads-per-block=1 to keep these alive ?

spatel added inline comments.Jan 4 2018, 12:42 PM
3 ↗(On Diff #128553)

Yes, that sounds good. Update coming soon...

spatel updated this revision to Diff 128640.Jan 4 2018, 12:48 PM

Patch updated:
Add a RUN to the IR test for expansion so we see exactly how the IR differs between the old and new settings. I just did that for x64 because x32 would be the same in most cases.

courbet accepted this revision.Jan 4 2018, 11:28 PM
This revision is now accepted and ready to land.Jan 4 2018, 11:28 PM
This revision was automatically updated to reflect the committed changes.