This is an archive of the discontinued LLVM Phabricator instance.

[InstCombine] Transform small unaligned memcmp calls used in zero equality tests
Needs ReviewPublic

Authored by msebor on Aug 30 2022, 11:00 AM.

Details

Reviewers
nikic
spatel
Summary

The memcmp transformation of calls used in equality tests with zero is restricted to arrays of a power-of-two alignment and size, even though the middle end is capable of emitting efficient code for arrays that don't satisfy this restriction. As a result, some memcmp calls result in less than optimal code for strictly aligned targets even with the subsequent optimization in ExpandMemCmp.cpp.

The proposed change relaxes the restriction in the simplifier, letting even calls with unaligned arrays whose size is not a power of two expand to a series of loads followed by a single comparison (up to the maximum scalar register size). This might come with a small increase in the size of the emitted code, but in my testing with targets like mips64, riscv64 and sparcv9 the increase didn't seem excessive. The change might also results in slightly different code for targets that don't penalize unaligned accesses (because it preempts the transformation in ExpandMemCmp.cpp) but based on inspection the resulting object code appears comparably optimal.

Diff Detail

Event Timeline

msebor created this revision.Aug 30 2022, 11:00 AM
Herald added a project: Restricted Project. · View Herald TranscriptAug 30 2022, 11:00 AM
msebor requested review of this revision.Aug 30 2022, 11:00 AM
efriedma added a subscriber: efriedma.

To be clear, this is proposing to replace a memcmp of 8 bytes with up to 16 byte-sized loads on targets with strict alignment? That seems excessive to me.

I can see why you'd want to expand this earlier, but I think we need to be a bit more conservative. Maybe move it to AggressiveInstCombine, and use TTI to check the costs.

nikic added a comment.Aug 30 2022, 2:11 PM

Agree with @efriedma, I don't think we want to introduce unaligned loads without some TTI driven cost modelling.

I believe the original plan was to move the MergeICmps and ExpandMemCmp passes from the backend into the (late) optimization pipeline, which would allow follow-on optimizations to apply. I believe this change also landed at some point, but was reverted (due to some sanitizer interaction possibly?) and nobody bothered bringing it back.

At a high level, I think that would still be the proper way to do things. ExpandMemCmp is the pass that implements the fully general, TTI-driven logic for these expansions, and we probably should not subsume it for non-trivial cases.

Agree with @efriedma, I don't think we want to introduce unaligned loads without some TTI driven cost modelling.

I believe the original plan was to move the MergeICmps and ExpandMemCmp passes from the backend into the (late) optimization pipeline, which would allow follow-on optimizations to apply. I believe this change also landed at some point, but was reverted (due to some sanitizer interaction possibly?) and nobody bothered bringing it back.

If you're interested in trying it again: https://reviews.llvm.org/D60318

To be clear, this is proposing to replace a memcmp of 8 bytes with up to 16 byte-sized loads on targets with strict alignment? That seems excessive to me.

No, the patch doesn't change the size limit (the size of a scalar register, or 8 on all the targets I tested). It just removes the restriction on alignment and size being a power of two.

nikic added a comment.Aug 31 2022, 8:58 AM

To be clear, this is proposing to replace a memcmp of 8 bytes with up to 16 byte-sized loads on targets with strict alignment? That seems excessive to me.

No, the patch doesn't change the size limit (the size of a scalar register, or 8 on all the targets I tested). It just removes the restriction on alignment and size being a power of two.

The point here is that these loads are going to be expanded into 16 byte-sized loads by the backend, if the target has strict alignment. See https://llvm.godbolt.org/z/MdqoEEv1P for an example.

The point here is that these loads are going to be expanded into 16 byte-sized loads by the backend, if the target has strict alignment. See https://llvm.godbolt.org/z/MdqoEEv1P for an example.

The example shows 8 byte loads. It would be the result of expanding a call to memcmp(a, b, 8) with the worst case scenario of both arguments being unaligned. Without the patch the call is made to the Glibc memcmp.S. I'm not an expert on aarch64 but the inline expansion looks like a clear win to me (except with -Os where it might make sense to prefer making the call). I'd expect the early expansion to be especially helpful when memcmp is called more than once with the same unaligned argument because it can then be reused instead of reloaded each time.