This is an archive of the discontinued LLVM Phabricator instance.

[X86] Memset is lowered to rep stos if MinSize is present
Needs ReviewPublic

Authored by vdsered on Dec 12 2021, 11:12 AM.

Details

Summary

X86 does not to account for MinSize when lowering memset which leads to larger code size than it is possible to get, for example, with using rep stos which is already in use in GCC.

This patch must reduce code size for memset when MinSize is used.

Issue with some more details: https://github.com/llvm/llvm-project/issues/51196

Diff Detail

Unit TestsFailed

Event Timeline

vdsered created this revision.Dec 12 2021, 11:12 AM
vdsered requested review of this revision.Dec 12 2021, 11:12 AM
Herald added a project: Restricted Project. · View Herald TranscriptDec 12 2021, 11:12 AM
vdsered set the repository for this revision to rG LLVM Github Monorepo.Dec 12 2021, 11:13 AM
This comment was removed by vdsered.

I don't remember seeing this part of x86 codegen before, so adding some more reviewers. I'm also not sure what is recommended as the best perf x86 asm on recent CPUs. Should we be using this for "optsize" too?

I don't remember seeing this part of x86 codegen before, so adding some more reviewers. I'm also not sure what is recommended as the best perf x86 asm on recent CPUs. Should we be using this for "optsize" too?

Neither do I. I saw Clang only has the builtin support __builtin_memcpy_inline by D73543. But GCC has the tuning code: https://github.com/gcc-mirror/gcc/commit/a32452a5442cd05040af53787af0d8b537ac77a6
Like the descriptions in GCC patch, this is not all win solution. But for minsize and optsize, this should be good to go. https://godbolt.org/z/4r8zTqfhY

lebedev.ri added inline comments.
llvm/include/llvm/CodeGen/SelectionDAGTargetInfo.h
80–92

Why do you need shouldEmitTargetCodeForMemset() if there already is EmitTargetCodeForMemset?

I don't remember seeing this part of x86 codegen before, so adding some more reviewers. I'm also not sure what is recommended as the best perf x86 asm on recent CPUs. Should we be using this for "optsize" too?

Not sure too, but rep stos seems to be a good choice if we are directly asked for MinSize. I'd not add OptSize here as we will probably sacrifice performance in some cases. At least, I'd not do that in that patch.

  1. There is a pull request for dotnet runtime with some similar problem Use xmm for stack prolog zeroing rather than rep stos. Seems like in general rep stos is not the best choice for performance.
  1. Intel in their manual "F.8.3.7 Copy and String Copy" say to use specifically only rep stosb for performance or as it is written "Software wishing to have a simple default string copy or store routine that will work well on a range of implementations (including future implementations) should consider using REP MOVSB or REP STOSB on implementations that support Enhanced REP MOVSB and STOSB".
  1. On the other hand, I found that AMD says rep stos is in general less efficient, but this is mentioned in only two guides Software Optimization Guide for AMD Family 15h Processors, (see 9.3 Repeated String Instructions) and Software Optimization Guide for AMD64 Processors (this one it too old and probably outdated). Plus, on the contrary to Intel, AMD recommends to use rep stos(b/w/d/q) to improve performance.

To sum up, I'd emit only rep stosb for Intel, if possible, and rep stos(b/w/d) for AMD when MinSize is present for now

Does it seem to be a good solution if we add the comment from @pengfei to this? Waiting for other opinions...

vdsered added inline comments.Dec 16 2021, 7:45 AM
llvm/include/llvm/CodeGen/SelectionDAGTargetInfo.h
80–92
  1. EmitTargetCodeForMemset seems to me to be the best place to generate code specific to MinSize for a target
  2. I intended to avoid changing/touching code generation for other backends which do not want to handle cases like MinSize or something similar and I thought of a way to go directly to EmitTargetCodeForMemset
  3. I optimize for MinSize when the length is constant and for such a case there is another lowering in getMemsetStores and I need to skip it

So, I created a method to decide whether to go directly to EmitTargetCodeForMemset or not and this method is customizable for each target.

There is probably a better solution, but I don't see a way to do it with less code written. Any suggestions? I can clarify it better in the comments...

It doesn't look like the changed tests match the case in the linked issue. Should we have a test for that?