Page MenuHomePhabricator

[SROA] Only generate memcpy if the slices is large 'enough' (WIP).
Needs ReviewPublic

Authored by fhahn on Oct 6 2020, 6:11 AM.

Details

Summary

Currently SROA liberally creates llvm.memcpy calls when dealing with
small slices of allocas. Unfortunately there are multiple places in LLVM
that do not work too well with llvm.memcpy.

This can lead to surprising code gen, e.g. see PR47705 and PR47709 (LICM
does not hoist invariant llvm.memcyp calls).

We can side-step this issue in some cases, by letting SROA emit
loads/stores instead of memcpy if the slice is small and we can
reasonably expect vector versions of those loads and stores can be used.

The chosen threshold of 2 x widest vector register is somewhat
arbitrary, but should ensure that we can be reasonably confident that
those loads & stores will be lowered relatively efficiently.

The patch as is is not ideal, because it potentially results in a large
number of insert/extractvalue instructions to move the loaded/stored
values to and from the slice. We could (and maybe should) try to
directly emit the correct vector loads/stores.

At this stage I am mainly interested to see if there's a reason for not
doing so already. It might not be desirable to bake in too much
target-specific knowledge into something as general as SROA. I'll update
the tests if we settle on the final approach

This potentially provides some nice performance benefits, e.g. on ARM64
with -O3 -flto, 450.soplex runs roughly 2.2% faster and generates to
expected assembly for PR47705.

We should also work on improving the handling of llvm.memcpy in
different passes, but that might be tricky in some cases. For example,
it might be desirable to de-compose llvm.memcpy in separate load/store
parts if this would lead to the load part being loop-invariant.

Diff Detail

Unit TestsFailed

TimeTest
130 mslinux > Clang.CodeGen::thinlto-distributed-newpm.ll
Script: -- : 'RUN: at line 6'; /mnt/disks/ssd0/agent/llvm-project/build/bin/opt -thinlto-bc -o /mnt/disks/ssd0/agent/llvm-project/build/tools/clang/test/CodeGen/Output/thinlto-distributed-newpm.ll.tmp.o /mnt/disks/ssd0/agent/llvm-project/clang/test/CodeGen/thinlto-distributed-newpm.ll
480 mslinux > Clang.CodeGenOpenCL::amdgpu-nullptr.cl
Script: -- : 'RUN: at line 1'; /mnt/disks/ssd0/agent/llvm-project/build/bin/clang -cc1 -internal-isystem /mnt/disks/ssd0/agent/llvm-project/build/lib/clang/12.0.0/include -nostdsysteminc /mnt/disks/ssd0/agent/llvm-project/clang/test/CodeGenOpenCL/amdgpu-nullptr.cl -cl-std=CL2.0 -include opencl-c.h -triple amdgcn -emit-llvm -o - | /mnt/disks/ssd0/agent/llvm-project/build/bin/FileCheck /mnt/disks/ssd0/agent/llvm-project/clang/test/CodeGenOpenCL/amdgpu-nullptr.cl
70 mslinux > LLVM.CodeGen/AMDGPU::memcpy-fixed-align.ll
Script: -- : 'RUN: at line 2'; /mnt/disks/ssd0/agent/llvm-project/build/bin/llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 < /mnt/disks/ssd0/agent/llvm-project/llvm/test/CodeGen/AMDGPU/memcpy-fixed-align.ll | /mnt/disks/ssd0/agent/llvm-project/build/bin/FileCheck /mnt/disks/ssd0/agent/llvm-project/llvm/test/CodeGen/AMDGPU/memcpy-fixed-align.ll
70 mslinux > LLVM.DebugInfo/X86::sroasplit-1.ll
Script: -- : 'RUN: at line 1'; /mnt/disks/ssd0/agent/llvm-project/build/bin/opt /mnt/disks/ssd0/agent/llvm-project/llvm/test/DebugInfo/X86/sroasplit-1.ll -sroa -verify -S -o - | /mnt/disks/ssd0/agent/llvm-project/build/bin/FileCheck /mnt/disks/ssd0/agent/llvm-project/llvm/test/DebugInfo/X86/sroasplit-1.ll
180 mslinux > LLVM.DebugInfo/X86::sroasplit-4.ll
Script: -- : 'RUN: at line 1'; /mnt/disks/ssd0/agent/llvm-project/build/bin/opt -sroa < /mnt/disks/ssd0/agent/llvm-project/llvm/test/DebugInfo/X86/sroasplit-4.ll -S -o - | /mnt/disks/ssd0/agent/llvm-project/build/bin/FileCheck /mnt/disks/ssd0/agent/llvm-project/llvm/test/DebugInfo/X86/sroasplit-4.ll
View Full Test Results (9 Failed)

Event Timeline

fhahn created this revision.Oct 6 2020, 6:11 AM
fhahn requested review of this revision.Oct 6 2020, 6:11 AM
bjope added a subscriber: bjope.Oct 6 2020, 6:25 AM

This isn't really SROA specific is it?
Instcombine also expands small (16 bytes, hardcoded) memcpy's
I would think this should be done by some standalone pass, as a costmodel-driven optimization.

nikic added a subscriber: nikic.Oct 6 2020, 7:40 AM
fhahn added a comment.Oct 6 2020, 7:56 AM

This isn't really SROA specific is it?
Instcombine also expands small (16 bytes, hardcoded) memcpy's
I would think this should be done by some standalone pass, as a costmodel-driven optimization.

yeah we might as well do this independently, although we might be able to catch the most important cases at the 'source' as well. Deciding which memcpys needs to be target specific I think, in a way of cost-modeling, which this patch does, although very naively. Unfortunately it is going to be quite hard to estimate whether expanding memcpy will enable further optimizations.

Cost modeling is target-specific yes, and also depends on the alignment of the copy.

If I'm understanding correctly, there are two significant benefits here:

  1. Splitting apart the "load" and "store" parts of the memcpy, so, for example, the load can be hoisted out of a loop.
  2. We have generally better optimizations for load/store instructions.

Unfortunately it is going to be quite hard to estimate whether expanding memcpy will enable further optimizations.

There are cases we could detect directly relatively easily: for example, whether we can hoist the load out of a loop, or whether the load is loading a constant value. More generally it could be tricky, yes.

Personally honestly i'm not so much not okay with doing this in SROA,
but it's that i'm uneasy about introduction of a new TTI dependency to some pass.
Because that implicitly changes it from being a target-independent pass
into being a cost-model driven transform, with rest of transforms
in the pass being "forgotten" to be costmodelled.

Instcombine also expands small (16 bytes, hardcoded) memcpy's

Will it help to schedule another instcombine after SROA, or put SROA before an existing instcombine pass schedule?

Instcombine also expands small (16 bytes, hardcoded) memcpy's

Will it help to schedule another instcombine after SROA, or put SROA before an existing instcombine pass schedule?

I'm not sure i understand the question.
There's for sure already instcombine invocations after SROA, e.g.: https://github.com/llvm/llvm-project/blob/88241ffb5636ebc0579d3ab8eeec78446a769c54/llvm/test/Other/opt-O3-pipeline.ll#L142-L164