This is an archive of the discontinued LLVM Phabricator instance.

[MemCpyOpt] Expand two memcpy's with clobber inbetween (PR59116)
AbandonedPublic

Authored by lebedev.ri on Nov 26 2022, 5:08 PM.

Details

Summary

This is a WIP of a somewhat more principled attempt to solve https://github.com/llvm/llvm-project/issues/59116
Admittedly, i am not familiar neither with this pass, nor with MSSA.

We have something like this:

MDep: memcpy(tmp <- a)
...
... a is potentially modified inbetween, e.g.:
memcpy(a <- b)
...
M:    memcpy(b <- tmp)

Since we know that tmp is last modified by MDep,
what we can do, is expand MDep's memcpy into load+store pair,
and then expand M's memcpy into a store of the MDep's load:

reload = load a
store reload, tmp ; spill
...
store reload, b   ; final store

This pattern can happen e.g. when swapping contents of the a and b,
in which case tmp might go away completely, especially if it is an alloca.

This isn't always an obvious improvement, and in general, creating large
vectors can easily cause problematic compile-time implications, so there is a
profitability heuristic: loading said vector should not require more vectors
than theoretically available on the given target.

Conceptually, does this make sense? Thoughts on the profitability check?
Should we actually ensure that the clobber is a memcpy(a <- b) ?

I suspect it's still too lax, yet we will have compile-time implications regardless.

Diff Detail

Event Timeline

lebedev.ri created this revision.Nov 26 2022, 5:08 PM
Herald added a project: Restricted Project. · View Herald TranscriptNov 26 2022, 5:08 PM
Herald added a subscriber: hiraditya. · View Herald Transcript
lebedev.ri requested review of this revision.Nov 26 2022, 5:08 PM
lebedev.ri edited the summary of this revision. (Show Details)Nov 26 2022, 5:18 PM

poke.
Does this conceptually make sense for this pass?

nikic added a comment.Dec 7 2022, 7:18 AM

At a high level, I'd say that this transform would be a better fit for SROA. The profitability is clearer if we can actually eliminate the alloca and spill from the first memcpy, making this a single load and store.

llvm/lib/Transforms/Scalar/MemCpyOptimizer.cpp
1195

So we want to use up *all* vector registers for the copy? That's like 64 * 32 = 2048 bytes for AVX-512. That seems *way* too aggressive.

lebedev.ri abandoned this revision.Dec 16 2022, 8:37 AM

Indeed, this originated as a SROA change, which is now back in.
I don't have an immediate need for non-SROA variant of this.