This is an archive of the discontinued LLVM Phabricator instance.

[SROA] Maintain shadow/backing alloca when some slices are noncapturnig read-only calls to allow alloca partitioning/promotion
Changes PlannedPublic

Authored by lebedev.ri on Nov 9 2021, 3:01 PM.

Details

Summary

This is inspired by the original variant of D109749 by Graham Hunter,
but is a more general version.

Roughly, instead of promoting the alloca, we call it a shadow/backing alloca,
go through all it's slices, clone(!) instructions that operated on it,
but make them operate on the cloned alloca, and promote cloned alloca instead.

This keeps the shadow/backing alloca, and all the original instructions around,
which results in said shadow/backing alloca being a perfect mirror/representation
of the promoted alloca's content, so calls that take the alloca
as arguments (non-capturingly!) can be supported.

For now, we require that the calls also don't modify the alloca's content,
but that is only to simplify the initial implementation,
and that will be supported in a follow-up.

Overall, this leads to *smaller* codesize:
https://llvm-compile-time-tracker.com/compare.php?from=a8b4f5bbab62091835205f3d648902432a4a5b58&to=aeae054055b125b011c1122f82c86457e159436f&stat=size-total
and is roughly neutral compile-time wise:
https://llvm-compile-time-tracker.com/compare.php?from=a8b4f5bbab62091835205f3d648902432a4a5b58&to=aeae054055b125b011c1122f82c86457e159436f&stat=instructions

Diff Detail

Event Timeline

lebedev.ri created this revision.Nov 9 2021, 3:01 PM
lebedev.ri requested review of this revision.Nov 9 2021, 3:01 PM

This is really cool. There is a FIXME, are you expecting to address this? Other than that, reading through it seems sensible to me.

llvm/lib/Transforms/Scalar/CMakeLists.txt
101 ↗(On Diff #385987)

leftover, I assume

lebedev.ri marked an inline comment as done.

Some assortment of small improvements.

This is really cool.

Yep.

There is a FIXME, are you expecting to address this? Other than that, reading through it seems sensible to me.

I've added run line with -opaque-pointers, and everything appears to just work.
@nikic - can you confirm that this is fine?

llvm/lib/Transforms/Scalar/CMakeLists.txt
101 ↗(On Diff #385987)

yep.

lebedev.ri added inline comments.Nov 10 2021, 2:26 AM
llvm/test/Transforms/SROA/non-capturing-call.ll
659–661

We fail to drop some dead instructions, which prevents deletion of the alloca itself.
Not yet sure how to deal with this.

Some assortment of small improvements.

This is really cool.

Yep.

There is a FIXME, are you expecting to address this? Other than that, reading through it seems sensible to me.

I've added run line with -opaque-pointers, and everything appears to just work.
@nikic - can you confirm that this is fine?

Actually thinking about it, that is obviously right,
because what opaque pointers disallow is *extracting*
the pointee type out of the pointer type,
and here we only get/use the whole pointer type,
so it doesn't matter if it contains the pointee knowledge or not.

Ok, it turned out that we do need to delete no-longer-used instructions,
the added test would otherwise assert as not being promoteable.
Also, don't perform this fixup if there are no non-escaping uses of alloca.

There is at least one more problem, but i'm not sure what it is yet.

Introduce profitability heuristic so that the compiler doesn't take out
the whole machine while trying to build e.g. vanilla llvm test-suite :)

The current rule-of-thumb is:
For an escaped alloca to be worthy of promotion,
it must be used by loads/stores in this function (i.e., no point in promoting https://godbolt.org/z/c51Yavv9P)
and either have a small total size (32 bytes currently,
very much a guess), or most of the bytes of the alloca
(>=80%, also a guess) must be loaded/stored within this function (i.e. not much point in promoting https://godbolt.org/z/7KToqzb6z)

This results in these numbers:
https://llvm-compile-time-tracker.com/compare.php?from=8d35c054e31e5a2bee082f7a587a660eeb24bf99&to=c7226d770b1ffa1117f724dd25de7fa2881eed18&stat=instructions
... and i'm not sure what is going on with tramp3d-v4's NewPM-ReleaseThinLTO build,
i can't reproduce the horrible slowdown locally (do i need a production debug/assert-less build?).

This transform does not do anything crazy on that code,
the alloca's that happen to be promoted are very tiny (<=32b),
and we introduce only a single spill/reload per each,
so this isn't something i see as preventable with profitability checks.

tramp3d-v4 - codesize increased by 4% so more inlining(?), more IR, slower build.

Introduce profitability heuristic so that the compiler doesn't take out
the whole machine while trying to build e.g. vanilla llvm test-suite :)

Fixed the "occupancy" to only consider bytes that are both loaded and stored.
After all, we are mostly interested in allowing mem2reg,
aka, avoiding using stack instead of registers, but if not a single byte
is eligible, then we don't really care. Notably, the current check is *still*
too soft, as the last test shows.

This results in these numbers:
https://llvm-compile-time-tracker.com/compare.php?from=8d35c054e31e5a2bee082f7a587a660eeb24bf99&to=c7226d770b1ffa1117f724dd25de7fa2881eed18&stat=instructions
... and i'm not sure what is going on with tramp3d-v4's NewPM-ReleaseThinLTO build,
i can't reproduce the horrible slowdown locally (do i need a production debug/assert-less build?).

... and now we get https://llvm-compile-time-tracker.com/compare.php?from=2c91f48c48c42f9bd730d0791fcb19dbe0038d96&to=b4282b9b3cf6ebe17938a4041af21681b762b881&stat=instructions,
and the horrible tramp3d-v4 regression is gone :)

This is mostly the point where i'd want some feedback.

Ping.
I'm not really sure who is comfortable/okay with reviewing SROA changes nowadays, so the CC list is a bit too large.

Did some perf testing on RawSpeed, while the results will be heavily workload-depend,
here they seem to be mostly positively neutral with some improvements,
and no heavy losses.


reames requested changes to this revision.Nov 30 2021, 12:18 PM

Spent a decent amount of time looking at this, but haven't yet fully formed a recommendation. I'm going to summarize my findings to date in the hopes this is useful.

I think the basic idea here is valuable, but as implemented, I don't think this is a good idea. The current implementation can result in a serious pessimization in cases like the following:

a = alloca <4 x i8> ... // assume a is otherwise splittable
<init a to zero>
loop {
   if (very_rare) a[3] = 5;
   foo(a);
}

This transformation will have the effect of rewriting this as:

a = alloca <4 x i8> // all other uses of a split and ssa constructed
loop {
   *a =  very_rate ? <0,0,0,5> : <0,0,0,0> 
   foo(a);
}

The result is that we've made the store to a dramatically more expensive by moving it into the hotpath.

I think this is a fatal flaw with the approach. We could maybe patch around it with some profiling data, but in general, we don't want to be moving init logic.

However, this is where I get struck. I don't really have a suggestion on how to implement this better. Conceptually, what we want is something along the lines of D113289, but applied to SROA. That is, doing SSA formation to eliminate loads while leaving the stores in place for use by the calls which reads the allocas. However, there are two problems with that:

  1. If we want to support non-readonly calls, we'd need to materialize a load after the call. This has the same "move to hot" problem as above. Maybe we can just restrict it to readonly arguments?
  2. This really doesn't cleanly fit into the code structure of SROA. While SROA indirectly does SSA construction, it does so by forming another alloca, and then mem2regging that. We'd have to figure out something analogous to mem2reg which didn't remove the stores or the allocas.

Purely in terms of the code structure of the current patch, the integration of this is weird. The remat transform should be done over the AllocaSlices array after initially scanned, and should directly rebuild that data structure rather than relying on the code to handle duplicates in the array and double scanning. That doesn't address the algorithmic concern above though, so this ends up being mostly an aside.

This might be helpful to brainstorm offline. Let me know if you'd like to chat.

This revision now requires changes to proceed.Nov 30 2021, 12:18 PM

Spent a decent amount of time looking at this, but haven't yet fully formed a recommendation. I'm going to summarize my findings to date in the hopes this is useful.

Thank you for taking a look!

I think the basic idea here is valuable, but as implemented, I don't think this is a good idea. The current implementation can result in a serious pessimization in cases like the following:

a = alloca <4 x i8> ... // assume a is otherwise splittable
<init a to zero>
loop {
   if (very_rare) a[3] = 5;
   foo(a);
}

This transformation will have the effect of rewriting this as:

a = alloca <4 x i8> // all other uses of a split and ssa constructed
loop {
   *a =  very_rate ? <0,0,0,5> : <0,0,0,0> 
   foo(a);
}

The result is that we've made the store to a dramatically more expensive by moving it into the hotpath.

I think this is a fatal flaw with the approach. We could maybe patch around it with some profiling data, but in general, we don't want to be moving init logic.

Indeed: https://godbolt.org/z/n64ozqnsT

However, this is where I get struck. I don't really have a suggestion on how to implement this better.

Conceptually, what we want is something along the lines of D113289, but applied to SROA.

Right. Not sure how much relevant potential is left after that change, to be noted.

That is, doing SSA formation to eliminate loads while leaving the stores in place for use by the calls which reads the allocas. However, there are two problems with that:

  1. If we want to support non-readonly calls, we'd need to materialize a load after the call. This has the same "move to hot" problem as above.

True.

Maybe we can just restrict it to readonly arguments?

  1. This really doesn't cleanly fit into the code structure of SROA. While SROA indirectly does SSA construction, it does so by forming another alloca, and then mem2regging that. We'd have to figure out something analogous to mem2reg which didn't remove the stores or the allocas.

Well, this one seems straight-forward - instead of spilling before the call, just go through all the slices,
and duplicate each store to store into our new alloca, this will essentially preserve all the original stores.

Purely in terms of the code structure of the current patch, the integration of this is weird. The remat transform should be done over the AllocaSlices array after initially scanned, and should directly rebuild that data structure rather than relying on the code to handle duplicates in the array and double scanning. That doesn't address the algorithmic concern above though, so this ends up being mostly an aside.

Note that i only scan the newly-added uses, i do not rescan everything, so i don't believe this particular concern is significant.

This might be helpful to brainstorm offline. Let me know if you'd like to chat.

I think the obvious solution is:

  1. Keep all stores. (in terms of this patch, iterate over all slices and duplicate all store instructions to also store into the cloned alloca)
  2. Keep all loads (in terms of this patch, iterate over all slices and before the load from the original alloca, load from the cloned alloca and store into the original alloca) UNLESS we can omit particular load because we can tell that there was no taint (may-write calls) on every path from the every previous store.

@lebedev.ri Your suggested approach makes sense to me at a basic level. Forming the alloca is starting to seem more and more like a hack, but we can come back to implementing partial mem2reg as a follow on code improvement.

A couple of suggestions:

  • Please leave taint tracking to later patch. Start with the requirement that the non-escaping call use must also be readonly in that argument.
  • Instead of having the cloned alloca be the one kept, have it be the one which gets mem2reged. That is, duplicate all the uses (except the escaping call) and let the code handle it. The primary point here is just to keep the naming in the IR more obvious. :)

@lebedev.ri Your suggested approach makes sense to me at a basic level. Forming the alloca is starting to seem more and more like a hack, but we can come back to implementing partial mem2reg as a follow on code improvement.

A couple of suggestions:

  • Please leave taint tracking to later patch. Start with the requirement that the non-escaping call use must also be readonly in that argument.

That was my plan, yes.

  • Instead of having the cloned alloca be the one kept, have it be the one which gets mem2reged. That is, duplicate all the uses (except the escaping call) and let the code handle it. The primary point here is just to keep the naming in the IR more obvious. :)

Yeah, i was wondering about that, but didn't do that yet.

lebedev.ri retitled this revision from [SROA] Spill alloca's around non-capturing escapes via calls to allow alloca partitioning/promotion to [SROA] Maintain shadow/backing alloca when some slices are noncapturnig read-only calls to allow alloca partitioning/promotion.
lebedev.ri edited the summary of this revision. (Show Details)

Finally managed to subdue crippling fear of the potential complexity
of the alternative design, only to discover it to be rather simple :)

@lebedev.ri Your suggested approach makes sense to me at a basic level. Forming the alloca is starting to seem more and more like a hack, but we can come back to implementing partial mem2reg as a follow on code improvement.

A couple of suggestions:

  • Please leave taint tracking to later patch. Start with the requirement that the non-escaping call use must also be readonly in that argument.
  • Instead of having the cloned alloca be the one kept, have it be the one which gets mem2reged. That is, duplicate all the uses (except the escaping call) and let the code handle it. The primary point here is just to keep the naming in the IR more obvious. :)

Done, PTAL :)

Herald added a project: Restricted Project. · View Herald TranscriptMar 1 2022, 2:14 PM

The idea behind this seems very interesting/valuable, thanks for this!

I am wondering if you are able to get some benchmark data (SPEC, or phoronix) for this? IIUC, the code-size will be increased, so is that size reasonable?

llvm/lib/Transforms/Scalar/SROA.cpp
3688

Should we add a TODO here for clarifying that there is a plan to support calls that do modify the alloca's content?

lebedev.ri updated this revision to Diff 412688.Mar 3 2022, 5:40 AM
lebedev.ri marked an inline comment as done.

Thanks for taking a look!

The idea behind this seems very interesting/valuable, thanks for this!

I am wondering if you are able to get some benchmark data (SPEC, or phoronix) for this? IIUC, the code-size will be increased, so is that size reasonable?

I suppose the LICM change, that came after this was initially posted,
has rendered this being less crucial than it would be,
but important nonetheless.

Size-wise, this is actually a win i would say:
https://llvm-compile-time-tracker.com/compare.php?from=a8b4f5bbab62091835205f3d648902432a4a5b58&to=aeae054055b125b011c1122f82c86457e159436f&stat=size-total

LGTM for my side, but please wait a few days to see if someone has some additional comments (although it looks like all the comments are already addressed).

llvm/lib/Transforms/Scalar/SROA.cpp
297

I guess ; is unintentional here.

LGTM for my side, but please wait a few days to see if someone has some additional comments (although it looks like all the comments are already addressed).

Thank you!

I'm hoping @reames would find time to re-review this, but given that
we've settled on the approach, and it's rather straight-forward,
i don't know how long i should wait indeed.

lebedev.ri edited the summary of this revision. (Show Details)Mar 3 2022, 6:49 AM
djtodoro accepted this revision.Mar 3 2022, 7:00 AM

@reames i'm intending to land this in 24 hours if there are no further comments.

This revision was not accepted when it landed; it landed in state Needs Review.Mar 4 2022, 10:09 AM
This revision was landed with ongoing or failed builds.
This revision was automatically updated to reflect the committed changes.
lebedev.ri reopened this revision.Mar 4 2022, 1:13 PM
This revision was not accepted when it landed; it landed in state Needs Review.Mar 4 2022, 1:14 PM
This revision was automatically updated to reflect the committed changes.
lebedev.ri reopened this revision.Mar 4 2022, 3:57 PM

Ok, i'm not yet sure how to deal with this, but somehow we end up trying to re-promote the alloca's here:

; Function Attrs: argmemonly nofree nounwind willreturn
declare void @llvm.memcpy.p0i8.p0i8.i64(i8* noalias nocapture writeonly, i8* noalias nocapture readonly, i64, i1 immarg) #0

define fastcc void @TraceLine(i64 %tmp) {
entry:
  %LDir = alloca i64, align 8
  %NewDir1 = alloca i64, align 8
  %LDir.cast = bitcast i64* %LDir to i8*
  %NewDir1.cast = bitcast i64* %NewDir1 to i8*
  store i64 %tmp, i64* %LDir

  call void @llvm.memcpy.p0i8.p0i8.i64(i8* %NewDir1.cast, i8* %LDir.cast, i64 8, i1 false)
  ;%reload = load i64, i64* %LDir
  ;store i64 %reload, i64* %NewDir1

  %call33 = call fastcc double @IntersectObjs(i64* %NewDir1)
  %call38 = call fastcc double @IntersectObjs(i64* %LDir)
  ret void
}

declare fastcc double @IntersectObjs(i64* nocapture readonly)

; uselistorder directives
uselistorder double (i64*)* @IntersectObjs, { 1, 0 }

attributes #0 = { argmemonly nofree nounwind willreturn }
lebedev.ri planned changes to this revision.Mar 5 2022, 10:55 AM