This is an archive of the discontinued LLVM Phabricator instance.

Add IR constructs for inalloca replacement preallocated call setup
ClosedPublic

Authored by aeubanks on Feb 14 2020, 2:18 PM.

Details

Summary

Add llvm.call.setup and llvm.call.alloc instrinsics.
Add callsetup operand bundle which takes a token produced by llvm.call.setup.
Add preallocated parameter attribute, which is like byval but without the copy.

Verifier changes for these IR constructs.

See https://github.com/rnk/llvm-project/blob/call-setup-docs/llvm/docs/CallSetup.md

Event Timeline

aeubanks created this revision.Feb 14 2020, 2:18 PM
aeubanks retitled this revision from Add intrinsics and operand bundles for inalloca replacement llvm.call.setup to [WIP] Add intrinsics and operand bundles for inalloca replacement llvm.call.setup.Feb 14 2020, 2:25 PM
rnk added a subscriber: rnk.Feb 14 2020, 3:44 PM

Nice, that's the essence of the verifier checks, and the tests look pretty good. LLVM has an 80 character limit, and the easiest way to handle that is to use clang-format. I recommend setting up an editor integration so you can use it interactively as you edit code, or running git-clang-format before uploading.

llvm/lib/IR/Verifier.cpp
3057

You should add some tests for these checks.

3061–3062

The verifier shouldn't reject unknown bundles, the idea is to allow other bundles as an extension point.

4441

This is the right idea, but I'd make APInt locals for the getValue() results to make it shorter.

4449

I guess I'd try to do this with less conditionals:

auto CallSetupBundle = ...
Assert(CallSetupBundle, "using call site should have a call.setup bundle");
Assert(CallSetupBundle->Inputs.front().get() == &Call, ...);
aeubanks updated this revision to Diff 245288.Feb 18 2020, 3:08 PM
aeubanks marked 5 inline comments as done.

Simplify some code, add more tests, check operand bundle type instead of checking that the value is a ConstantTokenNone

llvm/lib/IR/Verifier.cpp
3061–3062

Done.
I had (and still have) no idea how "callsetup" was getting read as a LLVMContext::OB_callsetup so I wanted to make sure that this branch was actually taken.

aeubanks updated this revision to Diff 245669.Feb 20 2020, 9:02 AM

Add preallocated attribute, major cleanups

  • Fix some stuff
  • Add preallocated
  • Actually add callsetup as an operand bundle
  • Parse "preallocated" in ll files, add proper ptr casts in tests
  • Check that callsetup token is from llvm.call.setup
  • Remove "preallocated" from function declaration
  • Add comment to Attributes.td
  • Remove unused function
  • Trailing spaces
aeubanks retitled this revision from [WIP] Add intrinsics and operand bundles for inalloca replacement llvm.call.setup to Add IR constructs for inalloca replacement llvm.call.setup.Feb 20 2020, 9:06 AM
aeubanks edited the summary of this revision. (Show Details)
aeubanks updated this revision to Diff 246006.Feb 21 2020, 3:10 PM

Add and enforce matching preallocated attribute in callee parameter

aeubanks updated this revision to Diff 246021.Feb 21 2020, 3:31 PM

Fix bad arc diff

aeubanks updated this revision to Diff 247283.Feb 28 2020, 8:56 AM

Fix arc diff

rnk added a comment.Feb 28 2020, 2:17 PM

I would add a test to llvm/test/Bitcode/attributes.ll for preallocated. Other than that, I think we can assume that the attribute machinery works, and we don't need dedicated round-trip testing like we did for inalloca (llvm/test/(Assembler|Bitcode)/inalloca.ll).

I basically think this looks good, but I think we need rules in llvm/docs/LangRef.rst before landing this, which means we need to wrap up the discussion on llvm-dev with Eli.

llvm/include/llvm/Bitcode/LLVMBitCodes.h
636 ↗(On Diff #247283)

This set off my "is this used as the RHS of a left shift?" spidey sense, but I checked, and I think it's all good.

aeubanks updated this revision to Diff 247756.Mar 2 2020, 4:27 PM
  • Add preallocated attribute to function declaration parameters in tests
  • Add preallocated attribute to attributes test
  • Based on attribute test, found some places that I missed regarding reading/writing asm files
  • Format
aeubanks updated this revision to Diff 247957.Mar 3 2020, 10:36 AM
  • Skip preallocated in addRawAttributeValue()
  • Add extra CHECK-NEXT in operand-bundles-bc-analyzer.ll
aeubanks updated this revision to Diff 255854.Apr 7 2020, 4:35 PM

Add extra verifier check

rnk added a comment.Apr 15 2020, 5:36 PM

I looked through this again, and basically think all the code looks good, but I want to get input from Eli, so I added him as a reviewer.

Eli, any concerns with the design or naming before landing this? I was surprised we didn't get more name suggestions upstream.

Arthur, now that you've worked with the names a bit, do you feel like they make sense? Would you change them?

We still need a LangRef patch at some point.

I suspect we might end up tweaking the signature of llvm.call.setup, based on the llvm-dev discussion, but we can experiment in-tree, I guess.

llvm/lib/IR/Verifier.cpp
1672

Do you need to check Attrs.getPreallocatedType()->isSized(), or something like that? Or is that checked elsewhere?

aeubanks updated this revision to Diff 257922.Apr 15 2020, 6:12 PM

Check isSized()
Add missing preallocated type handling in AttrBuilder

aeubanks marked 2 inline comments as done.Apr 15 2020, 6:15 PM
In D74651#1985474, @rnk wrote:

Arthur, now that you've worked with the names a bit, do you feel like they make sense? Would you change them?

@llvm.call.alloc is weird because it's not allocating anything, the allocation happens in @llvm.call.setup. Maybe @llvm.call.arg is better?

llvm/lib/IR/Verifier.cpp
1672

Piggybacked off existing checks below, thanks for catching.

rnk added a comment.Apr 16 2020, 9:52 AM
In D74651#1985474, @rnk wrote:

Arthur, now that you've worked with the names a bit, do you feel like they make sense? Would you change them?

@llvm.call.alloc is weird because it's not allocating anything, the allocation happens in @llvm.call.setup. Maybe @llvm.call.arg is better?

I think when I came up with this terminology, I was thinking that alloc would actually do the allocation, so there would be a staircase of ESP adjustments. However, it's true, in the current implementation, alloc is a misnomer, and we might never implement SP staircasing (I don't have a better name for this code pattern...).

Consider the code MSVC generates for a call like this:
https://gcc.godbolt.org/z/TeLRRa

struct Foo { Foo(int x); Foo(const Foo &o); ~Foo(); int x, y; };
void callee(int, Foo, int, Foo, int, Foo);
void bar() {
  callee(1, Foo(2), 3, Foo(4), 5, Foo(6));
}

Each ESP adjustment happens immediately before the argument is evaluated. It lets the compiler pass scalar memory arguments with the PUSH instruction, which is good for code size, and I hear performance, although I have no first hand evidence of this.


In any case, I think we should use a name that makes sense for either implementation strategy.

Should we double down on the preallocated terminology? llvm.call.preallocated.arg? The number arguments make more sense this way, they refer to the N'th "preallocated" argument. The pattern would be:

%cs = call token @llvm.call.setup(i32 3)
%x = call i8* @llvm.call.preallocated.arg(token %cs, i32 0)
%y = call i8* @llvm.call.preallocated.arg(token %cs, i32 1)
%z = call i8* @llvm.call.preallocated.arg(token %cs, i32 2)
...

(I suppose that if we ever wanted to do staircasing, we would have to require that the calls be evaluated right to left.)

aeubanks marked an inline comment as done.Apr 16 2020, 11:43 AM

In any case, I think we should use a name that makes sense for either implementation strategy.

Should we double down on the preallocated terminology? llvm.call.preallocated.arg? The number arguments make more sense this way, they refer to the N'th "preallocated" argument. The pattern would be:

%cs = call token @llvm.call.setup(i32 3)
%x = call i8* @llvm.call.preallocated.arg(token %cs, i32 0)
%y = call i8* @llvm.call.preallocated.arg(token %cs, i32 1)
%z = call i8* @llvm.call.preallocated.arg(token %cs, i32 2)
...

(I suppose that if we ever wanted to do staircasing, we would have to require that the calls be evaluated right to left.)

I like adding preallocated to the intrinsic name, but I think we should do it for all the intrinsics, especially since the @llvm.call prefix seems fairly generic.
Do @llvm.call.preallocated.setup, @llvm.call.preallocated.arg, and @llvm.call.preallocated.teardown sound good?

We still need a LangRef patch at some point.

I suspect we might end up tweaking the signature of llvm.call.setup, based on the llvm-dev discussion, but we can experiment in-tree, I guess.

Does the LangRef patch typically come before, after, or in the same commit? After doing some experimenting?

rnk added a comment.Apr 16 2020, 12:31 PM

I like adding preallocated to the intrinsic name, but I think we should do it for all the intrinsics, especially since the @llvm.call prefix seems fairly generic.
Do @llvm.call.preallocated.setup, @llvm.call.preallocated.arg, and @llvm.call.preallocated.teardown sound good?

I like that nomenclature. The the whole feature can be referred to as "calls with preallocated argument memory" or "preallocated call sites", and it sounds like grammatically correct English.

We still need a LangRef patch at some point.

I suspect we might end up tweaking the signature of llvm.call.setup, based on the llvm-dev discussion, but we can experiment in-tree, I guess.

Does the LangRef patch typically come before, after, or in the same commit? After doing some experimenting?

I guess in this case it would be best to put together a LangRef patch to land before this one, in case anyone wants to look at it alone, without the boilerplate.

The main thing we need in LangRef is to document the verifier rules that were added, and the rules for how to use the intrinsics without causing UB. The intrinsics depend on some unobservable state (ESP), and have to be called in a particular order to work at runtime.

In D74651#1987364, @rnk wrote:

Does the LangRef patch typically come before, after, or in the same commit? After doing some experimenting?

I guess in this case it would be best to put together a LangRef patch to land before this one, in case anyone wants to look at it alone, without the boilerplate.

A nice feature of landing LangRef changes coincidentally with IR changes is having the documentation at https://www.llvm.org/docs/LangRef.html match the code in https://github.com/llvm/llvm-project. But if there's a good reason to split them up I agree with @rnk that "docs before" is the right order.

aeubanks updated this revision to Diff 258434.Apr 17 2020, 2:58 PM

Add LangRef
Update some failing tests
Rename "callsetup" operand bundle to "preallocated"
Rename intrinsics to "llvm.call.preallocated.{setup,arg}"
Remove teardown intrinsic (will come later)

In D74651#1987364, @rnk wrote:

Does the LangRef patch typically come before, after, or in the same commit? After doing some experimenting?

I guess in this case it would be best to put together a LangRef patch to land before this one, in case anyone wants to look at it alone, without the boilerplate.

A nice feature of landing LangRef changes coincidentally with IR changes is having the documentation at https://www.llvm.org/docs/LangRef.html match the code in https://github.com/llvm/llvm-project. But if there's a good reason to split them up I agree with @rnk that "docs before" is the right order.

I added the LangRef change to this change, it shouldn't be too hard to just look at the LangRef changes.
(also, how do you look at the rendered version before submitting?)

In D74651#1987364, @rnk wrote:

I like adding preallocated to the intrinsic name, but I think we should do it for all the intrinsics, especially since the @llvm.call prefix seems fairly generic.
Do @llvm.call.preallocated.setup, @llvm.call.preallocated.arg, and @llvm.call.preallocated.teardown sound good?

I like that nomenclature. The the whole feature can be referred to as "calls with preallocated argument memory" or "preallocated call sites", and it sounds like grammatically correct English.

I renamed the "callsetup" operand bundle to "preallocated" because of this, lmk if that's too confusing since the attribute is also called "preallocated".

efriedma added inline comments.Apr 17 2020, 3:24 PM
llvm/docs/LangRef.rst
1069 ↗(On Diff #258434)

Are there any rules for what value has to be passed to a "preallocated" argument? Or is the value ignored?

11939 ↗(On Diff #258434)

Probably worth mentioning the rules for nested llvm.call.preallocated.setup .

11963 ↗(On Diff #258434)

Maybe worth mentioning what happens if you call llvm.call.preallocated.arg after the memory is deallocated?

llvm/lib/IR/Verifier.cpp
4455

Do you need to check that there aren't any calls to llvm.call.preallocated.arg with the wrong token?

4471

"Prelalocated"?

aeubanks updated this revision to Diff 258844.Apr 20 2020, 2:30 PM
aeubanks marked 7 inline comments as done.

Rebase
Address code review comments

aeubanks added inline comments.Apr 20 2020, 2:32 PM
llvm/docs/LangRef.rst
1069 ↗(On Diff #258434)

It's ignored, clarified.

11939 ↗(On Diff #258434)

Added blurb about how

t1 = setup()
t2 = setup()
foo() [t2]
foo() [t1]

is ok but not

t1 = setup()
t2 = setup()
foo() [t1]
foo() [t2],

is that good enough?

11963 ↗(On Diff #258434)

Said it's UB if you call llvm.call.preallocated.arg after the call, or after another llvm.call.preallocated.setup.

llvm/lib/IR/Verifier.cpp
4455

Good catch, I missed that. Done and added test case.

aeubanks marked an inline comment as done.Apr 20 2020, 2:33 PM
aeubanks added inline comments.
llvm/lib/IR/Verifier.cpp
4471

Done.

efriedma added inline comments.Apr 20 2020, 3:36 PM
llvm/docs/LangRef.rst
1069 ↗(On Diff #258434)

For the purpose of actually generating code, saying it's ignored is fine. It might be inconvenient for IPO transforms/analysis, though: any transform that examines a preallocated argument would have to explicitly blacklist optimizations on the argument. It might make sense to require that the argument has the "correct" value, even if code generation won't use it, for the sake of optimizations.

llvm/lib/IR/Verifier.cpp
1726

Missing comma

aeubanks updated this revision to Diff 258868.Apr 20 2020, 5:01 PM
aeubanks marked an inline comment as done.

Address code review comments

aeubanks marked an inline comment as done.Apr 20 2020, 5:02 PM
aeubanks added inline comments.
llvm/docs/LangRef.rst
1069 ↗(On Diff #258434)

Done, PTAL at new wording.

llvm/lib/IR/Verifier.cpp
1726

Done.

efriedma accepted this revision.Apr 20 2020, 5:25 PM

LGTM. Please give Reid a chance to comment before you merge, though.

llvm/docs/LangRef.rst
1069 ↗(On Diff #258434)

New wording seems fine.

This revision is now accepted and ready to land.Apr 20 2020, 5:25 PM
aeubanks updated this revision to Diff 259411.Apr 22 2020, 3:10 PM

Require preallocated call site attribute on llvm.call.preallocated.arg

Reid and I talked a bit and decided that this hadn't handled the case of when there's a setup without a call (e.g. via DCE). The stack adjustment will happen, but without the call the stack won't be cleaned up. One solution is turn the calls to llvm.call.preallocated.arg into allocas. But we need to know the type of the argument, and without the call we don't have that info. One solution is to add a call site attribute to llvm.call.preallocated.arg with the type. I reused the preallocated attribute for this purpose.
One downside of this approach is that alloca returns a pointer of the proper type, but llvm.call.preallocated.arg returns i8*. And we likely are casting the i8* to the actual pointer type we want, so there will likely be two unnecessary casts when cleaning up setups without calls. But maybe that doesn't matter so much.

Thoughts on this approach?

aeubanks retitled this revision from Add IR constructs for inalloca replacement llvm.call.setup to Add IR constructs for inalloca replacement preallocated call setup.Apr 22 2020, 3:53 PM

Your proposed solution to finding the correct type of an allocation with a corresponding call seems fine.

Not sure about rewrite llvm.call.preallocated.arg to an alloca. Not sure you'd actually want to explicitly rewrite it; probably simpler to come up with some fake stack layout and go through the normal lowering.

On a related note, I just realized there isn't any discussion of alignment in the documentation. Have you thought about how the alignment of preallocated arguments is specified?

rnk added a comment.Apr 22 2020, 4:56 PM

Thanks for reviewing Eli! I had some wording suggestions and syntax nits worth reuploading for.

llvm/docs/LangRef.rst
1971–1977 ↗(On Diff #259411)

I think this wording could be improved. I made a draft, didn't like it, and threw it away. :(

I think "actual argument" is vague. I tried to say something like, "The type used on this attribute has to match the type used by the attribute of the corresponding argument at the final call site."

I think it might be better to take the rest of this documentation and move it to the description of the intrinsic, and then refer to it here, after documenting the requirement on the attribute type.

I'd replace the fragment "which would result in the stack being adjusted..." with something about the fact that, without a call site, we can't do any stack adjustments at all, since we don't know the argument types, and therefore don't know how much to adjust the stack by.

From the standpoint of an optimizer, llvm.call.preallocated.arg is a special alloca: it allocates stack memory with some type. This attribute carries that type. If there is no corresponding call site, then it is legal and desirable to transform these intrinsics to plain static allocas using the type on this attribute.

2200 ↗(On Diff #259411)

I guess I wrote the words on "on Win32", but perhaps we should make it more vague and say "... compatible with MSVC on some targets." I think this call setup stuff is also necessary on arm32/thumb.

2204 ↗(On Diff #259411)

I think .rst files need double backticks, or it's a link, not monospace text.

11959 ↗(On Diff #259411)

You have to add call token between = @ here to make it legal-ish IR

2203 ↗(On Diff #258868)

This should take the token as the first argument. I guess now it should also use the new call site attribute as well.

11951 ↗(On Diff #258868)

You should close the bundle name string in the example, "preallocated"(token...

11973 ↗(On Diff #258868)

We discussed the need to add the argument size here

rnk added a comment.Apr 22 2020, 5:11 PM

Your proposed solution to finding the correct type of an allocation with a corresponding call seems fine.

Not sure about rewrite llvm.call.preallocated.arg to an alloca. Not sure you'd actually want to explicitly rewrite it; probably simpler to come up with some fake stack layout and go through the normal lowering.

Based on what I know of the codegen design, that is feasible, but it will be discovered post-isel, which is pretty late. During instruction finalizations, the DenseMap entry for the call site will be missing. We can make a stack object of appropriate size and alignment at that point, but it's much less convenient than assuming we've taken care of it by CGP.

We already have to write an IR transform utility to rewrite these intrinsics to allocas, at the very least for the inliner. Functionattrs (or the new Attributor) should remove this preallocated attributes when possible, so that pass will want this utility. And, if a call site gets eliminated because it is unreachable, we'll want to call this utility from standard cleanup passes like instcombine. Once we have the utility, we might as well just call it from CGP and rely on it later, instead of implementing it a second time.

On a related note, I just realized there isn't any discussion of alignment in the documentation. Have you thought about how the alignment of preallocated arguments is specified?

The alignment can be surprising. For example, MSVC does not align double arguments or argument structs containing double, despite alignof(double) == 8. MSVC will align vectors and structs that carry __declspec(align) or alignas attributes, but in those cases, it passes the argument by address using a regular alloca.

Is it possible to put an align attribute on a return value? If so, we could put it on the return value at the call site. However, for our purposes, it would always be 4. That's never enough to do any interesting transforms, so it's almost as good as leaving it unspecified.

aeubanks updated this revision to Diff 259488.Apr 22 2020, 11:33 PM
aeubanks marked 6 inline comments as done.

LangRef changes from code review

aeubanks added inline comments.Apr 22 2020, 11:35 PM
llvm/docs/LangRef.rst
11973 ↗(On Diff #258868)

I thought we didn't need it since now we have the type as part of the preallocated call site attribute.

aeubanks updated this revision to Diff 259596.Apr 23 2020, 9:03 AM

Overlooked a comment about wording the reasoning behind adding the call site attribute

rnk accepted this revision.Apr 23 2020, 10:48 AM

lgtm, I think this is ready to land.

llvm/docs/LangRef.rst
11973 ↗(On Diff #258868)

Sorry, that was a stale comment.

We already have to write an IR transform utility to rewrite these intrinsics to allocas, at the very least for the inliner. Functionattrs (or the new Attributor) should remove this preallocated attributes when possible, so that pass will want this utility. And, if a call site gets eliminated because it is unreachable, we'll want to call this utility from standard cleanup passes like instcombine. Once we have the utility, we might as well just call it from CGP and rely on it later, instead of implementing it a second time.

I edited my comment a couple of times and I think I dropped one important bit.

I'm not sure it's safe to rewrite an call_preallocated_setup to an alloca in the entry block, in general. There is no rule which prevents multiple calls to the same call_preallocated_setup in a loop, without deallocating the memory in between. Because of that, reusing the memory could lead to a miscompile. There wouldn't be any way to deallocate the result, but that isn't necessarily a problem. You could prove this doesn't happen in common cases, and clang wouldn't generate code like that. But LLVM optimizations could introduce it, I think, if there isn't a rule specifically preventing it.

Is it possible to put an align attribute on a return value? If so, we could put it on the return value at the call site. However, for our purposes, it would always be 4. That's never enough to do any interesting transforms, so it's almost as good as leaving it unspecified.

Saying that the alignment of preallocated arguments is always 4 without any way to tweak it seems a little obnoxious, if we ever want to use the intrinsics for any other purpose. I guess it works, though.

Is it possible to put an align attribute on a return value? If so, we could put it on the return value at the call site. However, for our purposes, it would always be 4. That's never enough to do any interesting transforms, so it's almost as good as leaving it unspecified.

Yes, you can put align on a return value. Probably more important would be putting align on the call arguments itself.

aeubanks updated this revision to Diff 259681.Apr 23 2020, 12:40 PM

Add blurb about alignment

I copied the blurb from byval about align to preallocated, since they should share the same codepath.

rnk added a comment.Apr 23 2020, 3:12 PM

I'm not sure it's safe to rewrite an call_preallocated_setup to an alloca in the entry block, in general. There is no rule which prevents multiple calls to the same call_preallocated_setup in a loop, without deallocating the memory in between. Because of that, reusing the memory could lead to a miscompile. There wouldn't be any way to deallocate the result, but that isn't necessarily a problem. You could prove this doesn't happen in common cases, and clang wouldn't generate code like that. But LLVM optimizations could introduce it, I think, if there isn't a rule specifically preventing it.

I think if we model calls with preallocated bundles as modifying inaccessible state (i.e. the stack pointer), LLVM transforms won't do this.

I'm imagining some kind of loop result code sinking transform that wants to sink a readonly function call out of a loop, because the result is only used after the last iteration.

I think if we model calls with preallocated bundles as modifying inaccessible state (i.e. the stack pointer), LLVM transforms won't do this.

I don't think that's enough in general. Well, maybe it's enough for the transforms we actually implement... LLVM currently has very few transforms that fold code together.

But a simple example, suppose we had a transform that took void g(); void f() { g();g();g();g();g(); } and transformed it into something like void g(); void f() { for (int i =0; i < n; ++i) g(); }. Looks fine generally. If the loop body is a sequence containing calls to llvm.call.preallocated.setup()/llvm.call.preallocated.arg(), maybe not so fine. (I think this is only an issue if there is no call actually consuming the stack; if there were, the token would block the transform.)

aeubanks added a comment.EditedApr 27 2020, 12:03 PM

I think if we model calls with preallocated bundles as modifying inaccessible state (i.e. the stack pointer), LLVM transforms won't do this.

I don't think that's enough in general. Well, maybe it's enough for the transforms we actually implement... LLVM currently has very few transforms that fold code together.

But a simple example, suppose we had a transform that took void g(); void f() { g();g();g();g();g(); } and transformed it into something like void g(); void f() { for (int i =0; i < n; ++i) g(); }. Looks fine generally. If the loop body is a sequence containing calls to llvm.call.preallocated.setup()/llvm.call.preallocated.arg(), maybe not so fine. (I think this is only an issue if there is no call actually consuming the stack; if there were, the token would block the transform.)

What about if we directly replace the llvm.call.preallocated.arg() with an alloca (as opposed to an alloca in the entry block)?

What about if we directly replace the llvm.call.preallocated.arg() with an alloca (as opposed to an alloca in the entry block)?

There can be more than one llvm.call.preallocated.arg referring to the same argument, so probably you'd want to insert the alloca at the point of the llvm.call.preallocated.setup. But yes, that should work.

aeubanks updated this revision to Diff 260450.Apr 27 2020, 1:45 PM

Update LangRef about replacing setup/arg with alloca/gep
Rebase
Rename call_setup*.ll -> preallocated*.ll
Preallocated EnumAttr -> TypeAttr

What about if we directly replace the llvm.call.preallocated.arg() with an alloca (as opposed to an alloca in the entry block)?

There can be more than one llvm.call.preallocated.arg referring to the same argument, so probably you'd want to insert the alloca at the point of the llvm.call.preallocated.setup. But yes, that should work.

We could go down the route of doing something similar to inalloca where the llvm.call.preallocated.setup becomes an alloca of a structure containing all the arguments, and the llvm.call.preallocated.arg becomes a gep of the alloca. I've changed the LangRef to say that. I'm not sure if I'm overspecifying in the LangRef though, maybe that should be an implementation detail?

I've changed the LangRef to say that. I'm not sure if I'm overspecifying in the LangRef though, maybe that should be an implementation detail?

LangRef can't require that lowering be implemented some specific way; that would break the as-if rule. The important bit is that it can be implemented that way, and that might be the simplest way to explain the point.

But maybe we don't actually need to include that sentence in LangRef at all; the fact that it's legal to lower that way should be implied from the other rules. And if it isn't implied, it's probably better to clarify the other rules.

aeubanks added a comment.EditedApr 27 2020, 3:58 PM

I've changed the LangRef to say that. I'm not sure if I'm overspecifying in the LangRef though, maybe that should be an implementation detail?

LangRef can't require that lowering be implemented some specific way; that would break the as-if rule. The important bit is that it can be implemented that way, and that might be the simplest way to explain the point.

But maybe we don't actually need to include that sentence in LangRef at all; the fact that it's legal to lower that way should be implied from the other rules. And if it isn't implied, it's probably better to clarify the other rules.

I think it is derivable from the other rules. I've removed those lines (and we can experiment later with where to put the alloca).

Thanks for the reviews!

aeubanks updated this revision to Diff 260490.Apr 27 2020, 3:59 PM

Remove implementation detail lines

rnk accepted this revision.Apr 27 2020, 4:05 PM

I think in the end, the description of how this is lowered was left out, so this looks good and we can commit it as is.

However, I'd like to ensure that it is valid to replace preallocated call sites with static allocas (allocas in the entry block).

I think if we model calls with preallocated bundles as modifying inaccessible state (i.e. the stack pointer), LLVM transforms won't do this.

I don't think that's enough in general. Well, maybe it's enough for the transforms we actually implement... LLVM currently has very few transforms that fold code together.

But a simple example, suppose we had a transform that took void g(); void f() { g();g();g();g();g(); } and transformed it into something like void g(); void f() { for (int i =0; i < n; ++i) g(); }. Looks fine generally. If the loop body is a sequence containing calls to llvm.call.preallocated.setup()/llvm.call.preallocated.arg(), maybe not so fine. (I think this is only an issue if there is no call actually consuming the stack; if there were, the token would block the transform.)

I don't think I understand this example. This looks like loop re-rolling to me, though, to give the transform a name.

If g() uses a preallocated call site, the loop re-roller would have to execute the exact same sequence of preallocated call intrinsics as it would in the unrolled form. That seems like it's all fine.

What about this transform rearranges the call site intrinsics to make them overlap, so that we would need two argument memory locations alive at the same time?

This revision was automatically updated to reflect the committed changes.

This looks like loop re-rolling to me, though, to give the transform a name.

Yes, that's the idea. (LLVM has a reroll pass, but it only handles rerolling loops.)

I guess I'll try to construct an actual C++ example. Suppose you have something like this:

extern "C" int printf(const char*, ...);
extern "C" void abort(void) __attribute((noreturn));
struct A;
A* ptrs[10];
struct A {
    int z;
    __attribute((noinline))
    A(int x) { z = 1; ptrs[z] = this; }
    A(const A&);
};
    __attribute((noinline))
void sum() { int sum = 0; for(A* p : ptrs) sum += p->z; printf("%d\n", sum); }
int g(int, A);
void f() { g(g(g(g(g((sum(), abort(), 6), A(0)), A(1)), A(2)), A(3)), A(4)); }

On 32-bit Windows, the IR looks something like this:

%1 = alloca inalloca <{ i32, %struct.A }>, align 4
%2 = getelementptr inbounds <{ i32, %struct.A }>, <{ i32, %struct.A }>* %1, i32 0, i32 1
%3 = call x86_thiscallcc %struct.A* @"??0A@@QAE@H@Z"(%struct.A* nonnull %2, i32 4)
%4 = alloca inalloca <{ i32, %struct.A }>, align 4
%5 = getelementptr inbounds <{ i32, %struct.A }>, <{ i32, %struct.A }>* %4, i32 0, i32 1
%6 = call x86_thiscallcc %struct.A* @"??0A@@QAE@H@Z"(%struct.A* nonnull %5, i32 3)
%7 = alloca inalloca <{ i32, %struct.A }>, align 4
%8 = getelementptr inbounds <{ i32, %struct.A }>, <{ i32, %struct.A }>* %7, i32 0, i32 1
%9 = call x86_thiscallcc %struct.A* @"??0A@@QAE@H@Z"(%struct.A* nonnull %8, i32 2)
%10 = alloca inalloca <{ i32, %struct.A }>, align 4
%11 = getelementptr inbounds <{ i32, %struct.A }>, <{ i32, %struct.A }>* %10, i32 0, i32 1
%12 = call x86_thiscallcc %struct.A* @"??0A@@QAE@H@Z"(%struct.A* nonnull %11, i32 1)
%13 = alloca inalloca <{ i32, %struct.A }>, align 4
%14 = getelementptr inbounds <{ i32, %struct.A }>, <{ i32, %struct.A }>* %13, i32 0, i32 1
%15 = call x86_thiscallcc %struct.A* @"??0A@@QAE@H@Z"(%struct.A* nonnull %14, i32 0)
call void @"?sum@@YAXXZ"()
call void @abort() #5
unreachable

Notice the repeated alloca/gep/call pattern. In the new world, the "alloca" will be replaced with preallocated.setup/preallocated.arg. Now say your reroller decides to reroll that. What's the result of "sum()"?

rnk added a comment.Apr 29 2020, 1:28 PM

I see, the noreturn part of this is key here. I guess we solved this problem for Windows exception handling by threading the token through all the call instructions, so we can still color EH funclet regions in cases like this:

void maythrow();
void f() {
  try {
    maythrow();
  } catch(...) {
    throw;
  }
}

Funclet bundles have been an ongoing maintenance burden that I would like to ease at some point, so I don't want to use that solution here.

So far we've identified two kinds of transforms that should be semantics preserving according to the rules in LangRef, but present problems with our planned lowering. Neither transform exists in LLVM today:

  1. Tail merging blocks ending in unreachable where some blocks are "inside" a preallocated call region and some are outside.
  2. Rolling up loops of intrinsics that establish half-open preallocated call regions. The lack of the "region close" marker in the IR is what makes this transform possible.

Similar to the case of SEH try, I think we can really only solve this with a first class scope/region concept. In the meantime, I think we should move ahead with the implementation, and try to come up with a scope/region design that works for EH, call setup, SEH __try, and perhaps stack variable lifetime. I have a draft doc with a problem statement, but I don't have a good solution yet.

Similar to the case of SEH try, I think we can really only solve this with a first class scope/region concept. In the meantime, I think we should move ahead with the implementation, and try to come up with a scope/region design that works for EH, call setup, SEH __try, and perhaps stack variable lifetime. I have a draft doc with a problem statement, but I don't have a good solution yet.

Okay, that makes sense. (I still don't really see why we'd want this for variable lifetimes. It isn't fundamentally problematic if variable lifetimes don't properly nest.)