This is an archive of the discontinued LLVM Phabricator instance.

Allow DataLayout to specify addrspace for allocas.
ClosedPublic

Authored by arsenm on Mar 16 2017, 10:54 AM.

Details

Reviewers
hfinkel
efriedma
Summary

LLVM makes several assumptions about address space 0. However,
alloca is presently constrained to always return this address space.
There's no real way to avoid using alloca, so without this
there is no way to opt out of these assumptions.

The problematic assumptions include:

  • That the pointer size used for the stack is the same size as the code size pointer, which is also the maximum sized pointer.
  • That 0 is an invalid, non-dereferencable pointer value.

These are problems for AMDGPU because alloca is used to
implement the private address space, which uses a 32-bit
index as the pointer value. Other pointers are 64-bit
and behave more like LLVM's notion of generic address
space. By changing the address space used for allocas,
we can change our generic pointer type to be LLVM's generic
pointer type which does have similar properties.

Diff Detail

Event Timeline

arsenm created this revision.Mar 16 2017, 10:54 AM
arsenm updated this revision to Diff 92080.Mar 16 2017, 3:52 PM

Print as , addrspace(num), but require it match the datalayout

efriedma added inline comments.Mar 16 2017, 4:10 PM
include/llvm/IR/IRBuilder.h
1106

I'm not sure this overload is a good idea; it saves a little typing, but it hides the reason we need the datalayout.

arsenm added inline comments.Mar 16 2017, 4:50 PM
include/llvm/IR/IRBuilder.h
1106

I'm not sure that's a problem here. This is the most common version, and the point of IRBuilder is to be the convenient way of creating IR. A DataLayout argument is less error prone than an unsigned argument, where it easier to accidentally pass the alignment or something else.

It would also be able to hide this more by getting the parent module's DataLayout from the insertion block (which is what the CHERI patches did). However there are already functions using DataLayout input arguments, so I figured there might be some reason to pass it in rather than doing that.

efriedma added inline comments.Mar 16 2017, 5:07 PM
include/llvm/IR/IRBuilder.h
1106

I guess. The part that concerns me is that the type of the result might not be what someone expects if they aren't paying attention, and the resulting assertion failure is a bit unfriendly... but I guess we can go with this for now, and see if it ends up being confusing in practice.

Getting the DataLayout from the insertion block doesn't work for all methods on IRBuilder; an IRBuilder isn't guaranteed to always have an insertion point. This can be used, for example, if you're building a constant. I guess for alloca in particular, it would be safe to assume we have an insertion point.

efriedma edited edge metadata.Mar 23 2017, 11:12 AM

LGTM except for a minor comment about the text in LangRef.

docs/LangRef.rst
1816

"the address space of the result for the stack" is a little unclear.

arsenm added inline comments.Mar 23 2017, 11:22 AM
docs/LangRef.rst
1816

I wasn't sure about referring to the stack at all. Should I rename everything added that says stack with alloca (e.g DL.getStackAddrSpace() -> DL.getAllocaAddrSpace()?)

efriedma added inline comments.Mar 23 2017, 11:47 AM
docs/LangRef.rst
1816

Well, in many contexts, it isn't precisely clear what the stack refers to... and there's a bunch of other stuff conventionally stored on the stack which could be stored elsewhere. Maybe better to just call it the address-space of allocas, I guess?

Even assuming the meaning of "the stack" is clear, the sentence is missing something: it doesn't mention allocation anywhere.

arsenm updated this revision to Diff 92983.Mar 24 2017, 11:23 AM

Change LangRef description, don't call stack

LGTM, but I'd to see a second approval for an IR change like this.

What else in LLVM actually hard-codes 0 as the assumed address space? Intrinsic function declarations? Optimizations that create new globals (if there are any?) Wouldn't it be much less invasive to change those and leave the assumption that the stack is in addrspace 0?

What else in LLVM actually hard-codes 0 as the assumed address space? Intrinsic function declarations? Optimizations that create new globals (if there are any?) Wouldn't it be much less invasive to change those and leave the assumption that the stack is in addrspace 0?

I'm not sure what you are suggesting. The concept of assumed address space doesn't make any sense on AMDGPU. Everything has an address space. This isn't changing how address space 0 is treated, and is specifically avoiding doing so. AMDGPU does want the assumed address space 0 properties, just for some of the other address spaces that are not the stack address space, which is one of the advantages of making this change. The problem isn't really places hard coding 0 as the address space, it's the properties of the stack address space do not behave like LLVM's notion of the generic address space as I mentioned in the RFC.

As far as 0 being a valid stack pointer, we currently have a workaround of not allocating user objects there. The bigger issues in my mind is that the pointer size for the stack does not match the code pointer size, which is a problematic area which keeps popping up. Another is it is not valid to addrspacecast any other address space to the stack address space. One partial alternative would be to introduce a code pointer size to the datalayout or code address space or something along those lines. I think that would still leave a conceptual gap between AMDGPU's stack and alloca, and be more of a hassle. (For example you still cannot addrspacecast any pointer to the address space of the stack).

There are some intrinsics which hard code 0, some of which need to be fixed (such as lifetime intrinsics which D31041 and D31043 take care of) but other than that one that's mostly a different question. There is a jump table optimization which creates new globals (which is disabled if BR_JT is illegal). Moving AMDGPU's flat address space to match LLVM's generic address space would also have the advantage of allowing legal addrspacecasts to address space 0 for some of these intrinsics.

What else in LLVM actually hard-codes 0 as the assumed address space? Intrinsic function declarations? Optimizations that create new globals (if there are any?) Wouldn't it be much less invasive to change those and leave the assumption that the stack is in addrspace 0?

I'm not sure what you are suggesting. The concept of assumed address space doesn't make any sense on AMDGPU. Everything has an address space. This isn't changing how address space 0 is treated, and is specifically avoiding doing so. AMDGPU does want the assumed address space 0 properties, just for some of the other address spaces that are not the stack address space, which is one of the advantages of making this change. The problem isn't really places hard coding 0 as the address space, it's the properties of the stack address space do not behave like LLVM's notion of the generic address space as I mentioned in the RFC.

As far as 0 being a valid stack pointer, we currently have a workaround of not allocating user objects there. The bigger issues in my mind is that the pointer size for the stack does not match the code pointer size, which is a problematic area which keeps popping up. Another is it is not valid to addrspacecast any other address space to the stack address space. One partial alternative would be to introduce a code pointer size to the datalayout or code address space or something along those lines. I think that would still leave a conceptual gap between AMDGPU's stack and alloca, and be more of a hassle. (For example you still cannot addrspacecast any pointer to the address space of the stack).

There are some intrinsics which hard code 0, some of which need to be fixed (such as lifetime intrinsics which D31041 and D31043 take care of) but other than that one that's mostly a different question. There is a jump table optimization which creates new globals (which is disabled if BR_JT is illegal). Moving AMDGPU's flat address space to match LLVM's generic address space would also have the advantage of allowing legal addrspacecasts to address space 0 for some of these intrinsics.

You keep talking about LLVM having a "generic address space". I'm trying to figure out what you mean by that, because in LLVM, global values can be declared in an arbitrary address space, pointer values all carry an address space, instructions work on arbitrary address spaces, the memory intrinsics are generally overloaded by pointer address spaces, etc. And in general, the idea that there exists a generic address space which subsumes every other address space is just wrong. As an example, on x86-64 we use address spaces as a way of requesting segmented addressing; even if the memory there was addressable in the standard segment (which is not guaranteed), AFAIK there's no way to actually perform that conversion.

Given that, there's nothing wrong with LLVM deciding to hard-code a specific address space index as the address space of the stack. That's worth doing if it significantly simplifies the IR, e.g. by not requiring a data layout to be passed to CreateAlloca. It just means that frontends that *do* assume there's a generic address space may need to insert addrspace conversions in some cases.

it's okay to call out some specific address space, like addrspace 0, and say that it's always the address space of the stack.

I certainly hope that nothing in LLVM is ever implicitly inserting addrspace conversions or assuming anything about them without target-specific information. In general, it is not true that there exists an address space that subsumes all others.

What else in LLVM actually hard-codes 0 as the assumed address space? Intrinsic function declarations? Optimizations that create new globals (if there are any?) Wouldn't it be much less invasive to change those and leave the assumption that the stack is in addrspace 0?

I'm not sure what you are suggesting. The concept of assumed address space doesn't make any sense on AMDGPU. Everything has an address space. This isn't changing how address space 0 is treated, and is specifically avoiding doing so. AMDGPU does want the assumed address space 0 properties, just for some of the other address spaces that are not the stack address space, which is one of the advantages of making this change. The problem isn't really places hard coding 0 as the address space, it's the properties of the stack address space do not behave like LLVM's notion of the generic address space as I mentioned in the RFC.

As far as 0 being a valid stack pointer, we currently have a workaround of not allocating user objects there. The bigger issues in my mind is that the pointer size for the stack does not match the code pointer size, which is a problematic area which keeps popping up. Another is it is not valid to addrspacecast any other address space to the stack address space. One partial alternative would be to introduce a code pointer size to the datalayout or code address space or something along those lines. I think that would still leave a conceptual gap between AMDGPU's stack and alloca, and be more of a hassle. (For example you still cannot addrspacecast any pointer to the address space of the stack).

There are some intrinsics which hard code 0, some of which need to be fixed (such as lifetime intrinsics which D31041 and D31043 take care of) but other than that one that's mostly a different question. There is a jump table optimization which creates new globals (which is disabled if BR_JT is illegal). Moving AMDGPU's flat address space to match LLVM's generic address space would also have the advantage of allowing legal addrspacecasts to address space 0 for some of these intrinsics.

You keep talking about LLVM having a "generic address space". I'm trying to figure out what you mean by that, because in LLVM, global values can be declared in an arbitrary address space, pointer values all carry an address space, instructions work on arbitrary address spaces, the memory intrinsics are generally overloaded by pointer address spaces, etc. And in general, the idea that there exists a generic address space which subsumes every other address space is just wrong. As an example, on x86-64 we use address spaces as a way of requesting segmented addressing; even if the memory there was addressable in the standard segment (which is not guaranteed), AFAIK there's no way to actually perform that conversion.

Given that, there's nothing wrong with LLVM deciding to hard-code a specific address space index as the address space of the stack. That's worth doing if it significantly simplifies the IR, e.g. by not requiring a data layout to be passed to CreateAlloca. It just means that frontends that *do* assume there's a generic address space may need to insert addrspace conversions in some cases.

it's okay to call out some specific address space, like addrspace 0, and say that it's always the address space of the stack.

I certainly hope that nothing in LLVM is ever implicitly inserting addrspace conversions or assuming anything about them without target-specific information. In general, it is not true that there exists an address space that subsumes all others.

I'd meant to edit out these last two paragraphs; alas, I'm making these comments through Phabricator, which uses a very small text edit box, and I just failed to see that there was more text below. That's why they seem to be redundant with the earlier comments, but at least they're not contradictory.

You keep talking about LLVM having a "generic address space". I'm trying to figure out what you mean by that, because in LLVM, global values can be declared in an arbitrary address space, pointer values all carry an address space, instructions work on arbitrary address spaces, the memory intrinsics are generally overloaded by pointer address spaces, etc. And in general, the idea that there exists a generic address space which subsumes every other address space is just wrong. As an example, on x86-64 we use address spaces as a way of requesting segmented addressing; even if the memory there was addressable in the standard segment (which is not guaranteed), AFAIK there's no way to actually perform that conversion.

Given that, there's nothing wrong with LLVM deciding to hard-code a specific address space index as the address space of the stack. That's worth doing if it significantly simplifies the IR, e.g. by not requiring a data layout to be passed to CreateAlloca. It just means that frontends that *do* assume there's a generic address space may need to insert addrspace conversions in some cases.

it's okay to call out some specific address space, like addrspace 0, and say that it's always the address space of the stack.

I certainly hope that nothing in LLVM is ever implicitly inserting addrspace conversions or assuming anything about them without target-specific information. In general, it is not true that there exists an address space that subsumes all others.

LLVM treats 0 as the "generic" address space. This is how the LangRef refers to it. In the AMDGPU sense this is a pointer that can be used to access any memory. For AMDGPU, globals declared without an address space are invalid. All objects must have a definite address space.

There have been bugs in the past from random places inserting addrspacecasts but I don't think I've seen one recently

LLVM treats 0 as the "generic" address space. This is how the LangRef refers to it.

You're revising LangRef here one way or the other. I'm suggesting that a better revision would be to stop saying that address space 0 is the "generic" address space (because there is no such thing), and instead say that it is the address space of the stack. Is there anything actually preventing you from doing that?

LLVM treats 0 as the "generic" address space. This is how the LangRef refers to it.

You're revising LangRef here one way or the other. I'm suggesting that a better revision would be to stop saying that address space 0 is the "generic" address space (because there is no such thing), and instead say that it is the address space of the stack. Is there anything actually preventing you from doing that?

Yes. The pointer size of the stack smaller than the pointer size for code or most other memory. 0 is a valid, dereferencable pointer for the stack address space

LLVM treats 0 as the "generic" address space. This is how the LangRef refers to it.

You're revising LangRef here one way or the other. I'm suggesting that a better revision would be to stop saying that address space 0 is the "generic" address space (because there is no such thing), and instead say that it is the address space of the stack. Is there anything actually preventing you from doing that?

Yes. The pointer size of the stack smaller than the pointer size for code or most other memory. 0 is a valid, dereferencable pointer for the stack address space

I don't see why the pointer size has anything to do with it. In the absence of target-specific information, LLVM is not allowed to assume that one address space is subsumed by another address space at all. It doesn't matter if one of them is address space 0.

The special treatment of zero pointers is an interesting point, but again, I feel like that transformation could be made target-dependent *much* more easily than this. This is an incredibly invasive change.

LLVM treats 0 as the "generic" address space. This is how the LangRef refers to it.

You're revising LangRef here one way or the other. I'm suggesting that a better revision would be to stop saying that address space 0 is the "generic" address space (because there is no such thing), and instead say that it is the address space of the stack. Is there anything actually preventing you from doing that?

Yes. The pointer size of the stack smaller than the pointer size for code or most other memory. 0 is a valid, dereferencable pointer for the stack address space

I don't see why the pointer size has anything to do with it. In the absence of target-specific information, LLVM is not allowed to assume that one address space is subsumed by another address space at all. It doesn't matter if one of them is address space 0.

The special treatment of zero pointers is an interesting point, but again, I feel like that transformation could be made target-dependent *much* more easily than this. This is an incredibly invasive change.

I think being able to opt out of any assumptions is CHERI's use case for this patch, but I don't work on that so I'm not sure.

The relationships between different address spaces is only indirectly related to adding an address space to alloca, and is more a useful side effect. I think LLVM's assumptions about pointers and address space 0 (and worse so, clang) are more widespread than you think. There is no concept of a code pointer size. This is one area where address space 0 is treated as a lack of an address space. The code size and stack pointer do not have the same size for AMDGPU, and it is important to be able to distinguish these. There is a lot of code that assumes the default address space size is "the one true pointer size". You can't pass around a 64-bit function pointer properly if it's required to have the same properties as a 32-bit stack pointer. It isn't really OK to start treating function pointer types differently than other types. Fixing that is an uphill battle in the frontend and backend. It is less work to align the AMDGPU generic address space notion with LLVM's "default" address space and the assumptions made about it than to fix everything else. Adding address spaces to functions would be a much more invasive change. I don't think adding this DataLayout argument is very invasive, and was less work than I anticipated (The IRBuilder changes could also be avoided by getting the DataLayout from the insertion block's parent module). There is no big semantic change, just simple bookkeeping updates.

The pressing reason for this is C++ GPU support. Clang's handling of address spaces is more problematic. The assumptions about address space 0 are more widespread and tied to LLVM's address space 0. The work to start annotating every pointer there as required would be a much more invasive change.

The relationships between different address spaces is only indirectly related to adding an address space to alloca, and is more a useful side effect. I think LLVM's assumptions about pointers and address space 0 (and worse so, clang) are more widespread than you think. There is no concept of a code pointer size.

Well, address space is carried by PointerType, so it isn't true that LLVM IR has no concept of code pointer sizes. But I hadn't realized that Function doesn't have a way to actually set an address space on creation; I can see how that would make that change more involved. On the other hand, it seems like something that could easily just be added as a default argument to a couple of APIs; I don't see why *that's* seen as an invasive change but adding a new mandatory argument to the interfaces for creating one of the most common instruction classes is not.

This is one area where address space 0 is treated as a lack of an address space. The code size and stack pointer do not have the same size for AMDGPU, and it is important to be able to distinguish these.

Yes, I did understand that the first time it was explained.

There is a lot of code that assumes the default address space size is "the one true pointer size".

Are you suggesting that LLVM's address space 0 must always be at least as large as every other address space? Where is that assumption made? Again, that seems like something that should be fixed independent of what we decide here.

You can't pass around a 64-bit function pointer properly if it's required to have the same properties as a 32-bit stack pointer. It isn't really OK to start treating function pointer types differently than other types.

Of course. I am not suggesting that we should do that.

I don't think adding this DataLayout argument is very invasive, and was less work than I anticipated

That's because you aren't changing any of the frontends, most of which are not in-tree. I understand that LLVM does not promise to keep the C++ API stable, but there's surely *some* responsibility to not break every frontend for features they don't care about. At the very least, yes, you should be picking up this value from the Module instead of requiring it to be passed in.

The pressing reason for this is C++ GPU support. Clang's handling of address spaces is more problematic. The assumptions about address space 0 are more widespread and tied to LLVM's address space 0.

Clang's assumptions about address space 0 are pervasive because it represents an actual concept in the source language: C has a generic address space. My point is that LLVM generally should not.

I don't think the assumptions in Clang that the Clang AST address space 0 equals LLVM's address space 0 are nearly as pervasive as you think. At any rate, since you require stack allocations to be in a different LLVM address space from the generic address space, but (per C) taking the address of a local variable yields a pointer in the generic address space, you're going to have to teach Clang about these differences between address spaces anyway, because it'll have to insert address-space conversions in a number of places.

There is a lot of code that assumes the default address space size is "the one true pointer size".

Are you suggesting that LLVM's address space 0 must always be at least as large as every other address space? Where is that assumption made? Again, that seems like something that should be fixed independent of what we decide here.

Not necessarily, it's just an occasional sticking point in some remaining that code isn't aware of multiple pointer sizes. One area was in debug info. Patches to fix that were rejected a few times in the past. If you grep for argumentless getPointerSizes in CodeGen+MC, most of them seem to be for debug info. I have another patch I've been holding on to which works around the DAG picking the wrong value type for pointers to function types. Pretty much everything in the DAG for dealing specifically with pointer types is unworkable with different sized pointers, but most of that is possible to avoid using.

You can't pass around a 64-bit function pointer properly if it's required to have the same properties as a 32-bit stack pointer. It isn't really OK to start treating function pointer types differently than other types.

Of course. I am not suggesting that we should do that.

I don't think adding this DataLayout argument is very invasive, and was less work than I anticipated

That's because you aren't changing any of the frontends, most of which are not in-tree. I understand that LLVM does not promise to keep the C++ API stable, but there's surely *some* responsibility to not break every frontend for features they don't care about. At the very least, yes, you should be picking up this value from the Module instead of requiring it to be passed in.

So your objection is just to the API change? I can change that if you insist. The actual mechanical adding of the argument isn't the source of the work. As someone who has maintained out of tree uses, this class of change is the least irritating. Most of the clang IR gen work is in updating the downstream users which use hardcoded pointer types, which users not using this don't need to care about it.

The pressing reason for this is C++ GPU support. Clang's handling of address spaces is more problematic. The assumptions about address space 0 are more widespread and tied to LLVM's address space 0.

Clang's assumptions about address space 0 are pervasive because it represents an actual concept in the source language: C has a generic address space. My point is that LLVM generally should not.

I don't think the assumptions in Clang that the Clang AST address space 0 equals LLVM's address space 0 are nearly as pervasive as you think. At any rate, since you require stack allocations to be in a different LLVM address space from the generic address space, but (per C) taking the address of a local variable yields a pointer in the generic address space, you're going to have to teach Clang about these differences between address spaces anyway, because it'll have to insert address-space conversions in a number of places.

Yes, I expect there will be many addrspacecasts inserted from the alloca, which then the InferAddressSpaces pass would hopefully optimize away.

I don't think the assumptions in Clang that the Clang AST address space 0 equals LLVM's address space 0 are nearly as pervasive as you think. At any rate, since you require stack allocations to be in a different LLVM address space from the generic address space, but (per C) taking the address of a local variable yields a pointer in the generic address space, you're going to have to teach Clang about these differences between address spaces anyway, because it'll have to insert address-space conversions in a number of places.

Yes, I expect there will be many addrspacecasts inserted from the alloca, which then the InferAddressSpaces pass would hopefully optimize away.

Okay. So, to summarize:

In amdgpu OpenCL, at source level, sizeof(void private *) == sizeof(void {global,local,constant} *) == 8. However, at an implementation level, stack addresses are actually 32-bit, and you want that to be modeled correctly in the IR. Therefore, allocas need to return a value in a 32-bit address space, which the frontend will immediately widen to a 64-bit address space in order to match the rules for __private. In order to maintain performance, you have a pass that recognizes when an address value is the result of that kind of widening and re-narrows it to the 32-bit stack address space.

Because LLVM currently makes some unfortunate assumptions about address space 0, this stack address space cannot be address space 0. These assumptions include:
(1) As far as debug info is concerned, the size of a pointer is the size of a pointer in address space 0. (Is this assumption why sizeof(void private *) == 8 despite private being clearly intended in the OpenCL spec to be the address space of stack pointers?)
(2) Functions are currently always defined in address space 0. A function pointer in your implementation must be in a 64-bit address space. (My understanding was that OpenCL generally forbade function pointers, so I'm not sure if this is actually important?)
(3) Some optimizations assume that 0 is an invalid address in address space 0, but in fact it's a valid stack address.

It's entirely possible that we should, indeed, change all of these assumptions. For the time being, you feel that's not a reasonable fight, and so you would like to allow IR to use an explicitly non-0 address space for the stack.

I think I'm prepared to accept your argument.

I don't think adding this DataLayout argument is very invasive, and was less work than I anticipated

That's because you aren't changing any of the frontends, most of which are not in-tree. I understand that LLVM does not promise to keep the C++ API stable, but there's surely *some* responsibility to not break every frontend for features they don't care about. At the very least, yes, you should be picking up this value from the Module instead of requiring it to be passed in.

So your objection is just to the API change? I can change that if you insist. The actual mechanical adding of the argument isn't the source of the work. As someone who has maintained out of tree uses, this class of change is the least irritating. Most of the clang IR gen work is in updating the downstream users which use hardcoded pointer types, which users not using this don't need to care about it.

Yes, I would like you to avoid changing the IRBuilder API if you can. I can see why you need to change the AllocaInst constructors, though.

Do you actually need this to be printed/parsed in .ll files? It's globally consistent across all alloca instructions, and the data layout is always parsed before anything else in the file. It's not the end of the world, just seems odd to imply that this is actually something that can be different on different instructions.

Do you actually need this to be printed/parsed in .ll files? It's globally consistent across all alloca instructions, and the data layout is always parsed before anything else in the file. It's not the end of the world, just seems odd to imply that this is actually something that can be different on different instructions.

No, strictly speaking it isn't necessary: every module has a datalayout, so we can compute the type from that. I requested it to make debugging more straightforward. Without it, there isn't any obvious indication if you dump() an instruction or look at an IR file that the address-space of the result is non-zero.

Do you actually need this to be printed/parsed in .ll files? It's globally consistent across all alloca instructions, and the data layout is always parsed before anything else in the file. It's not the end of the world, just seems odd to imply that this is actually something that can be different on different instructions.

No, strictly speaking it isn't necessary: every module has a datalayout, so we can compute the type from that. I requested it to make debugging more straightforward. Without it, there isn't any obvious indication if you dump() an instruction or look at an IR file that the address-space of the result is non-zero.

Well, since IR assembly is primarily a debugging / testing aid, and you won't get similar test output anyway because the address space will show up in every use of the instruction, I suppose it's fine.

I don't think the assumptions in Clang that the Clang AST address space 0 equals LLVM's address space 0 are nearly as pervasive as you think. At any rate, since you require stack allocations to be in a different LLVM address space from the generic address space, but (per C) taking the address of a local variable yields a pointer in the generic address space, you're going to have to teach Clang about these differences between address spaces anyway, because it'll have to insert address-space conversions in a number of places.

Yes, I expect there will be many addrspacecasts inserted from the alloca, which then the InferAddressSpaces pass would hopefully optimize away.

Okay. So, to summarize:

In amdgpu OpenCL, at source level, sizeof(void private *) == sizeof(void {global,local,constant} *) == 8. However, at an implementation level, stack addresses are actually 32-bit, and you want that to be modeled correctly in the IR. Therefore, allocas need to return a value in a 32-bit address space, which the frontend will immediately widen to a 64-bit address space in order to match the rules for __private. In order to maintain performance, you have a pass that recognizes when an address value is the result of that kind of widening and re-narrows it to the 32-bit stack address space.

This is mostly accurate except for the detail about the cast. For OpenCL 2.0 or C++ the alloca pointer will most likely need to be addrspacecasted to the OpenCL generic address space. Private pointers in memory may need to be zero extended to 64-bit in memory. For OpenCL 1.x there is no generic address space, and the old subtargets don't have the necessary hardware features to support this.This is one of reasons I don't want to use the same workaround NVPTX uses for this problem. There would be a pointer that if not optimized out would require creative codegen work to artificially support it on those targets.

Because LLVM currently makes some unfortunate assumptions about address space 0, this stack address space cannot be address space 0. These assumptions include:
(1) As far as debug info is concerned, the size of a pointer is the size of a pointer in address space 0. (Is this assumption why sizeof(void private *) == 8 despite private being clearly intended in the OpenCL spec to be the address space of stack pointers?)

This is a point that the spec is vague on. While the spec explicitly allows different size pointers for different address spaces, there isn't really any detailed description of what this entails. This wasn't addressed at all by the SPIR spec, which just defined a single 64-bit datalayout string. sizeof really indicates the in-memory representation's size, so by using a DataLayout with 32-bit pointers with an ABI alignment of 64-bit, we get the necessary codegen properties of a 32-bit pointer while maintaining the required ABI and single sizeof().

(2) Functions are currently always defined in address space 0. A function pointer in your implementation must be in a 64-bit address space. (My understanding was that OpenCL generally forbade function pointers, so I'm not sure if this is actually important?)

It is true that OpenCL doesn't allow function pointers at the source level. I'm working on implementing calls without inlining everything now, and to be able to correctly lower LLVM IR for them requires the DAG picking the right MVT when emitting the global address so it still comes up there. Function pointer support is also helpful for implementing other languages being worked on like C++ and python. I believe recent version of CUDA support them as well.

(3) Some optimizations assume that 0 is an invalid address in address space 0, but in fact it's a valid stack address.

It's entirely possible that we should, indeed, change all of these assumptions. For the time being, you feel that's not a reasonable fight, and so you would like to allow IR to use an explicitly non-0 address space for the stack.

Yes. Some day I would like to be able to apply some of these assumptions to other non-0 address spaces as well (i.e. 0 is null in some address spaces, but not others) but I think that is a larger task.

I think I'm prepared to accept your argument.

I don't think adding this DataLayout argument is very invasive, and was less work than I anticipated

That's because you aren't changing any of the frontends, most of which are not in-tree. I understand that LLVM does not promise to keep the C++ API stable, but there's surely *some* responsibility to not break every frontend for features they don't care about. At the very least, yes, you should be picking up this value from the Module instead of requiring it to be passed in.

So your objection is just to the API change? I can change that if you insist. The actual mechanical adding of the argument isn't the source of the work. As someone who has maintained out of tree uses, this class of change is the least irritating. Most of the clang IR gen work is in updating the downstream users which use hardcoded pointer types, which users not using this don't need to care about it.

Yes, I would like you to avoid changing the IRBuilder API if you can. I can see why you need to change the AllocaInst constructors, though.

Do you actually need this to be printed/parsed in .ll files? It's globally consistent across all alloca instructions, and the data layout is always parsed before anything else in the file. It's not the end of the world, just seems odd to imply that this is actually something that can be different on different instructions.

No, this was just a request in the RFC thread to make it more clear. I can maybe envision a use case in the future for having a different address space on individual allocas, but that is not a concern I have today.

arsenm updated this revision to Diff 93281.Mar 28 2017, 12:58 PM

Don't change IRBuilder API

I don't think the assumptions in Clang that the Clang AST address space 0 equals LLVM's address space 0 are nearly as pervasive as you think. At any rate, since you require stack allocations to be in a different LLVM address space from the generic address space, but (per C) taking the address of a local variable yields a pointer in the generic address space, you're going to have to teach Clang about these differences between address spaces anyway, because it'll have to insert address-space conversions in a number of places.

Yes, I expect there will be many addrspacecasts inserted from the alloca, which then the InferAddressSpaces pass would hopefully optimize away.

Okay. So, to summarize:

In amdgpu OpenCL, at source level, sizeof(void private *) == sizeof(void {global,local,constant} *) == 8. However, at an implementation level, stack addresses are actually 32-bit, and you want that to be modeled correctly in the IR. Therefore, allocas need to return a value in a 32-bit address space, which the frontend will immediately widen to a 64-bit address space in order to match the rules for __private. In order to maintain performance, you have a pass that recognizes when an address value is the result of that kind of widening and re-narrows it to the 32-bit stack address space.

This is mostly accurate except for the detail about the cast. For OpenCL 2.0 or C++ the alloca pointer will most likely need to be addrspacecasted to the OpenCL generic address space.

It looks to me like the language spec says that the address of a local variable is in the __private address space, but that a pointer in any address space can be promoted into the generic address space. Clang's AST should call out that second promotion as a separate ImplicitCastExpr. Of course, you may decide that you want to peephole it into a single addrspacecast instruction.

Private pointers in memory may need to be zero extended to 64-bit in memory. For OpenCL 1.x there is no generic address space, and the old subtargets don't have the necessary hardware features to support this.

By "this", you mean an efficiently-accessible generic address space? Presumably because the different memory-access operations work on specific address spaces and so, as you say, you would need creative codegen to figure out which case the pointer originally belonged to. That makes sense.

Because LLVM currently makes some unfortunate assumptions about address space 0, this stack address space cannot be address space 0. These assumptions include:
(1) As far as debug info is concerned, the size of a pointer is the size of a pointer in address space 0. (Is this assumption why sizeof(void private *) == 8 despite private being clearly intended in the OpenCL spec to be the address space of stack pointers?)

This is a point that the spec is vague on. While the spec explicitly allows different size pointers for different address spaces, there isn't really any detailed description of what this entails.

I don't see what the problem is. The spec doesn't allow pointers to be converted between address spaces at all (except for the promotion into the generic address space in OpenCL 2.0). Pointers into different address spaces are just completely different types.

I mean, in general, the OpenCL spec is rather hilariously poorly-drafted, and as a compiler programmer you have to read between the lines to figure out how things are supposed to be generalized. The old 1.1 spec used to say that "variables declared as pointers are assumed to point to the __private address space if an address space qualifier is not specified", and everyone just assumes that that's meant to be a general rule for pointer types, not something that only applies to variables and only at the outermost level of pointer. Similarly, you have to read the paragraph about assignability as a general conversion rule, and you have to guess that probably when the 2.0 specification says that "[c]asting a pointer to address space A to a pointer to address space B is illegal if A and B are named address spaces and A is not the same as B", that it's okay to both cast into *and out of* the generic address space, and you have to guess at what the semantics of the latter are supposed to be. The best guidance I can offer is to instead read the Embedded C specification (WG14 N1169 might be the most version), which goes into address spaces somewhat more rigorously; it's what I've always used as a baseline for understanding what OpenCL is trying to do. From this point of view, the four OpenCL 1.0 address spaces are disjoint, and OpenCL 2.0's generic address space is simply one that is a superset of all the others. Embedded C allows casts between address spaces, but the cast has undefined behavior if the pointer doesn't actually point into the destination address space. Since the four named address spaces are formally disjoint, casts directly between them are always invalid, and so OpenCL's rule that they're just forbidden makes sense. Anyway, that's what I would suggest as an approach for understanding OpenCL.

This wasn't addressed at all by the SPIR spec, which just defined a single 64-bit datalayout string. sizeof really indicates the in-memory representation's size, so by using a DataLayout with 32-bit pointers with an ABI alignment of 64-bit, we get the necessary codegen properties of a 32-bit pointer while maintaining the required ABI and single sizeof().

Well, SPIR is an abstract representation. If you were emitting SPIR, you wouldn't want any of these assumptions about stack addressing to leak into it. If you're lowering from SPIR, of course, you just have to invent an ABI rule for how your 32-bit pointers are passed around as 64-bit quantities, and then you globally rewrite the IR to implement that.

(2) Functions are currently always defined in address space 0. A function pointer in your implementation must be in a 64-bit address space. (My understanding was that OpenCL generally forbade function pointers, so I'm not sure if this is actually important?)

It is true that OpenCL doesn't allow function pointers at the source level. I'm working on implementing calls without inlining everything now, and to be able to correctly lower LLVM IR for them requires the DAG picking the right MVT when emitting the global address so it still comes up there. Function pointer support is also helpful for implementing other languages being worked on like C++ and python. I believe recent version of CUDA support them as well.

Well, if you don't have real function pointers, and therefore the only use of functions is in direct calls, it feels like you can always hack around pointer-size problems in your backend pretty easily. If the source language has real function pointers, it also has to decide what address space they're in — it could be a dedicated address space for functions, which would be legal under C rules, but which would take some effort to support in Clang.

(3) Some optimizations assume that 0 is an invalid address in address space 0, but in fact it's a valid stack address.

It's entirely possible that we should, indeed, change all of these assumptions. For the time being, you feel that's not a reasonable fight, and so you would like to allow IR to use an explicitly non-0 address space for the stack.

Yes. Some day I would like to be able to apply some of these assumptions to other non-0 address spaces as well (i.e. 0 is null in some address spaces, but not others) but I think that is a larger task.

Yes, that seems like a good goal. In fact, it would be nice if the assumptions about address space 0 were just a default behavior that could be overridden by the data layout.

I don't think the assumptions in Clang that the Clang AST address space 0 equals LLVM's address space 0 are nearly as pervasive as you think. At any rate, since you require stack allocations to be in a different LLVM address space from the generic address space, but (per C) taking the address of a local variable yields a pointer in the generic address space, you're going to have to teach Clang about these differences between address spaces anyway, because it'll have to insert address-space conversions in a number of places.

Yes, I expect there will be many addrspacecasts inserted from the alloca, which then the InferAddressSpaces pass would hopefully optimize away.

Okay. So, to summarize:

In amdgpu OpenCL, at source level, sizeof(void private *) == sizeof(void {global,local,constant} *) == 8. However, at an implementation level, stack addresses are actually 32-bit, and you want that to be modeled correctly in the IR. Therefore, allocas need to return a value in a 32-bit address space, which the frontend will immediately widen to a 64-bit address space in order to match the rules for __private. In order to maintain performance, you have a pass that recognizes when an address value is the result of that kind of widening and re-narrows it to the 32-bit stack address space.

This is mostly accurate except for the detail about the cast. For OpenCL 2.0 or C++ the alloca pointer will most likely need to be addrspacecasted to the OpenCL generic address space.

It looks to me like the language spec says that the address of a local variable is in the __private address space, but that a pointer in any address space can be promoted into the generic address space. Clang's AST should call out that second promotion as a separate ImplicitCastExpr. Of course, you may decide that you want to peephole it into a single addrspacecast instruction.

Private pointers in memory may need to be zero extended to 64-bit in memory. For OpenCL 1.x there is no generic address space, and the old subtargets don't have the necessary hardware features to support this.

By "this", you mean an efficiently-accessible generic address space? Presumably because the different memory-access operations work on specific address spaces and so, as you say, you would need creative codegen to figure out which case the pointer originally belonged to. That makes sense.

Yes

Because LLVM currently makes some unfortunate assumptions about address space 0, this stack address space cannot be address space 0. These assumptions include:
(1) As far as debug info is concerned, the size of a pointer is the size of a pointer in address space 0. (Is this assumption why sizeof(void private *) == 8 despite private being clearly intended in the OpenCL spec to be the address space of stack pointers?)

This is a point that the spec is vague on. While the spec explicitly allows different size pointers for different address spaces, there isn't really any detailed description of what this entails.

I don't see what the problem is. The spec doesn't allow pointers to be converted between address spaces at all (except for the promotion into the generic address space in OpenCL 2.0). Pointers into different address spaces are just completely different types.

The problem is mostly what happens if the host and device pointer size don't match? Struct layouts etc. still need to be compatible for whatever memory buffer was passed into the kernel. There isn't much practical reason to do it, but you could have a struct with private pointer members in it changing the offsets of the other items, not that you could do anything valid with the contents.

Because LLVM currently makes some unfortunate assumptions about address space 0, this stack address space cannot be address space 0. These assumptions include:
(1) As far as debug info is concerned, the size of a pointer is the size of a pointer in address space 0. (Is this assumption why sizeof(void private *) == 8 despite private being clearly intended in the OpenCL spec to be the address space of stack pointers?)

This is a point that the spec is vague on. While the spec explicitly allows different size pointers for different address spaces, there isn't really any detailed description of what this entails.

I don't see what the problem is. The spec doesn't allow pointers to be converted between address spaces at all (except for the promotion into the generic address space in OpenCL 2.0). Pointers into different address spaces are just completely different types.

The problem is mostly what happens if the host and device pointer size don't match? Struct layouts etc. still need to be compatible for whatever memory buffer was passed into the kernel. There isn't much practical reason to do it, but you could have a struct with private pointer members in it changing the offsets of the other items, not that you could do anything valid with the contents.

The easy lowering would be to leave a 64-bit private address space around but consider it "illegal". Address space conversions between it and the 32-bit private address space would simply be zexts and truncs. But your 8-byte-aligned pointer idea works, too, assuming it doesn't give LLVM too many fits to have an over-aligned primitive type.

The more complex lowering would be to rewrite all the struct types to replace T addrspace(private64)* with { T addrspace(private32)*, i32 }. There's actually relatively few IR constructs that work with struct types — GEP, insertvalue/extractvalue, constant aggregates, I think that's it. The IR type would change alignment, but only by getting weaker, and memory accesses should all have explicit alignments anyway.

Anyway, I guess we're getting a bit afield.

John.

yaxunl added a subscriber: yaxunl.Mar 30 2017, 11:22 AM
whchung added a subscriber: whchung.Apr 2 2017, 7:49 PM

Any more comments for the patch?

Not from me.

t-tye added a subscriber: t-tye.Apr 6 2017, 8:45 AM

Is this good to commit?

efriedma accepted this revision.Apr 10 2017, 11:14 AM

Yes, I think you've addressed everything.

This revision is now accepted and ready to land.Apr 10 2017, 11:14 AM
arsenm closed this revision.Apr 10 2017, 3:41 PM

r299888